rPPG-Toolbox Deep Remote PPG Toolbox Xin Liu1 Girish Narayanswamy1 Akshay Paruchuri2 Xiaoyu Zhang3 Jiankai Tang3 Yuzhe Zhang3 Roni Sengupta2 Shwetak Patel1 Yuntao Wang3 Daniel McDuff1

2025-05-03 0 0 1.8MB 25 页 10玖币
侵权投诉
rPPG-Toolbox: Deep Remote PPG Toolbox
Xin Liu1, Girish Narayanswamy1, Akshay Paruchuri2, Xiaoyu Zhang3, Jiankai Tang3,
Yuzhe Zhang3, Roni Sengupta2, Shwetak Patel1, Yuntao Wang3, Daniel McDuff1
University of Washington Seattle 1
University of North Carolina at Chapel Hill 2
Tsinghua University 3
{xliu0, girishvn, dmcudff}@cs.washington.edu, akshay@cs.unc.edu
Equal Contribution
Abstract
Camera-based physiological measurement is a fast growing field of computer
vision. Remote photoplethysmography (rPPG) utilizes imaging devices (e.g.,
cameras) to measure the peripheral blood volume pulse (BVP), and enables cardiac
measurement via webcams and smartphones. However, the task is non-trivial with
important pre-processing, modeling, and post-processing steps required to obtain
state-of-the-art results. Replication of results and benchmarking of new models is
critical for scientific progress; however, as with many other applications of deep
learning, reliable codebases are not easy to find or use. We present a comprehensive
toolbox, rPPG-Toolbox, that contains unsupervised and supervised rPPG models
with support for public benchmark datasets, data augmentation, and systematic
evaluation: https://github.com/ubicomplab/rPPG-Toolbox
1 Introduction
The vision of ubiquitous computing is to embed computation into everyday objects to enable them
to perform useful tasks. The sensing of physiological vital signs is one such task and plays an
important role in how health is understood and managed. Cameras are both ubiquitous and versatile
sensors, and the transformation of cameras into accurate health sensors has the potential to make the
measurement of health signals more comfortable and accessible. Examples of the applications of this
technology include systems for monitoring neonates [
1
], dialysis patients [
2
], and the detection of
arrhythmias [3].
Building on advances in computer vision, camera-based measurement of physiological vitals signs
has developed into a research field of its own [
?
]. Researchers have developed methods for measuring
cardiac and pulmonary signals by analyzing skin pixel changes over time. Recently, several companies
have been granted FDA De Novo status for products that use software algorithms to analyze video
and estimate pulse rate, heart rate, respiratory rate and/or breathing rate12.
There are hundreds of computational architectures that have been proposed for the measurement
of cardiopulmonary signals. Unsupervised signal processing methods leverage techniques such as
Independent Component Analysis (ICA) or Principal Component Analysis (PCA) and assumptions
about the periodicity or structure of the underlying blood volume pulse waveform. Neural network
architectures can be trained in a supervised fashion using videos with synchronized gold-standard
ground truth signals [
4
,
5
,
6
,
7
]. Innovative data generation [
8
] and augmentation [
9
], meta-learning
for personalization [
10
,
11
], federated learning [
12
], and unsupervised pretraining [
13
,
14
,
15
,
16
] have been widely explored in the field of camera-based physiological sensing and have led
1https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN200019.pdf
2https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN200038.pdf
37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks.
arXiv:2210.00716v3 [cs.CV] 25 Nov 2023
Signal Decomposition
B
G
R
Color Transform
X
Y
Z
SpatialAveraging
Unsupervised Methods
Neural Methods
Convolutional
Layers
Fully Connected
Layer
60 beats/min
Frequency
Power
BVP Power Spectrum
time (s)
amplitude
Predicted Blood Volume Pulse
after Band-pass Filter
Postprocessing &
Metric Calculation
Preprocessing
(e.g., ROI,
Normalization)
Physiogical Signal
Input
Frames
Figure 1: rPPG Pipeline. An example of the components of an rPPG pipeline including preprocessing,
training, inference, and evaluation.
to significant improvements in state-of-the-art performance. Further information regarding the
background, algorithms, and potential applications of rPPG are included in the Appendix-B and C.
However, standardization in the field is still severely lacking. Based on our review of literature in
the space, we identified four issues that have hindered the interpretation of results in many papers.
First, and perhaps most obviously, a number of the published works are not accompanied by public
code. While publishing code repositories with papers is now fairly common in the machine learning
and computer vision research communities, it is far less common in the field of camera-based
physiological sensing. While there are reasons that it might be difficult to release datasets (e.g.,
medical data privacy), we cannot find good arguments for not releasing code. Second, many papers
do not compare to previously published methods in an “apples-to-apples” fashion. This point is a
little more subtle, but rather than performing systematic side-by-side comparisons between methods,
the papers compare to numerical results from previous work, even if the training sets and/or test
sets are not identical (e.g., test samples were filtered because they were deemed to not have reliable
labels). Unfortunately, this often makes it unclear if performance differences are due to data, pre-
processing steps, model design, post-processing, training schemes and hardware specifications, or
a combination of the aforementioned. Continuing this thread, the third flaw is that papers use pre-
and post-processing steps that are not adequately described. Finally, different researchers compute
the “labels” (e.g., heart rate) using their own methods from the contact PPG or ECG time-series
data. Differences in these methods lead to different labels and a fundamental issue when it comes to
benchmarking performance. When combined, the aforementioned issues make it very difficult to
draw conclusions from the literature about the optimal choices for the design of rPPG systems.
Open source codes allow researchers to compare novel approaches to consistent baselines without
ambiguity regarding the implementation or parameters used. This transparency is important as
subsequent research invariably builds on prior state-of-the-art. Implementing a prior method from
a paper, even if clearly written, can be difficult. Furthermore, it is an inefficient use of time for
many researcher to re-implement all baseline methods. In an effort to address this, several open
source toolboxes have been released for camera-based physiological sensing. These toolboxes have
been a significant contribution to the community and provide implementations of methods and
models [
17
,
18
,
19
]; however, they are also incomplete. McDuff and Blackford [
17
]
3
implemented a
set of source separation methods (Green, ICA, CHROM, POS) and Pilz [
19
] published the PPGI-
Toolbox
4
containing implementations of Green, SSR, POS, Local Group Invariance (LGI), Diffusion
Process (DP) and Riemannian-PPGI (SPH) models. These toolboxes are implemented in MATLAB
(e.g., [
17
]) and do not contain examples of supervised methods. Python and supervised neural models
are now the focus of a large majority of computer vision and deep learning research. There are
3https://github.com/danmcduff/iphys-toolbox
4https://github.com/partofthestars/PPGI-Toolbox
2
Table 1: Comparison of rPPG Toolboxes. Comparison of rPPG-Toolbox with existing toolboxes in
camera-based physiological measurement.
Toolbox Dataset Support Unsup. Eval DNN Training DNN Eval
iPhys-Toolbox [20] ✗ ✗
PPG-I Toolbox [19] ✗ ✗
pyVHR [18, 21] ✗ ✓
rPPG-Toolbox (Ours) ✓ ✓
Unsup. = Unsupervised learning methods, DNN = Deep neural network methods.
several implementations of popular signal processing methods in Python: Bob.rrpg.base
5
includes
implementations of CHROM, SSR and Boccignone et al. [
18
] released code for Green, CHROM,
ICA, LGI, PBV, PCA, and POS. Several published papers have included links to code; however, often
this is only inference code and not training code for neural models. Without providing training code
for neural networks, it is challenging for researchers to conduct end-to-end reproducible experiments
and build on existing research.
In this paper, we present an end-to-end toolbox
6
for camera-based physiological measurement. This
toolbox includes: 1) support for six public datasets, 2) pre-processing code to format the datasets for
training neural models, 3) implementations of six neural model architectures and six unsupervised
learning methods, 4) evaluation and inference pipelines for supervised and unsupervised learning
methods for reproducibility and 5) enabling advanced neural training and inference such as weakly
supervised pseudo labels, motion augmentation and multitask learning. We use this toolbox to publish
clear and reproducible benchmarks that we hope will provide a foundation for the community to
compare methods in a more rigorous and informative manner.
2 Related Work
In the field of remote PPG sensing, there are three significant open-source toolboxes (documented in
Table 1):
iPhys-Toolbox [
17
]: An open-sourced toolbox written in MATLAB that is comprised of implemen-
tations of numerous algorithms for rPPG sensing. It empowers researchers to present results on
their datasets using public, standard implementations of baseline methods, ensuring transparency of
parameters. This toolbox incorporates a wide range of widely employed baseline methods; however, it
falls short on Python support, public dataset data loaders, and neural network training and evaluation.
PPG-I Toolbox [
19
]: This toolbox provides MATLAB implementations, specifically for six unsuper-
vised signal separation models. It incorporates four evaluation metrics, including Pearson correlation,
RMSE/MSE, SNR, and Bland-Altman plots. However, similar to the iPhys-Toolbox, it lacks support
for public dataset data loading and neural network training and evaluation.
pyVHR [
21
]: The most recent in the field, this toolbox adopts Python instead of MATLAB. While it
offers ample support for numerous unsupervised methods, its capabilities are limited when it comes
to modern neural networks. Notably, pyVHR supports only two neural networks for inference, and
none for model training. This omission can be a roadblock for researchers aiming to reproduce and
further advance state-of-the-art neural methods.
3 The rPPG-Toolbox
To address the gaps in the current tooling and to promote reproducibility and clearer benchmarking
within the camera-based physiological measurement (rPPG) community, we present an open-source
toolbox designed to support six public datasets, six unsupervised methods and six neural methods for
data preprocessing, neural model training and evaluation, and further analysis.
3.1 Datasets
The toolbox includes pre-processing code that converts six public datasets into a form amenable for
training with neural models. The standard form for the videos we select includes raw frames and
5https://pypi.org/project/bob.rppg.base/
6https://github.com/ubicomplab/rPPG-Toolbox
3
Data Preprocessing & Loaders Supervised Neural Methods Unsupervised Methods
DeepPhys
TS_CAN
PhysNet
EfficientPhys
DeepPhysTrainer
TscanTrainer
PhysnetTrainer
EfficientPhysTrainer
UBFCLoader
PURELoader
SCAMPSLoader
FFT
Unsupervised Predictor
GREENICA POS
CHROMLGIPBV
Evaluation
Peak
Detection
Prediction MAE
RMSE
MAPE
Pearson
Coef.
Ground
Truth
Train Valid Test
Face
Detection
Difference
Normalize.
Standardiza
tion
Resolution /
FS / Chunk GPU
BaseLoader
Implementations of exemplary data loaders Implementations of exemplary neural
architectures & training pipelines
Implementations of exemplary
unsupervised methods
Implementations of systematic video-level
evaluation pipelines
Configuration
Implementations of end-to-end configurations to parametrize and
abstract training, validation and testing
E.g.
Figure 2: Overview. An overview of the rPPG-Toolbox codebase.
difference frames (the difference between each pair of consecutive frames) stored as numpy arrays in
a [N, W, H, C] format. Where N is the length of the sequence, W is the width of the frames, H is the
height of the frames, and C is the number of channels. There are six channels in this case, as both the
raw frames and difference frames account for three color channels each. For faster data loading, all
videos in the datasets are typically broken up into several “chunks” of non-overlapping N (e.g., 180)
frame sequences. All of these parameters (N, W, H, C) are easy to change and customize. The PPG
waveform labels are stored as numpy arrays in a [N, 1] format. The entire pre-processing procedure
is supported with multi-thread processing to accelerate the data processing time.
We have provided pre-processing code for UBFC-rPPG [
22
], PURE [
23
] SCAMPS [
24
], MMPD [
25
],
BP4D+ [
26
], and UBFC-Phys [
27
]. Each of these datasets encompasses a diverse array of real-world
conditions, capturing variations in factors such as motion, lighting, skin tones/types, and backgrounds,
thus presenting robust challenges for any signal processing and machine learning algorithm. Tools
(python notebooks) are provided for quickly visualizing pre-processed datasets and will be detailed
further in Appendix-J. We also support the pre-processing and usage of augmented versions of the
UBFC-rPPG [22] and PURE [23] datasets, a feature which we describe further in Section 4.2.
UBFC-rPPG [
22
]: This dataset features RGB videos recorded using a Logitech C920 HD Pro
webcam at 30Hz. The videos have a resolution of 640x480, and they are stored in an uncompressed
8-bit RGB format. Reference PPG data was obtained using a CMS50E transmissive pulse oximeter,
thereby providing the gold-standard validation data. The subjects were positioned approximately one
meter away from the camera during the recording sessions. The videos were captured under indoor
conditions with a combination of natural sunlight and artificial illumination.
PURE [
23
]: This dataset consists of recordings from 10 subjects, including 8 males and 2 females.
The video footage was captured with an RGB eco274CVGE camera from SVS-Vistek GmbH, with a
frequency of 30Hz and a resolution of 640x480. Subjects were positioned approximately 1.1 meters
from the camera and were illuminated from the front by ambient natural light filtering through a
window. The gold-standard ground truth of PPG and SpO2 were obtained at 60Hz with a CMS50E
pulse oximeter affixed to the subject’s finger. Each participant completed six recordings under varied
motion conditions, thereby contributing to a range of data reflecting different physical states.
SCAMPS [
24
]: This dataset encompasses 2,800 video clips, comprising 1.68M frames, featuring
synthetic avatars in alignment with cardiac and respiratory signals. These waveforms and videos
were generated by employing a sophisticated facial processing pipeline, resulting in high-fidelity,
quasi-photorealistic renderings. To provide robust test conditions, the videos incorporate various
confounders such as head motions, facial expressions, and changes in ambient illumination.
MMPD [
25
]: This dataset includes 660 one-minute videos recorded using a Samsung Galaxy
S22 Ultra mobile phone, at 30 frames per second with a resolution of 1280x720 pixels and then
compressed to 320x240 pixels. The ground truth PPG signals were simultaneously captured using an
HKG-07C+ oximeter, at 200 Hz and then downsampled to 30 Hz. It contains Fitzpatrick skin types
3-6, four different lighting conditions (LED-low, LED-high, incandescent, natural), four distinct
4
activities (stationary, head rotation, talking, and walking), and exercise scenarios. With multiple
labels provided, different subsets of this dataset can be easily used for research using our toolbox.
BP4D+ [
26
]: This dataset contains video footage captured at a rate of 25 frames per second, for
140 subjects, each participating in 10 emotion-inducing tasks, amounting to a total of 1400 trials
and associated videos. In addition to the standard video footage, the dataset also includes 3D mesh
models and thermal video, both captured at the same frame rate. Alongside these, the dataset offers
supplementary data including blood pressure measurements (wave, systolic, diastolic, mean), heart
rate in beats per minute, respiration (wave, rate bpm), electrodermal activity, and Facial Action
Coding System (FACS) encodings for specified action units.
UBFC-Phys [
27
]: The UBFC-PHYS dataset, a multi-modal dataset, contains 168 RGB videos, with
56 subjects (46 women and 10 men) per a task. There are three tasks with significant amounts of
unconstrained motion under static lighting conditions - a rest task, a speech task, and an arithmetic
task. The dataset contains gold-standard blood volume pulse (BVP) and electrodermal activity (EDA)
measurements that were collected via the Empatica E4 wristband. The videos were recorded at a
resolution of 1024x1024 and 35Hz with a EO-23121C RGB digital camera. We utilized all three
tasks and the same subject sub-selection list provided by the authors of the dataset in the second
supplementary material of Sabour et al. [
27
] for evaluation. We reiterate this subject sub-selection
list in Appendix-H.
3.2 Methods
3.2.1 Unsupervised Methods
The following methods all use linear algebra and traditional signal processing to recover the estimated
PPG signal: 1) Green [
28
]: the green channel information is used as the proxy for the PPG after
spatial averaging of RGB video; 2) ICA [
29
]: Independent Component Analysis (ICA) is applied
to normalized, spatially averaged color signals to recover demixing matrices; 3) CHROM [
30
]: a
linear combination of the chrominance signals obtained from the RGB video are used for estimation;
4) POS [
31
]: plane-orthogonal-to-the-skin (POS), is a method that calculates a projection plane
orthogonal to the skin-tone based on physiological and optical principles. A fixed matrix projection
is applied to the spatially normalized, averaged pixel values, which are used to recover the PPG
waveform; 5) PBV [
32
]: a signature, that is determined by a given light spectrum and changes of
the blood volume pulse, is used in order to derive the PPG waveform while offsetting motion and
other noise in RGB videos; 6) LGI [
33
]: a feature representation method that is invariant to motion
through differentiable local transformations.
3.2.2 Supervised Neural Methods
The following implementations of supervised learning algorithms are included in the toolbox. All im-
plementations were done using PyTorch [
37
]. Common optimization algorithms, such as Adam [
38
]
and AdamW [
39
], and criterion, such as mean squared error (MSE) loss, are utilized for training
except for where noted. The learning rate scheduler typically follows the 1cycle policy [
40
], which
anneals the learning rate from an initial learning rate to some maximum learning rate and then, from
that maximum learning rate, to some learning rate much lower than the initial learning rate. The total
steps in this policy are determined by the number of epochs multiplied by the number of training
batches in an epoch. The 1cycle policy allows for convergence due to the learning rate being adjusted
well below the initial, maximum learning rate throughout the cycle, and after numerous epochs in
which the learning rate is much higher than the final learning rate. We found the 1cycle learning
rate scheduler to provide stable results with convergence using a maximum learning rate of 0.009
and 30 epochs. We provide parameters in the toolbox that can enable the visualization of the losses
and learning rate changes for both the training and validation phases. Further details on these key
visualizations for supervised neural methods are provided in the GitHub repository.
DeepPhys [
4
]: A two-branch 2D convolutional attention network architecture. The two represen-
tations (appearance and difference frames) are processed by parallel branches with the appearance
branch guiding the motion branch via a gated attention mechanism. The target signal is the first
differential of the PPG waveform.
PhysNet [
5
]: A 3D convolutional network architecture. Yu et al. compared this 3D-CNN architecture
with a 2D-CNN + RNN architecture, finding that a 3D-CNN version was able to achieve superior
5
摘要:

rPPG-Toolbox:DeepRemotePPGToolboxXinLiu1,GirishNarayanswamy1∗,AkshayParuchuri2∗,XiaoyuZhang3,JiankaiTang3,YuzheZhang3,RoniSengupta2,ShwetakPatel1,YuntaoWang3,DanielMcDuff1UniversityofWashingtonSeattle1UniversityofNorthCarolinaatChapelHill2TsinghuaUniversity3{xliu0,girishvn,dmcudff}@cs.washington.edu...

展开>> 收起<<
rPPG-Toolbox Deep Remote PPG Toolbox Xin Liu1 Girish Narayanswamy1 Akshay Paruchuri2 Xiaoyu Zhang3 Jiankai Tang3 Yuzhe Zhang3 Roni Sengupta2 Shwetak Patel1 Yuntao Wang3 Daniel McDuff1.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:1.8MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注