rPPG-Toolbox Deep Remote PPG Toolbox Xin Liu1 Girish Narayanswamy1 Akshay Paruchuri2 Xiaoyu Zhang3 Jiankai Tang3 Yuzhe Zhang3 Roni Sengupta2 Shwetak Patel1 Yuntao Wang3 Daniel McDuff1

2025-05-03 0 0 1.8MB 25 页 10玖币

侵权投诉

rPPG-Toolbox: Deep Remote PPG Toolbox

Xin Liu1, Girish Narayanswamy1∗, Akshay Paruchuri2∗, Xiaoyu Zhang3, Jiankai Tang3,

Yuzhe Zhang3, Roni Sengupta2, Shwetak Patel1, Yuntao Wang3, Daniel McDuff1

University of Washington Seattle 1

University of North Carolina at Chapel Hill 2

Tsinghua University 3

{xliu0, girishvn, dmcudff}@cs.washington.edu, akshay@cs.unc.edu

∗Equal Contribution

Abstract

Camera-based physiological measurement is a fast growing ﬁeld of computer

vision. Remote photoplethysmography (rPPG) utilizes imaging devices (e.g.,

cameras) to measure the peripheral blood volume pulse (BVP), and enables cardiac

measurement via webcams and smartphones. However, the task is non-trivial with

important pre-processing, modeling, and post-processing steps required to obtain

state-of-the-art results. Replication of results and benchmarking of new models is

critical for scientiﬁc progress; however, as with many other applications of deep

learning, reliable codebases are not easy to ﬁnd or use. We present a comprehensive

toolbox, rPPG-Toolbox, that contains unsupervised and supervised rPPG models

with support for public benchmark datasets, data augmentation, and systematic

evaluation: https://github.com/ubicomplab/rPPG-Toolbox

1 Introduction

The vision of ubiquitous computing is to embed computation into everyday objects to enable them

to perform useful tasks. The sensing of physiological vital signs is one such task and plays an

important role in how health is understood and managed. Cameras are both ubiquitous and versatile

sensors, and the transformation of cameras into accurate health sensors has the potential to make the

measurement of health signals more comfortable and accessible. Examples of the applications of this

technology include systems for monitoring neonates [

], dialysis patients [

], and the detection of

arrhythmias [3].

Building on advances in computer vision, camera-based measurement of physiological vitals signs

has developed into a research ﬁeld of its own [

]. Researchers have developed methods for measuring

cardiac and pulmonary signals by analyzing skin pixel changes over time. Recently, several companies

have been granted FDA De Novo status for products that use software algorithms to analyze video

and estimate pulse rate, heart rate, respiratory rate and/or breathing rate12.

There are hundreds of computational architectures that have been proposed for the measurement

of cardiopulmonary signals. Unsupervised signal processing methods leverage techniques such as

Independent Component Analysis (ICA) or Principal Component Analysis (PCA) and assumptions

about the periodicity or structure of the underlying blood volume pulse waveform. Neural network

architectures can be trained in a supervised fashion using videos with synchronized gold-standard

ground truth signals [

]. Innovative data generation [

] and augmentation [

], meta-learning

for personalization [

], federated learning [

], and unsupervised pretraining [

] have been widely explored in the ﬁeld of camera-based physiological sensing and have led

1https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN200019.pdf

2https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN200038.pdf

37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks.

arXiv:2210.00716v3 [cs.CV] 25 Nov 2023

Signal Decomposition

Color Transform

SpatialAveraging

Unsupervised Methods

Neural Methods

Convolutional

Layers

Fully Connected

Layer

60 beats/min

Frequency

Power

BVP Power Spectrum

time (s)

amplitude

Predicted Blood Volume Pulse

after Band-pass Filter

Postprocessing &

Metric Calculation

Preprocessing

(e.g., ROI,

Normalization)

Physiogical Signal

Input

Frames

Figure 1: rPPG Pipeline. An example of the components of an rPPG pipeline including preprocessing,

training, inference, and evaluation.

to signiﬁcant improvements in state-of-the-art performance. Further information regarding the

background, algorithms, and potential applications of rPPG are included in the Appendix-B and C.

However, standardization in the ﬁeld is still severely lacking. Based on our review of literature in

the space, we identiﬁed four issues that have hindered the interpretation of results in many papers.

First, and perhaps most obviously, a number of the published works are not accompanied by public

code. While publishing code repositories with papers is now fairly common in the machine learning

and computer vision research communities, it is far less common in the ﬁeld of camera-based

physiological sensing. While there are reasons that it might be difﬁcult to release datasets (e.g.,

medical data privacy), we cannot ﬁnd good arguments for not releasing code. Second, many papers

do not compare to previously published methods in an “apples-to-apples” fashion. This point is a

little more subtle, but rather than performing systematic side-by-side comparisons between methods,

the papers compare to numerical results from previous work, even if the training sets and/or test

sets are not identical (e.g., test samples were ﬁltered because they were deemed to not have reliable

labels). Unfortunately, this often makes it unclear if performance differences are due to data, pre-

processing steps, model design, post-processing, training schemes and hardware speciﬁcations, or

a combination of the aforementioned. Continuing this thread, the third ﬂaw is that papers use pre-

and post-processing steps that are not adequately described. Finally, different researchers compute

the “labels” (e.g., heart rate) using their own methods from the contact PPG or ECG time-series

data. Differences in these methods lead to different labels and a fundamental issue when it comes to

benchmarking performance. When combined, the aforementioned issues make it very difﬁcult to

draw conclusions from the literature about the optimal choices for the design of rPPG systems.

Open source codes allow researchers to compare novel approaches to consistent baselines without

ambiguity regarding the implementation or parameters used. This transparency is important as

subsequent research invariably builds on prior state-of-the-art. Implementing a prior method from

a paper, even if clearly written, can be difﬁcult. Furthermore, it is an inefﬁcient use of time for

many researcher to re-implement all baseline methods. In an effort to address this, several open

source toolboxes have been released for camera-based physiological sensing. These toolboxes have

been a signiﬁcant contribution to the community and provide implementations of methods and

models [

]; however, they are also incomplete. McDuff and Blackford [

]

implemented a

set of source separation methods (Green, ICA, CHROM, POS) and Pilz [

] published the PPGI-

Toolbox

containing implementations of Green, SSR, POS, Local Group Invariance (LGI), Diffusion

Process (DP) and Riemannian-PPGI (SPH) models. These toolboxes are implemented in MATLAB

(e.g., [

]) and do not contain examples of supervised methods. Python and supervised neural models

are now the focus of a large majority of computer vision and deep learning research. There are

3https://github.com/danmcduff/iphys-toolbox

4https://github.com/partofthestars/PPGI-Toolbox

Table 1: Comparison of rPPG Toolboxes. Comparison of rPPG-Toolbox with existing toolboxes in

camera-based physiological measurement.

Toolbox Dataset Support Unsup. Eval DNN Training DNN Eval

iPhys-Toolbox [20] ✗ ✓ ✗ ✗

PPG-I Toolbox [19] ✗ ✓ ✗ ✗

pyVHR [18, 21] ✓ ✓ ✗ ✓

rPPG-Toolbox (Ours) ✓ ✓ ✓ ✓

Unsup. = Unsupervised learning methods, DNN = Deep neural network methods.

several implementations of popular signal processing methods in Python: Bob.rrpg.base

includes

implementations of CHROM, SSR and Boccignone et al. [

] released code for Green, CHROM,

ICA, LGI, PBV, PCA, and POS. Several published papers have included links to code; however, often

this is only inference code and not training code for neural models. Without providing training code

for neural networks, it is challenging for researchers to conduct end-to-end reproducible experiments

and build on existing research.

In this paper, we present an end-to-end toolbox

for camera-based physiological measurement. This

toolbox includes: 1) support for six public datasets, 2) pre-processing code to format the datasets for

training neural models, 3) implementations of six neural model architectures and six unsupervised

learning methods, 4) evaluation and inference pipelines for supervised and unsupervised learning

methods for reproducibility and 5) enabling advanced neural training and inference such as weakly

supervised pseudo labels, motion augmentation and multitask learning. We use this toolbox to publish

clear and reproducible benchmarks that we hope will provide a foundation for the community to

compare methods in a more rigorous and informative manner.

2 Related Work

In the ﬁeld of remote PPG sensing, there are three signiﬁcant open-source toolboxes (documented in

Table 1):

iPhys-Toolbox [

]: An open-sourced toolbox written in MATLAB that is comprised of implemen-

tations of numerous algorithms for rPPG sensing. It empowers researchers to present results on

their datasets using public, standard implementations of baseline methods, ensuring transparency of

parameters. This toolbox incorporates a wide range of widely employed baseline methods; however, it

falls short on Python support, public dataset data loaders, and neural network training and evaluation.

PPG-I Toolbox [

]: This toolbox provides MATLAB implementations, speciﬁcally for six unsuper-

vised signal separation models. It incorporates four evaluation metrics, including Pearson correlation,

RMSE/MSE, SNR, and Bland-Altman plots. However, similar to the iPhys-Toolbox, it lacks support

for public dataset data loading and neural network training and evaluation.

pyVHR [

]: The most recent in the ﬁeld, this toolbox adopts Python instead of MATLAB. While it

offers ample support for numerous unsupervised methods, its capabilities are limited when it comes

to modern neural networks. Notably, pyVHR supports only two neural networks for inference, and

none for model training. This omission can be a roadblock for researchers aiming to reproduce and

further advance state-of-the-art neural methods.

3 The rPPG-Toolbox

To address the gaps in the current tooling and to promote reproducibility and clearer benchmarking

within the camera-based physiological measurement (rPPG) community, we present an open-source

toolbox designed to support six public datasets, six unsupervised methods and six neural methods for

data preprocessing, neural model training and evaluation, and further analysis.

3.1 Datasets

The toolbox includes pre-processing code that converts six public datasets into a form amenable for

training with neural models. The standard form for the videos we select includes raw frames and

5https://pypi.org/project/bob.rppg.base/

6https://github.com/ubicomplab/rPPG-Toolbox

Data Preprocessing & Loaders Supervised Neural Methods Unsupervised Methods

DeepPhys

TS_CAN

PhysNet

EﬃcientPhys

DeepPhysTrainer

TscanTrainer

PhysnetTrainer

EﬃcientPhysTrainer

UBFCLoader

PURELoader

SCAMPSLoader

FFT

Unsupervised Predictor

GREENICA POS

CHROMLGIPBV

Evaluation

Peak

Detection

Prediction MAE

RMSE

MAPE

Pearson

Coef.

Ground

Truth

Train Valid Test

Face

Detection

Difference

Normalize.

Standardiza

tion

Resolution /

FS / Chunk GPU

BaseLoader

Implementations of exemplary data loaders Implementations of exemplary neural

architectures & training pipelines

Implementations of exemplary

unsupervised methods

Implementations of systematic video-level

evaluation pipelines

Conﬁguration

Implementations of end-to-end conﬁgurations to parametrize and

abstract training, validation and testing

E.g.

Figure 2: Overview. An overview of the rPPG-Toolbox codebase.

difference frames (the difference between each pair of consecutive frames) stored as numpy arrays in

a [N, W, H, C] format. Where N is the length of the sequence, W is the width of the frames, H is the

height of the frames, and C is the number of channels. There are six channels in this case, as both the

raw frames and difference frames account for three color channels each. For faster data loading, all

videos in the datasets are typically broken up into several “chunks” of non-overlapping N (e.g., 180)

frame sequences. All of these parameters (N, W, H, C) are easy to change and customize. The PPG

waveform labels are stored as numpy arrays in a [N, 1] format. The entire pre-processing procedure

is supported with multi-thread processing to accelerate the data processing time.

We have provided pre-processing code for UBFC-rPPG [

], PURE [

] SCAMPS [

], MMPD [

BP4D+ [

], and UBFC-Phys [

]. Each of these datasets encompasses a diverse array of real-world

conditions, capturing variations in factors such as motion, lighting, skin tones/types, and backgrounds,

thus presenting robust challenges for any signal processing and machine learning algorithm. Tools

(python notebooks) are provided for quickly visualizing pre-processed datasets and will be detailed

further in Appendix-J. We also support the pre-processing and usage of augmented versions of the

UBFC-rPPG [22] and PURE [23] datasets, a feature which we describe further in Section 4.2.

UBFC-rPPG [

]: This dataset features RGB videos recorded using a Logitech C920 HD Pro

webcam at 30Hz. The videos have a resolution of 640x480, and they are stored in an uncompressed

8-bit RGB format. Reference PPG data was obtained using a CMS50E transmissive pulse oximeter,

thereby providing the gold-standard validation data. The subjects were positioned approximately one

meter away from the camera during the recording sessions. The videos were captured under indoor

conditions with a combination of natural sunlight and artiﬁcial illumination.

PURE [

]: This dataset consists of recordings from 10 subjects, including 8 males and 2 females.

The video footage was captured with an RGB eco274CVGE camera from SVS-Vistek GmbH, with a

frequency of 30Hz and a resolution of 640x480. Subjects were positioned approximately 1.1 meters

from the camera and were illuminated from the front by ambient natural light ﬁltering through a

window. The gold-standard ground truth of PPG and SpO2 were obtained at 60Hz with a CMS50E

pulse oximeter afﬁxed to the subject’s ﬁnger. Each participant completed six recordings under varied

motion conditions, thereby contributing to a range of data reﬂecting different physical states.

SCAMPS [

]: This dataset encompasses 2,800 video clips, comprising 1.68M frames, featuring

synthetic avatars in alignment with cardiac and respiratory signals. These waveforms and videos

were generated by employing a sophisticated facial processing pipeline, resulting in high-ﬁdelity,

quasi-photorealistic renderings. To provide robust test conditions, the videos incorporate various

confounders such as head motions, facial expressions, and changes in ambient illumination.

MMPD [

]: This dataset includes 660 one-minute videos recorded using a Samsung Galaxy

S22 Ultra mobile phone, at 30 frames per second with a resolution of 1280x720 pixels and then

compressed to 320x240 pixels. The ground truth PPG signals were simultaneously captured using an

HKG-07C+ oximeter, at 200 Hz and then downsampled to 30 Hz. It contains Fitzpatrick skin types

3-6, four different lighting conditions (LED-low, LED-high, incandescent, natural), four distinct

activities (stationary, head rotation, talking, and walking), and exercise scenarios. With multiple

labels provided, different subsets of this dataset can be easily used for research using our toolbox.

BP4D+ [

]: This dataset contains video footage captured at a rate of 25 frames per second, for

140 subjects, each participating in 10 emotion-inducing tasks, amounting to a total of 1400 trials

and associated videos. In addition to the standard video footage, the dataset also includes 3D mesh

models and thermal video, both captured at the same frame rate. Alongside these, the dataset offers

supplementary data including blood pressure measurements (wave, systolic, diastolic, mean), heart

rate in beats per minute, respiration (wave, rate bpm), electrodermal activity, and Facial Action

Coding System (FACS) encodings for speciﬁed action units.

UBFC-Phys [

]: The UBFC-PHYS dataset, a multi-modal dataset, contains 168 RGB videos, with

56 subjects (46 women and 10 men) per a task. There are three tasks with signiﬁcant amounts of

unconstrained motion under static lighting conditions - a rest task, a speech task, and an arithmetic

task. The dataset contains gold-standard blood volume pulse (BVP) and electrodermal activity (EDA)

measurements that were collected via the Empatica E4 wristband. The videos were recorded at a

resolution of 1024x1024 and 35Hz with a EO-23121C RGB digital camera. We utilized all three

tasks and the same subject sub-selection list provided by the authors of the dataset in the second

supplementary material of Sabour et al. [

] for evaluation. We reiterate this subject sub-selection

list in Appendix-H.

3.2 Methods

3.2.1 Unsupervised Methods

The following methods all use linear algebra and traditional signal processing to recover the estimated

PPG signal: 1) Green [

]: the green channel information is used as the proxy for the PPG after

spatial averaging of RGB video; 2) ICA [

]: Independent Component Analysis (ICA) is applied

to normalized, spatially averaged color signals to recover demixing matrices; 3) CHROM [

]: a

linear combination of the chrominance signals obtained from the RGB video are used for estimation;

4) POS [

]: plane-orthogonal-to-the-skin (POS), is a method that calculates a projection plane

orthogonal to the skin-tone based on physiological and optical principles. A ﬁxed matrix projection

is applied to the spatially normalized, averaged pixel values, which are used to recover the PPG

waveform; 5) PBV [

]: a signature, that is determined by a given light spectrum and changes of

the blood volume pulse, is used in order to derive the PPG waveform while offsetting motion and

rPPG-Toolbox Deep Remote PPG Toolbox Xin Liu1 Girish Narayanswamy1 Akshay Paruchuri2 Xiaoyu Zhang3 Jiankai Tang3 Yuzhe Zhang3 Roni Sengupta2 Shwetak Patel1 Yuntao Wang3 Daniel McDuff1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: