activities (stationary, head rotation, talking, and walking), and exercise scenarios. With multiple
labels provided, different subsets of this dataset can be easily used for research using our toolbox.
BP4D+ [
26
]: This dataset contains video footage captured at a rate of 25 frames per second, for
140 subjects, each participating in 10 emotion-inducing tasks, amounting to a total of 1400 trials
and associated videos. In addition to the standard video footage, the dataset also includes 3D mesh
models and thermal video, both captured at the same frame rate. Alongside these, the dataset offers
supplementary data including blood pressure measurements (wave, systolic, diastolic, mean), heart
rate in beats per minute, respiration (wave, rate bpm), electrodermal activity, and Facial Action
Coding System (FACS) encodings for specified action units.
UBFC-Phys [
27
]: The UBFC-PHYS dataset, a multi-modal dataset, contains 168 RGB videos, with
56 subjects (46 women and 10 men) per a task. There are three tasks with significant amounts of
unconstrained motion under static lighting conditions - a rest task, a speech task, and an arithmetic
task. The dataset contains gold-standard blood volume pulse (BVP) and electrodermal activity (EDA)
measurements that were collected via the Empatica E4 wristband. The videos were recorded at a
resolution of 1024x1024 and 35Hz with a EO-23121C RGB digital camera. We utilized all three
tasks and the same subject sub-selection list provided by the authors of the dataset in the second
supplementary material of Sabour et al. [
27
] for evaluation. We reiterate this subject sub-selection
list in Appendix-H.
3.2 Methods
3.2.1 Unsupervised Methods
The following methods all use linear algebra and traditional signal processing to recover the estimated
PPG signal: 1) Green [
28
]: the green channel information is used as the proxy for the PPG after
spatial averaging of RGB video; 2) ICA [
29
]: Independent Component Analysis (ICA) is applied
to normalized, spatially averaged color signals to recover demixing matrices; 3) CHROM [
30
]: a
linear combination of the chrominance signals obtained from the RGB video are used for estimation;
4) POS [
31
]: plane-orthogonal-to-the-skin (POS), is a method that calculates a projection plane
orthogonal to the skin-tone based on physiological and optical principles. A fixed matrix projection
is applied to the spatially normalized, averaged pixel values, which are used to recover the PPG
waveform; 5) PBV [
32
]: a signature, that is determined by a given light spectrum and changes of
the blood volume pulse, is used in order to derive the PPG waveform while offsetting motion and
other noise in RGB videos; 6) LGI [
33
]: a feature representation method that is invariant to motion
through differentiable local transformations.
3.2.2 Supervised Neural Methods
The following implementations of supervised learning algorithms are included in the toolbox. All im-
plementations were done using PyTorch [
37
]. Common optimization algorithms, such as Adam [
38
]
and AdamW [
39
], and criterion, such as mean squared error (MSE) loss, are utilized for training
except for where noted. The learning rate scheduler typically follows the 1cycle policy [
40
], which
anneals the learning rate from an initial learning rate to some maximum learning rate and then, from
that maximum learning rate, to some learning rate much lower than the initial learning rate. The total
steps in this policy are determined by the number of epochs multiplied by the number of training
batches in an epoch. The 1cycle policy allows for convergence due to the learning rate being adjusted
well below the initial, maximum learning rate throughout the cycle, and after numerous epochs in
which the learning rate is much higher than the final learning rate. We found the 1cycle learning
rate scheduler to provide stable results with convergence using a maximum learning rate of 0.009
and 30 epochs. We provide parameters in the toolbox that can enable the visualization of the losses
and learning rate changes for both the training and validation phases. Further details on these key
visualizations for supervised neural methods are provided in the GitHub repository.
DeepPhys [
4
]: A two-branch 2D convolutional attention network architecture. The two represen-
tations (appearance and difference frames) are processed by parallel branches with the appearance
branch guiding the motion branch via a gated attention mechanism. The target signal is the first
differential of the PPG waveform.
PhysNet [
5
]: A 3D convolutional network architecture. Yu et al. compared this 3D-CNN architecture
with a 2D-CNN + RNN architecture, finding that a 3D-CNN version was able to achieve superior
5