stress samples from another scenario.
II. RELATED WORK
Due to the health consequences of stress, there is extensive
research on stress recognition. It is beyond the scope of this
work to summarize the numerous works that improve stress
recognition. So, we focus on works that compare various
models or stress datasets to gain insights into trends pertaining
to their performance.
There are multiple feature-based models proposed for stress
recognition in various works. Bobade and Vani [2] compare
the stress recognition performance of various machine learning
models trained on hand-crafted features from various physi-
ological signals. They use the WESAD dataset [1] to train
K-Nearest Neighbour (KNN), Linear Discriminant Analysis
(LDA), Random Forest Classifier (RFC), Support Vector Ma-
chine (SVM), etc. They also propose a simple feed-forward
Artificial Neural Network (ANN) trained on the same input.
Their comparison shows that ANN achieves higher accuracy
than other models.
As mentioned before, there are two main types of stress
recognition models - deep neural networks and feature-
based machine learning models. Naturally, questions arise on
whether one type is better than the other. Zhang et al. [11]
address this question by studying the performance of a deep
neural network and feature-based models on a dataset they
collected. They propose a stress recognition model consisting
of both convolutional neural networks (CNN) and bidirectional
long short-term memory (BiLSTM). For comparison, they
extract HRV features and train popular machine learning
models like SVMs, RFC, Ada Boost, etc. The CNN-LSTM
model takes 10 sof raw ECG signal, whereas the other
machine learning models use HRV features extracted from
60 sof ECG data. Zhang et al. demonstrate that deep neural
networks significantly outperform HRV-based models.
Dzie˙
zyc et al. [14] compare various deep learning models
on their performance in emotion recognition tasks (including
stressful condition). An extensive study is performed on four
different datasets, separately. They chose an input signal length
of 50 −60 s, which is longer than the typical input length for
deep learning models. They note that CNN-based models tend
to perform better than LSTM-based models.
All the above works train and test the stress recognition
models on the same dataset. Cho et al. [15] consider two
datasets differing in size and train ECG-based deep learning
models to detect stress. They propose a transfer learning ap-
proach, which involves training a model on the bigger dataset
and then fine-tuning it on the smaller dataset. They observe
that the stress recognition on the smaller dataset improves
through transfer learning. Other than the size, the datasets were
very similar (e.g. same ECG sensor and configuration). The
authors note that when data from other datasets are used, their
model shows high bias to the type of stressor and a dependency
on the sensor used. In line with this observation, Liapis et
al. [16] demonstrate that a high stress recognition accuracy on
one dataset does not necessarily translate to high accuracy in
another dataset. To this end, they extract Skin Conductance
(SC) features from the WESAD dataset [1] and train four
machine learning models for stress recognition. These models
achieve high accuracy while testing on the WESAD dataset.
However, they did not achieve good results on input signals
from a different dataset (UX evaluation dataset). Since the UX
evaluation dataset is annotated primarily for emotion and not
stress, it is difficult to conclude about the generalizability of
the models. Nevertheless, their observation highlights the need
for cross-dataset evaluations and assessing the generalizability
of the stress recognition models.
As a first step towards combining stress datasets for devel-
oping generic models, Baird et al. [17] evaluate three datasets
on their ability to predict cortisol values. Cortisol values are
considered the ground truth for stress response. As they note,
the scales of cortisol values of the datasets are incompatible
and thus, a cross-dataset evaluation is not feasible. However,
all three datasets were collected through similar Trier Social
Stress Test (TSST) procedures. So, the responses in each
condition of the test are expected to be similar and therefore,
the trends in predicted cortisol values can be compared. To
this end, they extract features from the speech signals in the
datasets and train models for each dataset. They highlight
the feasibility of using speech signals from one dataset as
predictors of stress in another dataset.
III. APPROACH
Deep learning models trained directly on the ECG signal
typically outperform hand-crafted HRV features on a given
dataset [11]. However, it remains unexplored if these deep
learning models perform equally well in cross-dataset evalua-
tions. To investigate this, we train 5stress recognition models
- two deep learning models using ECG signals as input, and
three models based on hand-crafted HRV features. First, we
train and evaluate the stress models on the same dataset using
leave-one-subject-out (LOSO) cross-validation. We perform
this evaluation on two different datasets. Then, we evaluate
the LOSO models trained on dataset A using samples from
the other dataset B (cross-dataset evaluation) to assess their
generalization capabilities. Baird et al. [17] note that machine
learning models can benefit from combining stress datasets
as it increases the data available for training. It has not been
investigated if this holds true if the datasets are vastly different,
especially in terms of the stressors, the intensity of stress
experienced, and the brand of sensors used. So additionally, we
train the models on a combined dataset (merging samples from
the two datasets) and evaluate them using LOSO validation.
A. Datasets
1) WESAD: WESAD [1] is a multimodal dataset that
contains motion (ACC) and physiological (ECG, EDA, etc.)
signals, which were collected using chest-worn RespiBan and
wrist-worn Empatica E4 devices. The data was collected from
15 participants under three conditions: baseline, stress, and
amusement. Stress was elicited using the Trier Social Stress
Test (TSST) involving public speaking and mental arithmetic