SS-VAERR: Self-Supervised Apparent Emotional Reaction
Recognition from Video
Marija Jegorova1, Stavros Petridis1,2, and Maja Pantic1,2
1Meta Reality Labs, London, United Kingdom
2Department of Computing, Imperial College London, United Kingdom
Abstract— This work focuses on the apparent emotional reac-
tion recognition (AERR) from the video-only input, conducted
in a self-supervised fashion. The network is first pre-trained on
different self-supervised pretext tasks and later fine-tuned on
the downstream target task. Self-supervised learning facilitates
the use of pre-trained architectures and larger datasets that
might be deemed unfit for the target task and yet might be
useful to learn informative representations and hence provide
useful initializations for further fine-tuning on smaller more
suitable data. Our presented contribution is two-fold: (1)
an analysis of different state-of-the-art (SOTA) pretext tasks
for the video-only apparent emotional reaction recognition
architecture, and (2) an analysis of various combinations of the
regression and classification losses that are likely to improve
the performance further. Together these two contributions
result in the current state-of-the-art performance for the video-
only spontaneous apparent emotional reaction recognition with
continuous annotations.
I. INTRODUCTION
Apparent emotional reaction recognition (AERR) is a
broadly applicable branch of computer vision. In this paper
we are going to focus on specifically the video-only domain
for AERR for several reasons. First, the audio stream is
not always available, and not every apparent emotional
reaction is accompanied by a sound. Second, in audio-visual
domain active speaker detection is a whole new problem in
case of multiple speakers in the video. Finally, generalising
to noisy environments can represent certain challenges for
audio. Hence it would be useful to explore the efficient
AERR restricted solely to the video modality for the sake
of prediction robustness and broader applicability.
Further, this work explores predicting the continuous emo-
tion characteristics - arousal and valence (in this paper we
call these continuous emotions) instead of more traditional
AERR that is concerned with classifying the categorical
emotions (sadness, fear, surprise, etc). The reason being that
the categorical emotion theory is limited in its ability to
express subtle and disparate emotions [1].
Current state-of-the-art for video-only AERR are [2] and
[3]. First presents a model based on probabilistic modeling
of the temporal context, presenting compelling results on
SEWA dataset [4]. A somewhat comparable performance is
achieved by [5], using spatio-temporal higher-order convo-
lutional neural net. Secondly, for RECOLA dataset [6] the
current SOTA is TS-SATCN [3], a two-stage spatio-temporal
This work has been supported by Meta Reality Labs. With the exception
of training the pre-text architectures on LRS3 dataset, which has been
conducted on the servers of Imperial College London.
attention temporal convolution network. The only additional
video-only AERR method to be found is Visual ResNet-50
presented in [7], also evaluated on RECOLA.
Shortage of annotated data for specific tasks and domains
often represents a challenge. This can be addressed from
several angles, e.g. transfer or semi-supervised learning
and self-supervised learning (SSL). We focus on the SSL
approach, that can use labelled and unlabelled data within
the same model. It relies on the pretext training to leverage
the additional data, and then serve as an initialization to the
downstream training, solving target tasks.
The works that contributed to the SSL paradigm for facial
data in adjacent domains are [8] and [9]. One presented
a SSL framework for a number of tasks, including the
AERR from images, providing results on AffectNet, large-
scale facial expression image database [10], and is the SOTA
for self-supervised AERR on images [8]. The other, [9],
describes contrastive-learning across the video-sequences,
for specifically categorical emotions on acted dataset Oulu-
CASIA [11], and is SOTA for acted AERR from video.
Additionally, [12] offered a unified framework for multiple
tasks, but it does not surpass [7] and [3] on RECOLA [6].
To the best of our knowledge video-only self-supervised
framework for natural apparent emotional reaction recogni-
tion has not yet been explored, which is what we present
in this paper. We compare 3 different SSL methods for the
pretext training and investigate the impact of a variety of
loss functions during downstream training. We evaluate our
proposed method on two different natural emotional reactions
datasets (SEWA and RECOLA, [4], [6]) and achieve an
improvement by up to 10% on previously published models.
Our main contributions can be summarized as follows:
(1) a review of several pretext tasks for apparent emotional
reaction recognition from video for their downstream per-
formance across several spontaneous emotion datasets; (2)
analysis of the impact of the combined regression and classi-
fication losses, data augmentations, and downstream learning
parameters; (3) adding up to the first to our knowledge Self-
Supervised Visual Apparent Emotional Reaction Recognition
method for spontaneous emotions with continuous annota-
tions, SS-VAERR. Please check Tab. I for the results.
II. RELATED WORK
Apparent Emotional Reaction Recognition is a vast
research field spread across different methods and domains.
arXiv:2210.11341v1 [cs.CV] 20 Oct 2022