SS-V AERR Self-Supervised Apparent Emotional Reaction Recognition from Video Marija Jegorova1 Stavros Petridis12 and Maja Pantic12

2025-05-03 0 0 510.92KB 8 页 10玖币
侵权投诉
SS-VAERR: Self-Supervised Apparent Emotional Reaction
Recognition from Video
Marija Jegorova1, Stavros Petridis1,2, and Maja Pantic1,2
1Meta Reality Labs, London, United Kingdom
2Department of Computing, Imperial College London, United Kingdom
Abstract This work focuses on the apparent emotional reac-
tion recognition (AERR) from the video-only input, conducted
in a self-supervised fashion. The network is first pre-trained on
different self-supervised pretext tasks and later fine-tuned on
the downstream target task. Self-supervised learning facilitates
the use of pre-trained architectures and larger datasets that
might be deemed unfit for the target task and yet might be
useful to learn informative representations and hence provide
useful initializations for further fine-tuning on smaller more
suitable data. Our presented contribution is two-fold: (1)
an analysis of different state-of-the-art (SOTA) pretext tasks
for the video-only apparent emotional reaction recognition
architecture, and (2) an analysis of various combinations of the
regression and classification losses that are likely to improve
the performance further. Together these two contributions
result in the current state-of-the-art performance for the video-
only spontaneous apparent emotional reaction recognition with
continuous annotations.
I. INTRODUCTION
Apparent emotional reaction recognition (AERR) is a
broadly applicable branch of computer vision. In this paper
we are going to focus on specifically the video-only domain
for AERR for several reasons. First, the audio stream is
not always available, and not every apparent emotional
reaction is accompanied by a sound. Second, in audio-visual
domain active speaker detection is a whole new problem in
case of multiple speakers in the video. Finally, generalising
to noisy environments can represent certain challenges for
audio. Hence it would be useful to explore the efficient
AERR restricted solely to the video modality for the sake
of prediction robustness and broader applicability.
Further, this work explores predicting the continuous emo-
tion characteristics - arousal and valence (in this paper we
call these continuous emotions) instead of more traditional
AERR that is concerned with classifying the categorical
emotions (sadness, fear, surprise, etc). The reason being that
the categorical emotion theory is limited in its ability to
express subtle and disparate emotions [1].
Current state-of-the-art for video-only AERR are [2] and
[3]. First presents a model based on probabilistic modeling
of the temporal context, presenting compelling results on
SEWA dataset [4]. A somewhat comparable performance is
achieved by [5], using spatio-temporal higher-order convo-
lutional neural net. Secondly, for RECOLA dataset [6] the
current SOTA is TS-SATCN [3], a two-stage spatio-temporal
This work has been supported by Meta Reality Labs. With the exception
of training the pre-text architectures on LRS3 dataset, which has been
conducted on the servers of Imperial College London.
attention temporal convolution network. The only additional
video-only AERR method to be found is Visual ResNet-50
presented in [7], also evaluated on RECOLA.
Shortage of annotated data for specific tasks and domains
often represents a challenge. This can be addressed from
several angles, e.g. transfer or semi-supervised learning
and self-supervised learning (SSL). We focus on the SSL
approach, that can use labelled and unlabelled data within
the same model. It relies on the pretext training to leverage
the additional data, and then serve as an initialization to the
downstream training, solving target tasks.
The works that contributed to the SSL paradigm for facial
data in adjacent domains are [8] and [9]. One presented
a SSL framework for a number of tasks, including the
AERR from images, providing results on AffectNet, large-
scale facial expression image database [10], and is the SOTA
for self-supervised AERR on images [8]. The other, [9],
describes contrastive-learning across the video-sequences,
for specifically categorical emotions on acted dataset Oulu-
CASIA [11], and is SOTA for acted AERR from video.
Additionally, [12] offered a unified framework for multiple
tasks, but it does not surpass [7] and [3] on RECOLA [6].
To the best of our knowledge video-only self-supervised
framework for natural apparent emotional reaction recogni-
tion has not yet been explored, which is what we present
in this paper. We compare 3 different SSL methods for the
pretext training and investigate the impact of a variety of
loss functions during downstream training. We evaluate our
proposed method on two different natural emotional reactions
datasets (SEWA and RECOLA, [4], [6]) and achieve an
improvement by up to 10% on previously published models.
Our main contributions can be summarized as follows:
(1) a review of several pretext tasks for apparent emotional
reaction recognition from video for their downstream per-
formance across several spontaneous emotion datasets; (2)
analysis of the impact of the combined regression and classi-
fication losses, data augmentations, and downstream learning
parameters; (3) adding up to the first to our knowledge Self-
Supervised Visual Apparent Emotional Reaction Recognition
method for spontaneous emotions with continuous annota-
tions, SS-VAERR. Please check Tab. I for the results.
II. RELATED WORK
Apparent Emotional Reaction Recognition is a vast
research field spread across different methods and domains.
arXiv:2210.11341v1 [cs.CV] 20 Oct 2022
2D ResNet-18
3D Conv.
Conformer
MLP Projection Head
Predicted
Features
L1 loss
PASE
Features
PASE Encoder
(frozen)
Waveform Audio
Video / Image
Sequence
(a) LiRA (b) BYOL (c) DINO
2D ResNet-18
3D Conv.
MLP Projector
Input
MLP Predictor
Predicted
Features
Cross-
Entropy Loss
Target
Projection
MLP Projector
3D Conv.
2D ResNet-18 View 2
View 1
stop-gradient
Online
Projection
moving average over weights
Online Network Target Network
2D ResNet-18
3D Conv.
softmax (with higher
temperature)
Input
Predicted
Features /
attenttions
Cross-
Entropy Loss
Teacher's
attentions
softmax (with lower
temperature)
3D Conv.
2D ResNet-18 Global
Crops
Multiple
Crops
stop-gradient
exponential moving average
Student
Representer
Teacher
Representer
centering
2D ResNet-18
3D Conv.
GRU
2 x sigmoid - 1
Video / Image
Sequence
(d) SS-VAERR
(Downstream fine-tuning)
Softmax
CCC / MSE CE / nCCE
Arousal &
Valence (per
frame)
One-hot encoded
discretized
Arousal & Valence
Fig. 1. A comparative overview of all the reviewed pretext architectures. Please note ResNet18 architecture is used instead of the original ResNet50
and transformers for BYOL [13] and DINO [14] correspondingly for comparability. Blue and green boxes represent the network blocks (layers, standard
structures, activations/normalization). Green are the blocks used to initialize the downstream architecture. Yellow boxes represent training losses, and orange
are networks with interconnected weights.
Domain-wise there is audio-based, image-based, video-
based, and audio-visual AERR, additionally separated into
acted and spontaneous/natural AER. We mostly focus on
spontaneous and visual AER here. The results across the
field are reported on different datasets, complicating the
comparison, that is why SOTA is reported per dataset. There
are also multiple datasets for AERR of different modalities,
the ones discussed in this paper (Sec. IV-B), and others, such
as AffNet [15], Oulu-CASIA [11], and AffectNet [10].
Audio-visual or multi-modal AERR tends to yield better
results than video-only. Specifically audio is known to pro-
vide better signal for arousal [16]–[18]. Multi-modal AERR
works include [19], a BLSTM-based method, using joint
discrete and continuous emotion representation for AERR
that holds the current SOTA for multi-modal AERR on
RECOLA [6] development set. However, a ResNet50-based
method presented by [7] achieves SOTA results for RECOLA
on the test set, also presenting video- and audio-only results.
The most prominent examples of the image-based AERR
include EmoFan [20] - an approach for direct estimation
of facial landmarks, discrete and continuous emotions with
a single neural net from facial images, and [8] - an SSL
framework, purposed for a variety of downstream face-
related applications, including the SOTA results for AERR
on images presented on AffectNet [10].
Video-only AERR is a less explored field, the current
SOTA being Affective Processes [2] and [3]. First is a
neural processes model with a global stochastic contextual
representation, task-aware temporal context modelling, and
temporal context selection. Second is a two-stage spatio-
temporal attention temporal convolution network.
Self-supervised learning (SSL) focuses on minimising
the use of human-generated annotations at training time. It is
often used to leverage the large amounts of unlabelled data
to aid learning on significantly smaller annotated datasets.
The SSL is rooted in the assumption that solving a seemingly
unrelated self-supervised pretext task can help to learn useful
visual representations. These would serve as a good initial-
ization point for a task of interest, downstream task, given
that the model is well generalized and the tasks are similar
enough in kind [21]–[24]. If these assumptions are violated a
negative transfer might occur [25] - the performance would
be worse than that of a model trained entirely from scratch.
There is plenty of research showing the benefits of SSL
for general image datasets [26], [27]. SSL techniques vary
by both the downstream and pretext tasks. Traditional pretext
tasks include transformation classification [28], image in-
painting [29], image colorization (from grayscale) [30], [31],
and solving jigsaw puzzles [32]. A more recent field of SSL
is contrastive learning, that relies on minimizing the distance
摘要:

SS-VAERR:Self-SupervisedApparentEmotionalReactionRecognitionfromVideoMarijaJegorova1,StavrosPetridis1;2,andMajaPantic1;21MetaRealityLabs,London,UnitedKingdom2DepartmentofComputing,ImperialCollegeLondon,UnitedKingdomAbstract—Thisworkfocusesontheapparentemotionalreac-tionrecognition(AERR)fromthevideo-...

展开>> 收起<<
SS-V AERR Self-Supervised Apparent Emotional Reaction Recognition from Video Marija Jegorova1 Stavros Petridis12 and Maja Pantic12.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:510.92KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注