SS-V AERR Self-Supervised Apparent Emotional Reaction Recognition from Video Marija Jegorova1 Stavros Petridis12 and Maja Pantic12

2025-05-03 1 0 510.92KB 8 页 10玖币

侵权投诉

SS-VAERR: Self-Supervised Apparent Emotional Reaction

Recognition from Video

Marija Jegorova1, Stavros Petridis1,2, and Maja Pantic1,2

1Meta Reality Labs, London, United Kingdom

2Department of Computing, Imperial College London, United Kingdom

Abstract— This work focuses on the apparent emotional reac-

tion recognition (AERR) from the video-only input, conducted

in a self-supervised fashion. The network is ﬁrst pre-trained on

different self-supervised pretext tasks and later ﬁne-tuned on

the downstream target task. Self-supervised learning facilitates

the use of pre-trained architectures and larger datasets that

might be deemed unﬁt for the target task and yet might be

useful to learn informative representations and hence provide

useful initializations for further ﬁne-tuning on smaller more

suitable data. Our presented contribution is two-fold: (1)

an analysis of different state-of-the-art (SOTA) pretext tasks

for the video-only apparent emotional reaction recognition

architecture, and (2) an analysis of various combinations of the

regression and classiﬁcation losses that are likely to improve

the performance further. Together these two contributions

result in the current state-of-the-art performance for the video-

only spontaneous apparent emotional reaction recognition with

continuous annotations.

I. INTRODUCTION

Apparent emotional reaction recognition (AERR) is a

broadly applicable branch of computer vision. In this paper

we are going to focus on speciﬁcally the video-only domain

for AERR for several reasons. First, the audio stream is

not always available, and not every apparent emotional

reaction is accompanied by a sound. Second, in audio-visual

domain active speaker detection is a whole new problem in

case of multiple speakers in the video. Finally, generalising

to noisy environments can represent certain challenges for

audio. Hence it would be useful to explore the efﬁcient

AERR restricted solely to the video modality for the sake

of prediction robustness and broader applicability.

Further, this work explores predicting the continuous emo-

tion characteristics - arousal and valence (in this paper we

call these continuous emotions) instead of more traditional

AERR that is concerned with classifying the categorical

emotions (sadness, fear, surprise, etc). The reason being that

the categorical emotion theory is limited in its ability to

express subtle and disparate emotions [1].

Current state-of-the-art for video-only AERR are [2] and

[3]. First presents a model based on probabilistic modeling

of the temporal context, presenting compelling results on

SEWA dataset [4]. A somewhat comparable performance is

achieved by [5], using spatio-temporal higher-order convo-

lutional neural net. Secondly, for RECOLA dataset [6] the

current SOTA is TS-SATCN [3], a two-stage spatio-temporal

This work has been supported by Meta Reality Labs. With the exception

of training the pre-text architectures on LRS3 dataset, which has been

conducted on the servers of Imperial College London.

attention temporal convolution network. The only additional

video-only AERR method to be found is Visual ResNet-50

presented in [7], also evaluated on RECOLA.

Shortage of annotated data for speciﬁc tasks and domains

often represents a challenge. This can be addressed from

several angles, e.g. transfer or semi-supervised learning

and self-supervised learning (SSL). We focus on the SSL

approach, that can use labelled and unlabelled data within

the same model. It relies on the pretext training to leverage

the additional data, and then serve as an initialization to the

downstream training, solving target tasks.

The works that contributed to the SSL paradigm for facial

data in adjacent domains are [8] and [9]. One presented

a SSL framework for a number of tasks, including the

AERR from images, providing results on AffectNet, large-

scale facial expression image database [10], and is the SOTA

for self-supervised AERR on images [8]. The other, [9],

describes contrastive-learning across the video-sequences,

for speciﬁcally categorical emotions on acted dataset Oulu-

CASIA [11], and is SOTA for acted AERR from video.

Additionally, [12] offered a uniﬁed framework for multiple

tasks, but it does not surpass [7] and [3] on RECOLA [6].

To the best of our knowledge video-only self-supervised

framework for natural apparent emotional reaction recogni-

tion has not yet been explored, which is what we present

in this paper. We compare 3 different SSL methods for the

pretext training and investigate the impact of a variety of

loss functions during downstream training. We evaluate our

proposed method on two different natural emotional reactions

datasets (SEWA and RECOLA, [4], [6]) and achieve an

improvement by up to 10% on previously published models.

Our main contributions can be summarized as follows:

(1) a review of several pretext tasks for apparent emotional

reaction recognition from video for their downstream per-

formance across several spontaneous emotion datasets; (2)

analysis of the impact of the combined regression and classi-

ﬁcation losses, data augmentations, and downstream learning

parameters; (3) adding up to the ﬁrst to our knowledge Self-

Supervised Visual Apparent Emotional Reaction Recognition

method for spontaneous emotions with continuous annota-

tions, SS-VAERR. Please check Tab. I for the results.

II. RELATED WORK

Apparent Emotional Reaction Recognition is a vast

research ﬁeld spread across different methods and domains.

arXiv:2210.11341v1 [cs.CV] 20 Oct 2022

2D ResNet-18

3D Conv.

Conformer

MLP Projection Head

Predicted

Features

L1 loss

PASE

Features

PASE Encoder

(frozen)

Waveform Audio

Video / Image

Sequence

(a) LiRA (b) BYOL (c) DINO

2D ResNet-18

3D Conv.

MLP Projector

Input

MLP Predictor

Predicted

Features

Cross-

Entropy Loss

Target

Projection

MLP Projector

3D Conv.

2D ResNet-18 View 2

stop-gradient

Online

Projection

moving average over weights

Online Network Target Network

2D ResNet-18

3D Conv.

softmax (with higher

temperature)

Input

Predicted

Features /

attenttions

Cross-

Entropy Loss

Teacher's

attentions

softmax (with lower

temperature)

3D Conv.

2D ResNet-18 Global

Crops

Multiple

Crops

stop-gradient

exponential moving average

Student

Representer

Teacher

Representer

centering

2D ResNet-18

3D Conv.

GRU

2 x sigmoid - 1

Video / Image

Sequence

(d) SS-VAERR

(Downstream fine-tuning)

Softmax

CCC / MSE CE / nCCE

Arousal &

Valence (per

frame)

One-hot encoded

discretized

Arousal & Valence

Fig. 1. A comparative overview of all the reviewed pretext architectures. Please note ResNet18 architecture is used instead of the original ResNet50

and transformers for BYOL [13] and DINO [14] correspondingly for comparability. Blue and green boxes represent the network blocks (layers, standard

structures, activations/normalization). Green are the blocks used to initialize the downstream architecture. Yellow boxes represent training losses, and orange

are networks with interconnected weights.

Domain-wise there is audio-based, image-based, video-

based, and audio-visual AERR, additionally separated into

acted and spontaneous/natural AER. We mostly focus on

spontaneous and visual AER here. The results across the

ﬁeld are reported on different datasets, complicating the

comparison, that is why SOTA is reported per dataset. There

are also multiple datasets for AERR of different modalities,

the ones discussed in this paper (Sec. IV-B), and others, such

as AffNet [15], Oulu-CASIA [11], and AffectNet [10].

Audio-visual or multi-modal AERR tends to yield better

results than video-only. Speciﬁcally audio is known to pro-

vide better signal for arousal [16]–[18]. Multi-modal AERR

works include [19], a BLSTM-based method, using joint

discrete and continuous emotion representation for AERR

that holds the current SOTA for multi-modal AERR on

RECOLA [6] development set. However, a ResNet50-based

method presented by [7] achieves SOTA results for RECOLA

on the test set, also presenting video- and audio-only results.

The most prominent examples of the image-based AERR

include EmoFan [20] - an approach for direct estimation

of facial landmarks, discrete and continuous emotions with

a single neural net from facial images, and [8] - an SSL

framework, purposed for a variety of downstream face-

related applications, including the SOTA results for AERR

on images presented on AffectNet [10].

Video-only AERR is a less explored ﬁeld, the current

SOTA being Affective Processes [2] and [3]. First is a

neural processes model with a global stochastic contextual

representation, task-aware temporal context modelling, and

temporal context selection. Second is a two-stage spatio-

temporal attention temporal convolution network.

Self-supervised learning (SSL) focuses on minimising

the use of human-generated annotations at training time. It is

often used to leverage the large amounts of unlabelled data

to aid learning on signiﬁcantly smaller annotated datasets.

The SSL is rooted in the assumption that solving a seemingly

unrelated self-supervised pretext task can help to learn useful

visual representations. These would serve as a good initial-

ization point for a task of interest, downstream task, given

that the model is well generalized and the tasks are similar

enough in kind [21]–[24]. If these assumptions are violated a

negative transfer might occur [25] - the performance would

be worse than that of a model trained entirely from scratch.

There is plenty of research showing the beneﬁts of SSL

for general image datasets [26], [27]. SSL techniques vary

by both the downstream and pretext tasks. Traditional pretext

tasks include transformation classiﬁcation [28], image in-

painting [29], image colorization (from grayscale) [30], [31],

and solving jigsaw puzzles [32]. A more recent ﬁeld of SSL

is contrastive learning, that relies on minimizing the distance

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SS-VAERR:Self-SupervisedApparentEmotionalReactionRecognitionfromVideoMarijaJegorova1,StavrosPetridis1;2,andMajaPantic1;21MetaRealityLabs,London,UnitedKingdom2DepartmentofComputing,ImperialCollegeLondon,UnitedKingdomAbstractThisworkfocusesontheapparentemotionalreac-tionrecognition(AERR)fromthevideo-...

展开>> 收起<<

SS-V AERR Self-Supervised Apparent Emotional Reaction Recognition from Video Marija Jegorova1 Stavros Petridis12 and Maja Pantic12.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SS-V AERR Self-Supervised Apparent Emotional Reaction Recognition from Video Marija Jegorova1 Stavros Petridis12 and Maja Pantic12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: