
ICMI ’22 Companion, November 7–11, 2022, Bengaluru, India S. Narayana, R. Subramanian, I. Radwan, R. Goecke
Mood
classification
Single branch CNN
(mood labels only)
Model-fusion
(mood and delta
labels)
Two-branch CNN
(mood and delta
labels)
Teacher-Student
Model
Input
Figure 2: Overview of the proposed mood recognition framework.
while very little research has so far been devoted to automated
mood recognition [
15
] or the joint modelling of the interplay be-
tween emotion and mood for improved aective state recognition.
Psychological studies on mood have made substantial progress.
An eye-tracking study has revealed that positive mood results in
better global information processing than a negative mood [
23
].
The authors in [
22
] have observed a mood-congruity eect, where
positive mood hampers the recognition of mood-incongruent nega-
tive emotions and vice-versa. The mood-emotion loop is a theory
that posits mood and emotion as distinct mechanisms, which aect
each other repeatedly and continuously. This theory argues that
mood is a high-level construct activating latent low-level states
such as emotions [
29
]. Recognising the interactions between mood
and emotion has the potential to lead to a better understanding
of aective phenomena, such as mood disorders and emotional
regulation.
On the contrary, mood recognition has rarely been addressed
from a computational perspective and only a few studies have
explored mood [
15
]. Body posture and movement correlates of
mood have been explored in [
27
]. User mood is induced via musical
stimuli and the authors have observed that head posture and move-
ments characterise happy and sad mood. Katsimerou et al. [
15
] have
examined automatic mood prediction from recognised emotions,
showing that clustered emotions in the valence-arousal space pre-
dict single moods much better than multiple moods within a video.
Research on mood prediction has also neglected to investigate the
interplay between mood and emotion, though the psychological
literature recognises a relationship between the two [20].
From an aective computing viewpoint, developing a mood
recognition framework requires ground-truth mood labels for
model training, but only very few databases record the user mood
(directly or indirectly via an observer). Widely used aective cor-
pora, such as AFEW-VA [
16
], HUMAINE [
9
], SEMAINE [
19
] and
DECAF [
1
] only contain dimensional and/or categorical emotion
labels. One of the few datasets with mood ratings is EMMA [
14
],
where the annotations developed represent the overall emotional
impression of the human annotator (or observer) for the examined
stimulus [
14
]. Machine learning approaches have been extensively
used for inferring emotions from visual, acoustic, textual and neu-
rophysiological data [
4
,
5
,
17
,
26
,
28
]. Contemporary studies em-
phasise the improved performance of multimodal approaches to
the detection of emotional states vis-á-vis unimodal ones [
8
]. Re-
cent studies characterise mood disorders, such as depression, by
examining speech style, eye activity, and head pose [
2
,
3
,
25
]. Deng
et al. [
6
] propose a multitask emotion recognition framework that
can deal with missing labels employing a teacher-student paradigm.
Knowledge Distillation (KD) is a technique that enables the trans-
fer of knowledge between two neural networks, unifying model
compression and learning with privileged information [
12
,
18
]. KD
techniques have been employed for facial expression recognition
where the teacher has access to a fully visible face, whereas the
student model only has access to occluded faces [10].
While our research is ultimately aimed towards mood prediction
and understanding the interplay between mood and emotions from
video data, the present study is an initial step on this path. We
use the AFEW-VA dataset to derive (a) dominant emotion labels,
which refer to the emotion persisting for most consecutive frames
(termed mood labels), and (b)
Δ
or emotion change labels, which
represent the change in emotion over a xed window size. Given
the sparsity of in-the-wild data with mood annotations and the pre-
liminary nature of this study, the dominant emotion labels are used
here in lieu of actual mood labels. In the future, we will be using
actual mood labels derived from expert annotators. Fig. 1 illustrates
how emotion change is captured for an exemplar video clip, while
Fig. 2 overviews our dominant emotion or mood prediction frame-
work. A unimodal 3D Convolutional Network Network (3D CNN)
is trained using only mood labels, while a two-branch (multimodal)
CNN model, multi-layer perceptron, and a teacher-student model
are evaluated for fusing emotion-mood information for mood pre-
diction. Empirical evaluation reveals that incorporating emotion
change information improves mood prediction performance by as
much as 54%, conrming the salience of ne-grained emotional
information for coarse-grained mood prediction. This study makes
the following contributions:
•
To the best of our knowledge, from a computational mod-
elling perspective, this is the rst study to examine mood
prediction incorporating both mood and emotional infor-
mation. Mood labels are derived from valence annotations,
instead of subjective impressions provided by a human an-
notator.
•
The experimental evaluation of multiple models shows that
incorporating emotional change information is benecial
and can produce a signicant improvement in mood predic-
tion performance.
2 MATERIALS
2.1 Dataset
Here, the AFEW-VA [
16
] dataset, a subset of the AFEW [
7
], com-
prising 600 video clips extracted from feature lms at a rate of 25
frames per second, was used. Video clips in this dataset range from