4 A. Mittel and S. Tripathi
Preliminary methods focused on recognising six basic discrete emotions defined by
the psychologists Ekman and Friesen [
9
]. These include anger, surprise, disgust,
enjoyment, fear, and sadness. As research progressed, datasets, such as the
EMOTIC dataset [
22
,
23
,
24
], have expanded on these to provide a wider label
set. A second class of emotion recognition methods focus not on the discrete
classes but on a continuous set of labels described by Mehrabian [
30
] including
Valence (V), Arousal (A), and Dominance (D). We evaluate the performance of
our model using both the 26 discrete classes in the EMOTIC dataset [
23
], as
well as valence, arousal, and dominance errors. Our method works on visual cues,
more specifically on images and body crops.
Emotion recognition using facial features
. Most existing methods in
Computer Vision for emotion recognition focus on facial expression analysis
[
2
,
12
,
33
,
37
]. Initial work in this field was based on using the Facial Action
Coding System (FACS) [
4
,
10
,
11
,
26
,
34
] to recognise the basic set of emotions.
FACS refers to a set of facial muscle movements that correspond to a displayed
emotion, for instance raising the inner eyebrow can be considered as a unit
of FACS. These methods first extract facial landmarks from a face, which are
then used to create facial action units, a combination of which are used to
recognise the emotion. Another class of methods use CNNs to recognize the
emotions [
2
,
19
,
22
,
23
,
32
,
42
]. For instance, Emotionnet [
2
] uses face detector to
obtain face crops which are then passed into a CNN to get the emotion category.
Similar to these methods, we use facial landmarks in our work. However, uniquely,
the landmarks are used to create the
PAS
contextual images, which in turn
modulate the main network through a series of convolutions layers in the Cont-In
blocks.
Emotion recognition using body poses
. Unlike facial emotion recognition,
the work on emotion recognition using body poses is relatively new. Research in
psychology [
3
,
13
,
14
] suggests that cues from body pose, including features such
as hip, shoulder, elbow, pelvis, neck, and trunk can provide significant insight into
the emotional state of a person. Based on this hypothesis, Crenn
et al.
[
6
] sought
to classify body expressions by obtaining low-level features from 3D skeleton
sequences. They separate the features into three categories: geometric features,
motion features, and fourier features. Based on these low-level features, they
calculate meta features (mean and variance), which are sent to the classifier to
obtain the final expression labels. Huang
et al.
[
40
] use a body pose extractor
built on Actional-Structural GCN blocks as an input stream to their model. The
other streams in their model extract information from images and body crops
based on the architecture of Kosti
et al.
[
22
,
24
]. The output of all the streams are
concatenated using a fusion layer before the final classification. Gunes
et al.
[
15
]
also uses body gestures. Similar to
PERI
, they use facial features by combining
visual channels from face and upper body gestures. However, their approach
takes two high-resolution camera streams, one focusing only on the face and
other focusing only on the upper body gestures, making them unsuitable for
unconstrained settings. For our method, we use two forms of body posture
information, body crops and body pose detections. Body crops taken from