PERI Part Aware Emotion Recognition In The Wild Akshita Mittel1and Shashank Tripathi2

2025-05-02 0 0 2.17MB 17 页 10玖币
侵权投诉
PERI: Part Aware Emotion Recognition
In The Wild
Akshita Mittel1and Shashank Tripathi2
1NVIDIA
amittel@nvidia.com
2Max Planck Institute for Intelligent Systems, T¨ubingen, Germany
stripathi@tue.mpg.de
Abstract.
Emotion recognition aims to interpret the emotional states
of a person based on various inputs including audio, visual, and textual
cues. This paper focuses on emotion recognition using visual features.
To leverage the correlation between facial expression and the emotional
state of a person, pioneering methods rely primarily on facial features.
However, facial features are often unreliable in natural unconstrained
scenarios, such as in crowded scenes, as the face lacks pixel resolution and
contains artifacts due to occlusion and blur. To address this, methods
focusing on in the wild emotion recognition exploit full-body person crops
as well as the surrounding scene context. While effective, in a bid to
use body pose for emotion recognition, such methods fail to realize the
potential that facial expressions, when available, offer. Thus, the aim of
this paper is two-fold. First, we demonstrate a method,
PERI
, to leverage
both body pose and facial landmarks. We create part aware spatial (
PAS
)
images by extracting key regions from the input image using a mask
generated from both body pose and facial landmarks. This allows us
to exploit body pose in addition to facial context whenever available.
Second, to reason from the
PAS
images, we introduce context infusion
(
Cont-In
) blocks. These blocks attend to part-specific information, and
pass them onto the intermediate features of an emotion recognition
network. Our approach is conceptually simple and can be applied to
any existing emotion recognition method. We provide our results on the
publicly available in the wild EMOTIC dataset. Compared to existing
methods,
PERI
achieves superior performance and leads to significant
improvements in the mAP of emotion categories, while decreasing Valence,
Arousal and Dominance errors. Importantly, we observe that our method
improves performance in both images with fully visible faces as well as in
images with occluded or blurred faces.
1 Introduction
The objective of emotion recognition is to recognise how people feel. Humans
function on a daily basis by interpreting social cues from around them. Lecturers
can sense confusion in the class, comedians can sense engagement in their au-
dience, and psychiatrists can sense complex emotional states in their patients.
arXiv:2210.10130v1 [cs.CV] 18 Oct 2022
2 A. Mittel and S. Tripathi
As machines become an integral part of our lives, it is imperative that they
understand social cues in order to assist us better. By making machines more
aware of context, body language, and facial expressions, we enable them to play
a key in role in numerous situations. This includes monitoring critical patients in
hospitals, helping psychologists monitor patients they are consulting, detecting
engagement in students, analysing fatigue in truck drivers, to name a few. Thus,
emotion recognition and social AI have the potential to drive key technological
advancements in the future.
Facial expressions are one of the biggest indicators of how a person feels.
Therefore, early work in recognizing emotions focused on detecting and analyzing
faces [
2
,
12
,
33
,
37
]. Although rapid strides have been made in this direction, such
methods assume availability of well aligned, fully visible and high-resolution face
crops [
8
,
18
,
21
,
29
,
31
,
35
,
39
]. Unfortunately, this assumption does not hold in
realistic and unconstrained scenarios such as internet images, crowded scenes, and
autonomous driving. In the wild emotion recognition, thus, presents a significant
challenge for these methods as face crops tend to be low-resolution, blurred or
partially visible due to factors such as subject’s distance from the camera, person
and camera motion, crowding, person-object occlusion, frame occlusion etc. In
this paper, we address in-the-wild emotion recognition by leveraging face, body
and scene context in a robust and efficient framework called
P
art-aware
E
motion
Recognition In the wild, or PERI.
Research in psychology and affective computing has shown that body pose
offers significant cues on how a person feels [
5
,
7
,
20
]. For example, when people are
interested in something, they tilt their head forward. When someone is confident,
they tend to square their shoulders. Recent methods recognize the importance
of body pose for emotion recognition and tackle issues such as facial occlusion
and blurring by processing image crops of the entire body [
1
,
19
,
22
,
24
,
38
,
41
].
Kosti
et al.
[
22
,
24
] expand upon previous work by adding scene context in
the mix, noting that the surrounding scene plays a key role in deciphering the
emotional state of an individual. An illustrative example of this could be of
a person crying at a celebration such as graduation as opposed to a person
at a funeral. Both individuals can have identical posture but may feel vastly
different set of emotions. Huang
et al.
[
19
] expanded on this by improving emotion
recognition using body pose estimations.
In a bid to exploit full body crops, body keypoints and scene context, such
methods tend to ignore part-specific information such as shoulder position, head
tilt, facial expressions, etc. which, when available, serve as powerful indicators
of the emotional state. While previous approaches focus on either body pose or
facial expression, we hypothesize that a flexible architecture capable of leveraging
both body and facial features is needed. Such an architecture should be robust to
lack of reliable features on both occluded/blurred body or face, attend to relevant
body parts and be extensible enough to include context from the scene. To this
end, we present a novel representation, called part-aware spatial (
PAS
) images
that encodes both facial and part specific features and retains pixel-alignment
relative to the input image. Given a person crop, we generate a part-aware mask
PERI: Part Aware Emotion Recognition In The Wild 3
by fitting Gaussian functions to the detected face and body landmarks. Each
Gaussian in the part-aware mask represents the spatial context around body and
face regions and specifies key regions in the image the network should attend to.
We apply the part-aware mask on the input image which gives us the final
PAS
image (see Fig. 2). The
PAS
images are agnostic to occlusion and blur and take
into account both body and face features.
To reason from
PAS
images, we propose novel context-infusion (
Cont-In
)
blocks to inject part-aware features at multiple depths in a deep feature backbone
network. Since the
PAS
images are pixel-aligned, each
Cont-In
block implements
explicit attention on part-specific features from the input image. We show that
as opposed to early fusion (
e.g.
channel-wise concatenation) of
PAS
image with
input image
I
, or late fusion (concatenating the features extracted from
PAS
images just before the final classification),
Cont-In
blocks effectively utilize part-
aware features from the image.
Cont-In
blocks do not alter the architecture of the
base network, thereby allowing Imagenet pretraining on all layers. The
Cont-In
blocks are designed to be easy to implement, efficient to compute and can be
easily integrated with any emotion recognition network with minimal effort.
Closest to our work is the approach of Gunes
et al.
[
15
] which combines
visual channels from face and upper body gestures for emotion recognition.
However, unlike
PERI
, which takes unconstrained in the wild monocular images
as input, their approach takes two high-resolution camera streams, one focusing
only on the face and other focusing only on the upper body gestures from the
waist up. All of the training data in [
15
] is recorded in an indoor setting with
a uniform background, single subject, consistent lighting, front-facing camera
and fully visible face and body; a setting considerably simpler than our goal
of emotion recognition in real-world scenarios. Further, our architecture and
training scheme is fundamentally different and efficiently captures part-aware
features from monocular images.
In summary, we make the following contributions:
1.
Our approach,
PERI
, advances in the wild emotion recognition by introducing
a novel representation (called
PAS
images) which efficiently combines body
pose and facial landmarks such that they can supplement one another.
2. We propose context infusion (Cont-In) blocks which modulate intermediate
features of a base emotion recognition network, helping in reasoning from both
body poses and facial landmarks. Notably,
Cont-In
blocks are compatible
with any exiting emotion recognition network with minimal effort.
3.
Our approach results in significant improvements compared to existing ap-
proaches in the publicly-available in the wild EMOTIC dataset [
23
]. We show
that
PERI
adds robustness under occlusion, blur and low-resolution input
crops.
2 Related Work
Emotion recognition is a field of research with the objective of interpreting a
person’s emotions using various cues such as audio, visual, and textual inputs.
4 A. Mittel and S. Tripathi
Preliminary methods focused on recognising six basic discrete emotions defined by
the psychologists Ekman and Friesen [
9
]. These include anger, surprise, disgust,
enjoyment, fear, and sadness. As research progressed, datasets, such as the
EMOTIC dataset [
22
,
23
,
24
], have expanded on these to provide a wider label
set. A second class of emotion recognition methods focus not on the discrete
classes but on a continuous set of labels described by Mehrabian [
30
] including
Valence (V), Arousal (A), and Dominance (D). We evaluate the performance of
our model using both the 26 discrete classes in the EMOTIC dataset [
23
], as
well as valence, arousal, and dominance errors. Our method works on visual cues,
more specifically on images and body crops.
Emotion recognition using facial features
. Most existing methods in
Computer Vision for emotion recognition focus on facial expression analysis
[
2
,
12
,
33
,
37
]. Initial work in this field was based on using the Facial Action
Coding System (FACS) [
4
,
10
,
11
,
26
,
34
] to recognise the basic set of emotions.
FACS refers to a set of facial muscle movements that correspond to a displayed
emotion, for instance raising the inner eyebrow can be considered as a unit
of FACS. These methods first extract facial landmarks from a face, which are
then used to create facial action units, a combination of which are used to
recognise the emotion. Another class of methods use CNNs to recognize the
emotions [
2
,
19
,
22
,
23
,
32
,
42
]. For instance, Emotionnet [
2
] uses face detector to
obtain face crops which are then passed into a CNN to get the emotion category.
Similar to these methods, we use facial landmarks in our work. However, uniquely,
the landmarks are used to create the
PAS
contextual images, which in turn
modulate the main network through a series of convolutions layers in the Cont-In
blocks.
Emotion recognition using body poses
. Unlike facial emotion recognition,
the work on emotion recognition using body poses is relatively new. Research in
psychology [
3
,
13
,
14
] suggests that cues from body pose, including features such
as hip, shoulder, elbow, pelvis, neck, and trunk can provide significant insight into
the emotional state of a person. Based on this hypothesis, Crenn
et al.
[
6
] sought
to classify body expressions by obtaining low-level features from 3D skeleton
sequences. They separate the features into three categories: geometric features,
motion features, and fourier features. Based on these low-level features, they
calculate meta features (mean and variance), which are sent to the classifier to
obtain the final expression labels. Huang
et al.
[
40
] use a body pose extractor
built on Actional-Structural GCN blocks as an input stream to their model. The
other streams in their model extract information from images and body crops
based on the architecture of Kosti
et al.
[
22
,
24
]. The output of all the streams are
concatenated using a fusion layer before the final classification. Gunes
et al.
[
15
]
also uses body gestures. Similar to
PERI
, they use facial features by combining
visual channels from face and upper body gestures. However, their approach
takes two high-resolution camera streams, one focusing only on the face and
other focusing only on the upper body gestures, making them unsuitable for
unconstrained settings. For our method, we use two forms of body posture
information, body crops and body pose detections. Body crops taken from
摘要:

PERI:PartAwareEmotionRecognitionInTheWildAkshitaMittel1andShashankTripathi21NVIDIAamittel@nvidia.com2MaxPlanckInstituteforIntelligentSystems,T¨ubingen,Germanystripathi@tue.mpg.deAbstract.Emotionrecognitionaimstointerprettheemotionalstatesofapersonbasedonvariousinputsincludingaudio,visual,andtextualc...

展开>> 收起<<
PERI Part Aware Emotion Recognition In The Wild Akshita Mittel1and Shashank Tripathi2.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:2.17MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注