PERI Part Aware Emotion Recognition In The Wild Akshita Mittel1and Shashank Tripathi2

2025-05-02 1 0 2.17MB 17 页 10玖币

侵权投诉

PERI: Part Aware Emotion Recognition

In The Wild

Akshita Mittel1and Shashank Tripathi2

1NVIDIA

amittel@nvidia.com

2Max Planck Institute for Intelligent Systems, T¨ubingen, Germany

stripathi@tue.mpg.de

Abstract.

Emotion recognition aims to interpret the emotional states

of a person based on various inputs including audio, visual, and textual

cues. This paper focuses on emotion recognition using visual features.

To leverage the correlation between facial expression and the emotional

state of a person, pioneering methods rely primarily on facial features.

However, facial features are often unreliable in natural unconstrained

scenarios, such as in crowded scenes, as the face lacks pixel resolution and

contains artifacts due to occlusion and blur. To address this, methods

focusing on in the wild emotion recognition exploit full-body person crops

as well as the surrounding scene context. While eﬀective, in a bid to

use body pose for emotion recognition, such methods fail to realize the

potential that facial expressions, when available, oﬀer. Thus, the aim of

this paper is two-fold. First, we demonstrate a method,

PERI

, to leverage

both body pose and facial landmarks. We create part aware spatial (

PAS

)

images by extracting key regions from the input image using a mask

generated from both body pose and facial landmarks. This allows us

to exploit body pose in addition to facial context whenever available.

Second, to reason from the

PAS

images, we introduce context infusion

(

Cont-In

) blocks. These blocks attend to part-speciﬁc information, and

pass them onto the intermediate features of an emotion recognition

network. Our approach is conceptually simple and can be applied to

any existing emotion recognition method. We provide our results on the

publicly available in the wild EMOTIC dataset. Compared to existing

methods,

PERI

achieves superior performance and leads to signiﬁcant

improvements in the mAP of emotion categories, while decreasing Valence,

Arousal and Dominance errors. Importantly, we observe that our method

improves performance in both images with fully visible faces as well as in

images with occluded or blurred faces.

1 Introduction

The objective of emotion recognition is to recognise how people feel. Humans

function on a daily basis by interpreting social cues from around them. Lecturers

can sense confusion in the class, comedians can sense engagement in their au-

dience, and psychiatrists can sense complex emotional states in their patients.

arXiv:2210.10130v1 [cs.CV] 18 Oct 2022

2 A. Mittel and S. Tripathi

As machines become an integral part of our lives, it is imperative that they

understand social cues in order to assist us better. By making machines more

aware of context, body language, and facial expressions, we enable them to play

a key in role in numerous situations. This includes monitoring critical patients in

hospitals, helping psychologists monitor patients they are consulting, detecting

engagement in students, analysing fatigue in truck drivers, to name a few. Thus,

emotion recognition and social AI have the potential to drive key technological

advancements in the future.

Facial expressions are one of the biggest indicators of how a person feels.

Therefore, early work in recognizing emotions focused on detecting and analyzing

faces [

]. Although rapid strides have been made in this direction, such

methods assume availability of well aligned, fully visible and high-resolution face

crops [

]. Unfortunately, this assumption does not hold in

realistic and unconstrained scenarios such as internet images, crowded scenes, and

autonomous driving. In the wild emotion recognition, thus, presents a signiﬁcant

challenge for these methods as face crops tend to be low-resolution, blurred or

partially visible due to factors such as subject’s distance from the camera, person

and camera motion, crowding, person-object occlusion, frame occlusion etc. In

this paper, we address in-the-wild emotion recognition by leveraging face, body

and scene context in a robust and eﬃcient framework called

art-aware

motion

Recognition In the wild, or PERI.

Research in psychology and aﬀective computing has shown that body pose

oﬀers signiﬁcant cues on how a person feels [

]. For example, when people are

interested in something, they tilt their head forward. When someone is conﬁdent,

they tend to square their shoulders. Recent methods recognize the importance

of body pose for emotion recognition and tackle issues such as facial occlusion

and blurring by processing image crops of the entire body [

Kosti

et al.

[

] expand upon previous work by adding scene context in

the mix, noting that the surrounding scene plays a key role in deciphering the

emotional state of an individual. An illustrative example of this could be of

a person crying at a celebration such as graduation as opposed to a person

at a funeral. Both individuals can have identical posture but may feel vastly

diﬀerent set of emotions. Huang

et al.

[

] expanded on this by improving emotion

recognition using body pose estimations.

In a bid to exploit full body crops, body keypoints and scene context, such

methods tend to ignore part-speciﬁc information such as shoulder position, head

tilt, facial expressions, etc. which, when available, serve as powerful indicators

of the emotional state. While previous approaches focus on either body pose or

facial expression, we hypothesize that a ﬂexible architecture capable of leveraging

both body and facial features is needed. Such an architecture should be robust to

lack of reliable features on both occluded/blurred body or face, attend to relevant

body parts and be extensible enough to include context from the scene. To this

end, we present a novel representation, called part-aware spatial (

PAS

) images

that encodes both facial and part speciﬁc features and retains pixel-alignment

relative to the input image. Given a person crop, we generate a part-aware mask

PERI: Part Aware Emotion Recognition In The Wild 3

by ﬁtting Gaussian functions to the detected face and body landmarks. Each

Gaussian in the part-aware mask represents the spatial context around body and

face regions and speciﬁes key regions in the image the network should attend to.

We apply the part-aware mask on the input image which gives us the ﬁnal

PAS

image (see Fig. 2). The

PAS

images are agnostic to occlusion and blur and take

into account both body and face features.

To reason from

PAS

images, we propose novel context-infusion (

Cont-In

)

blocks to inject part-aware features at multiple depths in a deep feature backbone

network. Since the

PAS

images are pixel-aligned, each

Cont-In

block implements

explicit attention on part-speciﬁc features from the input image. We show that

as opposed to early fusion (

e.g.

channel-wise concatenation) of

PAS

image with

input image

, or late fusion (concatenating the features extracted from

PAS

images just before the ﬁnal classiﬁcation),

Cont-In

blocks eﬀectively utilize part-

aware features from the image.

Cont-In

blocks do not alter the architecture of the

base network, thereby allowing Imagenet pretraining on all layers. The

Cont-In

blocks are designed to be easy to implement, eﬃcient to compute and can be

easily integrated with any emotion recognition network with minimal eﬀort.

Closest to our work is the approach of Gunes

et al.

[

] which combines

visual channels from face and upper body gestures for emotion recognition.

However, unlike

PERI

, which takes unconstrained in the wild monocular images

as input, their approach takes two high-resolution camera streams, one focusing

only on the face and other focusing only on the upper body gestures from the

waist up. All of the training data in [

] is recorded in an indoor setting with

a uniform background, single subject, consistent lighting, front-facing camera

and fully visible face and body; a setting considerably simpler than our goal

of emotion recognition in real-world scenarios. Further, our architecture and

training scheme is fundamentally diﬀerent and eﬃciently captures part-aware

features from monocular images.

In summary, we make the following contributions:

Our approach,

PERI

, advances in the wild emotion recognition by introducing

a novel representation (called

PAS

images) which eﬃciently combines body

pose and facial landmarks such that they can supplement one another.

2. We propose context infusion (Cont-In) blocks which modulate intermediate

features of a base emotion recognition network, helping in reasoning from both

body poses and facial landmarks. Notably,

Cont-In

blocks are compatible

with any exiting emotion recognition network with minimal eﬀort.

Our approach results in signiﬁcant improvements compared to existing ap-

proaches in the publicly-available in the wild EMOTIC dataset [

]. We show

that

PERI

adds robustness under occlusion, blur and low-resolution input

crops.

2 Related Work

Emotion recognition is a ﬁeld of research with the objective of interpreting a

person’s emotions using various cues such as audio, visual, and textual inputs.

4 A. Mittel and S. Tripathi

Preliminary methods focused on recognising six basic discrete emotions deﬁned by

the psychologists Ekman and Friesen [

]. These include anger, surprise, disgust,

enjoyment, fear, and sadness. As research progressed, datasets, such as the

EMOTIC dataset [

], have expanded on these to provide a wider label

set. A second class of emotion recognition methods focus not on the discrete

classes but on a continuous set of labels described by Mehrabian [

] including

Valence (V), Arousal (A), and Dominance (D). We evaluate the performance of

our model using both the 26 discrete classes in the EMOTIC dataset [

], as

well as valence, arousal, and dominance errors. Our method works on visual cues,

more speciﬁcally on images and body crops.

Emotion recognition using facial features

. Most existing methods in

Computer Vision for emotion recognition focus on facial expression analysis

[

]. Initial work in this ﬁeld was based on using the Facial Action

Coding System (FACS) [

] to recognise the basic set of emotions.

FACS refers to a set of facial muscle movements that correspond to a displayed

emotion, for instance raising the inner eyebrow can be considered as a unit

of FACS. These methods ﬁrst extract facial landmarks from a face, which are

then used to create facial action units, a combination of which are used to

recognise the emotion. Another class of methods use CNNs to recognize the

emotions [

]. For instance, Emotionnet [

] uses face detector to

obtain face crops which are then passed into a CNN to get the emotion category.

Similar to these methods, we use facial landmarks in our work. However, uniquely,

the landmarks are used to create the

PAS

contextual images, which in turn

modulate the main network through a series of convolutions layers in the Cont-In

blocks.

Emotion recognition using body poses

. Unlike facial emotion recognition,

the work on emotion recognition using body poses is relatively new. Research in

psychology [

] suggests that cues from body pose, including features such

as hip, shoulder, elbow, pelvis, neck, and trunk can provide signiﬁcant insight into

the emotional state of a person. Based on this hypothesis, Crenn

et al.

[

] sought

to classify body expressions by obtaining low-level features from 3D skeleton

sequences. They separate the features into three categories: geometric features,

motion features, and fourier features. Based on these low-level features, they

calculate meta features (mean and variance), which are sent to the classiﬁer to

obtain the ﬁnal expression labels. Huang

et al.

[

] use a body pose extractor

built on Actional-Structural GCN blocks as an input stream to their model. The

other streams in their model extract information from images and body crops

based on the architecture of Kosti

et al.

[

]. The output of all the streams are

concatenated using a fusion layer before the ﬁnal classiﬁcation. Gunes

et al.

[

]

also uses body gestures. Similar to

PERI

, they use facial features by combining

visual channels from face and upper body gestures. However, their approach

takes two high-resolution camera streams, one focusing only on the face and

other focusing only on the upper body gestures, making them unsuitable for

unconstrained settings. For our method, we use two forms of body posture

information, body crops and body pose detections. Body crops taken from

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PERI:PartAwareEmotionRecognitionInTheWildAkshitaMittel1andShashankTripathi21NVIDIAamittel@nvidia.com2MaxPlanckInstituteforIntelligentSystems,T¨ubingen,Germanystripathi@tue.mpg.deAbstract.Emotionrecognitionaimstointerprettheemotionalstatesofapersonbasedonvariousinputsincludingaudio,visual,andtextualc...

收起<<

PERI Part Aware Emotion Recognition In The Wild Akshita Mittel1and Shashank Tripathi2.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PERI Part Aware Emotion Recognition In The Wild Akshita Mittel1and Shashank Tripathi2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: