1 Global-to-local Expression-aware Embeddings for Facial Action Unit Detection

2025-04-30 0 0 3.42MB 12 页 10玖币

Global-to-local Expression-aware Embeddings

for Facial Action Unit Detection

Rudong An, Wei Zhang, Hao Zeng, Wei Chen, Zhigang Deng, Yu Ding

Abstract—Expressions and facial action units (AUs) are two levels of facial behavior descriptors. Expression auxiliary information has

been widely used to improve the AU detection performance. However, most existing expression representations utilized in AU detection

works can only describe pre-determined discrete categories (e.g., Angry, Disgust, Happy, Sad, etc.) and cannot capture subtle

expression transformations like AUs. In this paper, we propose a novel ﬁne-grained Global Expression representation Encoder to

capture subtle and continuous facial movements, to promote AU detection. To obtain such a global expression representation, we

propose to train an expression embedding model on a large-scale expression dataset according to global expression similarity.

Moreover, considering the local deﬁnition of AUs, it is essential to extract local AU features. Therefore, we design a Local AU Features

Module to generate local facial features for each AU. Speciﬁcally, it consists of an AU feature map extractor and a corresponding AU

mask extractor. First, the two extractors transform the global expression representation into AU feature maps and masks, respectively.

Then, AU feature maps and their corresponding AU masks are multiplied to generate AU masked features focusing on local facial

region. Finally, the AU masked features are fed into an AU classiﬁer for judging the AU occurrence. Extensive experiment results

demonstrate the superiority of our proposed method. Our method validly outperforms previous works and achieves state-of-the-art

performances on widely-used face datasets, including BP4D, DISFA, and BP4D+.

Index Terms—Facial action coding, facial action unit detection, facial expression recognition, expression-aware embedding, deep

learning

1 INTRODUCTION

FACIAL Affect Analysis (FAA) is an active research area in

computer vision and affective computing communities.

FAA includes two approaches, namely expression and Fa-

cial Action Units (AUs). As a universal non-verbal approach

to carrying out human communication, facial expressions

can be described via a combination of AUs [1]. Facial AUs,

coded by the Facial Action Coding System (FACS) [1],

include 32 atomic facial action descriptors based on anatom-

ical facial muscle groups. Table 1 summarizes the informa-

tion of the 15 AUs used in this paper, including their names

and corresponding involved muscles. Each AU deﬁnes the

movement of a speciﬁc face region. For example, AU1 (Inner

Brow Raiser) is a descriptor focused on the medial part of

the frontalis. AUs and expression are highly related to each

other. Nearly any possible expression can be described as a

speciﬁc combination of facial AUs. For instance, as shown

in Figure 1, a happy expression can be achieved with the

occurrence of both AU6 and AU12; a doubt expression is

related to AU4, and a surprise expression is related to AU1

and AU26. From this perspective, AU features are often

considered as a kind of facial expression representation.

In fact, AUs play an important role in conveying human

emotions and automatic expression analysis. AU detection

has attracted much attention in recent years due to its

•R. An, W. Zhang, H. Zeng, W, Chen, and Y. Ding are with the Netease

Fuxi AI Lab, Hangzhou, China.

•Z. Deng is with the Department of Computer Science, University of

Houston, Houston, Texas, USA.

Manuscript received xxx xx, 2022; revised xxxx xx, 20xx.

Fig. 1. Examples of three expressions and their corresponding AUs.

From left to right: doubt, happiness, surprise. Expression is a global de-

scription of the face muscle movements, while AUs refer to the individual

local description of muscle motion.

wide applications, including emotion recognition [2], micro-

expression detection [3], and mental health diagnosis [4].

AU detection is challenging due to the lack of annotated

data and difﬁculty in capturing local features. Annotating

AUs takes hours of tedious effort to manually label one

hundred images, even for well-trained experts [5]. There-

fore, it is time-consuming and laborious to obtain large scale

well-annotated AUs data [6] [7]. In terms the lack of labeled

data, on the one hand, it often makes the training of deep

networks suffer from data over-ﬁtting. On the other hand,

the AU datasets usually contain limited identities, dozens

[8] [9] or hundreds [10], which also leads to identity over-

ﬁtting [11] [12] [13]. Compared with AU datasets, there are

amounts of accessible and easy-to-annotate expression data,

like FEC [14] and AffectNet [15], which are beneﬁcial for

improving AU detection performance due to their close cor-

relations [6] [7] [11]. However, simply regarding expressions

as several rough discrete classes is sub-optimal for learning

subtle expression transformations [16] [17]. By contrast, in

arXiv:2210.15160v2 [cs.CV] 28 Oct 2022

this paper we measure the subtle distinctions of expres-

sions through similarity learning, leveraging continuous

and compact expression features for AU representations.

As a facial AU is deﬁned anatomically according to

the movements of its corresponding facial muscles, it is

inherently related to local facial regions, termed as AU’s

locality. To this end, how to effectively extract local features

from an input face image is a key yet unresolved challenge,

because local facial deformations not only are subtle and

transient [5] [18] but also vary with individuals [11] [12]. The

straightforward approach is to divide the input image into

several local parts in a pre-deﬁned way, which are then fed

into networks to capture local information, known as patch

learning [19] [20] [21]. These methods pre-deﬁne AU local

regions with prior knowledge, cropping a face image into

several regular yet coarse-grained patches. Subsequently,

the feature extractor is limited to the quality of cropped

patches.

Another category of approaches is to utilize struc-

tured geometric information like landmarks to crop more

ﬁne-grained local partitions such as Regions Of Interest

(ROIs) [22] [23] or attention maps [24] [25] [26] [27] [28].

Some recent works [25] [26] [27] focus on learning attention

maps based on facial landmarks in a supervised manner.

The attention maps are usually initialized by pre-deﬁned

AU centers based on landmarks and treated as the ground

truth during training. However, facial AUs occur in speciﬁc

locations but are not limited to landmark points in most

cases [29]. Hence, pre-deﬁned AU regions or centers in a

uniform way (either by landmarks or prior domain knowl-

edge) are generally error-prone. For example, many prior

AU patch generation procedures are ﬁxed for various head

poses, while landmark detection tends to deviate under

large head rotations, leading to inaccurate AU patches or

attention [21]. In sum, these approaches are sub-optimal in

two aspects. First, AU centers based on domain knowledge

and landmarks are limited to landmark positions and/or

pre-deﬁned rules [29] [24]. Second, landmark detection er-

rors often inﬂuence the ﬁnal performance [29] [24]. There-

fore, capturing local features based on manually pre-deﬁned

rules with coarse guidances may limit the capability of the

local feature extractor. To this end, the sensitive position of

each AU in the face feature maps should depend on the AU

detection task and the training dataset. In other words, the

attention map learning progress should be data-driven.

Motivated by the above observations, we propose a

novel AU detection framework, Global-to-Local Expression-

aware Network(GTLE-Net), to address above mentioned chal-

lenges. Firstly, for the data and identity over-ﬁting chal-

lenge, we pre-train a global facial expression representation

encoder(GEE) to extract identity-invariant global expression

features on a large-scale facial expression dataset, FEC [14].

Instead of roughly classifying expressions into several dis-

crete categories, we project expressions into a continuous

representation space, which is more suitable and effective

for AU detection in our experiments, since AUs are intrin-

sically subtle and continuous. Secondly, for the local feature

capturing challenge, we design a local AU feature module

(LAM) with two extractors to produce AU mask and feature

map for each AU, respectively. Particularly, the AU mask

extractor is learned without any intermediate supervision.

TABLE 1

The descriptions and involved muscles of the AUs used in this work.

AU Name Description Involved Facial Muscle(s)

AU1 Inner Brow Raiser Frontalis, pars medialis

AU2 Outer Brow Raiser Frontalis, pars lateralis

AU4 Brow Lowerer Depressor Glabellae

AU6 Cheek Raiser Orbicularis oculi

AU7 Lid Tightener Orbicularis oculi

AU9 Nose Wrinkler Levator labii superioris alaquae nasi

AU10 Upper Lip Raiser Levator Labii Superioris

AU12 Lip Corner Puller Zygomatic Major

AU14 Dimpler Buccinator

AU15 Lip Corner Depressor Depressor anguli oris

AU17 Chin Raiser Mentalis

AU23 Lip Tightener Orbicularis oris

AU24 Lip Pressor Orbicularis oris

AU25 Lips part Depressor labii inferioris

AU26 Jaw Drop Masseter

Then, by multiplying the AU feature maps and the cor-

responding AU masks, we obtain the AU masked features,

which ﬁlter out irrelevant ﬁelds and retain informative ones.

In this way, our framework can not only make full use of

large scale well-annotated expression data but also learn

attention maps adaptively without extra supervision, to

promote AU detection .

To sum up, the main contributions of this paper can be

summarized below:

(1) A global expression feature encoder(GEE) is pre-

trained with structured triplet data. This step is

proved to be essential for the improvement of AU

detection.

(2) An effective attention learning method by a local AU

features module (LAM) is proposed to capture AU-

speciﬁc representations, which is crucial to obtain

more informative local features and boost up AU

detection performance. Visual analytics prove that

LAM produces highly AU-related local information

adaptively.

(3) Through extensive experiments, we validate the ef-

fectiveness and accuracy of the proposed method on

three widely-used benchmark datasets, outperform-

ing the state-of-the-art signiﬁcantly.

The remainder of this paper is organized as follows.

Related works are presented in Section 2. The details of

our method are described in Section 3. Comprehensive

experiment results are reported in Section 4. Section 5 pro-

vides a series of visual analyses. Discussion and concluding

remarks are provided in Section 6. Our code will be released

if accepted.

2 RELATED WORK

In recent years, researchers have developed many ap-

proaches for AU detection including multi-task and atten-

tion mechanisms, introducing some auxiliary information

like landmarks, text description, and expressions, etc. We

describe previous related works below.

2.1 AU Detection with Auxiliary Information

Due to the high labor cost of AU annotations, the scale

and subject variations of AU detection datasets are usu-

ally limited. As a result, previous AU detection methods

resort to various kinds of auxiliary information to improve

the generalization performance. Facial landmarks are the

most widely used pre-trained features for AU detection.

JPML [30] employs facial landmark features to crop local

image patches for different AUs. EAC-Net [25] constructs

local regions of interest and spatial attention maps from the

facial landmarks. LP-Net [12] trains an individual-speciﬁc

shape regularization network from the detected facial land-

marks. J ˆ

AA-Net [26] jointly performs AU detection and

facial landmark detection from the data annotated with both

labels.

Other types of auxiliary information have also been

explored. Zhao et al. [31] pre-train a weakly supervised

embedding from a large number of web images. Cui et al.

[16] summarize the prior probabilities of AU occurrences as

generic knowledge. Emotions and AUs are jointly optimized

under the prior probabilities. Recently, SEV-Net [28] lever-

ages textual descriptions of AU occurrences by employing

a pre-trained word embedding to obtain auxiliary textual

features. To enhance the generalization of our model, our

method introduces a pre-trained expression embedding as

the auxiliary information.

2.2 AU Feature Learning

Due to the local deﬁnitions of AUs, it is essential to extract

local AU features. As such, some researchers proposed

to obtain local information through patch learning. For

instance, Zhong et al. [19] and Liu et al. [20] preprocess

an input image into uniform patches before encoding to

analyze facial expressions. Taking the head pose into con-

sideration, Onal et al. [21], [32] crop AU-speciﬁc local facial

patches containing information for speciﬁc AU recognition

after registering the 3D head pose to reduce changes from

head movements. Besides, it is a common practice to use

attention mechanisms to highlight the features at the facial

AU-based positions. Facial landmarks with sparse facial

geometric features show an advantage in operating as a

supervised attention prior. EAC-Net [25] creates ﬁxed at-

tention maps related to the correlations between AUs and

landmarks. J ˆ

AA-Net [26] jointly performs AU detection and

facial landmark detection, and the predicted landmarks are

used to compute the attention map for each AU. Jacob et

al. [27] propose a multi-task method that combines the tasks

of AU detection and landmark-based attention map predic-

tion. ARL et al. [24] proposed a channel-wise and spatial

attention learning for each AU. And, a pixel-level relation is

learned by CRF to reﬁne the spatial attention. Except for the

facial landmarks, SEV-Net [28] utilizes the textual descrip-

tions of local details to generate a regional attention map.

In this way, it highlights the local parts of global features.

However, it requires extra annotations for descriptions. Our

work proposes a pixel-wise self-attention map that is data-

driven without supervision. Our experiments prove that our

attention map is superior to the attention maps produced by

prior knowledge.

2.3 Expression Representations

Expressions are essential auxiliary information for AU de-

tection. Some prior works leverage amounts of accessible

expression data to enhance AU detection [11] [7] [31] [16].

Recently, Chang et al. [33] utilize a large amount of unla-

beled images to train a representation encoder to extract

local representations and project them to a low-dimensional

latent space, then improve network performance through

contrastive learning. Many methods map face images into a

low-dimensional manifold for subject-independent expres-

sion representations. Early works [15] [34] [35] train the em-

beddings for discrete emotion classiﬁcation tasks but neglect

the facial expression variations within each class. The 3D

Morphable Model (3DMM) [36] [37] has been proposed to

ﬁt identities and expression parameters from a single face

image. Expressions are represented as the coefﬁcients of pre-

deﬁned blendshapes in 3DMM. The estimated expression

coefﬁcients are then used for talking head synthesis [38]

[39], expression transfer [40] [41], and face manipulation

[42]. However, the estimated expression coefﬁcients have

weaknesses in representing ﬁne-grained expressions. To

solve the problem, Vemulapalli and Agarwala [14] proposed

a compact embedding for complicated and subtle facial

expressions, where facial expression similarity is deﬁned

through triplet annotations. Zhang et al. [43] proposed a

Deviation Learning Network (DLN) to remove the identity

information from continuous expression embeddings, and

thus achieve more compact and smooth representations.

3 PROPOSED METHOD

In this section, we ﬁrst brieﬂy introduce the problem deﬁni-

tion and then describe our proposed GTLE-Net framework.

3.1 Problem Deﬁnition

The task of AU detection is to predict the AU occurrence

probabilities [o1, o2, ..., oN]given an input face image with

a resolution of 256 ×256. As described in section 1, expres-

sions and AUs are two means of description from global

and local levels respectively. They are highly related and

thus can be utilized to promote each other. Moreover, facial

expression data are easier to obtain and annotate, while AUs

are more subtle and difﬁcult to annotate. Intuitively, facial

AU information can be extracted from facial expression

representations when expression representation perception

is adequately ﬁne-grained. Considering the above factors,

we propose a global expression encoder(GEE) and pre-train it

on a large-scale facial expression dataset, aiming to obtain

a powerful and robust expression representation. Further-

more, for the purpose of capturing AU local features, we

propose a local AU features module (LAM) constructed with

two extractors, namely, the AU mask extractor (Em) and the

AU feature map extractor (Ea). The framework is illustrated

in Fig 2. In the following, we discuss each main module in

detail.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Global-to-localExpression-awareEmbeddingsforFacialActionUnitDetectionRudongAn,WeiZhang,HaoZeng,WeiChen,ZhigangDeng,YuDingAbstractExpressionsandfacialactionunits(AUs)aretwolevelsoffacialbehaviordescriptors.ExpressionauxiliaryinformationhasbeenwidelyusedtoimprovetheAUdetectionperformance.However,mos...

展开>> 收起<<

1 Global-to-local Expression-aware Embeddings for Facial Action Unit Detection.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Global-to-local Expression-aware Embeddings for Facial Action Unit Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: