Facial Action Units Detection Aided by Global-Local Expression Embedding Zhipeng Hu1 Wei Zhang1 Lincheng Li1 Yu Ding1 Wei Chen2 Zhigang Deng3 Xin Yu4

2025-05-06 0 0 1.28MB 11 页 10玖币

侵权投诉

Facial Action Units Detection Aided by Global-Local Expression Embedding

Zhipeng Hu 1*

, Wei Zhang 1*

, Lincheng Li 1, Yu Ding 1*

, Wei Chen 2,

Zhigang Deng 3, Xin Yu4

1Netease Fuxi AI Lab

2Hebei Agricultural University 3University of Houston

4University of Technology Sydney

{zphu, zhangwei05, lilincheng, dingyu01}@corp.netease.com

rshchchw@hebau.edu.cn; zdeng4@uh.edu; xin.yu@uts.edu.au

Abstract

Since Facial Action Unit (AU) annotations require do-

main expertise, common AU datasets only contain a lim-

ited number of subjects. As a result, a crucial challenge for

AU detection is addressing identity overﬁtting. We ﬁnd that

AUs and facial expressions are highly associated, and ex-

isting facial expression datasets often contain a large num-

ber of identities. In this paper, we aim to utilize the ex-

pression datasets without AU labels to facilitate AU detec-

tion. Speciﬁcally, we develop a novel AU detection frame-

work aided by the Global-Local facial Expressions Em-

bedding, dubbed GLEE-Net. Our GLEE-Net consists of

three branches to extract identity-independent expression

features for AU detection. We introduce a global branch

for modeling the overall facial expression while eliminat-

ing the impacts of identities. We also design a local branch

focusing on speciﬁc local face regions. The combined out-

put of global and local branches is ﬁrstly pre-trained on

an expression dataset as an identity-independent expression

embedding, and then ﬁnetuned on AU datasets. Therefore,

we signiﬁcantly alleviate the issue of limited identities. Fur-

thermore, we introduce a 3D global branch that extracts ex-

pression coefﬁcients through 3D face reconstruction to con-

solidate 2D AU descriptions. Finally, a Transformer-based

multi-label classiﬁer is employed to fuse all the representa-

tions for AU detection. Extensive experiments demonstrate

that our method signiﬁcantly outperforms the state-of-the-

art on the widely-used DISFA, BP4D and BP4D+ datasets.

1. Introduction

Facial Action Units (AUs), coded by Facial Action Cod-

ing System (FACS), include 32 atomic facial action descrip-

*Equal contribution. Yu Ding is the corresponding author.

tors based on facial muscle groups [9]. Face AU detection

has attracted lots of research efforts due to its crucial appli-

cations in emotion recognition [27], micro-expression de-

tection [47], and mental health diagnosis [29].

Since AU annotations require sophisticated expertise and

are time-consuming, the size of annotated AU datasets is

usually limited, especially in terms of identity variations.

e.g., less than 50. As a result, most AU detection meth-

ods overﬁt to the training identities and do not generalize

well to new subjects. To alleviate the overﬁtting problem

on the small AU datasets, previous methods resort to vari-

ous auxiliary information as regularization, including facial

landmarks [23,26,31,48], unsupervised web images [49],

emotion priors [8], textual AU descriptions [38] and so on.

However, these additional constraints do not directly re-

move the interference of the training identities from the ex-

tracted visual features, thus limiting their performance.

Different from the previous works, we aim to make the

ﬁrst attempt to employ an expression embedding extracted

from the in-the-wild expression dataset [36] without AU

labels. The embedding can provide a strong prior for AU

detection due to the two important properties: continuity

and identity-independence. First, the embedding provides a

continuous space for representing the ﬁne-grained expres-

sions. It is beneﬁcial for AU detection since AUs usually

show slight variations on the face. Second, the embedding

is less sensitive to identities because the semantic similar

expressions of different identities are analogous in the em-

bedding space. This important property can be used to al-

leviate the overﬁtting problem in AU detection. Hence, our

motivation is to leverage a continuous expression embed-

ding space to represent AUs for accurate AU detection.

Driven by our motivation, we develop a novel AU de-

tection framework aided by the Global-Local facial Expres-

sion Embedding, namely GLEE-Net. Our GLEE-Net con-

sists of three branches that extract identity-independent fa-

cial expression features for AU detection. To comprehend

arXiv:2210.13718v1 [cs.CV] 25 Oct 2022

Similar expressions

AUs(6,7,10,12,14)

AUs(4,6,7,10,14,

17,23,24)

AUs(1,2,10)

Else

Expression representation space

Figure 1. T-SNE visualization of the distributions of some AU

combinations in the embedding space of our GLEE-Net. Left: the

same AU combinations distribute closely in the expression embed-

ding space. Right: similar expressions in the expression embed-

ding space often have similar AU labels and learning expressions

can facilitate AU detection.

the overall facial expressions, we introduce a global branch

that modeling the expression information as the deviation

from the identity representation. In this way, our global

branch would be less sensitive to the identity information.

Considering AUs are deﬁned on local face regions, we de-

sign a local branch to focus on details of speciﬁc face re-

gions. In order to alleviate the problem of limited identities

in AU datasets, different from existing methods with global

and local branches, we ﬁrst pretrain the two branches on

an expression dataset [36] and then ﬁnetune them on the

target AU dataset. In this manner, our network has seen

various subjects’ expressions although their AU labels are

not available and moreover, acquires a compact expression

embedding for AU detection. In Figure 1, we sample some

images from BP4D and visualize their expression embed-

dings. As expected, the same AU combinations and similar

expressions from different identities are close in our expres-

sion embedding space, thus facilitating AU classiﬁcation.

Furthermore, in contrast to existing methods that only

rely on auxiliary information of 2D images, we ﬁnd that

3D facial information also provides important expression

clues. Thus, we introduce a 3D global branch to obtain ex-

pression coefﬁcients, as 3D expression features, through 3D

face reconstruction. To fully exploit all the representations

from our global-local branches, we design a Transformer-

based multi-label classiﬁer. Beneﬁting from the powerful

global attention mechanism of Transformer [35], we can ef-

fectively fuse different representations and thus explore the

correlations among multiple AUs. With the co-occurrence

relationships of AUs, our network can predict AUs more

accurately. Extensive experiments demonstrate that our ap-

proach signiﬁcantly achieves superior performance on the

widely-used DISFA, BP4D and BP4D+ datasets.

In summary, the contributions of our work are three-fold:

• We propose a novel Global-Local facial Expression

Embedding Network (GLEE-Net) for AU detection,

which can leverage additional facial expression data

(without AU labels) to improve AU detection accuracy.

• We develop the global and local branches to extract

the compact expression embeddings from face regions

while paying attention to local facial details. To the

best of our knowledge, our work is the ﬁrst attempt to

utilize continuous and compact expression features to

represent AUs effectively. It achieves appealing gener-

alization capability in addressing AU classiﬁcation for

unseen identities.

• We introduce a 3D global branch to extract expression

coefﬁcients through 3D face reconstruction for AU de-

tection, and demonstrate that exploiting 3D face priors

can further improve 2D AU detection.

2. Related Works

2.1. AU Detection with Auxiliary Information

The widely used AU datasets only contain limited sub-

jects due to the difﬁculty of AU annotation, which is the

main cause of overﬁtting. To resolve it, some works re-

course to the various kinds of auxiliary information to en-

hance the model generalization and facilitate AU detection.

Introducing extra information of facial landmarks is a com-

mon practice in AU detection. To effectively extract the

local features for AUs, JPML [48] utilize the landmarks

to crop the facial patches instead of uniformly distributed

grids. EAC-Net [23] also generate the spatial attention

maps according to the facial landmarks and applies them

to the different levels of networks. LP-Net [26] sents the

detected facial landmarks into the P-Net to learn the person-

speciﬁc shape information. J ˆ

AA-Net [31] proposes a multi-

task framework combining landmark detection and AU de-

tection. Besides this, there exist some other kinds of aux-

iliary information. Zhao et al. [49] utilize the unlabelled

large-scale web images and propose a weakly-supervised

spectral embedding for AU detection. Cui et al. [8] con-

struct an expression-AUs knowledge prior based on the ex-

isting anatomic and psychological research and introduce

the expression recognition model for AU detection. SEV-

Net [38] introduces the pre-trained word embedding to learn

spatial attention maps based on the textual descriptions of

AU occurrences. The aforementioned methods all directly

or indirectly introduce additional data for producing extra

regularization in AU detection. We propose to utilize the ex-

pression embedding as auxiliary information, which better

improves the generalization capability of the AU detection.

2.2. AU Detection with global and local features

Due to the local deﬁnition of AUs, many methods at-

tempt to combine the full and regional facial features for

AU detection. These works can be classiﬁed into three cat-

egories: patch-based, multi-task, and text-based methods.

Patch-based methods usually crop the full face into

patches according to the local deﬁnitions of AUs. DSIN [7]

crops 5 patches from a full face based on landmarks and

feeds them with the full face into networks for learning

the global and local features. ROI-Net [22] also designs a

prior landmark cropping rule to crop the inner feature maps.

These methods usually suffer from performance degrada-

tion in the wild due to erroneous landmark estimation.

To facilitate the model with local features, multi-task

methods usually combine AU detection with landmark de-

tection [3,31] or landmark-based attention map predic-

tion [13]. In this way, models can extract global features

from full faces and also focus on local details from the

landmarks for better AU detection. However, these meth-

ods ignore that landmarks also contain rich identity infor-

mation [14] which may aggravate the identity overﬁtting.

SEV-Net [38] proposes to utilize the textual descriptions

of local details to generate a regional attention map. In

this way, it highlights the local parts of the global features.

However, it requires the extra annotations for the descrip-

tions. In addition, the global features of the previous works

do not take the removal of the identity disturbance into ac-

count.

Different from the above works, our carefully-designed

global branch is dedicated to eliminating identity distur-

bance, and our cropped patches for the local branch are

based on positioned face patches instead of landmarks.

2.3. Expression Representations

The action units reﬂect the facial expression information

and the model perception of the expression plays a cru-

cial role in AU detection. The expression representation

can be used to evaluate the expression perception capabil-

ity of model. A common practice to represent expression

is mapping the face images into a low-dimensional mani-

fold, which describes the expressions without disturbance

of identity, pose or illumination. Early works utilize the

hidden features of the last or penultimate layer of the model

trained in discrete expression classiﬁcation tasks [25,37,51]

as the expression representation, in which the extracted ex-

pression information reﬂects more information of the lim-

ited expression categories but neglect the complicated and

ﬁne-grained facial expressions. Different from them, a

compact and continuous embedding for representing facial

expressions is proposed by Vemulapalli and Agarwala [36].

It constructs a large-scale facial dataset annotated with ex-

pression similarity in a triplet way. Through a large num-

ber of triplet comparisons, the trained expression embed-

ding can perceive slight expression changes. To further re-

duce the identity inﬂuence, Zhang et al. [42] develop a De-

viation Learning Network (DLN) with a two-branch struc-

ture to achieve more compact and smooth expression em-

bedding. 3D Morphable Model (3DMM) [4,28] has been

proposed to ﬁt identities and expression parameters from a

single face image. Expressions are represented as the coef-

ﬁcients of predeﬁned blendshapes in 3DMM. The estimated

expression coefﬁcients are then used for talking head syn-

thesis [21,46], expression transfer [17,40] or face manipu-

lation [11]

3. Proposed Method

The architecture of the proposed GLEE-Net is shown in

Figure 2, which takes an image as input and outputs a binary

vector to indicate the occurrence of each AU. The whole

framework consists of a global branch, a local branch, a

3D global branch and a Transformer classiﬁer. The global

branch extracts the full face feature to model the full face

expression while the local branch focuses on detailed local

information. The two branches are pretrained on the FEC

expression dataset [36] and then ﬁnetuned on the AU dataset

to alleviate the issue of limited identities. To further enrich

2D facial representations, the 3D global branch extracts the

expression coefﬁcients through 3D face reconstruction. Fi-

nally, the Transformer classiﬁer carries out the ﬁnal AU de-

tection from the combined features of three branches with

the powerful attention mechanism.

3.1. Global Branch

Inspired by DLN [43], the global branch models the ex-

pression feature vector Vexp as the deviation from the iden-

tity vector Vid. Speciﬁcally, the global branch consists of

two siamese models, i.e., the face model and the identity

model. The identity model and the face model are initial-

ized with the pretrained FaceNet [30] for a face recognition

task [5]. Then, we ﬁx the identity model and train the face

model to learn the expression deviation. The extracted full

face expression feature vector Vexp is obtained by:

Vexp =Vface −Vid.(1)

The deviation model of the global branch beneﬁts from an

effective feature initialization that can alleviate the distur-

bance of expression-irrelevant information, such as identity,

pose, etc. After a linear layer for dimension reduction, we

obtain Gexp as the global expression feature vector.

3.2. Local Branch

We introduce a local branch to facilitate the global

branch with more detailed local information, which is also

beneﬁcial for AU detection due to the local nature of AUs.

First, we crop the image into 16 parts for local part extrac-

tion. Since the expression dataset contains a large number

of in-the-wild images, it is hard to locate speciﬁc face re-

gions accurately. Therefore, we choose to crop the image

according to the whole image area instead of facial land-

marks. Speciﬁcally, we crop three-quarters of the image

from left, right, top and bottom and call them L34, R34,

T34 and B34 respectively. Similarly, we crop half from the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FacialActionUnitsDetectionAidedbyGlobal-LocalExpressionEmbeddingZhipengHu1*,WeiZhang1*,LinchengLi1,YuDing1*,WeiChen2,ZhigangDeng3,XinYu41NeteaseFuxiAILab2HebeiAgriculturalUniversity3UniversityofHouston4UniversityofTechnologySydneyfzphu,zhangwei05,lilincheng,dingyu01g@corp.netease.comrshchchw@hebau.e...

展开>> 收起<<

Facial Action Units Detection Aided by Global-Local Expression Embedding Zhipeng Hu1 Wei Zhang1 Lincheng Li1 Yu Ding1 Wei Chen2 Zhigang Deng3 Xin Yu4.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Facial Action Units Detection Aided by Global-Local Expression Embedding Zhipeng Hu1 Wei Zhang1 Lincheng Li1 Yu Ding1 Wei Chen2 Zhigang Deng3 Xin Yu4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: