Facial Action Units Detection Aided by Global-Local Expression Embedding Zhipeng Hu1 Wei Zhang1 Lincheng Li1 Yu Ding1 Wei Chen2 Zhigang Deng3 Xin Yu4

2025-05-06 0 0 1.28MB 11 页 10玖币
侵权投诉
Facial Action Units Detection Aided by Global-Local Expression Embedding
Zhipeng Hu 1*
, Wei Zhang 1*
, Lincheng Li 1, Yu Ding 1*
, Wei Chen 2,
Zhigang Deng 3, Xin Yu4
1Netease Fuxi AI Lab
2Hebei Agricultural University 3University of Houston
4University of Technology Sydney
{zphu, zhangwei05, lilincheng, dingyu01}@corp.netease.com
rshchchw@hebau.edu.cn; zdeng4@uh.edu; xin.yu@uts.edu.au
Abstract
Since Facial Action Unit (AU) annotations require do-
main expertise, common AU datasets only contain a lim-
ited number of subjects. As a result, a crucial challenge for
AU detection is addressing identity overfitting. We find that
AUs and facial expressions are highly associated, and ex-
isting facial expression datasets often contain a large num-
ber of identities. In this paper, we aim to utilize the ex-
pression datasets without AU labels to facilitate AU detec-
tion. Specifically, we develop a novel AU detection frame-
work aided by the Global-Local facial Expressions Em-
bedding, dubbed GLEE-Net. Our GLEE-Net consists of
three branches to extract identity-independent expression
features for AU detection. We introduce a global branch
for modeling the overall facial expression while eliminat-
ing the impacts of identities. We also design a local branch
focusing on specific local face regions. The combined out-
put of global and local branches is firstly pre-trained on
an expression dataset as an identity-independent expression
embedding, and then finetuned on AU datasets. Therefore,
we significantly alleviate the issue of limited identities. Fur-
thermore, we introduce a 3D global branch that extracts ex-
pression coefficients through 3D face reconstruction to con-
solidate 2D AU descriptions. Finally, a Transformer-based
multi-label classifier is employed to fuse all the representa-
tions for AU detection. Extensive experiments demonstrate
that our method significantly outperforms the state-of-the-
art on the widely-used DISFA, BP4D and BP4D+ datasets.
1. Introduction
Facial Action Units (AUs), coded by Facial Action Cod-
ing System (FACS), include 32 atomic facial action descrip-
*Equal contribution. Yu Ding is the corresponding author.
tors based on facial muscle groups [9]. Face AU detection
has attracted lots of research efforts due to its crucial appli-
cations in emotion recognition [27], micro-expression de-
tection [47], and mental health diagnosis [29].
Since AU annotations require sophisticated expertise and
are time-consuming, the size of annotated AU datasets is
usually limited, especially in terms of identity variations.
e.g., less than 50. As a result, most AU detection meth-
ods overfit to the training identities and do not generalize
well to new subjects. To alleviate the overfitting problem
on the small AU datasets, previous methods resort to vari-
ous auxiliary information as regularization, including facial
landmarks [23,26,31,48], unsupervised web images [49],
emotion priors [8], textual AU descriptions [38] and so on.
However, these additional constraints do not directly re-
move the interference of the training identities from the ex-
tracted visual features, thus limiting their performance.
Different from the previous works, we aim to make the
first attempt to employ an expression embedding extracted
from the in-the-wild expression dataset [36] without AU
labels. The embedding can provide a strong prior for AU
detection due to the two important properties: continuity
and identity-independence. First, the embedding provides a
continuous space for representing the fine-grained expres-
sions. It is beneficial for AU detection since AUs usually
show slight variations on the face. Second, the embedding
is less sensitive to identities because the semantic similar
expressions of different identities are analogous in the em-
bedding space. This important property can be used to al-
leviate the overfitting problem in AU detection. Hence, our
motivation is to leverage a continuous expression embed-
ding space to represent AUs for accurate AU detection.
Driven by our motivation, we develop a novel AU de-
tection framework aided by the Global-Local facial Expres-
sion Embedding, namely GLEE-Net. Our GLEE-Net con-
sists of three branches that extract identity-independent fa-
cial expression features for AU detection. To comprehend
1
arXiv:2210.13718v1 [cs.CV] 25 Oct 2022
Similar expressions
AUs(6,7,10,12,14)
AUs(4,6,7,10,14,
17,23,24)
AUs(1,2,10)
Else
Expression representation space
Figure 1. T-SNE visualization of the distributions of some AU
combinations in the embedding space of our GLEE-Net. Left: the
same AU combinations distribute closely in the expression embed-
ding space. Right: similar expressions in the expression embed-
ding space often have similar AU labels and learning expressions
can facilitate AU detection.
the overall facial expressions, we introduce a global branch
that modeling the expression information as the deviation
from the identity representation. In this way, our global
branch would be less sensitive to the identity information.
Considering AUs are defined on local face regions, we de-
sign a local branch to focus on details of specific face re-
gions. In order to alleviate the problem of limited identities
in AU datasets, different from existing methods with global
and local branches, we first pretrain the two branches on
an expression dataset [36] and then finetune them on the
target AU dataset. In this manner, our network has seen
various subjects’ expressions although their AU labels are
not available and moreover, acquires a compact expression
embedding for AU detection. In Figure 1, we sample some
images from BP4D and visualize their expression embed-
dings. As expected, the same AU combinations and similar
expressions from different identities are close in our expres-
sion embedding space, thus facilitating AU classification.
Furthermore, in contrast to existing methods that only
rely on auxiliary information of 2D images, we find that
3D facial information also provides important expression
clues. Thus, we introduce a 3D global branch to obtain ex-
pression coefficients, as 3D expression features, through 3D
face reconstruction. To fully exploit all the representations
from our global-local branches, we design a Transformer-
based multi-label classifier. Benefiting from the powerful
global attention mechanism of Transformer [35], we can ef-
fectively fuse different representations and thus explore the
correlations among multiple AUs. With the co-occurrence
relationships of AUs, our network can predict AUs more
accurately. Extensive experiments demonstrate that our ap-
proach significantly achieves superior performance on the
widely-used DISFA, BP4D and BP4D+ datasets.
In summary, the contributions of our work are three-fold:
We propose a novel Global-Local facial Expression
Embedding Network (GLEE-Net) for AU detection,
which can leverage additional facial expression data
(without AU labels) to improve AU detection accuracy.
We develop the global and local branches to extract
the compact expression embeddings from face regions
while paying attention to local facial details. To the
best of our knowledge, our work is the first attempt to
utilize continuous and compact expression features to
represent AUs effectively. It achieves appealing gener-
alization capability in addressing AU classification for
unseen identities.
We introduce a 3D global branch to extract expression
coefficients through 3D face reconstruction for AU de-
tection, and demonstrate that exploiting 3D face priors
can further improve 2D AU detection.
2. Related Works
2.1. AU Detection with Auxiliary Information
The widely used AU datasets only contain limited sub-
jects due to the difficulty of AU annotation, which is the
main cause of overfitting. To resolve it, some works re-
course to the various kinds of auxiliary information to en-
hance the model generalization and facilitate AU detection.
Introducing extra information of facial landmarks is a com-
mon practice in AU detection. To effectively extract the
local features for AUs, JPML [48] utilize the landmarks
to crop the facial patches instead of uniformly distributed
grids. EAC-Net [23] also generate the spatial attention
maps according to the facial landmarks and applies them
to the different levels of networks. LP-Net [26] sents the
detected facial landmarks into the P-Net to learn the person-
specific shape information. J ˆ
AA-Net [31] proposes a multi-
task framework combining landmark detection and AU de-
tection. Besides this, there exist some other kinds of aux-
iliary information. Zhao et al. [49] utilize the unlabelled
large-scale web images and propose a weakly-supervised
spectral embedding for AU detection. Cui et al. [8] con-
struct an expression-AUs knowledge prior based on the ex-
isting anatomic and psychological research and introduce
the expression recognition model for AU detection. SEV-
Net [38] introduces the pre-trained word embedding to learn
spatial attention maps based on the textual descriptions of
AU occurrences. The aforementioned methods all directly
or indirectly introduce additional data for producing extra
regularization in AU detection. We propose to utilize the ex-
pression embedding as auxiliary information, which better
improves the generalization capability of the AU detection.
2.2. AU Detection with global and local features
Due to the local definition of AUs, many methods at-
tempt to combine the full and regional facial features for
AU detection. These works can be classified into three cat-
egories: patch-based, multi-task, and text-based methods.
Patch-based methods usually crop the full face into
patches according to the local definitions of AUs. DSIN [7]
2
crops 5 patches from a full face based on landmarks and
feeds them with the full face into networks for learning
the global and local features. ROI-Net [22] also designs a
prior landmark cropping rule to crop the inner feature maps.
These methods usually suffer from performance degrada-
tion in the wild due to erroneous landmark estimation.
To facilitate the model with local features, multi-task
methods usually combine AU detection with landmark de-
tection [3,31] or landmark-based attention map predic-
tion [13]. In this way, models can extract global features
from full faces and also focus on local details from the
landmarks for better AU detection. However, these meth-
ods ignore that landmarks also contain rich identity infor-
mation [14] which may aggravate the identity overfitting.
SEV-Net [38] proposes to utilize the textual descriptions
of local details to generate a regional attention map. In
this way, it highlights the local parts of the global features.
However, it requires the extra annotations for the descrip-
tions. In addition, the global features of the previous works
do not take the removal of the identity disturbance into ac-
count.
Different from the above works, our carefully-designed
global branch is dedicated to eliminating identity distur-
bance, and our cropped patches for the local branch are
based on positioned face patches instead of landmarks.
2.3. Expression Representations
The action units reflect the facial expression information
and the model perception of the expression plays a cru-
cial role in AU detection. The expression representation
can be used to evaluate the expression perception capabil-
ity of model. A common practice to represent expression
is mapping the face images into a low-dimensional mani-
fold, which describes the expressions without disturbance
of identity, pose or illumination. Early works utilize the
hidden features of the last or penultimate layer of the model
trained in discrete expression classification tasks [25,37,51]
as the expression representation, in which the extracted ex-
pression information reflects more information of the lim-
ited expression categories but neglect the complicated and
fine-grained facial expressions. Different from them, a
compact and continuous embedding for representing facial
expressions is proposed by Vemulapalli and Agarwala [36].
It constructs a large-scale facial dataset annotated with ex-
pression similarity in a triplet way. Through a large num-
ber of triplet comparisons, the trained expression embed-
ding can perceive slight expression changes. To further re-
duce the identity influence, Zhang et al. [42] develop a De-
viation Learning Network (DLN) with a two-branch struc-
ture to achieve more compact and smooth expression em-
bedding. 3D Morphable Model (3DMM) [4,28] has been
proposed to fit identities and expression parameters from a
single face image. Expressions are represented as the coef-
ficients of predefined blendshapes in 3DMM. The estimated
expression coefficients are then used for talking head syn-
thesis [21,46], expression transfer [17,40] or face manipu-
lation [11]
3. Proposed Method
The architecture of the proposed GLEE-Net is shown in
Figure 2, which takes an image as input and outputs a binary
vector to indicate the occurrence of each AU. The whole
framework consists of a global branch, a local branch, a
3D global branch and a Transformer classifier. The global
branch extracts the full face feature to model the full face
expression while the local branch focuses on detailed local
information. The two branches are pretrained on the FEC
expression dataset [36] and then finetuned on the AU dataset
to alleviate the issue of limited identities. To further enrich
2D facial representations, the 3D global branch extracts the
expression coefficients through 3D face reconstruction. Fi-
nally, the Transformer classifier carries out the final AU de-
tection from the combined features of three branches with
the powerful attention mechanism.
3.1. Global Branch
Inspired by DLN [43], the global branch models the ex-
pression feature vector Vexp as the deviation from the iden-
tity vector Vid. Specifically, the global branch consists of
two siamese models, i.e., the face model and the identity
model. The identity model and the face model are initial-
ized with the pretrained FaceNet [30] for a face recognition
task [5]. Then, we fix the identity model and train the face
model to learn the expression deviation. The extracted full
face expression feature vector Vexp is obtained by:
Vexp =Vface Vid.(1)
The deviation model of the global branch benefits from an
effective feature initialization that can alleviate the distur-
bance of expression-irrelevant information, such as identity,
pose, etc. After a linear layer for dimension reduction, we
obtain Gexp as the global expression feature vector.
3.2. Local Branch
We introduce a local branch to facilitate the global
branch with more detailed local information, which is also
beneficial for AU detection due to the local nature of AUs.
First, we crop the image into 16 parts for local part extrac-
tion. Since the expression dataset contains a large number
of in-the-wild images, it is hard to locate specific face re-
gions accurately. Therefore, we choose to crop the image
according to the whole image area instead of facial land-
marks. Specifically, we crop three-quarters of the image
from left, right, top and bottom and call them L34, R34,
T34 and B34 respectively. Similarly, we crop half from the
3
摘要:

FacialActionUnitsDetectionAidedbyGlobal-LocalExpressionEmbeddingZhipengHu1*,WeiZhang1*,LinchengLi1,YuDing1*,WeiChen2,ZhigangDeng3,XinYu41NeteaseFuxiAILab2HebeiAgriculturalUniversity3UniversityofHouston4UniversityofTechnologySydneyfzphu,zhangwei05,lilincheng,dingyu01g@corp.netease.comrshchchw@hebau.e...

展开>> 收起<<
Facial Action Units Detection Aided by Global-Local Expression Embedding Zhipeng Hu1 Wei Zhang1 Lincheng Li1 Yu Ding1 Wei Chen2 Zhigang Deng3 Xin Yu4.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:1.28MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注