1 Global-to-local Expression-aware Embeddings for Facial Action Unit Detection

2025-04-30 0 0 3.42MB 12 页 10玖币
侵权投诉
1
Global-to-local Expression-aware Embeddings
for Facial Action Unit Detection
Rudong An, Wei Zhang, Hao Zeng, Wei Chen, Zhigang Deng, Yu Ding
Abstract—Expressions and facial action units (AUs) are two levels of facial behavior descriptors. Expression auxiliary information has
been widely used to improve the AU detection performance. However, most existing expression representations utilized in AU detection
works can only describe pre-determined discrete categories (e.g., Angry, Disgust, Happy, Sad, etc.) and cannot capture subtle
expression transformations like AUs. In this paper, we propose a novel fine-grained Global Expression representation Encoder to
capture subtle and continuous facial movements, to promote AU detection. To obtain such a global expression representation, we
propose to train an expression embedding model on a large-scale expression dataset according to global expression similarity.
Moreover, considering the local definition of AUs, it is essential to extract local AU features. Therefore, we design a Local AU Features
Module to generate local facial features for each AU. Specifically, it consists of an AU feature map extractor and a corresponding AU
mask extractor. First, the two extractors transform the global expression representation into AU feature maps and masks, respectively.
Then, AU feature maps and their corresponding AU masks are multiplied to generate AU masked features focusing on local facial
region. Finally, the AU masked features are fed into an AU classifier for judging the AU occurrence. Extensive experiment results
demonstrate the superiority of our proposed method. Our method validly outperforms previous works and achieves state-of-the-art
performances on widely-used face datasets, including BP4D, DISFA, and BP4D+.
Index Terms—Facial action coding, facial action unit detection, facial expression recognition, expression-aware embedding, deep
learning
F
1 INTRODUCTION
FACIAL Affect Analysis (FAA) is an active research area in
computer vision and affective computing communities.
FAA includes two approaches, namely expression and Fa-
cial Action Units (AUs). As a universal non-verbal approach
to carrying out human communication, facial expressions
can be described via a combination of AUs [1]. Facial AUs,
coded by the Facial Action Coding System (FACS) [1],
include 32 atomic facial action descriptors based on anatom-
ical facial muscle groups. Table 1 summarizes the informa-
tion of the 15 AUs used in this paper, including their names
and corresponding involved muscles. Each AU defines the
movement of a specific face region. For example, AU1 (Inner
Brow Raiser) is a descriptor focused on the medial part of
the frontalis. AUs and expression are highly related to each
other. Nearly any possible expression can be described as a
specific combination of facial AUs. For instance, as shown
in Figure 1, a happy expression can be achieved with the
occurrence of both AU6 and AU12; a doubt expression is
related to AU4, and a surprise expression is related to AU1
and AU26. From this perspective, AU features are often
considered as a kind of facial expression representation.
In fact, AUs play an important role in conveying human
emotions and automatic expression analysis. AU detection
has attracted much attention in recent years due to its
R. An, W. Zhang, H. Zeng, W, Chen, and Y. Ding are with the Netease
Fuxi AI Lab, Hangzhou, China.
Z. Deng is with the Department of Computer Science, University of
Houston, Houston, Texas, USA.
Manuscript received xxx xx, 2022; revised xxxx xx, 20xx.
Fig. 1. Examples of three expressions and their corresponding AUs.
From left to right: doubt, happiness, surprise. Expression is a global de-
scription of the face muscle movements, while AUs refer to the individual
local description of muscle motion.
wide applications, including emotion recognition [2], micro-
expression detection [3], and mental health diagnosis [4].
AU detection is challenging due to the lack of annotated
data and difficulty in capturing local features. Annotating
AUs takes hours of tedious effort to manually label one
hundred images, even for well-trained experts [5]. There-
fore, it is time-consuming and laborious to obtain large scale
well-annotated AUs data [6] [7]. In terms the lack of labeled
data, on the one hand, it often makes the training of deep
networks suffer from data over-fitting. On the other hand,
the AU datasets usually contain limited identities, dozens
[8] [9] or hundreds [10], which also leads to identity over-
fitting [11] [12] [13]. Compared with AU datasets, there are
amounts of accessible and easy-to-annotate expression data,
like FEC [14] and AffectNet [15], which are beneficial for
improving AU detection performance due to their close cor-
relations [6] [7] [11]. However, simply regarding expressions
as several rough discrete classes is sub-optimal for learning
subtle expression transformations [16] [17]. By contrast, in
arXiv:2210.15160v2 [cs.CV] 28 Oct 2022
2
this paper we measure the subtle distinctions of expres-
sions through similarity learning, leveraging continuous
and compact expression features for AU representations.
As a facial AU is defined anatomically according to
the movements of its corresponding facial muscles, it is
inherently related to local facial regions, termed as AU’s
locality. To this end, how to effectively extract local features
from an input face image is a key yet unresolved challenge,
because local facial deformations not only are subtle and
transient [5] [18] but also vary with individuals [11] [12]. The
straightforward approach is to divide the input image into
several local parts in a pre-defined way, which are then fed
into networks to capture local information, known as patch
learning [19] [20] [21]. These methods pre-define AU local
regions with prior knowledge, cropping a face image into
several regular yet coarse-grained patches. Subsequently,
the feature extractor is limited to the quality of cropped
patches.
Another category of approaches is to utilize struc-
tured geometric information like landmarks to crop more
fine-grained local partitions such as Regions Of Interest
(ROIs) [22] [23] or attention maps [24] [25] [26] [27] [28].
Some recent works [25] [26] [27] focus on learning attention
maps based on facial landmarks in a supervised manner.
The attention maps are usually initialized by pre-defined
AU centers based on landmarks and treated as the ground
truth during training. However, facial AUs occur in specific
locations but are not limited to landmark points in most
cases [29]. Hence, pre-defined AU regions or centers in a
uniform way (either by landmarks or prior domain knowl-
edge) are generally error-prone. For example, many prior
AU patch generation procedures are fixed for various head
poses, while landmark detection tends to deviate under
large head rotations, leading to inaccurate AU patches or
attention [21]. In sum, these approaches are sub-optimal in
two aspects. First, AU centers based on domain knowledge
and landmarks are limited to landmark positions and/or
pre-defined rules [29] [24]. Second, landmark detection er-
rors often influence the final performance [29] [24]. There-
fore, capturing local features based on manually pre-defined
rules with coarse guidances may limit the capability of the
local feature extractor. To this end, the sensitive position of
each AU in the face feature maps should depend on the AU
detection task and the training dataset. In other words, the
attention map learning progress should be data-driven.
Motivated by the above observations, we propose a
novel AU detection framework, Global-to-Local Expression-
aware Network(GTLE-Net), to address above mentioned chal-
lenges. Firstly, for the data and identity over-fiting chal-
lenge, we pre-train a global facial expression representation
encoder(GEE) to extract identity-invariant global expression
features on a large-scale facial expression dataset, FEC [14].
Instead of roughly classifying expressions into several dis-
crete categories, we project expressions into a continuous
representation space, which is more suitable and effective
for AU detection in our experiments, since AUs are intrin-
sically subtle and continuous. Secondly, for the local feature
capturing challenge, we design a local AU feature module
(LAM) with two extractors to produce AU mask and feature
map for each AU, respectively. Particularly, the AU mask
extractor is learned without any intermediate supervision.
TABLE 1
The descriptions and involved muscles of the AUs used in this work.
AU Name Description Involved Facial Muscle(s)
AU1 Inner Brow Raiser Frontalis, pars medialis
AU2 Outer Brow Raiser Frontalis, pars lateralis
AU4 Brow Lowerer Depressor Glabellae
AU6 Cheek Raiser Orbicularis oculi
AU7 Lid Tightener Orbicularis oculi
AU9 Nose Wrinkler Levator labii superioris alaquae nasi
AU10 Upper Lip Raiser Levator Labii Superioris
AU12 Lip Corner Puller Zygomatic Major
AU14 Dimpler Buccinator
AU15 Lip Corner Depressor Depressor anguli oris
AU17 Chin Raiser Mentalis
AU23 Lip Tightener Orbicularis oris
AU24 Lip Pressor Orbicularis oris
AU25 Lips part Depressor labii inferioris
AU26 Jaw Drop Masseter
Then, by multiplying the AU feature maps and the cor-
responding AU masks, we obtain the AU masked features,
which filter out irrelevant fields and retain informative ones.
In this way, our framework can not only make full use of
large scale well-annotated expression data but also learn
attention maps adaptively without extra supervision, to
promote AU detection .
To sum up, the main contributions of this paper can be
summarized below:
(1) A global expression feature encoder(GEE) is pre-
trained with structured triplet data. This step is
proved to be essential for the improvement of AU
detection.
(2) An effective attention learning method by a local AU
features module (LAM) is proposed to capture AU-
specific representations, which is crucial to obtain
more informative local features and boost up AU
detection performance. Visual analytics prove that
LAM produces highly AU-related local information
adaptively.
(3) Through extensive experiments, we validate the ef-
fectiveness and accuracy of the proposed method on
three widely-used benchmark datasets, outperform-
ing the state-of-the-art significantly.
The remainder of this paper is organized as follows.
Related works are presented in Section 2. The details of
our method are described in Section 3. Comprehensive
experiment results are reported in Section 4. Section 5 pro-
vides a series of visual analyses. Discussion and concluding
remarks are provided in Section 6. Our code will be released
if accepted.
2 RELATED WORK
In recent years, researchers have developed many ap-
proaches for AU detection including multi-task and atten-
tion mechanisms, introducing some auxiliary information
3
like landmarks, text description, and expressions, etc. We
describe previous related works below.
2.1 AU Detection with Auxiliary Information
Due to the high labor cost of AU annotations, the scale
and subject variations of AU detection datasets are usu-
ally limited. As a result, previous AU detection methods
resort to various kinds of auxiliary information to improve
the generalization performance. Facial landmarks are the
most widely used pre-trained features for AU detection.
JPML [30] employs facial landmark features to crop local
image patches for different AUs. EAC-Net [25] constructs
local regions of interest and spatial attention maps from the
facial landmarks. LP-Net [12] trains an individual-specific
shape regularization network from the detected facial land-
marks. J ˆ
AA-Net [26] jointly performs AU detection and
facial landmark detection from the data annotated with both
labels.
Other types of auxiliary information have also been
explored. Zhao et al. [31] pre-train a weakly supervised
embedding from a large number of web images. Cui et al.
[16] summarize the prior probabilities of AU occurrences as
generic knowledge. Emotions and AUs are jointly optimized
under the prior probabilities. Recently, SEV-Net [28] lever-
ages textual descriptions of AU occurrences by employing
a pre-trained word embedding to obtain auxiliary textual
features. To enhance the generalization of our model, our
method introduces a pre-trained expression embedding as
the auxiliary information.
2.2 AU Feature Learning
Due to the local definitions of AUs, it is essential to extract
local AU features. As such, some researchers proposed
to obtain local information through patch learning. For
instance, Zhong et al. [19] and Liu et al. [20] preprocess
an input image into uniform patches before encoding to
analyze facial expressions. Taking the head pose into con-
sideration, Onal et al. [21], [32] crop AU-specific local facial
patches containing information for specific AU recognition
after registering the 3D head pose to reduce changes from
head movements. Besides, it is a common practice to use
attention mechanisms to highlight the features at the facial
AU-based positions. Facial landmarks with sparse facial
geometric features show an advantage in operating as a
supervised attention prior. EAC-Net [25] creates fixed at-
tention maps related to the correlations between AUs and
landmarks. J ˆ
AA-Net [26] jointly performs AU detection and
facial landmark detection, and the predicted landmarks are
used to compute the attention map for each AU. Jacob et
al. [27] propose a multi-task method that combines the tasks
of AU detection and landmark-based attention map predic-
tion. ARL et al. [24] proposed a channel-wise and spatial
attention learning for each AU. And, a pixel-level relation is
learned by CRF to refine the spatial attention. Except for the
facial landmarks, SEV-Net [28] utilizes the textual descrip-
tions of local details to generate a regional attention map.
In this way, it highlights the local parts of global features.
However, it requires extra annotations for descriptions. Our
work proposes a pixel-wise self-attention map that is data-
driven without supervision. Our experiments prove that our
attention map is superior to the attention maps produced by
prior knowledge.
2.3 Expression Representations
Expressions are essential auxiliary information for AU de-
tection. Some prior works leverage amounts of accessible
expression data to enhance AU detection [11] [7] [31] [16].
Recently, Chang et al. [33] utilize a large amount of unla-
beled images to train a representation encoder to extract
local representations and project them to a low-dimensional
latent space, then improve network performance through
contrastive learning. Many methods map face images into a
low-dimensional manifold for subject-independent expres-
sion representations. Early works [15] [34] [35] train the em-
beddings for discrete emotion classification tasks but neglect
the facial expression variations within each class. The 3D
Morphable Model (3DMM) [36] [37] has been proposed to
fit identities and expression parameters from a single face
image. Expressions are represented as the coefficients of pre-
defined blendshapes in 3DMM. The estimated expression
coefficients are then used for talking head synthesis [38]
[39], expression transfer [40] [41], and face manipulation
[42]. However, the estimated expression coefficients have
weaknesses in representing fine-grained expressions. To
solve the problem, Vemulapalli and Agarwala [14] proposed
a compact embedding for complicated and subtle facial
expressions, where facial expression similarity is defined
through triplet annotations. Zhang et al. [43] proposed a
Deviation Learning Network (DLN) to remove the identity
information from continuous expression embeddings, and
thus achieve more compact and smooth representations.
3 PROPOSED METHOD
In this section, we first briefly introduce the problem defini-
tion and then describe our proposed GTLE-Net framework.
3.1 Problem Definition
The task of AU detection is to predict the AU occurrence
probabilities [o1, o2, ..., oN]given an input face image with
a resolution of 256 ×256. As described in section 1, expres-
sions and AUs are two means of description from global
and local levels respectively. They are highly related and
thus can be utilized to promote each other. Moreover, facial
expression data are easier to obtain and annotate, while AUs
are more subtle and difficult to annotate. Intuitively, facial
AU information can be extracted from facial expression
representations when expression representation perception
is adequately fine-grained. Considering the above factors,
we propose a global expression encoder(GEE) and pre-train it
on a large-scale facial expression dataset, aiming to obtain
a powerful and robust expression representation. Further-
more, for the purpose of capturing AU local features, we
propose a local AU features module (LAM) constructed with
two extractors, namely, the AU mask extractor (Em) and the
AU feature map extractor (Ea). The framework is illustrated
in Fig 2. In the following, we discuss each main module in
detail.
摘要:

1Global-to-localExpression-awareEmbeddingsforFacialActionUnitDetectionRudongAn,WeiZhang,HaoZeng,WeiChen,ZhigangDeng,YuDingAbstract—Expressionsandfacialactionunits(AUs)aretwolevelsoffacialbehaviordescriptors.ExpressionauxiliaryinformationhasbeenwidelyusedtoimprovetheAUdetectionperformance.However,mos...

展开>> 收起<<
1 Global-to-local Expression-aware Embeddings for Facial Action Unit Detection.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:3.42MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注