3
like landmarks, text description, and expressions, etc. We
describe previous related works below.
2.1 AU Detection with Auxiliary Information
Due to the high labor cost of AU annotations, the scale
and subject variations of AU detection datasets are usu-
ally limited. As a result, previous AU detection methods
resort to various kinds of auxiliary information to improve
the generalization performance. Facial landmarks are the
most widely used pre-trained features for AU detection.
JPML [30] employs facial landmark features to crop local
image patches for different AUs. EAC-Net [25] constructs
local regions of interest and spatial attention maps from the
facial landmarks. LP-Net [12] trains an individual-specific
shape regularization network from the detected facial land-
marks. J ˆ
AA-Net [26] jointly performs AU detection and
facial landmark detection from the data annotated with both
labels.
Other types of auxiliary information have also been
explored. Zhao et al. [31] pre-train a weakly supervised
embedding from a large number of web images. Cui et al.
[16] summarize the prior probabilities of AU occurrences as
generic knowledge. Emotions and AUs are jointly optimized
under the prior probabilities. Recently, SEV-Net [28] lever-
ages textual descriptions of AU occurrences by employing
a pre-trained word embedding to obtain auxiliary textual
features. To enhance the generalization of our model, our
method introduces a pre-trained expression embedding as
the auxiliary information.
2.2 AU Feature Learning
Due to the local definitions of AUs, it is essential to extract
local AU features. As such, some researchers proposed
to obtain local information through patch learning. For
instance, Zhong et al. [19] and Liu et al. [20] preprocess
an input image into uniform patches before encoding to
analyze facial expressions. Taking the head pose into con-
sideration, Onal et al. [21], [32] crop AU-specific local facial
patches containing information for specific AU recognition
after registering the 3D head pose to reduce changes from
head movements. Besides, it is a common practice to use
attention mechanisms to highlight the features at the facial
AU-based positions. Facial landmarks with sparse facial
geometric features show an advantage in operating as a
supervised attention prior. EAC-Net [25] creates fixed at-
tention maps related to the correlations between AUs and
landmarks. J ˆ
AA-Net [26] jointly performs AU detection and
facial landmark detection, and the predicted landmarks are
used to compute the attention map for each AU. Jacob et
al. [27] propose a multi-task method that combines the tasks
of AU detection and landmark-based attention map predic-
tion. ARL et al. [24] proposed a channel-wise and spatial
attention learning for each AU. And, a pixel-level relation is
learned by CRF to refine the spatial attention. Except for the
facial landmarks, SEV-Net [28] utilizes the textual descrip-
tions of local details to generate a regional attention map.
In this way, it highlights the local parts of global features.
However, it requires extra annotations for descriptions. Our
work proposes a pixel-wise self-attention map that is data-
driven without supervision. Our experiments prove that our
attention map is superior to the attention maps produced by
prior knowledge.
2.3 Expression Representations
Expressions are essential auxiliary information for AU de-
tection. Some prior works leverage amounts of accessible
expression data to enhance AU detection [11] [7] [31] [16].
Recently, Chang et al. [33] utilize a large amount of unla-
beled images to train a representation encoder to extract
local representations and project them to a low-dimensional
latent space, then improve network performance through
contrastive learning. Many methods map face images into a
low-dimensional manifold for subject-independent expres-
sion representations. Early works [15] [34] [35] train the em-
beddings for discrete emotion classification tasks but neglect
the facial expression variations within each class. The 3D
Morphable Model (3DMM) [36] [37] has been proposed to
fit identities and expression parameters from a single face
image. Expressions are represented as the coefficients of pre-
defined blendshapes in 3DMM. The estimated expression
coefficients are then used for talking head synthesis [38]
[39], expression transfer [40] [41], and face manipulation
[42]. However, the estimated expression coefficients have
weaknesses in representing fine-grained expressions. To
solve the problem, Vemulapalli and Agarwala [14] proposed
a compact embedding for complicated and subtle facial
expressions, where facial expression similarity is defined
through triplet annotations. Zhang et al. [43] proposed a
Deviation Learning Network (DLN) to remove the identity
information from continuous expression embeddings, and
thus achieve more compact and smooth representations.
3 PROPOSED METHOD
In this section, we first briefly introduce the problem defini-
tion and then describe our proposed GTLE-Net framework.
3.1 Problem Definition
The task of AU detection is to predict the AU occurrence
probabilities [o1, o2, ..., oN]given an input face image with
a resolution of 256 ×256. As described in section 1, expres-
sions and AUs are two means of description from global
and local levels respectively. They are highly related and
thus can be utilized to promote each other. Moreover, facial
expression data are easier to obtain and annotate, while AUs
are more subtle and difficult to annotate. Intuitively, facial
AU information can be extracted from facial expression
representations when expression representation perception
is adequately fine-grained. Considering the above factors,
we propose a global expression encoder(GEE) and pre-train it
on a large-scale facial expression dataset, aiming to obtain
a powerful and robust expression representation. Further-
more, for the purpose of capturing AU local features, we
propose a local AU features module (LAM) constructed with
two extractors, namely, the AU mask extractor (Em) and the
AU feature map extractor (Ea). The framework is illustrated
in Fig 2. In the following, we discuss each main module in
detail.