crops 5 patches from a full face based on landmarks and
feeds them with the full face into networks for learning
the global and local features. ROI-Net [22] also designs a
prior landmark cropping rule to crop the inner feature maps.
These methods usually suffer from performance degrada-
tion in the wild due to erroneous landmark estimation.
To facilitate the model with local features, multi-task
methods usually combine AU detection with landmark de-
tection [3,31] or landmark-based attention map predic-
tion [13]. In this way, models can extract global features
from full faces and also focus on local details from the
landmarks for better AU detection. However, these meth-
ods ignore that landmarks also contain rich identity infor-
mation [14] which may aggravate the identity overfitting.
SEV-Net [38] proposes to utilize the textual descriptions
of local details to generate a regional attention map. In
this way, it highlights the local parts of the global features.
However, it requires the extra annotations for the descrip-
tions. In addition, the global features of the previous works
do not take the removal of the identity disturbance into ac-
count.
Different from the above works, our carefully-designed
global branch is dedicated to eliminating identity distur-
bance, and our cropped patches for the local branch are
based on positioned face patches instead of landmarks.
2.3. Expression Representations
The action units reflect the facial expression information
and the model perception of the expression plays a cru-
cial role in AU detection. The expression representation
can be used to evaluate the expression perception capabil-
ity of model. A common practice to represent expression
is mapping the face images into a low-dimensional mani-
fold, which describes the expressions without disturbance
of identity, pose or illumination. Early works utilize the
hidden features of the last or penultimate layer of the model
trained in discrete expression classification tasks [25,37,51]
as the expression representation, in which the extracted ex-
pression information reflects more information of the lim-
ited expression categories but neglect the complicated and
fine-grained facial expressions. Different from them, a
compact and continuous embedding for representing facial
expressions is proposed by Vemulapalli and Agarwala [36].
It constructs a large-scale facial dataset annotated with ex-
pression similarity in a triplet way. Through a large num-
ber of triplet comparisons, the trained expression embed-
ding can perceive slight expression changes. To further re-
duce the identity influence, Zhang et al. [42] develop a De-
viation Learning Network (DLN) with a two-branch struc-
ture to achieve more compact and smooth expression em-
bedding. 3D Morphable Model (3DMM) [4,28] has been
proposed to fit identities and expression parameters from a
single face image. Expressions are represented as the coef-
ficients of predefined blendshapes in 3DMM. The estimated
expression coefficients are then used for talking head syn-
thesis [21,46], expression transfer [17,40] or face manipu-
lation [11]
3. Proposed Method
The architecture of the proposed GLEE-Net is shown in
Figure 2, which takes an image as input and outputs a binary
vector to indicate the occurrence of each AU. The whole
framework consists of a global branch, a local branch, a
3D global branch and a Transformer classifier. The global
branch extracts the full face feature to model the full face
expression while the local branch focuses on detailed local
information. The two branches are pretrained on the FEC
expression dataset [36] and then finetuned on the AU dataset
to alleviate the issue of limited identities. To further enrich
2D facial representations, the 3D global branch extracts the
expression coefficients through 3D face reconstruction. Fi-
nally, the Transformer classifier carries out the final AU de-
tection from the combined features of three branches with
the powerful attention mechanism.
3.1. Global Branch
Inspired by DLN [43], the global branch models the ex-
pression feature vector Vexp as the deviation from the iden-
tity vector Vid. Specifically, the global branch consists of
two siamese models, i.e., the face model and the identity
model. The identity model and the face model are initial-
ized with the pretrained FaceNet [30] for a face recognition
task [5]. Then, we fix the identity model and train the face
model to learn the expression deviation. The extracted full
face expression feature vector Vexp is obtained by:
Vexp =Vface −Vid.(1)
The deviation model of the global branch benefits from an
effective feature initialization that can alleviate the distur-
bance of expression-irrelevant information, such as identity,
pose, etc. After a linear layer for dimension reduction, we
obtain Gexp as the global expression feature vector.
3.2. Local Branch
We introduce a local branch to facilitate the global
branch with more detailed local information, which is also
beneficial for AU detection due to the local nature of AUs.
First, we crop the image into 16 parts for local part extrac-
tion. Since the expression dataset contains a large number
of in-the-wild images, it is hard to locate specific face re-
gions accurately. Therefore, we choose to crop the image
according to the whole image area instead of facial land-
marks. Specifically, we crop three-quarters of the image
from left, right, top and bottom and call them L34, R34,
T34 and B34 respectively. Similarly, we crop half from the
3