
ATTENTION BASED RELATION NETWORK FOR FACIAL ACTION UNITS RECOGNITION
Yao Wei Haoxiang Wang?Mingze Sun Jiawang Liu
South China University of Technology, China
ABSTRACT
Facial action unit (AU) recognition is essential to facial ex-
pression analysis. Since there are highly positive or nega-
tive correlations between AUs, some existing AU recogni-
tion works have focused on modeling AU relations. How-
ever, previous relationship-based approaches typically embed
predefined rules into their models and ignore the impact of
various AU relations in different crowds. In this paper, we
propose a novel Attention Based Relation Network (ABRNet)
for AU recognition, which can automatically capture AU rela-
tions without unnecessary or even disturbing predefined rules.
ABRNet uses several relation learning layers to automatically
capture different AU relations. The learned AU relation fea-
tures are then fed into a self-attention fusion module, which
aims to refine individual AU features with attention weights
to enhance the feature robustness. Furthermore, we propose
an AU relation dropout strategy and AU relation loss (AUR-
Loss) to better model AU relations, which can further im-
prove AU recognition. Extensive experiments show that our
approach achieves state-of-the-art performance on the DISFA
and DISFA+ datasets.
Index Terms—Facial action unit recognition, Attention
mechanism, AU relation learning
1. INTRODUCTION
As AUs occur in different regions of the face, most current
works [1, 2, 3] treat AU recognition as a multi-label classifi-
cation problem and consider AUs to be independent of each
other. However, the AUs are highly related to each other. Due
to the anatomical mechanisms of faces, a facial expression
is associated with a certain set of AUs. For example, AU6
(Cheek Raiser) and AU12 (Lip Corner Puller) tend to be ac-
tivated together when we express happy emotions, and it is
difficult to make AU9 (Brow Lowerer) without the presence
of AU4(Inner Brow Raiser)[4]. On the other hand, some AUs
are not likely to appear simultaneously because of the struc-
tural limitations controlled by facial anatomy. For example,
we can hardly simultaneously make AU22(Lip Funneler) and
AU23(Lip Tightener).
?Corresponding author: Haoxiang Wang (hxwang@scut.edu.cn).
This work was supported by Guangdong Basic and Applied Basic Re-
search Foundation (2021A1515011852).
Fig. 1: Illustration of various AU relations. Subject 1 tends to
raise her brows (AU1 and AU2) when smiling, but Subject 2
tends to drop his jaw (AU26) when smiling.
Considering the intuitive AU relations, some works have
made progress in modeling the AU relationships. Robert
Walecki et al. [5] propose to combine deep learning with
conditional random field (CRF) to model AU dependencies
for more accurate AU detection. Corneanu et al. [6] develop
a complex model to exploit AU correlations via probabilistic
graphic approaches. However, these widely adopted strate-
gies [5, 6, 7] do not explicitly consider the AU relationship
in their model design and bring in noise information from
non-AU regions, which limits the performance of AU-level
relation modeling. To better model AU-level relations, Li
et al. [8] recently propose to apply Gated Graph Neural
Network (GGNN) to learn relationship-embed AU feature
representation with a defined AU relation graph based on
statistics of the training data. However, due to the predefined
rules, these constrained models [8, 9] can only learn limited
AU relationships. Besides, in different crowds, people may
have different AU relations. As shown in Fig.1, some people
tend to raise their brows (AU1 and AU2) when smiling, while
others may tend to drop their jaw (AU26). However, existing
AU relation-based methods do not consider these various AU
relations in different crowds.
Driven by this observation and inspired by attention
mechanism, we propose a novel ABRNet for facial AU recog-
nition. Our main contributions are listed as follows:
• ABRNet is proposed to capture the various AU rela-
tions in different crowds. The ABRNet uses a relation
learning module to automatically capture different AU
relations and a self-attention fusion module to refine the
AU features with attention weights;
arXiv:2210.13988v1 [cs.CV] 23 Oct 2022