
Mitigating Gender Bias in Face Recognition
by section 2.2 where we discuss different fairness metrics
that arise in FR and introduce two new ones we think are
more relevant with regards to operational use cases. In sec-
tion 3, we present the von Mises-Fisher loss that is used for
the training of the Ethical Module and discuss its benefits.
Finally, in section 4, we present at length our numerical
experiments, which partly consist in learning an Ethical
Module on the ArcFace model, pre-trained on the MS1MV3
dataset (Deng et al.,2019b). Our results show that, remark-
ably, some specific choices of hyperparameters provide high
performance and low fairness metrics both at the same time.
Related works. The correction of bias in FR has been
the subject of several recent papers. (Liu et al.,2019) and
(Wang & Deng,2020) use reinforcement learning to learn
fair decision rules but despite their mathematical relevance,
such methods are computationally prohibitive. Another line
of research followed by (Yin et al.,2019), (Wang et al.,
2019a) and (Huang et al.,2019) assumes that bias comes
from the unbalanced nature of FR datasets and builds on
imbalanced and transfer learning methods. Unfortunately,
these methods do dot completely remove bias and it has
been recently pointed out that balanced dataset are actually
not enough to mitigate bias, as illustrated by (Albiero et al.,
2020) for gender bias, (Gwilliam et al.,2021) for racial
bias and (Wang et al.,2019b) for gender bias in face detec-
tion. (Gong et al.,2019), (Alasadi et al.,2019) and (Dhar
et al.,2021) rely on adversarial methods that can reduce
bias but are also known to be unstable and computationally
expensive. All of the previously mentioned methods try to
learn fair representations. In contrast, some other works
do not affect the latent space but modify the decision rule
instead: (Terh
¨
orst et al.,2020) act on the score function
whereas (Salvador et al.,2021) rely on calibration methods.
Despite encouraging results, these approaches do not solve
the source of the problem which is the bias incurred by the
embeddings used.
2. Fairness in Face Recognition
In this section, we first briefly recall the main principles of
deep Face Recognition and introduce some notations. The
interested reader may consult (Masi et al.,2018) or (Wang
& Deng,2018) for a detailed exposition. Then, we present
the fairness metrics we adopt and argue of their relevance
in our framework.
2.1. Overview of Face Recognition
Framework. A typical FR dataset consists of face im-
ages of individuals from which we wish to predict the
identities. Assuming that the images are of size
h×w
and that there are
K
identities among the images, this
can be modeled by i.i.d. realizations of a random variable
(X, y)∈Rh×w×c× {1, . . . , K}
, where
c
corresponds to
the color channel dimension. In the following, we denote
by Pthe corresponding probability law.
Objective. The usual goal of FR is to learn an encoder
function
fθ:Rh×w×c→Rd
that embeds the images in a
way to bring same identities closer together. The resulting
latent representation
Z:= fθ(X)
is the face embedding of
X
. Since the advent of deep learning, the encoder is a deep
Convolutional Neural Network (CNN) whose parameters
θ
are learned on a huge FR dataset
(xi, yi)1≤i≤N
made of
N
i.i.d. realizations of the random variables
(X, y)
. There are
generally two FR use cases: identification, which consists
in finding the specific identity of a probe face among several
previously enrolled identities, and verification (which we
focus on throughout this paper), which aims at deciding
whether two face images correspond to the same identity
or not. To do so, the closeness between two embeddings
is usually quantified with the cosine similarity measure
s(zi,zj) := z⊺
izj/(||zi|| · ||zj||)
, where
|| · ||
stands for
the usual Euclidean norm (the Euclidean metric
||zi−zj||
is also used in some early works e.g. (Schroff et al.,2015)).
Therefore, an operating point
t∈[−1,1]
(threshold of ac-
ceptance) has to be chosen to classify a pair
(zi,zj)
as
genuine (same identity) if
s≥t
and impostor (distinct
identities) otherwise.
Training. For the training phase only, a fully-connected
layer is added on top of the deep embeddings so that the
output is a
K
-dimensional vector, predicting the identity of
each image within the training set. The full model (CNN +
fully-connected layer) is trained as an identity classification
task. Until 2018, most of the popular FR loss functions were
of the form:
L=−1
n
n
X
i=1
log eκµ⊺
yizi
PK
k=1 eκµ⊺
kzi!,(1)
where the
µk
’s are the fully-connected layer’s parameters,
κ > 0
is the inverse temperature of the softmax function
used in brackets and
n
is the batch size. Early works
(Taigman et al.,2014;Sun et al.,2014) took
κ= 1
and
used a bias term in the fully-connected layer but (Wang
et al.,2017) showed that the bias term degrades the per-
formance of the model. It was thus quickly discarded in
later works. Since the canonical similarity measure at the
test stage is the cosine similarity, the decision rule only
depends on the angle between two embeddings, whereas
it could depend on the norms of
µk
and
zi
during train-
ing. This has led (Wang et al.,2017) and (Hasnat et al.,
2017) to add a normalization step during training and take
µk,zi∈Sd−1:= {z∈Rd:||z|| = 1}
as well as introduc-
ing the re-scaling parameter
κ
in Eq. 1: these ideas signifi-
cantly improved upon former models and are now widely
adopted. The hypersphere
Sd−1
to which the embeddings
belong is commonly called face hypersphere. Denoting by
θi
the angle between
µyi
and
zi
, the major advance over