Rethinking Bias Mitigation Fair Architectures Make for Fair Face Recognition

2025-04-29 0 0 1.62MB 28 页 10玖币
侵权投诉
Rethinking Bias Mitigation: Fairer Architectures
Make for Fairer Face Recognition
Samuel Dooley
University of Maryland, Abacus.AI
samuel@abacus.ai
Rhea Sanjay Sukthanker
University of Freiburg
sukthank@cs.uni-freiburg.de
John P. Dickerson
University of Maryland, Arthur AI
johnd@umd.edu
Colin White
Caltech, Abacus.AI
crwhite@caltech.edu
Frank Hutter
University of Freiburg
fh@cs.uni-freiburg.de
Micah Goldblum
New York University
goldblum@nyu.edu
Abstract
Face recognition systems are widely deployed in safety-critical applications, in-
cluding law enforcement, yet they exhibit bias across a range of socio-demographic
dimensions, such as gender and race. Conventional wisdom dictates that model
biases arise from biased training data. As a consequence, previous works on bias
mitigation largely focused on pre-processing the training data, adding penalties
to prevent bias from effecting the model during training, or post-processing pre-
dictions to debias them, yet these approaches have shown limited success on hard
problems such as face recognition. In our work, we discover that biases are actually
inherent to neural network architectures themselves. Following this reframing, we
conduct the first neural architecture search for fairness, jointly with a search for
hyperparameters. Our search outputs a suite of models which Pareto-dominate
all other high-performance architectures and existing bias mitigation methods in
terms of accuracy and fairness, often by large margins, on the two most widely
used datasets for face identification, CelebA and VGGFace2. Furthermore, these
models generalize to other datasets and sensitive attributes. We release our code,
models and raw data files at https://github.com/dooleys/FR-NAS.
1 Introduction
Machine learning is applied to a wide variety of socially-consequential domains, e.g., credit scoring,
fraud detection, hiring decisions, criminal recidivism, loan repayment, and face recognition [
78
,
81
,
61
,
3
], with many of these applications significantly impacting people’s lives, often in discriminatory
ways [
5
,
55
,
114
]. Dozens of formal definitions of fairness have been proposed [
80
], and many
algorithmic techniques have been developed for debiasing according to these definitions [
106
].
Existing debiasing algorithms broadly fit into three (or arguably four [
96
]) categories: pre-processing
[e.g.,
32
,
93
,
89
,
110
], in-processing [e.g.,
123
,
124
,
25
,
35
,
83
,
110
,
73
,
79
,
24
,
59
], or post-processing
[e.g., 44,114].
* indicates equal contribution
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.09943v3 [cs.CV] 6 Dec 2023
Architectures
Examples:
Inception
ViTransformer
VovNet
MobileNetV3
Dual Path Net
Error
355 models trained
Hyperparams
Examples:
MagFace (head)
ArcFace (head)
CosFace (head)
Optimizer
Learning Rate
Bias
Design search
space using
best model.
Run NAS+HPO
Error
Bias
Pareto frontier Novel Search Space Pareto dominant
Discover new
architecture
Figure 1: Overview of our methodology.
Conventional wisdom is that in order to effectively mitigate bias, we should start by selecting a model
architecture and set of hyperparameters which are optimal in terms of accuracy and then apply a
mitigation strategy to reduce bias. This strategy has yielded little success in hard problems such
as face recognition [
14
]. Moreover, even randomly initialized face recognition models exhibit bias
and in the same ways and extents as trained models, indicating that these biases are baked in to the
architectures already [
13
]. While existing methods for debiasing machine learning systems use a
fixed neural architecture and hyperparameter setting, we instead, ask a fundamental question which
has received little attention: Does model bias arise from the architecture and hyperparameters?
Following an affirmative answer to this question, we exploit advances in neural architecture search
(NAS) [30] and hyperparameter optimization (HPO) [33] to search for inherently fair models.
We demonstrate our results on face identification systems where pre-, post-, and in-processing
techniques have fallen short of debiasing face recognition systems. Training fair models in this setting
demands addressing several technical challenges [
14
]. Face identification is a type of face recognition
deployed worldwide by government agencies for tasks including surveillance, employment, and
housing decisions. Face recognition systems exhibit disparity in accuracy based on race and gender
[
37
,
92
,
91
,
61
]. For example, some face recognition models are 10 to 100 times more likely to give
false positives for Black or Asian people, compared to white people [
2
]. This bias has already led to
multiple false arrests and jail time for innocent Black men in the USA [48].
In this work, we begin by conducting the first large-scale analysis of the impact of architectures
and hyperparameters on bias. We train a diverse set of 29 architectures, ranging from ResNets [
47
]
to vision transformers [
28
,
68
] to Gluon Inception V3 [
103
] to MobileNetV3 [
50
] on the two most
widely used datasets in face identification that have socio-demographic labels: CelebA [
69
] and
VGGFace2 [
8
]. In doing so, we discover that architectures and hyperparameters have a significant
impact on fairness, across fairness definitions.
Motivated by this discovery, we design architectures that are simultaneously fair and accurate. To
this end, we initiate the study of NAS for fairness by conducting the first use of NAS+HPO to jointly
optimize fairness and accuracy. We construct a search space informed by the highest-performing
architecture from our large-scale analysis, and we adapt the existing Sequential Model-based Algo-
rithm Configuration method (SMAC) [
66
] for multi-objective architecture and hyperparameter search.
We discover a Pareto frontier of face recognition models that outperform existing state-of-the-art
models on both test accuracy and multiple fairness metrics, often by large margins. An outline of our
methodology can be found in Figure 1.
We summarize our primary contributions below:
By conducting an exhaustive evaluation of architectures and hyperparameters, we uncover their
strong influence on fairness. Bias is inherent to a model’s inductive bias, leading to a substantial
difference in fairness across different architectures. We conclude that the implicit convention of
choosing standard architectures designed for high accuracy is a losing strategy for fairness.
Inspired by these findings, we propose a new way to mitigate biases. We build an architecture and
hyperparameter search space, and we apply existing tools from NAS and HPO to automatically
design a fair face recognition system.
Our approach finds architectures which are Pareto-optimal on a variety of fairness metrics on both
CelebA and VGGFace2. Moreover, our approach is Pareto-optimal compared to other previous
bias mitigation techniques, finding the fairest model.
2
The architectures we synthesize via NAS and HPO generalize to other datasets and sensitive
attributes. Notably, these architectures also reduce the linear separability of protected attributes,
indicating their effectiveness in mitigating bias across different contexts.
We release our code and raw results at
https://github.com/dooleys/FR-NAS
, so that users can
easily adapt our approach to any bias metric or dataset.
2 Background and Related Work
Face Identification. Face recognition tasks can be broadly categorized into two distinct categories:
verification and identification. Our specific focus lies in face identification tasks which ask whether
a given person in a source image appears within a gallery composed of many target identities and
their associated images; this is a one-to-many comparison. Novel techniques in face recognition
tasks, such as ArcFace [108], CosFace [23], and MagFace [75], use deep networks (often called the
backbone) to extract feature representations of faces and then compare those to match individuals
(with mechanisms called the head). Generally, backbones take the form of image feature extractors
and heads resemble MLPs with specialized loss functions. Often, the term “head” refers to both
the last layer of the network and the loss function. Our analysis primarily centers around the face
identification task, and we focus our evaluation on examining how close images of similar identities
are in the feature space of trained models, since the technology relies on this feature representation to
differentiate individuals. An overview of these topics can be found in Wang and Deng [109].
Bias Mitigation in Face Recognition. The existence of differential performance of face recognition
on population groups and subgroups has been explored in a variety of settings. Earlier work [e.g.,
57
,
82
] focuses on single-demographic effects (specifically, race and gender) in pre-deep-learning face
detection and recognition. Buolamwini and Gebru
[5]
uncover unequal performance at the phenotypic
subgroup level in, specifically, a gender classification task powered by commercial systems. Raji
and Buolamwini
[90]
provide a follow-up analysis – exploring the impact of the public disclosures
of Buolamwini and Gebru
[5]
– where they discovered that named companies (IBM, Microsoft,
and Megvii) updated their APIs within a year to address some concerns that had surfaced. Further
research continues to show that commercial face recognition systems still have socio-demographic
disparities in many complex and pernicious ways [29,27,54,54,26].
Facial recognition is a large and complex space with many different individual technologies, some
with bias mitigation strategies designed just for them [
63
,
118
]. The main bias mitigation strategies
for facial identification are described in Section 4.2.
Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO). Deep learning
derives its success from the manually designed feature extractors which automate the feature engi-
neering process. Neural Architecture Search (NAS) [
30
,
116
], on the other hand, aims at automating
the very design of network architectures for a task at hand. NAS can be seen as a subset of HPO
[
33
], which refers to the automated search for optimal hyperparameters, such as learning rate, batch
size, dropout, loss function, optimizer, and architectural choices. Rapid and extensive research on
NAS for image classification and object detection has been witnessed as of late [
67
,
125
,
121
,
88
,
6
].
Deploying NAS techniques in face recognition systems has also seen a growing interest [
129
,
113
].
For example, reinforcement learning-based NAS strategies [
121
] and one-shot NAS methods [
113
]
have been deployed to search for an efficient architecture for face recognition with low error. How-
ever, in a majority of these methods, the training hyperparameters for the architectures are fixed. We
observe that this practice should be reconsidered in order to obtain the fairest possible face recognition
systems. Moreover, one-shot NAS methods have also been applied for multi-objective optimization
[
39
,
7
], e.g., optimizing accuracy and parameter size. However, none of these methods can be applied
for a joint architecture and hyperparameter search, and none of them have been used to optimize
fairness.
For the case of tabular datasets, a few works have applied hyperparameter optimization to mitigate
bias in models. Perrone et al.
[87]
introduced a Bayesian optimization framework to optimize
accuracy of models while satisfying a bias constraint. Schmucker et al.
[97]
and Cruz et al.
[17]
extended Hyperband [
64
] to the multi-objective setting and showed its applications to fairness. Lin
et al.
[65]
proposed de-biasing face recognition models through model pruning. However, they only
considered two architectures and just one set of fixed hyperparameters. To the best of our knowledge,
3
no prior work uses any AutoML technique (NAS, HPO, or joint NAS and HPO) to design fair face
recognition models, and no prior work uses NAS to design fair models for any application.
3 Are Architectures and Hyperparameters Important for Fairness?
In this section, we study the question “Are architectures and hyperparameters important for fairness?”
and report an extensive exploration of the effect of model architectures and hyperparameters.
Experimental Setup. We train and evaluate each model configuration on a gender-balanced subset
of the two most popular face identification datasets: CelebA and VGGFace2. CelebA [
69
] is a
large-scale face attributes dataset with more than 200K celebrity images and a total of 10 177 gender-
labeled identities. VGGFace2 [
8
] is a much larger dataset designed specifically for face identification
and comprises over 3.1 million images and a total of 9 131 gender-labeled identities. While this work
analyzes phenotypic metadata (perceived gender), the reader should not interpret our findings absent
a social lens of what these demographic groups mean inside society. We guide the reader to Hamidi
et al. [40] and Keyes [56] for a look at these concepts for gender.
To study the importance of architectures and hyperparameters for fairness, we use the following train-
ing pipeline – ultimately conducting 355 training runs with different combinations of 29 architectures
from the Pytorch Image Model (
timm
) database [
117
] and hyperparameters. For each model, we use
the default learning rate and optimizer that was published with that model. We then train the model
with these hyperparameters for each of three heads, ArcFace [
108
], CosFace [
23
], and MagFace
[
75
]. Next, we use the model’s default learning rate with both AdamW [
70
] and SGD optimizers
(again with each head choice). Finally, we also train with AdamW and SGD with unified learning
rates (SGD with
learning_rate
=0.1 and AdamW with
learning_rate
=0.001). In total, we thus
evaluate a single architecture between 9 and 13 times (9 times if the default optimizer and learning
rates are the same as the standardized, and 13 times otherwise). All other hyperparameters are held
constant fortraining of the model.
Evaluation procedure. As is commonplace in face identification tasks [
12
,
13
], we evaluate
the performance of the learned representations. Recall that face recognition models usually learn
representations with an image backbone and then learn a mapping from those representations onto
identities of individuals with the head of the model. We pass each test image through a trained model
and save the learned representation. To compute the representation error (which we will henceforth
simply refer to as Error), we merely ask, for a given probe image/identity, whether the closest image
in feature space is not of the same person based on
l2
distance. We split each dataset into train,
validation, and test sets. We conduct our search for novel architectures using the train and validation
splits, and then show the improvement of our model on the test set.
The most widely used fairness metric in face identification is rank disparity, which is explored in the
NIST FRVT [
38
]. To compute the rank of a given image/identity, we ask how many images of a
different identity are closer to the image in feature space. We define this index as the rank of a given
image under consideration. Thus, Rank(image)
= 0
if and only if Error(image)
= 0
; Rank(image)
>0
if and only if Error(image)
= 1
. We examine the rank disparity: the absolute difference of the
average ranks for each perceived gender in a dataset D:
1
|Dmale|X
x∈Dmale
Rank (x)1
|Dfemale|X
x∈Dfemale
Rank(x)
.(1)
We focus on rank disparity throughout the main body of this paper as it is the most widely used in face
identification, but we explore other forms of fairness metrics in face recognition in Appendix C.4.
Results and Discussion. By plotting the performance of each training run on the validation set
with the error on the
x
-axis and rank disparity on the
y
-axis in Figure 2, we can easily conclude two
main points. First, optimizing for error does not always optimize for fairness, and second, different
architectures have different fairness properties. We also find the DPN architecture has the lowest
error and is Pareto-optimal on both datasets; hence, we use that architecture to design our search
space in Section 4.
4
0 0.05 0.1 0.15 0.2 0.25 0.3
0
1
2
3
4
5
6
ReXNet Inception TNT EseVoVNet DPN
CelebA Validation Set
Error
Rank Disparity
0 0.05 0.1 0.15 0.2 0.25 0.3
0
0.1
0.2
0.3
0.4
0.5
0.6
DPN ReXNet
VGGFace2 Validation Set
Error
Rank Disparity
Figure 2: (Left) CelebA (Right) VGGFace2. Error-Rank Disparity Pareto front of the architectures
with lowest error (< 0.3). Models in the lower left corner are better. The Pareto front is denoted
with a dashed line. Other points are architecture and hyperparameter combinations which are not
Pareto-optimal.
We note that in general there is a low correlation between error and rank disparity (e.g., for models
with error < 0.3,
ρ=.113
for CelebA and
ρ=.291
for VGGFace2). However, there are differences
between the two datasets at the most extreme low errors. First, for VGGFace2, the baseline models
already have very low error, with there being 10 models with error < 0.05; CelebA only has three
such models. Additionally, models with low error also have low rank disparity on VGGFace2 but this
is not the case for CelebA. This can be seen by looking at the Pareto curves in Figure 2.
The Pareto-optimal models also differ across datasets: on CelebA, they are versions of DPN, TNT,
ReXNet, VovNet, and ResNets, whereas on VGGFace2 they are DPN and ReXNet. Finally, we note
that different architectures exhibit different optimal hyperparameters. For example, on CelebA, for
the Xception65 architecture finds the combinations of (SGD, ArcFace) and (AdamW, ArcFace) as
Pareto-optimal, whereas the Inception-ResNet architecture finds the combinations (SGD, MagFace)
and (SGD, CosFace) Pareto-optimal.
4 Neural Architecture Search for Bias Mitigation
Inspired by our findings on the importance of architecture and hyperparameters for fairness in Sec-
tion 3, we now initiate the first joint study of NAS for fairness in face recognition, also simultaneously
optimizing hyperparameters. We start by describing our search space and search strategy. We then
compare the results of our NAS+HPO-based bias mitigation strategy against other popular face
recognition bias mitigation strategies. We conclude that our strategy indeed discovers simultaneously
accurate and fair architectures.
4.1 Search Space Design and Search Strategy
We design our search space based on our analysis in Section 3, specifically around the Dual Path
Networks[
10
] architecture which has the lowest error and is Pareto-optimal on both datasets, yielding
the best trade-off between rank disparity and accuracy as seen in Figure 2.
Hyperparameter Search Space Design. We optimize two categorical hyperparameters (the archi-
tecture head/loss and the optimizer) and one continuous one (the learning rate). The learning rate’s
range is conditional on the choice of optimizer; the exact ranges are listed in Table 6 in the appendix.
Architecture Search Space Design. Dual Path Networks [
10
] for image classification share
common features (like ResNets [
46
]) while possessing the flexibility to explore new features [
52
]
through a dual path architecture. We replace the repeating
1x1_conv–3x3_conv–1x1_conv
block
with a simple recurring searchable block. Furthermore, we stack multiple such searched blocks to
closely follow the architecture of Dual Path Networks. We have nine possible choices for each of the
three operations in the DPN block, each of which we give a number 0 through 8. The choices include a
vanilla convolution, a convolution with pre-normalization and a convolution with post-normalization,
each of them paired with kernel sizes 1
×
1, 3
×
3, or 5
×
5 (see Appendix C.2 for full details). We thus
have 729 possible architectures (in addition to an infinite number of hyperparameter configurations).
We denote each of these architectures by
XYZ
where
X, Y, Z ∈ {0,...,8}
; e.g., architecture
180
represents the architecture which has operation 1, followed by operation 8, followed by operation 0.
5
摘要:

RethinkingBiasMitigation:FairerArchitecturesMakeforFairerFaceRecognitionSamuelDooley∗UniversityofMaryland,Abacus.AIsamuel@abacus.aiRheaSanjaySukthanker∗UniversityofFreiburgsukthank@cs.uni-freiburg.deJohnP.DickersonUniversityofMaryland,ArthurAIjohnd@umd.eduColinWhiteCaltech,Abacus.AIcrwhite@caltech.e...

展开>> 收起<<
Rethinking Bias Mitigation Fair Architectures Make for Fair Face Recognition.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.62MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注