Rethinking Bias Mitigation Fair Architectures Make for Fair Face Recognition

2025-04-29 0 0 1.62MB 28 页 10玖币

侵权投诉

Rethinking Bias Mitigation: Fairer Architectures

Make for Fairer Face Recognition

Samuel Dooley∗

University of Maryland, Abacus.AI

samuel@abacus.ai

Rhea Sanjay Sukthanker∗

University of Freiburg

sukthank@cs.uni-freiburg.de

John P. Dickerson

University of Maryland, Arthur AI

johnd@umd.edu

Colin White

Caltech, Abacus.AI

crwhite@caltech.edu

Frank Hutter

University of Freiburg

fh@cs.uni-freiburg.de

Micah Goldblum

New York University

goldblum@nyu.edu

Abstract

Face recognition systems are widely deployed in safety-critical applications, in-

cluding law enforcement, yet they exhibit bias across a range of socio-demographic

dimensions, such as gender and race. Conventional wisdom dictates that model

biases arise from biased training data. As a consequence, previous works on bias

mitigation largely focused on pre-processing the training data, adding penalties

to prevent bias from effecting the model during training, or post-processing pre-

dictions to debias them, yet these approaches have shown limited success on hard

problems such as face recognition. In our work, we discover that biases are actually

inherent to neural network architectures themselves. Following this reframing, we

conduct the ﬁrst neural architecture search for fairness, jointly with a search for

hyperparameters. Our search outputs a suite of models which Pareto-dominate

all other high-performance architectures and existing bias mitigation methods in

terms of accuracy and fairness, often by large margins, on the two most widely

used datasets for face identiﬁcation, CelebA and VGGFace2. Furthermore, these

models generalize to other datasets and sensitive attributes. We release our code,

models and raw data ﬁles at https://github.com/dooleys/FR-NAS.

1 Introduction

Machine learning is applied to a wide variety of socially-consequential domains, e.g., credit scoring,

fraud detection, hiring decisions, criminal recidivism, loan repayment, and face recognition [

], with many of these applications signiﬁcantly impacting people’s lives, often in discriminatory

ways [

114

]. Dozens of formal deﬁnitions of fairness have been proposed [

], and many

algorithmic techniques have been developed for debiasing according to these deﬁnitions [

106

Existing debiasing algorithms broadly ﬁt into three (or arguably four [

]) categories: pre-processing

[e.g.,

110

], in-processing [e.g.,

123

124

110

], or post-processing

[e.g., 44,114].

* indicates equal contribution

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.09943v3 [cs.CV] 6 Dec 2023

Architectures

Examples:

Inception

ViTransformer

VovNet

MobileNetV3

Dual Path Net

Error

355 models trained

Hyperparams

Examples:

MagFace (head)

ArcFace (head)

CosFace (head)

Optimizer

Learning Rate

Bias

Design search

space using

best model.

Run NAS+HPO

Error

Bias

Pareto frontier Novel Search Space Pareto dominant

Discover new

architecture

Figure 1: Overview of our methodology.

Conventional wisdom is that in order to effectively mitigate bias, we should start by selecting a model

architecture and set of hyperparameters which are optimal in terms of accuracy and then apply a

mitigation strategy to reduce bias. This strategy has yielded little success in hard problems such

as face recognition [

]. Moreover, even randomly initialized face recognition models exhibit bias

and in the same ways and extents as trained models, indicating that these biases are baked in to the

architectures already [

]. While existing methods for debiasing machine learning systems use a

ﬁxed neural architecture and hyperparameter setting, we instead, ask a fundamental question which

has received little attention: Does model bias arise from the architecture and hyperparameters?

Following an afﬁrmative answer to this question, we exploit advances in neural architecture search

(NAS) [30] and hyperparameter optimization (HPO) [33] to search for inherently fair models.

We demonstrate our results on face identiﬁcation systems where pre-, post-, and in-processing

techniques have fallen short of debiasing face recognition systems. Training fair models in this setting

demands addressing several technical challenges [

]. Face identiﬁcation is a type of face recognition

deployed worldwide by government agencies for tasks including surveillance, employment, and

housing decisions. Face recognition systems exhibit disparity in accuracy based on race and gender

[

]. For example, some face recognition models are 10 to 100 times more likely to give

false positives for Black or Asian people, compared to white people [

]. This bias has already led to

multiple false arrests and jail time for innocent Black men in the USA [48].

In this work, we begin by conducting the ﬁrst large-scale analysis of the impact of architectures

and hyperparameters on bias. We train a diverse set of 29 architectures, ranging from ResNets [

]

to vision transformers [

] to Gluon Inception V3 [

103

] to MobileNetV3 [

] on the two most

widely used datasets in face identiﬁcation that have socio-demographic labels: CelebA [

] and

VGGFace2 [

]. In doing so, we discover that architectures and hyperparameters have a signiﬁcant

impact on fairness, across fairness deﬁnitions.

Motivated by this discovery, we design architectures that are simultaneously fair and accurate. To

this end, we initiate the study of NAS for fairness by conducting the ﬁrst use of NAS+HPO to jointly

optimize fairness and accuracy. We construct a search space informed by the highest-performing

architecture from our large-scale analysis, and we adapt the existing Sequential Model-based Algo-

rithm Conﬁguration method (SMAC) [

] for multi-objective architecture and hyperparameter search.

We discover a Pareto frontier of face recognition models that outperform existing state-of-the-art

models on both test accuracy and multiple fairness metrics, often by large margins. An outline of our

methodology can be found in Figure 1.

We summarize our primary contributions below:

•

By conducting an exhaustive evaluation of architectures and hyperparameters, we uncover their

strong inﬂuence on fairness. Bias is inherent to a model’s inductive bias, leading to a substantial

difference in fairness across different architectures. We conclude that the implicit convention of

choosing standard architectures designed for high accuracy is a losing strategy for fairness.

•

Inspired by these ﬁndings, we propose a new way to mitigate biases. We build an architecture and

hyperparameter search space, and we apply existing tools from NAS and HPO to automatically

design a fair face recognition system.

•

Our approach ﬁnds architectures which are Pareto-optimal on a variety of fairness metrics on both

CelebA and VGGFace2. Moreover, our approach is Pareto-optimal compared to other previous

bias mitigation techniques, ﬁnding the fairest model.

•

The architectures we synthesize via NAS and HPO generalize to other datasets and sensitive

attributes. Notably, these architectures also reduce the linear separability of protected attributes,

indicating their effectiveness in mitigating bias across different contexts.

We release our code and raw results at

https://github.com/dooleys/FR-NAS

, so that users can

easily adapt our approach to any bias metric or dataset.

2 Background and Related Work

Face Identiﬁcation. Face recognition tasks can be broadly categorized into two distinct categories:

veriﬁcation and identiﬁcation. Our speciﬁc focus lies in face identiﬁcation tasks which ask whether

a given person in a source image appears within a gallery composed of many target identities and

their associated images; this is a one-to-many comparison. Novel techniques in face recognition

tasks, such as ArcFace [108], CosFace [23], and MagFace [75], use deep networks (often called the

backbone) to extract feature representations of faces and then compare those to match individuals

(with mechanisms called the head). Generally, backbones take the form of image feature extractors

and heads resemble MLPs with specialized loss functions. Often, the term “head” refers to both

the last layer of the network and the loss function. Our analysis primarily centers around the face

identiﬁcation task, and we focus our evaluation on examining how close images of similar identities

are in the feature space of trained models, since the technology relies on this feature representation to

differentiate individuals. An overview of these topics can be found in Wang and Deng [109].

Bias Mitigation in Face Recognition. The existence of differential performance of face recognition

on population groups and subgroups has been explored in a variety of settings. Earlier work [e.g.,

] focuses on single-demographic effects (speciﬁcally, race and gender) in pre-deep-learning face

detection and recognition. Buolamwini and Gebru

[5]

uncover unequal performance at the phenotypic

subgroup level in, speciﬁcally, a gender classiﬁcation task powered by commercial systems. Raji

and Buolamwini

[90]

provide a follow-up analysis – exploring the impact of the public disclosures

of Buolamwini and Gebru

[5]

– where they discovered that named companies (IBM, Microsoft,

and Megvii) updated their APIs within a year to address some concerns that had surfaced. Further

research continues to show that commercial face recognition systems still have socio-demographic

disparities in many complex and pernicious ways [29,27,54,54,26].

Facial recognition is a large and complex space with many different individual technologies, some

with bias mitigation strategies designed just for them [

118

]. The main bias mitigation strategies

for facial identiﬁcation are described in Section 4.2.

Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO). Deep learning

derives its success from the manually designed feature extractors which automate the feature engi-

neering process. Neural Architecture Search (NAS) [

116

], on the other hand, aims at automating

the very design of network architectures for a task at hand. NAS can be seen as a subset of HPO

[

], which refers to the automated search for optimal hyperparameters, such as learning rate, batch

size, dropout, loss function, optimizer, and architectural choices. Rapid and extensive research on

NAS for image classiﬁcation and object detection has been witnessed as of late [

125

121

Deploying NAS techniques in face recognition systems has also seen a growing interest [

129

113

For example, reinforcement learning-based NAS strategies [

121

] and one-shot NAS methods [

113

]

have been deployed to search for an efﬁcient architecture for face recognition with low error. How-

ever, in a majority of these methods, the training hyperparameters for the architectures are ﬁxed. We

observe that this practice should be reconsidered in order to obtain the fairest possible face recognition

systems. Moreover, one-shot NAS methods have also been applied for multi-objective optimization

[

], e.g., optimizing accuracy and parameter size. However, none of these methods can be applied

for a joint architecture and hyperparameter search, and none of them have been used to optimize

fairness.

For the case of tabular datasets, a few works have applied hyperparameter optimization to mitigate

bias in models. Perrone et al.

[87]

introduced a Bayesian optimization framework to optimize

accuracy of models while satisfying a bias constraint. Schmucker et al.

[97]

and Cruz et al.

[17]

extended Hyperband [

] to the multi-objective setting and showed its applications to fairness. Lin

et al.

[65]

proposed de-biasing face recognition models through model pruning. However, they only

considered two architectures and just one set of ﬁxed hyperparameters. To the best of our knowledge,

no prior work uses any AutoML technique (NAS, HPO, or joint NAS and HPO) to design fair face

recognition models, and no prior work uses NAS to design fair models for any application.

3 Are Architectures and Hyperparameters Important for Fairness?

In this section, we study the question “Are architectures and hyperparameters important for fairness?”

and report an extensive exploration of the effect of model architectures and hyperparameters.

Experimental Setup. We train and evaluate each model conﬁguration on a gender-balanced subset

of the two most popular face identiﬁcation datasets: CelebA and VGGFace2. CelebA [

] is a

large-scale face attributes dataset with more than 200K celebrity images and a total of 10 177 gender-

labeled identities. VGGFace2 [

] is a much larger dataset designed speciﬁcally for face identiﬁcation

and comprises over 3.1 million images and a total of 9 131 gender-labeled identities. While this work

analyzes phenotypic metadata (perceived gender), the reader should not interpret our ﬁndings absent

a social lens of what these demographic groups mean inside society. We guide the reader to Hamidi

et al. [40] and Keyes [56] for a look at these concepts for gender.

To study the importance of architectures and hyperparameters for fairness, we use the following train-

ing pipeline – ultimately conducting 355 training runs with different combinations of 29 architectures

from the Pytorch Image Model (

timm

) database [

117

] and hyperparameters. For each model, we use

the default learning rate and optimizer that was published with that model. We then train the model

with these hyperparameters for each of three heads, ArcFace [

108

], CosFace [

], and MagFace

[

]. Next, we use the model’s default learning rate with both AdamW [

] and SGD optimizers

(again with each head choice). Finally, we also train with AdamW and SGD with uniﬁed learning

rates (SGD with

learning_rate

=0.1 and AdamW with

learning_rate

=0.001). In total, we thus

evaluate a single architecture between 9 and 13 times (9 times if the default optimizer and learning

rates are the same as the standardized, and 13 times otherwise). All other hyperparameters are held

constant fortraining of the model.

Evaluation procedure. As is commonplace in face identiﬁcation tasks [

], we evaluate

the performance of the learned representations. Recall that face recognition models usually learn

representations with an image backbone and then learn a mapping from those representations onto

identities of individuals with the head of the model. We pass each test image through a trained model

and save the learned representation. To compute the representation error (which we will henceforth

simply refer to as Error), we merely ask, for a given probe image/identity, whether the closest image

in feature space is not of the same person based on

distance. We split each dataset into train,

validation, and test sets. We conduct our search for novel architectures using the train and validation

splits, and then show the improvement of our model on the test set.

The most widely used fairness metric in face identiﬁcation is rank disparity, which is explored in the

NIST FRVT [

]. To compute the rank of a given image/identity, we ask how many images of a

different identity are closer to the image in feature space. We deﬁne this index as the rank of a given

image under consideration. Thus, Rank(image)

= 0

if and only if Error(image)

= 0

; Rank(image)

if and only if Error(image)

= 1

. We examine the rank disparity: the absolute difference of the

average ranks for each perceived gender in a dataset D:



|Dmale|X

x∈Dmale

Rank (x)−1

|Dfemale|X

x∈Dfemale

Rank(x)



.(1)

We focus on rank disparity throughout the main body of this paper as it is the most widely used in face

identiﬁcation, but we explore other forms of fairness metrics in face recognition in Appendix C.4.

Results and Discussion. By plotting the performance of each training run on the validation set

with the error on the

-axis and rank disparity on the

-axis in Figure 2, we can easily conclude two

main points. First, optimizing for error does not always optimize for fairness, and second, different

architectures have different fairness properties. We also ﬁnd the DPN architecture has the lowest

error and is Pareto-optimal on both datasets; hence, we use that architecture to design our search

space in Section 4.

0 0.05 0.1 0.15 0.2 0.25 0.3

ReXNet Inception TNT EseVoVNet DPN

CelebA Validation Set

Error

Rank Disparity

0 0.05 0.1 0.15 0.2 0.25 0.3

0.1

0.2

0.3

0.4

0.5

0.6

DPN ReXNet

VGGFace2 Validation Set

Error

Rank Disparity

Figure 2: (Left) CelebA (Right) VGGFace2. Error-Rank Disparity Pareto front of the architectures

with lowest error (< 0.3). Models in the lower left corner are better. The Pareto front is denoted

with a dashed line. Other points are architecture and hyperparameter combinations which are not

Pareto-optimal.

We note that in general there is a low correlation between error and rank disparity (e.g., for models

with error < 0.3,

ρ=.113

for CelebA and

ρ=.291

for VGGFace2). However, there are differences

between the two datasets at the most extreme low errors. First, for VGGFace2, the baseline models

already have very low error, with there being 10 models with error < 0.05; CelebA only has three

such models. Additionally, models with low error also have low rank disparity on VGGFace2 but this

is not the case for CelebA. This can be seen by looking at the Pareto curves in Figure 2.

The Pareto-optimal models also differ across datasets: on CelebA, they are versions of DPN, TNT,

ReXNet, VovNet, and ResNets, whereas on VGGFace2 they are DPN and ReXNet. Finally, we note

that different architectures exhibit different optimal hyperparameters. For example, on CelebA, for

the Xception65 architecture ﬁnds the combinations of (SGD, ArcFace) and (AdamW, ArcFace) as

Pareto-optimal, whereas the Inception-ResNet architecture ﬁnds the combinations (SGD, MagFace)

and (SGD, CosFace) Pareto-optimal.

4 Neural Architecture Search for Bias Mitigation

Inspired by our ﬁndings on the importance of architecture and hyperparameters for fairness in Sec-

tion 3, we now initiate the ﬁrst joint study of NAS for fairness in face recognition, also simultaneously

optimizing hyperparameters. We start by describing our search space and search strategy. We then

compare the results of our NAS+HPO-based bias mitigation strategy against other popular face

recognition bias mitigation strategies. We conclude that our strategy indeed discovers simultaneously

accurate and fair architectures.

4.1 Search Space Design and Search Strategy

We design our search space based on our analysis in Section 3, speciﬁcally around the Dual Path

Networks[

] architecture which has the lowest error and is Pareto-optimal on both datasets, yielding

the best trade-off between rank disparity and accuracy as seen in Figure 2.

Hyperparameter Search Space Design. We optimize two categorical hyperparameters (the archi-

tecture head/loss and the optimizer) and one continuous one (the learning rate). The learning rate’s

range is conditional on the choice of optimizer; the exact ranges are listed in Table 6 in the appendix.

Architecture Search Space Design. Dual Path Networks [

] for image classiﬁcation share

common features (like ResNets [

]) while possessing the ﬂexibility to explore new features [

]

through a dual path architecture. We replace the repeating

1x1_conv–3x3_conv–1x1_conv

block

with a simple recurring searchable block. Furthermore, we stack multiple such searched blocks to

closely follow the architecture of Dual Path Networks. We have nine possible choices for each of the

three operations in the DPN block, each of which we give a number 0 through 8. The choices include a

vanilla convolution, a convolution with pre-normalization and a convolution with post-normalization,

each of them paired with kernel sizes 1

1, 3

3, or 5

5 (see Appendix C.2 for full details). We thus

have 729 possible architectures (in addition to an inﬁnite number of hyperparameter conﬁgurations).

We denote each of these architectures by

XYZ

where

X, Y, Z ∈ {0,...,8}

; e.g., architecture

180

represents the architecture which has operation 1, followed by operation 8, followed by operation 0.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RethinkingBiasMitigation:FairerArchitecturesMakeforFairerFaceRecognitionSamuelDooley∗UniversityofMaryland,Abacus.AIsamuel@abacus.aiRheaSanjaySukthanker∗UniversityofFreiburgsukthank@cs.uni-freiburg.deJohnP.DickersonUniversityofMaryland,ArthurAIjohnd@umd.eduColinWhiteCaltech,Abacus.AIcrwhite@caltech.e...

展开>> 收起<<

Rethinking Bias Mitigation Fair Architectures Make for Fair Face Recognition.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Rethinking Bias Mitigation Fair Architectures Make for Fair Face Recognition

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: