On Learning Fairness and Accuracy on Multiple Subgroups Changjian Shui124Gezheng Xu3Qi Chen4Jiaqi Li3

2025-05-06 0 0 999.85KB 27 页 10玖币

侵权投诉

On Learning Fairness and Accuracy on Multiple

Subgroups

Changjian Shui1,2,4,∗Gezheng Xu3,∗Qi Chen4Jiaqi Li3

Charles X. Ling3Tal Arbel1,2,5Boyu Wang3,†Christian Gagné2,4,5,†

1Centre for Intelligent Machines, McGill University 2Mila, Quebec AI Institute

3Department of Computer Science, University of Western Ontario

4Institute Intelligence and Data, Université Laval 5CIFAR AI Chair

Abstract

We propose an analysis in fair learning that preserves the utility of the data while

reducing prediction disparities under the criteria of group sufﬁciency. We focus on

the scenario where the data contains multiple or even many subgroups, each with

limited number of samples. As a result, we present a principled method for learning

a fair predictor for all subgroups via formulating it as a bilevel objective. In the

lower-level, the subgroup-speciﬁc predictors are learned through a small amount

of data and the fair predictor. In the upper-level, the fair predictor is updated to

be close to all subgroup speciﬁc predictors. We further prove that such a bilevel

objective can effectively control the group sufﬁciency and generalization error.

We evaluate the proposed framework on real-world datasets. Empirical evidence

suggests the consistently improved fair predictions, as well as the comparable

accuracy to the baselines.

1 Introduction

Machine learning has made rapid progress in sociotechnical systems such as automatic resume

screening, video surveillance, and credit scoring for loan applications. Simultaneously, it has been

observed that learning algorithms exhibited biased predictions on the subgroups of population [

For example, the algorithm denies a loan application based on sensitive attributes such as gender,

race, or disability, which has heightened public concerns.

To this end, fair learning is recently highlighted to mitigate prediction disparities. The high-level

idea is quite straightforward: adding fair constraints during the training [

]. As a result, fair learning

principally gives rise to two desiderata. On the one hand, the fair predictor should be informative to

ensure accurate predictions for the data. On the other hand, the predictor is required to guarantee

fairness to avoid prediction disparities across subgroups. Therefore, it is crucial to understand the

possibilities and then design provable approaches for achieving both informative and fair learning.

Clearly, achieving both objectives depends on predeﬁned fair notations. Consider demographic parity

[

] as the fair criteria, which necessitates the independence between the predictor’s output

f(X)

and

the sensitive attribute (or subgourp index)

. Thus, if the sensitive attribute

and the ground-truth

label Yare highly correlated, it is impossible to learn a both fair and informative predictor.

To avoid such intrinsic impossibilities, alternative fair notions have been developed. In this work,

we focus on the criteria of group sufﬁciency [

], which ensures that the conditional expectation

of ground-truth label (

E[Y|f(X), A]

) is identical across different subgroups, given the predictor’s

∗Equal contribution

†Corresponding

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10837v2 [stat.ML] 29 Nov 2022

output. Notably, the risk of violating group sufﬁciency has arisen in a number of real-world scenarios.

E.g., in medical artiﬁcial intelligence, the machine learning algorithm is used to assess the clinic risk,

and guide decisions regarding initiating medical therapy. However, [

] revealed a signiﬁcant racial

bias in such algorithms: when the algorithm predicts the same clinical risk score

f(X)

for white and

black patients, black patients are actually at a higher risk of severe illness:

E[Y|f(X), A =black]

E[Y|f(X), A =white]

. The deployed algorithms have resulted in more referrals of white patients to

specialty healthcare services, resulting in both spending disparities and racial bias [5].

In summary, this work aims to propose a novel principled framework for ensuring group sufﬁciency,

as well as preserving an informative prediction with a small generalization error. In particular, we

focus on one challenge scenario: the data includes multiple or even a large number of subgroups,

some with only limited samples, as often occurs in the real-world. For example, datasets for the

self-driving car are collected from a wide range of geographical regions, each with a limited number

of training samples [

]. How can we ensure group sufﬁciency as well as accurate predictions?

Speciﬁcally, our contributions are summarized as follows:

•

•Q?

•S2

•Q?

•S3

Figure 1: Illustration of the

proposed algorithm. Consider

three subgroups

S1, S2, S3

, e.g.,

datasets for three different races.

The proposed algorithm is then

formulated as a bilevel optimiza-

tion to learn an informative and

fair predictive-distribution

. In

the lower-level (cyan), we learn

the subgroup speciﬁc predictive-

distribution

from dataset

(limited samples) and the prior

. In the upper-level (brown),

is then updated to be as close to

all of the learned subgroup spe-

ciﬁc Q?

aas possible.

Controlling group sufﬁciency

We adopted group sufﬁciency

gap to measure fairness w.r.t. group sufﬁciency of a classiﬁer

(Sec.3), and then derive an upper bound of the group sufﬁciency

gap (Theorem 4.1). Under proper assumptions, the upper bound

is controlled by the discrepancy between the classiﬁer

and the

subgroup Bayes predictors. Namely, minimizing the upper bound

also encourages an informative classiﬁer.

Algorithmic contribution

Motivated by the upper bound of

the group sufﬁciency gap, we develop a principled algorithm.

Concretely, we adopt a randomized algorithm that produces a

predictive-distribution

over the classiﬁer (

f∼Q

) to learn

informative and fair classiﬁcation. We further formulate the prob-

lem as a bilevel optimization (Sec. 5.3), as shown in Fig.1. (1)

In the lower-level, the subgroup speciﬁc dataset

and the fair

predictive-distribution

are used to learn the subgroup speciﬁc

predictive-distribution

, where

is regarded as an informative

prior for learning limited data within each subgroup. Theorem 5.1

formally demonstrates that under proper assumptions, the lower-

level loss can effectively control the generalization error. (2) In the

upper-level, the fair predictive-distribution

is then updated to

be close to all subgroup speciﬁc predictive-distributions, in order

to minimize the upper bound of the group sufﬁciency gap.

Empirical justiﬁcations

The proposed algorithm is applicable

to the general parametric and differentiable model, where we

adopt the neural network in the implementation. We evaluate

the proposed algorithm on two real-world NLP datasets that have

shown prediction disparities w.r.t. group sufﬁciency. Compared with baselines, the results indicate

that group sufﬁciency has been consistently improved, with almost no loss of accuracy. Code is

available at https://github.com/xugezheng/FAMS.

2 Related Work

Algorithmic fairness

Fairness has been attached great importance and widely studied in various

applications, such as natural language processing [

–

], natural language generation [

–

computer vision [

], and deep learning [

]. Then various approaches have been proposed

in algorithmic fairness. They typically add fair constraints during the training procedure, such as

demographic parity or equalized odds [

–

]. Apart from this, other fair notions are adopted

such as accuracy parity [

], which requires each subgroup to attain the same accuracy; small

prediction variance [

], which ensures small prediction variations among the subgroup; or small

prediction loss for all the subgroups [

–

]. Furthermore, based on the concept of Independence (e.g.

demographic parity

A⊥⊥ f(X)

) or conditional independence (e.g. equalized odds

A⊥⊥ f(X)|Y

group sufﬁciency

A⊥⊥ Y|f(X)

), another popular line in fair learning is then naturally integrated with

information theoretical framework through adding mutual information constraints such as [32, 33].

Understanding fairness-accuracy trade-off

As for the theoretical aspect, [

] further investigated

the relation of fairness (demographic parity) and algorithmic stability. [

] formally justiﬁed the

inherent trade-off between fairness (w.r.t. demographic parity and equalized odds) and accuracy,

whereas the analysis is conducted for the binary sensitive attribute with the population loss. [

]

studied the fair-accuracy trade-off in the multi-task learning.

Group sufﬁciency

The fair notion of group sufﬁciency has recently been highlighted in various

real-world scenarios such as health [

] and crime prediction [

]. Speciﬁcally, [

] demonstrated

that under proper assumptions, group sufﬁciency can be controlled in the unconstraint learning.

However, this conclusion may not necessarily always hold in the overparameterized models with

limited samples per subgroup, where [

] essentially revealed the prediction disparities

between the different subgroups in the unconstraint learning. [

] recently studied the fair selective

classiﬁcation w.r.t. group sufﬁciency through an information theoretical framework, while the

theoretical guarantee is unknown. In contrast, our proposed lower-level loss within the paper can

provably control the generalization error, and the upper-level loss controls the group sufﬁciency gap.

Besides, a close notion to the group sufﬁciency is the probability calibration [

], which is deﬁned as

E[Y|f(X)] = f(X)

in binary classiﬁcation. We will empirically show the probability calibration

could be consistently improved within our framework, whereas the analysis on ﬁnite samples and its

theoretical relation with group sufﬁciency remains still opening [43].

Bi-level optimization in fairness

Bi-level optimization seeks to solve problems with a hierarchical

structure. Namely, two levels of optimization problems where one task is nested inside another [

Several ideas related to bi-level optimization have been proposed in the context of fair-learning. For

instance, we could design a min-max optimization to learn fair representation when considering

demographic parity (DP) or equalized odds (EO) [

]. In this context, a representation

function aims to minimize the loss caused by the discriminator in the lower-level. Simultaneously, in

the upper-level, a discriminator could be introduced to maximize the loss. Then fair representation

could be enforced through the bi-level optimization. Besides, if the accuracy and its variants are

tracked as the metrics for each subgroup [

], the bi-level objective could also be deployed in

controlling the loss [

] or the prediction variance [

], where the lower-level’s goal is to minimize

the loss for each subgroup and the upper-level’s goal is to estimate the prediction disparities. In

our paper, we theoretically justiﬁed a novel bi-level optimization perspective: controlling group

sufﬁciency and accuracy. Simultaneously, other bi-level optimization and its relevant meta-learning

algorithms could be further considered in the fair learning such as recurrent based gradient updating

[46], layer-wise transformation [47] or implicit gradient based approach [48].

3 Preliminaries

We assume the joint random variable

(X, Y, A)

follows an underlying distribution

D(X, Y, A)

, where

X∈ X

is the input,

Y∈ Y

is the label, and the scalar discrete random variable

A∈ A

denotes

the sensitive attribute (or subgroup index). For instance,

represents gender, race, or age. We also

denote

E[Y|X]

as the conditional expectation of

, which is essentially a function of

EA,X [·]

is denoted as the expectation on the marginal distribution of

D(A, X)

. Throughout the paper, we

consider binary classiﬁcation with

Y={0,1}

. We further deﬁne the predictor as a scoring function

f:X → [0,1]

that maps the input into a real value in

[0,1]

. It is worth mentioning that in general

f(X)/∈ Y

since

f(X)

is continuous. We then introduce group sufﬁciency and group sufﬁciency gap.

Deﬁnition 3.1

(Group sufﬁciency [

])

A predictor

satisﬁes group sufﬁciency with respect to

the sensitive attribute Aif E[Y|f(X)] = E[Y|f(X), A].

Intuitively, given a output score of the predictor

f(X) = τ

, the conditional expectation of

invariant across different subgroups. Namely, conditioning on the speciﬁc subgroup

A=a

does not

provide any additional information about the conditional expectation of

. Then we could naturally

deﬁne group sufﬁciency gap.

Deﬁnition 3.2

(Group sufﬁciency gap [

])

The group sufﬁciency gap of a predictor

is deﬁned as:

Suff=EA,X [|E[Y|f(X)] −E[Y|f(X), A]|]

Speciﬁcally,

Suff

measures the extent of group sufﬁciency violation, induced by the predictor

which is taken by the expectation over

(X, A)

. Clearly,

Suff= 0

suggests that

satisﬁes groups

sufﬁciency and vice versa. For completeness, we also discuss other popular group fairness criteria:

demographic parity and equalized odds.

Deﬁnition 3.3

(Demographic Parity (DP))

A predictor

satisﬁes the demographic parity with

respect to the sensitive attribute Aif: E[f(X)] = E[f(X)|A]

Demographic Parity (DP), also known as statistical parity or independence rule, emphasizes that the

expectation of the output score

f(X)

is independent of

. [

] further revealed that if

A6⊥⊥ Y

group sufﬁciency and demographic parity could not be simultaneously achieved.

Deﬁnition 3.4

(Equalized Odds (EO) [

])

A predictor

satisﬁes the equalized odds with respect

to Aif: E[f(X)|Y] = E[f(X)|Y, A]

Equalized odds (EO) emphasizes the conditional expectation of output

is invariant w.r.t.

, given

the ground truth

. [

] reveal that if

D(X, Y, A)>0

and

A6⊥⊥ Y

, group sufﬁciency and

equalized odds can not both hold.

The analysis reveals a general incompatibility between group sufﬁciency and DP/EO when

A6⊥⊥ Y

which often occurs in practice. Besides, DP/EO based criteria generally suffers the well-known fair

accuracy trade-off [

]: enforcing the fair constraint degrades the prediction performance. This paper

depicts that under the criteria of group sufﬁciency, these objectives could be both encouraged.

4 Upper bound of group sufﬁciency gap

To derive the theoretical results, we ﬁrst introduce the group Bayes predictor.

Deﬁnition 4.1

(

-group Bayes predictor)

The

-group Bayes predictor

fBayes

is deﬁned as:

fBayes

A(X) = E[Y|X, A]

The

-group Bayes predictor is associated with the underlying data distribution

D(X, Y, A)

. Given

the ﬁxed realization

X=x, A =a

, we have

fBayes

A=a(x) = E[Y|X=x, A =a]

, which suggests the

ground truth conditional data generation of subgroup

A=a

. By adopting

fBayes

A=a(x)

, we could derive

the upper bound of group sufﬁciency gap w.r.t. any predictor f:

Theorem 4.1. Group sufﬁciency gap Suffis upper bounded by: Suff≤4EA,X [|f−fBayes

A|]

Speciﬁcally, if

takes ﬁnite value (

|A| <+∞

) and follows uniform distribution with

D(A=a) =

1/|A|. Then the group sufﬁciency gap is further simpliﬁed as:

Suff≤4

|A| X

EX[|f−fBayes

A=a||A=a]

The proof is inspired by [

]. Speciﬁcally, Theorem 4.1 reveals that the upper bound of group

sufﬁciency gap depends on the discrepancy between the predictor

and

-group Bayes predictor

fBayes

A(X)

. Namely, given different subgroups

A=a

, the optimal predictor

ought to be closed to

all the group Bayes predictors fBayes

A=a(X),∀a∈ A.

Underlying assumption

Theorem 4.1 also reveals underlying assumptions w.r.t. the data generation

distribution

D(X, Y, A)

for achieving a small group sufﬁciency gap. If

fBayes

for each subgroup

A=a

are quite similar, then minimizing the upper bound yields a small group sufﬁciency gap

Suff

. For example, consider the extreme scenario

the

-group Bayes predictors are identical

w.r.t.

E[Y|X, A =a] = E[Y|X],∀a∈ A

, where

E[Y|X]

is the conventional Bayes predictor

deﬁned on the marginalized distribution

D(X, Y )

. The upper bound recovers the difference between

the predictor

and standard Bayes predictor. If we use a probabilistic framework to approximate

predictor

f(X)≈E[Y|X]

(i.e, training the entire dataset without any fair constraint), both group

sufﬁciency gap and prediction error (since Bayes predictor is optimal) will be small, which is

consistent with [

]. On the contrary, if

-group Bayes predictors are completely arbitrary with high

variance for

, both group sufﬁciency gap and prediction error are large and it would be impossible

for an informative prediction.

5 Principled Approach

Based on the upper bound, we propose a principled approach to learn the predictor that achieves both

small generalization error and group sufﬁciency gap.

5.1 Upper bound in randomized algorithm

To establish the theoretical result, we consider a randomized algorithm that learns a predictive-

distribution

over scoring predictors from the data. For instance, if we consider Bayes framework,

the predictor is drawn from the posterior distribution

f∼Q

. In the inference, the predictor’s output

is formulated as the expectation of the learned predictive-distribution Q:f(X) = E˜

f∼Q˜

f(X).

In practice, it is infeasible to optimize over all the possible distributions. Then we should restrict

the predictive-distribution

within a distribution family

Q∈ Q

such as Gaussian distribution.

We also denote

a∈ Q

as the optimal prediction-distribution w.r.t.

A=a

under binary cross-

entropy loss within the distribution family

a:= argminQa∈Q E˜

fa∼QaLBCE

a(˜

fa)

. In generally

a6=fBayes(x, A =a)

, since the distribution family

is only the subset of all possible distributions

(shown in Fig. 2). We then extend the upper bound in the randomized algorithm.

Corollary 5.1.

The group sufﬁciency gap

Suff

in randomized algorithm w.r.t. learned predictive-

distribution Qis upper bounded by:

Suff≤2√2

|A| [X

apKL(Q?

akQ)

| {z }

Optimization

+pKL(Q?

akD(Y|X, A =a))

| {z }

Approximation

]

Where

is the Kullback–Leibler divergence. Corollary 5.1 further reveals that the upper bound is

decomposed into two terms, showing in Fig.2.

•

a•

D(Y|X, A =a)

•

•D(Y|X, A =b)

•

Figure 2: Illustration of optimiza-

tion and approximation term. In the

binary subgroup

A={a, b}

, the op-

timization term is to ﬁnd

Q∈ Q

minimize the discrepancy between

(Q?

a, Q?

. The approximation term

is solely based on the distribution

family

(brown region). If the

predeﬁned

has a rich expressive

power, the approximation is treated

as a small constant.

Optimization term

The optimization term is the average KL

divergence between the learned distribution

and optimal

predictive-distribution

for each subgroup

A=a

. Minimiz-

ing the optimization term implies that the learned distribution

will be both fair and informative for the prediction, because

it aims to minimize the upper bound of the group sufﬁciency

gap

Suff

and be close to the optimal predictive-distribution

w.r.t. each A=a.

Approximation term

The approximation term is the average

KL divergence between the optimal distribution

and the

underlying data generation distribution. Given the distribution

family

, it is a unknown constant. Besides, if the distribution

family

has a rich expressive power such as deep neural-

network, the approximation term will be small [

]. However,

an extreme large distribution family

could simultaneously

yield a potential overﬁtting on ﬁnite samples. In this paper,

the neural network is adopted and the approximation term is

assumed to be a small constant. Thus, controlling

Suff

implies

minimizing the optimization term.

5.2 Challenge in learning limited samples

In practice, we only have access to ﬁnite or even limited samples in each subgroup, rather than the

underlying distribution

. We denote

Sa={(xa

i, ya

i)}m

i=1

as the observed data w.r.t. subgroups

A=a

, which are i.i.d. samplings from the underlying distribution

D(x, y|A=a)

. We also denote

the

empirical

binary cross entropy loss w.r.t

A=a

as:

LBCE

a(˜

f) = 1

mPm

i=1 −[ya

ilog( ˜

f(xa

i)) +

(1 −ya

i) log(1 −˜

f(xa

i))]. Then a straight approach is to minimize the empirical term ˆ

a=argminQa∈Q E˜

fa∼Qa

LBCE

a(˜

fa)(1)

Then

is updated through minimizing the average KL-divergence:

PaKL(ˆ

akQ)

from learned

However, this idea generally does not work in our setting, because each subgroup contains

limited

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnLearningFairnessandAccuracyonMultipleSubgroupsChangjianShui1;2;4;GezhengXu3;QiChen4JiaqiLi3CharlesX.Ling3TalArbel1;2;5BoyuWang3;yChristianGagné2;4;5;y1CentreforIntelligentMachines,McGillUniversity2Mila,QuebecAIInstitute3DepartmentofComputerScience,UniversityofWesternOntario4InstituteIntelligence...

展开>> 收起<<

On Learning Fairness and Accuracy on Multiple Subgroups Changjian Shui124Gezheng Xu3Qi Chen4Jiaqi Li3.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Learning Fairness and Accuracy on Multiple Subgroups Changjian Shui124Gezheng Xu3Qi Chen4Jiaqi Li3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: