On Learning Fairness and Accuracy on Multiple Subgroups Changjian Shui124Gezheng Xu3Qi Chen4Jiaqi Li3

2025-05-06 0 0 999.85KB 27 页 10玖币
侵权投诉
On Learning Fairness and Accuracy on Multiple
Subgroups
Changjian Shui1,2,4,Gezheng Xu3,Qi Chen4Jiaqi Li3
Charles X. Ling3Tal Arbel1,2,5Boyu Wang3,Christian Gagné2,4,5,
1Centre for Intelligent Machines, McGill University 2Mila, Quebec AI Institute
3Department of Computer Science, University of Western Ontario
4Institute Intelligence and Data, Université Laval 5CIFAR AI Chair
Abstract
We propose an analysis in fair learning that preserves the utility of the data while
reducing prediction disparities under the criteria of group sufficiency. We focus on
the scenario where the data contains multiple or even many subgroups, each with
limited number of samples. As a result, we present a principled method for learning
a fair predictor for all subgroups via formulating it as a bilevel objective. In the
lower-level, the subgroup-specific predictors are learned through a small amount
of data and the fair predictor. In the upper-level, the fair predictor is updated to
be close to all subgroup specific predictors. We further prove that such a bilevel
objective can effectively control the group sufficiency and generalization error.
We evaluate the proposed framework on real-world datasets. Empirical evidence
suggests the consistently improved fair predictions, as well as the comparable
accuracy to the baselines.
1 Introduction
Machine learning has made rapid progress in sociotechnical systems such as automatic resume
screening, video surveillance, and credit scoring for loan applications. Simultaneously, it has been
observed that learning algorithms exhibited biased predictions on the subgroups of population [
1
,
2
].
For example, the algorithm denies a loan application based on sensitive attributes such as gender,
race, or disability, which has heightened public concerns.
To this end, fair learning is recently highlighted to mitigate prediction disparities. The high-level
idea is quite straightforward: adding fair constraints during the training [
3
]. As a result, fair learning
principally gives rise to two desiderata. On the one hand, the fair predictor should be informative to
ensure accurate predictions for the data. On the other hand, the predictor is required to guarantee
fairness to avoid prediction disparities across subgroups. Therefore, it is crucial to understand the
possibilities and then design provable approaches for achieving both informative and fair learning.
Clearly, achieving both objectives depends on predefined fair notations. Consider demographic parity
[
1
] as the fair criteria, which necessitates the independence between the predictor’s output
f(X)
and
the sensitive attribute (or subgourp index)
A
. Thus, if the sensitive attribute
A
and the ground-truth
label Yare highly correlated, it is impossible to learn a both fair and informative predictor.
To avoid such intrinsic impossibilities, alternative fair notions have been developed. In this work,
we focus on the criteria of group sufficiency [
1
,
4
], which ensures that the conditional expectation
of ground-truth label (
E[Y|f(X), A]
) is identical across different subgroups, given the predictor’s
Equal contribution
Corresponding
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10837v2 [stat.ML] 29 Nov 2022
output. Notably, the risk of violating group sufficiency has arisen in a number of real-world scenarios.
E.g., in medical artificial intelligence, the machine learning algorithm is used to assess the clinic risk,
and guide decisions regarding initiating medical therapy. However, [
5
,
6
] revealed a significant racial
bias in such algorithms: when the algorithm predicts the same clinical risk score
f(X)
for white and
black patients, black patients are actually at a higher risk of severe illness:
E[Y|f(X), A =black]
E[Y|f(X), A =white]
. The deployed algorithms have resulted in more referrals of white patients to
specialty healthcare services, resulting in both spending disparities and racial bias [5].
In summary, this work aims to propose a novel principled framework for ensuring group sufficiency,
as well as preserving an informative prediction with a small generalization error. In particular, we
focus on one challenge scenario: the data includes multiple or even a large number of subgroups,
some with only limited samples, as often occurs in the real-world. For example, datasets for the
self-driving car are collected from a wide range of geographical regions, each with a limited number
of training samples [
7
]. How can we ensure group sufficiency as well as accurate predictions?
Specifically, our contributions are summarized as follows:
Q
Q?
1
S1
Q?
2
S2
Q?
3
S3
Figure 1: Illustration of the
proposed algorithm. Consider
three subgroups
S1, S2, S3
, e.g.,
datasets for three different races.
The proposed algorithm is then
formulated as a bilevel optimiza-
tion to learn an informative and
fair predictive-distribution
Q
. In
the lower-level (cyan), we learn
the subgroup specific predictive-
distribution
Q?
a
from dataset
Sa
(limited samples) and the prior
Q
. In the upper-level (brown),
Q
is then updated to be as close to
all of the learned subgroup spe-
cific Q?
aas possible.
Controlling group sufficiency
We adopted group sufficiency
gap to measure fairness w.r.t. group sufficiency of a classifier
f
(Sec.3), and then derive an upper bound of the group sufficiency
gap (Theorem 4.1). Under proper assumptions, the upper bound
is controlled by the discrepancy between the classifier
f
and the
subgroup Bayes predictors. Namely, minimizing the upper bound
also encourages an informative classifier.
Algorithmic contribution
Motivated by the upper bound of
the group sufficiency gap, we develop a principled algorithm.
Concretely, we adopt a randomized algorithm that produces a
predictive-distribution
Q
over the classifier (
fQ
) to learn
informative and fair classification. We further formulate the prob-
lem as a bilevel optimization (Sec. 5.3), as shown in Fig.1. (1)
In the lower-level, the subgroup specific dataset
Sa
and the fair
predictive-distribution
Q
are used to learn the subgroup specific
predictive-distribution
Q?
a
, where
Q
is regarded as an informative
prior for learning limited data within each subgroup. Theorem 5.1
formally demonstrates that under proper assumptions, the lower-
level loss can effectively control the generalization error. (2) In the
upper-level, the fair predictive-distribution
Q
is then updated to
be close to all subgroup specific predictive-distributions, in order
to minimize the upper bound of the group sufficiency gap.
Empirical justifications
The proposed algorithm is applicable
to the general parametric and differentiable model, where we
adopt the neural network in the implementation. We evaluate
the proposed algorithm on two real-world NLP datasets that have
shown prediction disparities w.r.t. group sufficiency. Compared with baselines, the results indicate
that group sufficiency has been consistently improved, with almost no loss of accuracy. Code is
available at https://github.com/xugezheng/FAMS.
2 Related Work
Algorithmic fairness
Fairness has been attached great importance and widely studied in various
applications, such as natural language processing [
8
10
], natural language generation [
11
13
],
computer vision [
14
,
15
], and deep learning [
16
,
17
]. Then various approaches have been proposed
in algorithmic fairness. They typically add fair constraints during the training procedure, such as
demographic parity or equalized odds [
18
23
]. Apart from this, other fair notions are adopted
such as accuracy parity [
24
,
25
], which requires each subgroup to attain the same accuracy; small
prediction variance [
26
,
27
], which ensures small prediction variations among the subgroup; or small
prediction loss for all the subgroups [
28
31
]. Furthermore, based on the concept of Independence (e.g.
demographic parity
Af(X)
) or conditional independence (e.g. equalized odds
Af(X)|Y
or
2
group sufficiency
AY|f(X)
), another popular line in fair learning is then naturally integrated with
information theoretical framework through adding mutual information constraints such as [32, 33].
Understanding fairness-accuracy trade-off
As for the theoretical aspect, [
34
] further investigated
the relation of fairness (demographic parity) and algorithmic stability. [
35
] formally justified the
inherent trade-off between fairness (w.r.t. demographic parity and equalized odds) and accuracy,
whereas the analysis is conducted for the binary sensitive attribute with the population loss. [
36
]
studied the fair-accuracy trade-off in the multi-task learning.
Group sufficiency
The fair notion of group sufficiency has recently been highlighted in various
real-world scenarios such as health [
6
] and crime prediction [
4
,
37
]. Specifically, [
38
] demonstrated
that under proper assumptions, group sufficiency can be controlled in the unconstraint learning.
However, this conclusion may not necessarily always hold in the overparameterized models with
limited samples per subgroup, where [
6
,
39
,
40
] essentially revealed the prediction disparities
between the different subgroups in the unconstraint learning. [
41
] recently studied the fair selective
classification w.r.t. group sufficiency through an information theoretical framework, while the
theoretical guarantee is unknown. In contrast, our proposed lower-level loss within the paper can
provably control the generalization error, and the upper-level loss controls the group sufficiency gap.
Besides, a close notion to the group sufficiency is the probability calibration [
42
], which is defined as
E[Y|f(X)] = f(X)
in binary classification. We will empirically show the probability calibration
could be consistently improved within our framework, whereas the analysis on finite samples and its
theoretical relation with group sufficiency remains still opening [43].
Bi-level optimization in fairness
Bi-level optimization seeks to solve problems with a hierarchical
structure. Namely, two levels of optimization problems where one task is nested inside another [
44
].
Several ideas related to bi-level optimization have been proposed in the context of fair-learning. For
instance, we could design a min-max optimization to learn fair representation when considering
demographic parity (DP) or equalized odds (EO) [
19
,
32
,
25
]. In this context, a representation
function aims to minimize the loss caused by the discriminator in the lower-level. Simultaneously, in
the upper-level, a discriminator could be introduced to maximize the loss. Then fair representation
could be enforced through the bi-level optimization. Besides, if the accuracy and its variants are
tracked as the metrics for each subgroup [
12
], the bi-level objective could also be deployed in
controlling the loss [
45
] or the prediction variance [
27
], where the lower-level’s goal is to minimize
the loss for each subgroup and the upper-level’s goal is to estimate the prediction disparities. In
our paper, we theoretically justified a novel bi-level optimization perspective: controlling group
sufficiency and accuracy. Simultaneously, other bi-level optimization and its relevant meta-learning
algorithms could be further considered in the fair learning such as recurrent based gradient updating
[46], layer-wise transformation [47] or implicit gradient based approach [48].
3 Preliminaries
We assume the joint random variable
(X, Y, A)
follows an underlying distribution
D(X, Y, A)
, where
X∈ X
is the input,
Y∈ Y
is the label, and the scalar discrete random variable
A∈ A
denotes
the sensitive attribute (or subgroup index). For instance,
A
represents gender, race, or age. We also
denote
E[Y|X]
as the conditional expectation of
Y
, which is essentially a function of
X
.
EA,X [·]
is denoted as the expectation on the marginal distribution of
D(A, X)
. Throughout the paper, we
consider binary classification with
Y={0,1}
. We further define the predictor as a scoring function
f:X [0,1]
that maps the input into a real value in
[0,1]
. It is worth mentioning that in general
f(X)/∈ Y
since
f(X)
is continuous. We then introduce group sufficiency and group sufficiency gap.
Definition 3.1
(Group sufficiency [
1
,
4
,
38
])
.
A predictor
f
satisfies group sufficiency with respect to
the sensitive attribute Aif E[Y|f(X)] = E[Y|f(X), A].
Intuitively, given a output score of the predictor
f(X) = τ
, the conditional expectation of
Y
is
invariant across different subgroups. Namely, conditioning on the specific subgroup
A=a
does not
provide any additional information about the conditional expectation of
Y
. Then we could naturally
define group sufficiency gap.
Definition 3.2
(Group sufficiency gap [
38
])
.
The group sufficiency gap of a predictor
f
is defined as:
Suff=EA,X [|E[Y|f(X)] E[Y|f(X), A]|]
3
Specifically,
Suff
measures the extent of group sufficiency violation, induced by the predictor
f
,
which is taken by the expectation over
(X, A)
. Clearly,
Suff= 0
suggests that
f
satisfies groups
sufficiency and vice versa. For completeness, we also discuss other popular group fairness criteria:
demographic parity and equalized odds.
Definition 3.3
(Demographic Parity (DP))
.
A predictor
f
satisfies the demographic parity with
respect to the sensitive attribute Aif: E[f(X)] = E[f(X)|A]
Demographic Parity (DP), also known as statistical parity or independence rule, emphasizes that the
expectation of the output score
f(X)
is independent of
A
. [
1
,
4
] further revealed that if
A6Y
,
group sufficiency and demographic parity could not be simultaneously achieved.
Definition 3.4
(Equalized Odds (EO) [
18
])
.
A predictor
f
satisfies the equalized odds with respect
to Aif: E[f(X)|Y] = E[f(X)|Y, A]
Equalized odds (EO) emphasizes the conditional expectation of output
f
is invariant w.r.t.
A
, given
the ground truth
Y
. [
1
,
37
] reveal that if
D(X, Y, A)>0
and
A6Y
, group sufficiency and
equalized odds can not both hold.
The analysis reveals a general incompatibility between group sufficiency and DP/EO when
A6Y
,
which often occurs in practice. Besides, DP/EO based criteria generally suffers the well-known fair
accuracy trade-off [
32
]: enforcing the fair constraint degrades the prediction performance. This paper
depicts that under the criteria of group sufficiency, these objectives could be both encouraged.
4 Upper bound of group sufficiency gap
To derive the theoretical results, we first introduce the group Bayes predictor.
Definition 4.1
(
A
-group Bayes predictor)
.
The
A
-group Bayes predictor
fBayes
A
is defined as:
fBayes
A(X) = E[Y|X, A]
The
A
-group Bayes predictor is associated with the underlying data distribution
D(X, Y, A)
. Given
the fixed realization
X=x, A =a
, we have
fBayes
A=a(x) = E[Y|X=x, A =a]
, which suggests the
ground truth conditional data generation of subgroup
A=a
. By adopting
fBayes
A=a(x)
, we could derive
the upper bound of group sufficiency gap w.r.t. any predictor f:
Theorem 4.1. Group sufficiency gap Suffis upper bounded by: Suff4EA,X [|ffBayes
A|]
Specifically, if
A
takes finite value (
|A| <+
) and follows uniform distribution with
D(A=a) =
1/|A|. Then the group sufficiency gap is further simplified as:
Suff4
|A| X
a
EX[|ffBayes
A=a||A=a]
The proof is inspired by [
38
]. Specifically, Theorem 4.1 reveals that the upper bound of group
sufficiency gap depends on the discrepancy between the predictor
f
and
A
-group Bayes predictor
fBayes
A(X)
. Namely, given different subgroups
A=a
, the optimal predictor
f
ought to be closed to
all the group Bayes predictors fBayes
A=a(X),a∈ A.
Underlying assumption
Theorem 4.1 also reveals underlying assumptions w.r.t. the data generation
distribution
D(X, Y, A)
for achieving a small group sufficiency gap. If
fBayes
A
for each subgroup
A=a
are quite similar, then minimizing the upper bound yields a small group sufficiency gap
Suff
. For example, consider the extreme scenario
if
the
A
-group Bayes predictors are identical
w.r.t.
A
,
E[Y|X, A =a] = E[Y|X],a∈ A
, where
E[Y|X]
is the conventional Bayes predictor
defined on the marginalized distribution
D(X, Y )
. The upper bound recovers the difference between
the predictor
f
and standard Bayes predictor. If we use a probabilistic framework to approximate
predictor
f(X)E[Y|X]
(i.e, training the entire dataset without any fair constraint), both group
sufficiency gap and prediction error (since Bayes predictor is optimal) will be small, which is
consistent with [
38
]. On the contrary, if
A
-group Bayes predictors are completely arbitrary with high
variance for
A
, both group sufficiency gap and prediction error are large and it would be impossible
for an informative prediction.
4
5 Principled Approach
Based on the upper bound, we propose a principled approach to learn the predictor that achieves both
small generalization error and group sufficiency gap.
5.1 Upper bound in randomized algorithm
To establish the theoretical result, we consider a randomized algorithm that learns a predictive-
distribution
Q
over scoring predictors from the data. For instance, if we consider Bayes framework,
the predictor is drawn from the posterior distribution
˜
fQ
. In the inference, the predictor’s output
is formulated as the expectation of the learned predictive-distribution Q:f(X) = E˜
fQ˜
f(X).
In practice, it is infeasible to optimize over all the possible distributions. Then we should restrict
the predictive-distribution
Q
within a distribution family
Q∈ Q
such as Gaussian distribution.
We also denote
Q?
a∈ Q
as the optimal prediction-distribution w.r.t.
A=a
under binary cross-
entropy loss within the distribution family
Q
:
Q?
a:= argminQa∈Q E˜
faQaLBCE
a(˜
fa)
. In generally
Q?
a6=fBayes(x, A =a)
, since the distribution family
Q
is only the subset of all possible distributions
(shown in Fig. 2). We then extend the upper bound in the randomized algorithm.
Corollary 5.1.
The group sufficiency gap
Suff
in randomized algorithm w.r.t. learned predictive-
distribution Qis upper bounded by:
Suff22
|A| [X
apKL(Q?
akQ)
| {z }
Optimization
+pKL(Q?
akD(Y|X, A =a))
| {z }
Approximation
]
Where
KL
is the Kullback–Leibler divergence. Corollary 5.1 further reveals that the upper bound is
decomposed into two terms, showing in Fig.2.
Q?
a
D(Y|X, A =a)
Q?
b
D(Y|X, A =b)
Q
Q
Figure 2: Illustration of optimiza-
tion and approximation term. In the
binary subgroup
A={a, b}
, the op-
timization term is to find
Q∈ Q
to
minimize the discrepancy between
(Q?
a, Q?
b)
. The approximation term
is solely based on the distribution
family
Q
(brown region). If the
predefined
Q
has a rich expressive
power, the approximation is treated
as a small constant.
Optimization term
The optimization term is the average KL
divergence between the learned distribution
Q
and optimal
predictive-distribution
Q?
a
for each subgroup
A=a
. Minimiz-
ing the optimization term implies that the learned distribution
Q
will be both fair and informative for the prediction, because
it aims to minimize the upper bound of the group sufficiency
gap
Suff
and be close to the optimal predictive-distribution
w.r.t. each A=a.
Approximation term
The approximation term is the average
KL divergence between the optimal distribution
Q?
a
and the
underlying data generation distribution. Given the distribution
family
Q
, it is a unknown constant. Besides, if the distribution
family
Q
has a rich expressive power such as deep neural-
network, the approximation term will be small [
49
]. However,
an extreme large distribution family
Q
could simultaneously
yield a potential overfitting on finite samples. In this paper,
the neural network is adopted and the approximation term is
assumed to be a small constant. Thus, controlling
Suff
implies
minimizing the optimization term.
5.2 Challenge in learning limited samples
In practice, we only have access to finite or even limited samples in each subgroup, rather than the
underlying distribution
D
. We denote
Sa={(xa
i, ya
i)}m
i=1
as the observed data w.r.t. subgroups
A=a
, which are i.i.d. samplings from the underlying distribution
D(x, y|A=a)
. We also denote
the
empirical
binary cross entropy loss w.r.t
A=a
as:
ˆ
LBCE
a(˜
f) = 1
mPm
i=1 [ya
ilog( ˜
f(xa
i)) +
(1 ya
i) log(1 ˜
f(xa
i))]. Then a straight approach is to minimize the empirical term ˆ
Q?
a:
ˆ
Q?
a=argminQa∈Q E˜
faQa
ˆ
LBCE
a(˜
fa)(1)
Then
Q
is updated through minimizing the average KL-divergence:
PaKL(ˆ
Q?
akQ)
from learned
ˆ
Q?
a
.
However, this idea generally does not work in our setting, because each subgroup contains
limited
5
摘要:

OnLearningFairnessandAccuracyonMultipleSubgroupsChangjianShui1;2;4;GezhengXu3;QiChen4JiaqiLi3CharlesX.Ling3TalArbel1;2;5BoyuWang3;yChristianGagné2;4;5;y1CentreforIntelligentMachines,McGillUniversity2Mila,QuebecAIInstitute3DepartmentofComputerScience,UniversityofWesternOntario4InstituteIntelligence...

展开>> 收起<<
On Learning Fairness and Accuracy on Multiple Subgroups Changjian Shui124Gezheng Xu3Qi Chen4Jiaqi Li3.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:999.85KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注