Navigating Ensemble Configurations for Algorithmic Fairness Michael Feffer1Martin Hirzel2Samuel C. Hoffman2 Kiran Kate2Parikshit Ram2Avraham Shinnar2

2025-04-24 0 0 695.42KB 12 页 10玖币
侵权投诉
Navigating Ensemble Configurations for Algorithmic Fairness
Michael Feffer,1Martin Hirzel,2Samuel C. Hoffman,2
Kiran Kate,2Parikshit Ram,2Avraham Shinnar2
1Carnegie Mellon University, Pittsburgh, PA, USA
2IBM Research, Yorktown Heights, NY, USA
mfeffer@andrew.cmu.edu, hirzel@us.ibm.com
Abstract
Bias mitigators can improve algorithmic fairness in machine
learning models, but their effect on fairness is often not stable
across data splits. A popular approach to train more stable mod-
els is ensemble learning, but unfortunately, it is unclear how to
combine ensembles with mitigators to best navigate trade-offs
between fairness and predictive performance. To that end, we
built an open-source library enabling the modular composi-
tion of 8 mitigators, 4 ensembles, and their corresponding
hyperparameters, and we empirically explored the space of
configurations on 13 datasets. We distilled our insights from
this exploration in the form of a guidance diagram for practi-
tioners that we demonstrate is robust and reproducible.
Introduction
Algorithmic bias in machine learning can lead to models that
discriminate against underprivileged groups in various do-
mains, including hiring, healthcare, finance, criminal justice,
education, and even child care. Of course, bias in machine
learning is a socio-technical problem that cannot be solved
with technical solutions alone. That said, to make tangible
progress, this paper focuses on bias mitigators, which im-
prove or replace an existing machine learning estimator (e.g.,
a classifier) so it makes less biased predictions (e.g., class
labels) as measured by a fairness metric (e.g., disparate im-
pact (Feldman et al. 2015)). Unfortunately, bias mitigation
often suffers from high volatility, meaning the estimator is
less stable with respect to group fairness metrics. In the worst
case, this volatility can even cause a model to appear fair
when measured on training data while being unfair on pro-
duction data. Given that ensembles (e.g., bagging or boost-
ing) can improve stability for accuracy metrics (Witten et al.
2016), we felt it was important to explore whether they also
improve stability for group fairness metrics.
Unfortunately, the sheer number of ways in which ensem-
bles and mitigators can be combined and configured with
base estimators and hyperparameters presents a dilemma. On
the one hand, the diversity of the space increases the chances
of it containing at least one combination with satisfactory
fairness and/or predictive performance for the provided data.
On the other hand, finding this combination via brute-force
exploration may be untenable if resources are limited.
To this end, we conducted experiments that navigated this
space with 8 bias mitigators from AIF360 (Bellamy et al.
2018); bagging, boosting, voting, and stacking ensembles
from the popular scikit-learn library (Buitinck et al. 2013);
and 13 datasets of various sizes and baseline fairness (earlier
papers used at most a handful). Specifically, we searched
the Cartesian product of datasets, mitigators, ensembles,
and hyperparameters both via brute-force and via Hyper-
opt (Bergstra, Yamins, and Cox 2013) for configurations that
optimized fairness while maintaining decent predictive per-
formance and vice-versa. Our findings confirm the intuition
that ensembles often improve stability of not just accuracy
but also of the group fairness metrics we explored. However,
the best configuration of mitigator and ensemble depends on
dataset characteristics, evaluation metric of choice, and even
worldview (Friedler, Scheidegger, and Venkatasubramanian
2021). Therefore, we automatically distilled a method selec-
tion guidance diagram in accordance with the results from
both brute-force search and Hyperopt exploration.
To support these experiments, we assembled a library of
pluggable ensembles, bias mitigators, and fairness datasets.
While we reused popular and well-established open-source
technologies, we made several new adaptations in our library
to get components to work well together. Our library is avail-
able open-source (https://github.com/IBM/lale) to encourage
research and real-world adoption.
Related Work
A few pieces of prior work used ensembles for fairness, but
they used specialized ensembles and bias mitigators, in con-
trast to our work, which uses off-the-shelf modular compo-
nents. The discrimination-aware ensemble uses a heteroge-
neous collection of base estimators (Kamiran, Karim, and
Zhang 2012); when they all agree, it returns the consensus
prediction, otherwise, it classifies instances as positive iff they
belong to the unprivileged group. The random ensemble also
uses a heterogeneous collection of base estimators, and picks
one of them at random to make a prediction (Grgic-Hlaca et al.
2017). The paper offers a synthetic case where the ensemble
is more fair and more accurate than all base estimators, but
lacks experiments with real datasets. Exponentiated gradient
reduction trains a sequence of base estimators using a game
theoretic model where one player seeks to maximize fairness
violations by the estimators so far and the other player seeks
to build a fairer next estimator (Agarwal et al. 2018). In the
end, for predictions, it uses weights to pick a random base
arXiv:2210.05594v1 [cs.LG] 11 Oct 2022
estimator. Fair AdaBoost modifies boosting to boost not for
accuracy but for individual fairness (Bhaskaruni, Hu, and
Lan 2019). In the end, for predictions, it gives a base estima-
tor higher weight if it was fair on more instances from the
training set. The fair voting ensemble uses a heterogeneous
collection of base estimators (Kenfack et al. 2021). Each
prediction votes among base estimators
φt
,
t1..n
, with
weights
Wt=α·At/n
t=1Aj) + (1 α)·Ft/n
t=1Fj)
,
where
At
is an accuracy metric and
Ft
is a fairness met-
ric. The fair double ensemble uses stacked predictors, where
the final estimator is linear, with a novel approach to train the
weights of the final estimator to satisfy a system of accuracy
and fairness constraints (Mishler and Kennedy 2021).
Each of the above-listed approaches used an ensemble-
specific bias mitigator, whereas we experiment with eight
different off-the-shelf modular mitigators. Moreover, each of
these approaches used one specific kind of ensemble, whereas
we experiment with off-the-shelf modular implementations
of bagging, boosting, voting, and stacking. Using off-the-
shelf mitigators and ensembles facilitates plug-and-play be-
tween the best available independently-developed implemen-
tations. Out of the work on fairness with ensembles discussed
above, one paper had an experimental evaluation with five
datasets (Agarwal et al. 2018) and the other papers used at
most three datasets. In contrast, we use 13 datasets. Finally,
unlike these earlier papers, our paper specifically explores
fairness stability and the best ways to combine mitigators and
ensembles. We auto-generate a guidance diagram from this
exploration.
Ours is not the first paper to use automated machine learn-
ing, including Bayesian optimizers, to optimize models and
mitigators for fairness (Perrone et al. 2020; Wu and Wang
2021). But unlike prior work, we specifically focus on ap-
plying AutoML to ensemble learning and bias mitigation to
validate our guidance diagram and other search.
Our work takes inspiration from earlier empirical stud-
ies of fairness techniques (Biswas and Rajan 2021; Friedler
et al. 2019; Holstein et al. 2019; Lee and Singh 2021; Singh
et al. 2021; Valentim, Louren
c¸
o, and Antunes 2019; Yang
et al. 2020), which help practitioners and researchers better
understand the state of the art. But unlike these works, we
experiment with ensembles and with fairness stability.
Our work also offers a new library of bias mitigators.
While there have been excellent prior fairness toolkits such
as ThemisML (Bantilan 2017), AIF360 (Bellamy et al. 2018),
and FairLearn (Agarwal et al. 2018), none of them support
ensembles. Ours is the first that is modular enough to in-
vestigate a large space of unexplored mitigator-ensemble
combinations. We previously published some aspects of our
library in a non-archival workshop with no official proceed-
ings, but that paper did not yet discuss ensembles (Hirzel,
Kate, and Ram 2021). In another non-archival workshop pa-
per, we discussed ensembles and some of these experimental
results (Feffer et al. 2022), but no Hyperopt results and only
limited analysis of the guidance diagram, both of which are
present in this work.
Library and Datasets
Aside from our experiments, one contribution of our work
is implementing compatibility between mitigators from
AIF360 (Bellamy et al. 2018) and ensembles from scikit-
learn (Buitinck et al. 2013). To provide the glue and facilitate
searching over a space of mitigator and ensemble configu-
rations, we extended the Lale open-source library for semi-
automated data science (Baudart et al. 2021).
Metrics.
This paper uses metrics from scikit-learn, includ-
ing precision, recall, and
F1
score. In addition, we imple-
mented a scikit-learn compatible API for several fairness met-
rics from AIF360 including disparate impact (as described
in Feldman et al. (2015)). We also measure time (in seconds)
and memory (in MB) utilized when fitting models.
Ensembles.
Ensemble learning uses multiple weak models
to form one strong model. Our experiments use four ensem-
bles supported by scikit-learn: bagging, boosting, voting, and
stacking. Following scikit-learn, we use the following ter-
minology to characterize ensembles: A base estimator is an
estimator that serves as a building block for the ensemble. An
ensemble supports one of two composition types: whether the
ensemble consists of identical base estimators (homogeneous,
e.g. bagging and boosting) or different ones (heterogeneous,
e.g. voting and stacking). For the homogeneous ensembles,
we used their most common base estimator in practice: the
decision-tree classifier. For the heterogeneous ensembles (vot-
ing and stacking), we used a set of typical base estimators:
XGBoost (Chen and Guestrin 2016), random forest, k-nearest
neighbors, and support vector machines. Finally, for stacking,
we also used XGBoost as the final estimator.
Mitigators.
We added support in Lale for bias mitiga-
tion from AIF360 (Bellamy et al. 2018). AIF360 distin-
guishes three kinds of mitigators for improving group fair-
ness: pre-estimator mitigators, which are learned input ma-
nipulations that reduce bias in the data sent to downstream
estimators (we used DisparateImpactRemover (Feldman et al.
2015), LFR (Zemel et al. 2013), and Reweighing (Kamiran
and Calders 2012)); in-estimator mitigators, which are spe-
cialized estimators that directly incorporate debiasing into
their training (AdversarialDebiasing (Zhang, Lemoine, and
Mitchell 2018), GerryFairClassifier (Kearns et al. 2018),
MetaFairClassifier (Celis et al. 2019), and PrejudiceRe-
mover (Kamishima et al. 2012)); and post-estimator mitiga-
tors, which reduce bias in predictions made by an upstream
estimator (we used CalibratedEqOddsPostprocessing (Pleiss
et al. 2017)).
Fig. 1 visualizes the combinations of ensemble and miti-
gator kinds we explored, while also highlighting the modu-
larity of our approach. Mitigation strategies can be applied
at the level of either the base estimator or the entire ensem-
ble, but by the nature of some ensembles and mitigators,
not all combinations are feasible. First, post-estimator miti-
gators typically do not support
predict proba
functionality
required for some ensemble methods and recommended for
others. Calibrating probabilities from post-estimator miti-
gators has been shown to be tricky (Pleiss et al. 2017), so
despite Lale support for other post-estimator mitigators, our
2
est0
mit0
estn-1
mitn-1
estn
stack
est0
mit
estn-1
estn
stack
mit_est0
mit_estn-1
estn
stack mit0
est0
mitn-1
estn-1
estn
stack
est0
mit
estn-1
estn
stack
est0
mit0
estn-1
mitn-1
vote est0
mit
estn-1
vote mit_est0
mit_estn-1
vote mit0
est0
mitn-1
estn-1
vote est0
mit
estn-1
vote
est
mit boost
n
est
mit boost
nmit_est boost
nest mit
boost
n
mit
est boost
n
est
mit bag
n
est
mit bag
nmit_est bag
nest mit
bag
n
mit
est bag
n
Kind of Ensemble
in-estimator post-estimator
bagging
boosting
voting
stacking
Kind of Fairness Mitigation
pre-estimator
stack stack stack
est0
estn-1
mit estn
est0
estn-1
mit_estn
est0
mit
estn-1
estn
Pr(Bag(e, n))
Bag(Pr(e), n)Bag(In, n)Bag(Post(e), n)Post(Bag(e, n))
Pr(Boost(e, n))
Boost(Pr(e), n)Boost(In, n)Boost(Post(e), n)Post(Boost(e, n))
Pr(Vote(e))
Vote(Pr(e)) Vote(In) Vote(Post(e)) Post(Vote(e))
Pr(Stack(e, e))
Stack(Pr(e), e)
Stack(e, Pr(e))
Stack(In, e)
Stack(e, In)
Stack(Post(e), e)
Stack(e, Post(e))
Post(Stack(e, e))
ensemble-level
estimator-level estimator-level estimator-level ensemble-level
Figure 1: Combinations of ensembles and mitigators.
Pr(e)
applies a pre-estimator mitigator before an estimator
e
;
In
denotes an
in-estimator mitigator, which is itself an estimator; and
Post(e)
applies a post-estimator mitigator after an estimator
est
.
Bag(e, n)
is short for
BaggingClassifier
with
n
instances of base estimator
e
;
Boost(e, n)
is short for
AdaBoostClassifier
with
n
instances of base
estimator
e
;
Vote(e)
applies a
VotingClassifier
to a list of base estimators
e
; and
Stack(e, e)
applies a
StackingClassifier
to a list of base
estimators (first e) and a final estimator (second e). For stacking, the passthrough option is represented by a dashed arrow.
experiments only explored CalibratedEqOddsPostprocessing.
Additionally, it is impossible to apply an in-estimator mitiga-
tor at the ensemble level, so we exclude those combinations.
Finally, we decided to omit some combinations that are tech-
nically feasible but less interesting. For example, while our
library supports mitigation at multiple points, say, at both
the ensemble and estimator level of bagging, we elided these
configuration from Fig. 1 and from our experiments.
Datasets.
We gathered the datasets for our experiments
primarily from OpenML (Vanschoren et al. 2014); the excep-
tions come from Medical Expenditures Panel Survey (MEPS)
data not hosted there (AHRQ 2015, 2016). Some have been
used extensively as benchmarks elsewhere in the algorith-
mic fairness literature. We pulled other novel datasets from
OpenML that have demographic data that could be consid-
ered protected attributes (such as race, age, or gender) and
contained associated baseline levels of disparate impact. In
addition, to get a sense for the predictive power of each pro-
tected attribute, we fit XGBoost models to each dataset with
five different seeds and found the ranking of the average fea-
ture importance (where 1 is the most important) of the most
predictive protected attribute for that dataset. In all, we used
13 datasets, with most information summarized in Table 1
and granular feature importance information summarized
in the Appendix. When running experiments, we split the
datasets using stratification by not just the target labels but
also the protected attributes (Hirzel, Kate, and Ram 2021),
leading to moderately more homogeneous fairness results
across different splits. The exact details of the preprocessing
are in the open-source code for our library for reproducibility.
We hope that bundling these datasets and default preprocess-
ing with our package, in addition to AIF360 and scikit-learn
compatibility, will improve dataset quality going forward.
Methodology
Given our 13 datasets, 4 types of ensembles, 8 mitigators,
and all relevant hyperparameters, we wanted to gain insights
about the best ways to combine ensemble learning and bias
mitigation in various problem contexts and data setups. To
this end, we conducted two searches over the Cartesian prod-
uct of these settings and compared their results. The first was
a manual grid search to determine optimal configurations
for each dataset. The second also involved finding suitable
configurations per dataset, but was automated via Bayesian
3
摘要:

NavigatingEnsembleCongurationsforAlgorithmicFairnessMichaelFeffer,1MartinHirzel,2SamuelC.Hoffman,2KiranKate,2ParikshitRam,2AvrahamShinnar21CarnegieMellonUniversity,Pittsburgh,PA,USA2IBMResearch,YorktownHeights,NY,USAmfeffer@andrew.cmu.edu,hirzel@us.ibm.comAbstractBiasmitigatorscanimprovealgorithmic...

展开>> 收起<<
Navigating Ensemble Configurations for Algorithmic Fairness Michael Feffer1Martin Hirzel2Samuel C. Hoffman2 Kiran Kate2Parikshit Ram2Avraham Shinnar2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:695.42KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注