Navigating Ensemble Conﬁgurations for Algorithmic Fairness Michael Feffer1Martin Hirzel2Samuel C. Hoffman2 Kiran Kate2Parikshit Ram2Avraham Shinnar2

2025-04-24 0 0 695.42KB 12 页 10玖币

侵权投诉

Navigating Ensemble Conﬁgurations for Algorithmic Fairness

Michael Feffer,1Martin Hirzel,2Samuel C. Hoffman,2

Kiran Kate,2Parikshit Ram,2Avraham Shinnar2

1Carnegie Mellon University, Pittsburgh, PA, USA

2IBM Research, Yorktown Heights, NY, USA

mfeffer@andrew.cmu.edu, hirzel@us.ibm.com

Abstract

Bias mitigators can improve algorithmic fairness in machine

learning models, but their effect on fairness is often not stable

across data splits. A popular approach to train more stable mod-

els is ensemble learning, but unfortunately, it is unclear how to

combine ensembles with mitigators to best navigate trade-offs

between fairness and predictive performance. To that end, we

built an open-source library enabling the modular composi-

tion of 8 mitigators, 4 ensembles, and their corresponding

hyperparameters, and we empirically explored the space of

conﬁgurations on 13 datasets. We distilled our insights from

this exploration in the form of a guidance diagram for practi-

tioners that we demonstrate is robust and reproducible.

Introduction

Algorithmic bias in machine learning can lead to models that

discriminate against underprivileged groups in various do-

mains, including hiring, healthcare, ﬁnance, criminal justice,

education, and even child care. Of course, bias in machine

learning is a socio-technical problem that cannot be solved

with technical solutions alone. That said, to make tangible

progress, this paper focuses on bias mitigators, which im-

prove or replace an existing machine learning estimator (e.g.,

a classiﬁer) so it makes less biased predictions (e.g., class

labels) as measured by a fairness metric (e.g., disparate im-

pact (Feldman et al. 2015)). Unfortunately, bias mitigation

often suffers from high volatility, meaning the estimator is

less stable with respect to group fairness metrics. In the worst

case, this volatility can even cause a model to appear fair

when measured on training data while being unfair on pro-

duction data. Given that ensembles (e.g., bagging or boost-

ing) can improve stability for accuracy metrics (Witten et al.

2016), we felt it was important to explore whether they also

improve stability for group fairness metrics.

Unfortunately, the sheer number of ways in which ensem-

bles and mitigators can be combined and conﬁgured with

base estimators and hyperparameters presents a dilemma. On

the one hand, the diversity of the space increases the chances

of it containing at least one combination with satisfactory

fairness and/or predictive performance for the provided data.

On the other hand, ﬁnding this combination via brute-force

exploration may be untenable if resources are limited.

To this end, we conducted experiments that navigated this

space with 8 bias mitigators from AIF360 (Bellamy et al.

2018); bagging, boosting, voting, and stacking ensembles

from the popular scikit-learn library (Buitinck et al. 2013);

and 13 datasets of various sizes and baseline fairness (earlier

papers used at most a handful). Speciﬁcally, we searched

the Cartesian product of datasets, mitigators, ensembles,

and hyperparameters both via brute-force and via Hyper-

opt (Bergstra, Yamins, and Cox 2013) for conﬁgurations that

optimized fairness while maintaining decent predictive per-

formance and vice-versa. Our ﬁndings conﬁrm the intuition

that ensembles often improve stability of not just accuracy

but also of the group fairness metrics we explored. However,

the best conﬁguration of mitigator and ensemble depends on

dataset characteristics, evaluation metric of choice, and even

worldview (Friedler, Scheidegger, and Venkatasubramanian

2021). Therefore, we automatically distilled a method selec-

tion guidance diagram in accordance with the results from

both brute-force search and Hyperopt exploration.

To support these experiments, we assembled a library of

pluggable ensembles, bias mitigators, and fairness datasets.

While we reused popular and well-established open-source

technologies, we made several new adaptations in our library

to get components to work well together. Our library is avail-

able open-source (https://github.com/IBM/lale) to encourage

research and real-world adoption.

Related Work

A few pieces of prior work used ensembles for fairness, but

they used specialized ensembles and bias mitigators, in con-

trast to our work, which uses off-the-shelf modular compo-

nents. The discrimination-aware ensemble uses a heteroge-

neous collection of base estimators (Kamiran, Karim, and

Zhang 2012); when they all agree, it returns the consensus

prediction, otherwise, it classiﬁes instances as positive iff they

belong to the unprivileged group. The random ensemble also

uses a heterogeneous collection of base estimators, and picks

one of them at random to make a prediction (Grgic-Hlaca et al.

2017). The paper offers a synthetic case where the ensemble

is more fair and more accurate than all base estimators, but

lacks experiments with real datasets. Exponentiated gradient

reduction trains a sequence of base estimators using a game

theoretic model where one player seeks to maximize fairness

violations by the estimators so far and the other player seeks

to build a fairer next estimator (Agarwal et al. 2018). In the

end, for predictions, it uses weights to pick a random base

arXiv:2210.05594v1 [cs.LG] 11 Oct 2022

estimator. Fair AdaBoost modiﬁes boosting to boost not for

accuracy but for individual fairness (Bhaskaruni, Hu, and

Lan 2019). In the end, for predictions, it gives a base estima-

tor higher weight if it was fair on more instances from the

training set. The fair voting ensemble uses a heterogeneous

collection of base estimators (Kenfack et al. 2021). Each

prediction votes among base estimators

φt

t∈1..n

, with

weights

Wt=α·At/(Σn

t=1Aj) + (1 −α)·Ft/(Σn

t=1Fj)

where

is an accuracy metric and

is a fairness met-

ric. The fair double ensemble uses stacked predictors, where

the ﬁnal estimator is linear, with a novel approach to train the

weights of the ﬁnal estimator to satisfy a system of accuracy

and fairness constraints (Mishler and Kennedy 2021).

Each of the above-listed approaches used an ensemble-

speciﬁc bias mitigator, whereas we experiment with eight

different off-the-shelf modular mitigators. Moreover, each of

these approaches used one speciﬁc kind of ensemble, whereas

we experiment with off-the-shelf modular implementations

of bagging, boosting, voting, and stacking. Using off-the-

shelf mitigators and ensembles facilitates plug-and-play be-

tween the best available independently-developed implemen-

tations. Out of the work on fairness with ensembles discussed

above, one paper had an experimental evaluation with ﬁve

datasets (Agarwal et al. 2018) and the other papers used at

most three datasets. In contrast, we use 13 datasets. Finally,

unlike these earlier papers, our paper speciﬁcally explores

fairness stability and the best ways to combine mitigators and

ensembles. We auto-generate a guidance diagram from this

exploration.

Ours is not the ﬁrst paper to use automated machine learn-

ing, including Bayesian optimizers, to optimize models and

mitigators for fairness (Perrone et al. 2020; Wu and Wang

2021). But unlike prior work, we speciﬁcally focus on ap-

plying AutoML to ensemble learning and bias mitigation to

validate our guidance diagram and other search.

Our work takes inspiration from earlier empirical stud-

ies of fairness techniques (Biswas and Rajan 2021; Friedler

et al. 2019; Holstein et al. 2019; Lee and Singh 2021; Singh

et al. 2021; Valentim, Louren

c¸

o, and Antunes 2019; Yang

et al. 2020), which help practitioners and researchers better

understand the state of the art. But unlike these works, we

experiment with ensembles and with fairness stability.

Our work also offers a new library of bias mitigators.

While there have been excellent prior fairness toolkits such

as ThemisML (Bantilan 2017), AIF360 (Bellamy et al. 2018),

and FairLearn (Agarwal et al. 2018), none of them support

ensembles. Ours is the ﬁrst that is modular enough to in-

vestigate a large space of unexplored mitigator-ensemble

combinations. We previously published some aspects of our

library in a non-archival workshop with no ofﬁcial proceed-

ings, but that paper did not yet discuss ensembles (Hirzel,

Kate, and Ram 2021). In another non-archival workshop pa-

per, we discussed ensembles and some of these experimental

results (Feffer et al. 2022), but no Hyperopt results and only

limited analysis of the guidance diagram, both of which are

present in this work.

Library and Datasets

Aside from our experiments, one contribution of our work

is implementing compatibility between mitigators from

AIF360 (Bellamy et al. 2018) and ensembles from scikit-

learn (Buitinck et al. 2013). To provide the glue and facilitate

searching over a space of mitigator and ensemble conﬁgu-

rations, we extended the Lale open-source library for semi-

automated data science (Baudart et al. 2021).

Metrics.

This paper uses metrics from scikit-learn, includ-

ing precision, recall, and

score. In addition, we imple-

mented a scikit-learn compatible API for several fairness met-

rics from AIF360 including disparate impact (as described

in Feldman et al. (2015)). We also measure time (in seconds)

and memory (in MB) utilized when ﬁtting models.

Ensembles.

Ensemble learning uses multiple weak models

to form one strong model. Our experiments use four ensem-

bles supported by scikit-learn: bagging, boosting, voting, and

stacking. Following scikit-learn, we use the following ter-

minology to characterize ensembles: A base estimator is an

estimator that serves as a building block for the ensemble. An

ensemble supports one of two composition types: whether the

ensemble consists of identical base estimators (homogeneous,

e.g. bagging and boosting) or different ones (heterogeneous,

e.g. voting and stacking). For the homogeneous ensembles,

we used their most common base estimator in practice: the

decision-tree classiﬁer. For the heterogeneous ensembles (vot-

ing and stacking), we used a set of typical base estimators:

XGBoost (Chen and Guestrin 2016), random forest, k-nearest

neighbors, and support vector machines. Finally, for stacking,

we also used XGBoost as the ﬁnal estimator.

Mitigators.

We added support in Lale for bias mitiga-

tion from AIF360 (Bellamy et al. 2018). AIF360 distin-

guishes three kinds of mitigators for improving group fair-

ness: pre-estimator mitigators, which are learned input ma-

nipulations that reduce bias in the data sent to downstream

estimators (we used DisparateImpactRemover (Feldman et al.

2015), LFR (Zemel et al. 2013), and Reweighing (Kamiran

and Calders 2012)); in-estimator mitigators, which are spe-

cialized estimators that directly incorporate debiasing into

their training (AdversarialDebiasing (Zhang, Lemoine, and

Mitchell 2018), GerryFairClassiﬁer (Kearns et al. 2018),

MetaFairClassiﬁer (Celis et al. 2019), and PrejudiceRe-

mover (Kamishima et al. 2012)); and post-estimator mitiga-

tors, which reduce bias in predictions made by an upstream

estimator (we used CalibratedEqOddsPostprocessing (Pleiss

et al. 2017)).

Fig. 1 visualizes the combinations of ensemble and miti-

gator kinds we explored, while also highlighting the modu-

larity of our approach. Mitigation strategies can be applied

at the level of either the base estimator or the entire ensem-

ble, but by the nature of some ensembles and mitigators,

not all combinations are feasible. First, post-estimator miti-

gators typically do not support

predict proba

functionality

required for some ensemble methods and recommended for

others. Calibrating probabilities from post-estimator miti-

gators has been shown to be tricky (Pleiss et al. 2017), so

despite Lale support for other post-estimator mitigators, our

est0

mit0

estn-1

mitn-1

estn

stack

est0

mit

estn-1

estn

stack

mit_est0

mit_estn-1

estn

stack mit0

est0

mitn-1

estn-1

estn

stack

est0

mit

estn-1

estn

stack

est0

mit0

estn-1

mitn-1

vote est0

mit

estn-1

vote mit_est0

mit_estn-1

vote mit0

est0

mitn-1

estn-1

vote est0

mit

estn-1

vote

est

mit boost

n

est

mit boost

nmit_est boost

nest mit

boost

n

mit

est boost

n

est

mit bag

n

est

mit bag

nmit_est bag

nest mit

bag

n

mit

est bag

n

Kind of Ensemble

in-estimator post-estimator

bagging

boosting

voting

stacking

Kind of Fairness Mitigation

pre-estimator

stack stack stack

est0

estn-1

mit estn

est0

estn-1

mit_estn

est0

mit

estn-1

estn

Pr(Bag(e, n))

Bag(Pr(e), n)Bag(In, n)Bag(Post(e), n)Post(Bag(e, n))

Pr(Boost(e, n))

Boost(Pr(e), n)Boost(In, n)Boost(Post(e), n)Post(Boost(e, n))

Pr(Vote(e))

Vote(Pr(e)) Vote(In) Vote(Post(e)) Post(Vote(e))

Pr(Stack(e, e))

Stack(Pr(e), e)

Stack(e, Pr(e))

Stack(In, e)

Stack(e, In)

Stack(Post(e), e)

Stack(e, Post(e))

Post(Stack(e, e))

ensemble-level

estimator-level estimator-level estimator-level ensemble-level

Figure 1: Combinations of ensembles and mitigators.

Pr(e)

applies a pre-estimator mitigator before an estimator

;

denotes an

in-estimator mitigator, which is itself an estimator; and

Post(e)

applies a post-estimator mitigator after an estimator

est

Bag(e, n)

is short for

BaggingClassiﬁer

with

instances of base estimator

;

Boost(e, n)

is short for

AdaBoostClassiﬁer

with

instances of base

estimator

;

Vote(e)

applies a

VotingClassiﬁer

to a list of base estimators

; and

Stack(e, e)

applies a

StackingClassiﬁer

to a list of base

estimators (ﬁrst e) and a ﬁnal estimator (second e). For stacking, the passthrough option is represented by a dashed arrow.

experiments only explored CalibratedEqOddsPostprocessing.

Additionally, it is impossible to apply an in-estimator mitiga-

tor at the ensemble level, so we exclude those combinations.

Finally, we decided to omit some combinations that are tech-

nically feasible but less interesting. For example, while our

library supports mitigation at multiple points, say, at both

the ensemble and estimator level of bagging, we elided these

conﬁguration from Fig. 1 and from our experiments.

Datasets.

We gathered the datasets for our experiments

primarily from OpenML (Vanschoren et al. 2014); the excep-

tions come from Medical Expenditures Panel Survey (MEPS)

data not hosted there (AHRQ 2015, 2016). Some have been

used extensively as benchmarks elsewhere in the algorith-

mic fairness literature. We pulled other novel datasets from

OpenML that have demographic data that could be consid-

ered protected attributes (such as race, age, or gender) and

contained associated baseline levels of disparate impact. In

addition, to get a sense for the predictive power of each pro-

tected attribute, we ﬁt XGBoost models to each dataset with

ﬁve different seeds and found the ranking of the average fea-

ture importance (where 1 is the most important) of the most

predictive protected attribute for that dataset. In all, we used

13 datasets, with most information summarized in Table 1

and granular feature importance information summarized

in the Appendix. When running experiments, we split the

datasets using stratiﬁcation by not just the target labels but

also the protected attributes (Hirzel, Kate, and Ram 2021),

leading to moderately more homogeneous fairness results

across different splits. The exact details of the preprocessing

are in the open-source code for our library for reproducibility.

We hope that bundling these datasets and default preprocess-

ing with our package, in addition to AIF360 and scikit-learn

compatibility, will improve dataset quality going forward.

Methodology

Given our 13 datasets, 4 types of ensembles, 8 mitigators,

and all relevant hyperparameters, we wanted to gain insights

about the best ways to combine ensemble learning and bias

mitigation in various problem contexts and data setups. To

this end, we conducted two searches over the Cartesian prod-

uct of these settings and compared their results. The ﬁrst was

a manual grid search to determine optimal conﬁgurations

for each dataset. The second also involved ﬁnding suitable

conﬁgurations per dataset, but was automated via Bayesian

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NavigatingEnsembleCongurationsforAlgorithmicFairnessMichaelFeffer,1MartinHirzel,2SamuelC.Hoffman,2KiranKate,2ParikshitRam,2AvrahamShinnar21CarnegieMellonUniversity,Pittsburgh,PA,USA2IBMResearch,YorktownHeights,NY,USAmfeffer@andrew.cmu.edu,hirzel@us.ibm.comAbstractBiasmitigatorscanimprovealgorithmic...

展开>> 收起<<

Navigating Ensemble Conﬁgurations for Algorithmic Fairness Michael Feffer1Martin Hirzel2Samuel C. Hoffman2 Kiran Kate2Parikshit Ram2Avraham Shinnar2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Navigating Ensemble Conﬁgurations for Algorithmic Fairness Michael Feffer1Martin Hirzel2Samuel C. Hoffman2 Kiran Kate2Parikshit Ram2Avraham Shinnar2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: