
estimator. Fair AdaBoost modifies boosting to boost not for
accuracy but for individual fairness (Bhaskaruni, Hu, and
Lan 2019). In the end, for predictions, it gives a base estima-
tor higher weight if it was fair on more instances from the
training set. The fair voting ensemble uses a heterogeneous
collection of base estimators (Kenfack et al. 2021). Each
prediction votes among base estimators
φt
,
t∈1..n
, with
weights
Wt=α·At/(Σn
t=1Aj) + (1 −α)·Ft/(Σn
t=1Fj)
,
where
At
is an accuracy metric and
Ft
is a fairness met-
ric. The fair double ensemble uses stacked predictors, where
the final estimator is linear, with a novel approach to train the
weights of the final estimator to satisfy a system of accuracy
and fairness constraints (Mishler and Kennedy 2021).
Each of the above-listed approaches used an ensemble-
specific bias mitigator, whereas we experiment with eight
different off-the-shelf modular mitigators. Moreover, each of
these approaches used one specific kind of ensemble, whereas
we experiment with off-the-shelf modular implementations
of bagging, boosting, voting, and stacking. Using off-the-
shelf mitigators and ensembles facilitates plug-and-play be-
tween the best available independently-developed implemen-
tations. Out of the work on fairness with ensembles discussed
above, one paper had an experimental evaluation with five
datasets (Agarwal et al. 2018) and the other papers used at
most three datasets. In contrast, we use 13 datasets. Finally,
unlike these earlier papers, our paper specifically explores
fairness stability and the best ways to combine mitigators and
ensembles. We auto-generate a guidance diagram from this
exploration.
Ours is not the first paper to use automated machine learn-
ing, including Bayesian optimizers, to optimize models and
mitigators for fairness (Perrone et al. 2020; Wu and Wang
2021). But unlike prior work, we specifically focus on ap-
plying AutoML to ensemble learning and bias mitigation to
validate our guidance diagram and other search.
Our work takes inspiration from earlier empirical stud-
ies of fairness techniques (Biswas and Rajan 2021; Friedler
et al. 2019; Holstein et al. 2019; Lee and Singh 2021; Singh
et al. 2021; Valentim, Louren
c¸
o, and Antunes 2019; Yang
et al. 2020), which help practitioners and researchers better
understand the state of the art. But unlike these works, we
experiment with ensembles and with fairness stability.
Our work also offers a new library of bias mitigators.
While there have been excellent prior fairness toolkits such
as ThemisML (Bantilan 2017), AIF360 (Bellamy et al. 2018),
and FairLearn (Agarwal et al. 2018), none of them support
ensembles. Ours is the first that is modular enough to in-
vestigate a large space of unexplored mitigator-ensemble
combinations. We previously published some aspects of our
library in a non-archival workshop with no official proceed-
ings, but that paper did not yet discuss ensembles (Hirzel,
Kate, and Ram 2021). In another non-archival workshop pa-
per, we discussed ensembles and some of these experimental
results (Feffer et al. 2022), but no Hyperopt results and only
limited analysis of the guidance diagram, both of which are
present in this work.
Library and Datasets
Aside from our experiments, one contribution of our work
is implementing compatibility between mitigators from
AIF360 (Bellamy et al. 2018) and ensembles from scikit-
learn (Buitinck et al. 2013). To provide the glue and facilitate
searching over a space of mitigator and ensemble configu-
rations, we extended the Lale open-source library for semi-
automated data science (Baudart et al. 2021).
Metrics.
This paper uses metrics from scikit-learn, includ-
ing precision, recall, and
F1
score. In addition, we imple-
mented a scikit-learn compatible API for several fairness met-
rics from AIF360 including disparate impact (as described
in Feldman et al. (2015)). We also measure time (in seconds)
and memory (in MB) utilized when fitting models.
Ensembles.
Ensemble learning uses multiple weak models
to form one strong model. Our experiments use four ensem-
bles supported by scikit-learn: bagging, boosting, voting, and
stacking. Following scikit-learn, we use the following ter-
minology to characterize ensembles: A base estimator is an
estimator that serves as a building block for the ensemble. An
ensemble supports one of two composition types: whether the
ensemble consists of identical base estimators (homogeneous,
e.g. bagging and boosting) or different ones (heterogeneous,
e.g. voting and stacking). For the homogeneous ensembles,
we used their most common base estimator in practice: the
decision-tree classifier. For the heterogeneous ensembles (vot-
ing and stacking), we used a set of typical base estimators:
XGBoost (Chen and Guestrin 2016), random forest, k-nearest
neighbors, and support vector machines. Finally, for stacking,
we also used XGBoost as the final estimator.
Mitigators.
We added support in Lale for bias mitiga-
tion from AIF360 (Bellamy et al. 2018). AIF360 distin-
guishes three kinds of mitigators for improving group fair-
ness: pre-estimator mitigators, which are learned input ma-
nipulations that reduce bias in the data sent to downstream
estimators (we used DisparateImpactRemover (Feldman et al.
2015), LFR (Zemel et al. 2013), and Reweighing (Kamiran
and Calders 2012)); in-estimator mitigators, which are spe-
cialized estimators that directly incorporate debiasing into
their training (AdversarialDebiasing (Zhang, Lemoine, and
Mitchell 2018), GerryFairClassifier (Kearns et al. 2018),
MetaFairClassifier (Celis et al. 2019), and PrejudiceRe-
mover (Kamishima et al. 2012)); and post-estimator mitiga-
tors, which reduce bias in predictions made by an upstream
estimator (we used CalibratedEqOddsPostprocessing (Pleiss
et al. 2017)).
Fig. 1 visualizes the combinations of ensemble and miti-
gator kinds we explored, while also highlighting the modu-
larity of our approach. Mitigation strategies can be applied
at the level of either the base estimator or the entire ensem-
ble, but by the nature of some ensembles and mitigators,
not all combinations are feasible. First, post-estimator miti-
gators typically do not support
predict proba
functionality
required for some ensemble methods and recommended for
others. Calibrating probabilities from post-estimator miti-
gators has been shown to be tricky (Pleiss et al. 2017), so
despite Lale support for other post-estimator mitigators, our
2