
Extrinsic Performance Extrinsic Predictions Intrinsic
Dataset
Metric Gap Gap Gap % or # of % or # Model Pred LM Prediction SEAT CEAT Probe Cluster Nearest Cos PCA
(Label) (Stereo) (Gender) Answer Changed Prefers Stereotype Gap On Target words Neighbors
Winogender (Rudinger et al.,2018)XX
X X X X X X X X X X X
Winobias (Zhao et al.,2018a)XX
X X X X X X X X X X X
Gap (Webster et al.,2018)X
X(aug)
Crow-S (Nangia et al.,2020a)X
X X X X X X X
StereoSet (Nadeem et al.,2021)X
X X X X X X X
Bias in Bios (De-Arteaga et al.,2019)X
X X X(aug) X(aug) X(aug) X X X X X X X
EEC (Kiritchenko and Mohammad,2018)X
X X X X X
STS-B for genders (Beutel et al.,2020)XX
X X X X X X X X
Dev et al. (2020a) (NLI) X X X X X X X X X X
PTB, WikiText, CNN/DailyMail X
X
(Bordia and Bowman,2019)
BOLD (Dhamala et al.,2021)X
Templates from May et al. (2019a)XX
X X X X X X
Templates from Kurita et al. (2019)X
X
X X X X X X
DisCo templates (Beutel et al.,2020)XX
X X X X X X
BEC-Pro templates (Bartl et al.,2020)XX
X X X X X X
English-German news corpus X X X
X
X
X
X
(Basta et al.,2021)
Reddit (Guo and Caliskan 2021,XX
X X X X X
Voigt et al. 2018)
MAP (Cao and Daumé III,2021)X
X X X X X X X
GICoref (Cao and Daumé III,2021)X
X X X X X X
Table 1: Combinations of gender bias datasets and metrics in the literature. Xmarks a feasible combination of
a metric and a dataset. X
marks the original metrics used on the dataset, and X(aug) marks metrics that can
be measured after augmenting the dataset such that every example is matched with a counterexample of another
gender. Extrinsic performance metrics depend on gold labels while extrinsic prediction metrics do not. A full
description of the metrics is given in Appendix A.
sured (many X’s in the same row). Some datasets,
such as Bias in Bios (De-Arteaga et al.,2019), have
numerous metrics compatible, while others have
fewer, but still multiple, compatible metrics. Bias
in Bios has many compatible metrics since it has
information that can be used to calculate them: in
addition to gold labels, it also has gender labels
and clear stereotype definitions derived from the
labels which are professions. Text corpora and
template data, which do not address a specific task
(bottom seven rows), are mostly compatible with
intrinsic metrics. The compatibility of intrinsic
metrics with many datasets may explain why pa-
pers report intrinsic metrics more often (§5). Ad-
ditionally, Table 1indicates that not many datasets
can be used to measure extrinsic metrics, partic-
ularly extrinsic performance metrics that require
gold labels. On the other hand, measuring LM
prediction on target words, which we consider as
extrinsic, can be done on many datasets. This is
useful for analyzing bias when dealing with LMs.
It can be done by computing bias metrics from the
LM output predictions, such as the mean proba-
bility gap when predicting the word “he” versus
“she” in specific contexts. Also, some templates are
valid for measuring extrinsic prediction metrics,
especially stereotype-related metrics, as they were
developed with explicit stereotypes in mind (such
as profession-related stereotypes).
Based on Table 1, it is clear that there are many
possible ways to measure gender bias in the litera-
ture, but they all fall under the vague category of
“gender bias”. Each of the possible combinations
gives a different definition, or interpretation, for
gender bias. The large number of different metrics
makes it difficult or even impossible to compare
different studies, including proposed gender bias
mitigation methods. This raises questions about the
validity of results derived from specific combina-
tions of measurements. In the next two sections, we
demonstrate how the choice of datasets and metrics
can affect the bias measurement.
4 Effect of Dataset on Measured Results
The choice of data to measure bias has an impact
on the calculated bias. Many researchers used sen-
tence templates that are “semantically bleached”
(e.g., “This is <word>.”, “<person> studied <pro-
fession> at college.”) to adjust metrics developed
for static word embeddings to contextualized rep-
resentations (May et al.,2019b;Kurita et al.,2019;
Webster et al.,2020;Bartl et al.,2020). Delobelle
et al. (2021) found that the choice of templates
significantly affected the results, with little corre-
lation between different templates. Additionally,
May et al. (2019b) reported that templates are not
as semantically bleached as expected.
Another common feature of bias metrics is the
use of hand-curated word lexicons by almost every
bias metric in the literature. Antoniak and Mimno
(2021) reported that the lexicon choice can greatly
affect bias measurement, leading to differing con-
clusions between different lexicons.