
could result in completely different bias measure-
ments.
The goal of this work is to illustrate the extent
to which social bias measurements are effected by
assumptions that are built into dataset construc-
tions. To that end, we consider several alternate
dataset constructions for
2
bias benchmarks WINO-
GENDER and BIASNLI. We show that, just by the
choice of certain target-bias-irrelevant elements in
a dataset, it is possible to discover different de-
grees of bias for the same model as well as dif-
ferent model rankings
1
. For instance, one experi-
ment on BIASNLI demonstrated that merely negat-
ing verbs drastically reduced the measured bias
(
41.64 →13.40
) on an ELMo-based Decompos-
able Attention model and even caused a switch in
the comparative ranking with RoBERTa. Our find-
ings demonstrate the unreliability of current bench-
marks to truly measure social bias in models and
suggest caution when considering these measures
as the gold truth. We provide a detailed discussion
(§5) of the implications of our findings, relation to
experienced harms, suggestions for improving bias
benchmarks, and directions for future work.
2 Related Work
A large body of work investigates ways to eval-
uate biases carried inherently in language mod-
els (Bolukbasi et al.,2016;Caliskan et al.,2017;
Nadeem et al.,2021) and expressed in specific
tasks (Nangia et al.,2020;Kirk et al.,2021;
Schramowski et al.,2022;Prabhumoye et al.,2021;
Srinivasan and Bisk,2021;Kirk et al.,2021;Par-
rish et al.,2021;Baldini et al.,2022;Czarnowska
et al.,2021;Dev et al.,2021a;Zhao et al.,2021).
Alongside, there is also growing concern about the
measures not relating to experienced harms (Blod-
gett et al.,2020), not inclusive in framing (Dev
et al.,2021b), ambiguous about what bias is mea-
sured (Blodgett et al.,2021;Goldfarb-Tarrant et al.,
2023), not correlated in their findings of bias across
intrinsic versus extrinsic techniques (Goldfarb-
Tarrant et al.,2021;Cao et al.,2022), and suscepti-
ble to adversarial perturbations (Zhang et al.,2021)
and seed word selection (Antoniak and Mimno,
2021).
The concurrent work by (Seshadri et al.,2022)
discusses the unreliability of quantifying social bi-
1
All preprocessed datasets (original and al-
ternate constructions) and code are available at
https://github.com/uclanlp/socialbias-dataset-construction-
biases.
ases using templates by varying templates in a se-
mantic preserving manner. While their findings are
consistent with ours, the two works provide com-
plementary experimental observations. Seshadri
et al. (2022) study a wider range of tasks, though
we focus our experiments on a wider set of models
and alternate dataset constructions (with a greater
range of syntactic and semantic variability). As a
result, we are able to illustrate the effect of the ob-
served variability on ranking large language models
according to measured bias for deployment in real
world applications.
3 Social Bias Measurements and
Alternate Constructions
Bias measures in NLP are often quantified through
comparative prediction disparities on language
datasets that follow existing tasks such as classi-
fication (De-Arteaga et al.,2019) or coreference
resolution (Rudinger et al.,2018). As a result, these
datasets are central to what eventually gets mea-
sured as “bias”. Not only do they determine the
“amount” of bias measured but also the “type” of
bias or stereotype measured. Datasets often vary
combinations of gendered pronouns and occupa-
tions to evaluate stereotypical associations. It is im-
portant to note that these constructs of datasets and
their templates, which determine what gets mea-
sured, are often arbitrary choices. The sentences
could be differently structured, be generated from a
different set of seed words, and more. However, we
expect that for any faithful bias benchmark, such
dataset alterations that are not relevant to social
bias should not have a significant impact on the
artifact (e.g. gender bias) being measured.
Thus, to evaluate the faithfulness of current
benchmarks, we develop alternate dataset construc-
tions through modifications that should not have
any effect on the social bias being measured in a
dataset. They are minor changes that should not
influence models with true language understand-
ing – the implicit assumption made by current bias
benchmarks. Any notable observed changes in a
model’s bias measure due to these modifications
would highlight the incorrectness of this assump-
tion. Consequently, this would bring to light the
unreliability of current benchmarks to faithfully
measure the target bias and disentangle the mea-
surement from measurement of other non-social
biases. A non-exhaustive set of such alternate con-
structions considered in this work are listed below.