The Tail Wagging the Dog Dataset Construction Biases of Social Bias Benchmarks Nikil Roashan Selvam1Sunipa Dev2

2025-05-06 0 0 528.35KB 12 页 10玖币
侵权投诉
The Tail Wagging the Dog:
Dataset Construction Biases of Social Bias Benchmarks
Nikil Roashan Selvam1Sunipa Dev2
Daniel Khashabi3Tushar Khot4Kai-Wei Chang1
1University of California, Los Angeles 2Google Research
3Johns Hopkins University 4Allen Institute for AI
{nikilrselvam,kwchang}@ucla.edu,sunipadev@google.com
danielk@jhu.edu,tushark@allenai.org
Abstract
How reliably can we trust the scores obtained
from social bias benchmarks as faithful indi-
cators of problematic social biases in a given
model? In this work, we study this question
by contrasting social biases with
non
-social
biases that stem from choices made during
dataset construction (which might not even be
discernible to the human eye). To do so, we em-
pirically simulate various alternative construc-
tions for a given benchmark based on seem-
ingly innocuous modifications (such as para-
phrasing or random-sampling) that maintain
the essence of their social bias. On two well-
known social bias benchmarks (WINOGENDER
and BIASNLI), we observe that these shallow
modifications have a surprising effect on the
resulting degree of bias across various mod-
els and consequently the relative ordering of
these models when ranked by measured bias.
We hope these troubling observations motivate
more robust measures of social biases.
1 Introduction
The omnipresence of large pre-trained language
models (Liu et al.,2019;Raffel et al.,2020;Brown
et al.,2020) has fueled concerns regarding their sys-
tematic biases carried over from underlying data
into the applications they are used in, resulting in
disparate treatment of people with different identi-
ties (Sheng et al.,2021;Abid et al.,2021).
In response to such concerns, various bench-
marks have been proposed to quantify the amount
of social biases in models (Rudinger et al.,2018;
Sheng et al.,2019;Li et al.,2020). These mea-
sures are composed of textual datasets built for a
specific NLP task (such as question answering) and
are accompanied by a metric such as accuracy of
prediction which is used as an approximation of
the amount of social biases.
These bias benchmarks are commonly used by
machine learning practitioners to compare the de-
gree of social biases (such as gender-occupation
Gender-Occupation Bias
Gender-Occupation Bias
The electrician warned the homeowner that he might need an
extra day to finish rewiring the house.
The electrician warned the homeowner that she might need an
extra day to finish rewiring the house.
coref
coref
WinoGender
The electrician cautioned the homeowner that he might need an
extra day to finish rewiring the house.
The electrician cautioned the homeowner that she might need an
extra day to finish rewiring the house.
WinoGender-Alternate Construction
coref
coref
Figure 1: Two potential constructions of WINOGEN-
DER with minor differences: a model (span-BERT,
in this case) with the original dataset might seem to
have gender-occupation bias (
green tick
) based on the
change in its pronoun resolution. However, a minor
change in its phrasing with no change in meaning (e.g.,
synonymous verb
) can drastically affect the perceived
bias of the model and changes the conclusion (
no bias
).
bias) in different real-world models (Chowdhery
et al.,2022;Thoppilan et al.,2022) before deploy-
ing them in a myriad of applications. However,
they also inadvertently measure other non-social
biases in their datasets. For example, consider the
sentence from WINOGENDER in Figure 1. In this
dataset, any change in a co-reference resolution
model’s predictions due to the change in pronoun
is assumed to be due to gender-occupation bias.
However, this assumption only holds for a model
with near-perfect language understanding with no
other biases. This may not often be the case, e.g., a
model’s positional bias (Murray and Chiang,2018;
Ko et al.,2020) (bias to resolve “she" to a close-
by entity) or spurious correlations (Schlegel et al.,
2020) (bias to resolve “he” to the object of the verb
“warned”) would also be measured as a gender-
occupation bias. As a result, a slightly different
template (e.g., changing the verb to “cautioned”)
arXiv:2210.10040v2 [cs.CL] 16 Jun 2023
could result in completely different bias measure-
ments.
The goal of this work is to illustrate the extent
to which social bias measurements are effected by
assumptions that are built into dataset construc-
tions. To that end, we consider several alternate
dataset constructions for
2
bias benchmarks WINO-
GENDER and BIASNLI. We show that, just by the
choice of certain target-bias-irrelevant elements in
a dataset, it is possible to discover different de-
grees of bias for the same model as well as dif-
ferent model rankings
1
. For instance, one experi-
ment on BIASNLI demonstrated that merely negat-
ing verbs drastically reduced the measured bias
(
41.64 13.40
) on an ELMo-based Decompos-
able Attention model and even caused a switch in
the comparative ranking with RoBERTa. Our find-
ings demonstrate the unreliability of current bench-
marks to truly measure social bias in models and
suggest caution when considering these measures
as the gold truth. We provide a detailed discussion
5) of the implications of our findings, relation to
experienced harms, suggestions for improving bias
benchmarks, and directions for future work.
2 Related Work
A large body of work investigates ways to eval-
uate biases carried inherently in language mod-
els (Bolukbasi et al.,2016;Caliskan et al.,2017;
Nadeem et al.,2021) and expressed in specific
tasks (Nangia et al.,2020;Kirk et al.,2021;
Schramowski et al.,2022;Prabhumoye et al.,2021;
Srinivasan and Bisk,2021;Kirk et al.,2021;Par-
rish et al.,2021;Baldini et al.,2022;Czarnowska
et al.,2021;Dev et al.,2021a;Zhao et al.,2021).
Alongside, there is also growing concern about the
measures not relating to experienced harms (Blod-
gett et al.,2020), not inclusive in framing (Dev
et al.,2021b), ambiguous about what bias is mea-
sured (Blodgett et al.,2021;Goldfarb-Tarrant et al.,
2023), not correlated in their findings of bias across
intrinsic versus extrinsic techniques (Goldfarb-
Tarrant et al.,2021;Cao et al.,2022), and suscepti-
ble to adversarial perturbations (Zhang et al.,2021)
and seed word selection (Antoniak and Mimno,
2021).
The concurrent work by (Seshadri et al.,2022)
discusses the unreliability of quantifying social bi-
1
All preprocessed datasets (original and al-
ternate constructions) and code are available at
https://github.com/uclanlp/socialbias-dataset-construction-
biases.
ases using templates by varying templates in a se-
mantic preserving manner. While their findings are
consistent with ours, the two works provide com-
plementary experimental observations. Seshadri
et al. (2022) study a wider range of tasks, though
we focus our experiments on a wider set of models
and alternate dataset constructions (with a greater
range of syntactic and semantic variability). As a
result, we are able to illustrate the effect of the ob-
served variability on ranking large language models
according to measured bias for deployment in real
world applications.
3 Social Bias Measurements and
Alternate Constructions
Bias measures in NLP are often quantified through
comparative prediction disparities on language
datasets that follow existing tasks such as classi-
fication (De-Arteaga et al.,2019) or coreference
resolution (Rudinger et al.,2018). As a result, these
datasets are central to what eventually gets mea-
sured as “bias”. Not only do they determine the
“amount” of bias measured but also the “type” of
bias or stereotype measured. Datasets often vary
combinations of gendered pronouns and occupa-
tions to evaluate stereotypical associations. It is im-
portant to note that these constructs of datasets and
their templates, which determine what gets mea-
sured, are often arbitrary choices. The sentences
could be differently structured, be generated from a
different set of seed words, and more. However, we
expect that for any faithful bias benchmark, such
dataset alterations that are not relevant to social
bias should not have a significant impact on the
artifact (e.g. gender bias) being measured.
Thus, to evaluate the faithfulness of current
benchmarks, we develop alternate dataset construc-
tions through modifications that should not have
any effect on the social bias being measured in a
dataset. They are minor changes that should not
influence models with true language understand-
ing – the implicit assumption made by current bias
benchmarks. Any notable observed changes in a
model’s bias measure due to these modifications
would highlight the incorrectness of this assump-
tion. Consequently, this would bring to light the
unreliability of current benchmarks to faithfully
measure the target bias and disentangle the mea-
surement from measurement of other non-social
biases. A non-exhaustive set of such alternate con-
structions considered in this work are listed below.






















 
   




Figure 2: An instance (“The engineer informed the client that he would need to make all future payments on
time”) from WINOGENDER benchmark modified under various shallow modifications (§3). To a human eye, such
modifications do not necessarily affect the outcome of the given pronoun resolution problem.
Negations: A basic function in language under-
standing is to understand the negations of word
groups such as action verbs, or adjectives. Altering
verbs in particular, such as ‘the doctor bought’ to
‘the doctor did not buy’ should typically not affect
the inferences made about occupation associations.
Synonym substitutions: Another fundamental
function of language understanding is the ability
to parse the usage of similar words or synonyms
used in identical contexts, to derive the same over-
all meaning of a sentence. For bias measuring
datasets, synonymizing non-pivotal words (such as
non-identity words like verbs) should not change
the outcome of how much bias is measured.
Varying length of the text: In typical evaluation
datasets, the number of clauses that each sentence
is composed of and overall the sentence length are
arbitrary experimental choices. Fixing this length is
common, especially when such datasets need to be
created at scale. If language is understood, adding
a neutral phrase without impacting the task-specific
semantics should not alter the bias measured.
Adding descriptors: Sentences used in real life are
structured in complex ways and can have descrip-
tors, such as adjectives about an action, person, or
object, without changing the net message expressed
by the text. For example, the sentences, “The doc-
tor bought an apple.", and “The doctor bought a
red apple." do not change any assumptions made
about the doctor, or the action of buying an apple.
Random samples: Since the sentence constructs
of these datasets are not unique, a very simple al-
ternate construction of a dataset is a different sub-
sample of itself. This is because the dataset is
scraped or generated with specific assumptions or
parameters, such as seed word lists, templates of
sentences, and word order. However, neither the
sentence constructs or templates, nor the seed word
lists typically used are exhaustive or representative
of entire categories of words (such as gendered
words, emotions, and occupations).
See Fig. 2for example constructions on WINO-
GENDER (App. A,Bfor detailed descriptions).
4 Case Studies
We discuss here the impact of alternate construc-
tions on two task-based measures of bias.2
4.1 Coreference Resolution
Several different bias measures (Rudinger et al.,
2018;Zhao et al.,2018;Cao and Daumé III,2021)
for coreference resolution work similar to Wino-
grad Schema (Winograd,1972) where a sentence
has two entities and the task is to resolve which en-
tity a specific pronoun or noun refers to. We work
here with WINOGENDER (Rudinger et al.,2018),
popularly used to measure biases. It is worth noting
that WINOGENDER was originally intended by its
authors to merely be a diagnostic tool that checks
for bias in a model; the authors note that it may
demonstrate the presence of model bias but not
prove the absence of the same. Nonetheless, mod-
els developed today are indeed tested and compared
for social bias on WinoGender, leading to its usage
as a comparative standard or benchmark (Chowdh-
ery et al.,2022;Thoppilan et al.,2022).
The metric used to evaluate bias is the percent-
age of sentence pairs where there is a mismatch
in predictions for the male and female gendered
pronouns. For instance, in Fig. 2, if the pronoun
“he” is linked to “engineer” but switches to “client”
for the pronoun “she”, that would indicate a gender-
occupation bias. Higher the number of mismatches,
2
We note that throughout this paper, we focus on gender-
occupation bias as an illustrative example; however, our dis-
cussion can be extended to other aspects of biases too.
摘要:

TheTailWaggingtheDog:DatasetConstructionBiasesofSocialBiasBenchmarksNikilRoashanSelvam1SunipaDev2DanielKhashabi3TusharKhot4Kai-WeiChang11UniversityofCalifornia,LosAngeles2GoogleResearch3JohnsHopkinsUniversity4AllenInstituteforAI{nikilrselvam,kwchang}@ucla.edu,sunipadev@google.comdanielk@jhu.edu,tush...

展开>> 收起<<
The Tail Wagging the Dog Dataset Construction Biases of Social Bias Benchmarks Nikil Roashan Selvam1Sunipa Dev2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:528.35KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注