The Tail Wagging the Dog Dataset Construction Biases of Social Bias Benchmarks Nikil Roashan Selvam1Sunipa Dev2

2025-05-06 0 0 528.35KB 12 页 10玖币

侵权投诉

The Tail Wagging the Dog:

Dataset Construction Biases of Social Bias Benchmarks

Nikil Roashan Selvam1Sunipa Dev2

Daniel Khashabi3Tushar Khot4Kai-Wei Chang1

1University of California, Los Angeles 2Google Research

3Johns Hopkins University 4Allen Institute for AI

{nikilrselvam,kwchang}@ucla.edu,sunipadev@google.com

danielk@jhu.edu,tushark@allenai.org

Abstract

How reliably can we trust the scores obtained

from social bias benchmarks as faithful indi-

cators of problematic social biases in a given

model? In this work, we study this question

by contrasting social biases with

non

-social

biases that stem from choices made during

dataset construction (which might not even be

discernible to the human eye). To do so, we em-

pirically simulate various alternative construc-

tions for a given benchmark based on seem-

ingly innocuous modiﬁcations (such as para-

phrasing or random-sampling) that maintain

the essence of their social bias. On two well-

known social bias benchmarks (WINOGENDER

and BIASNLI), we observe that these shallow

modiﬁcations have a surprising effect on the

resulting degree of bias across various mod-

els and consequently the relative ordering of

these models when ranked by measured bias.

We hope these troubling observations motivate

more robust measures of social biases.

1 Introduction

The omnipresence of large pre-trained language

models (Liu et al.,2019;Raffel et al.,2020;Brown

et al.,2020) has fueled concerns regarding their sys-

tematic biases carried over from underlying data

into the applications they are used in, resulting in

disparate treatment of people with different identi-

ties (Sheng et al.,2021;Abid et al.,2021).

In response to such concerns, various bench-

marks have been proposed to quantify the amount

of social biases in models (Rudinger et al.,2018;

Sheng et al.,2019;Li et al.,2020). These mea-

sures are composed of textual datasets built for a

speciﬁc NLP task (such as question answering) and

are accompanied by a metric such as accuracy of

prediction which is used as an approximation of

the amount of social biases.

These bias benchmarks are commonly used by

machine learning practitioners to compare the de-

gree of social biases (such as gender-occupation

Gender-Occupation Bias ❌

Gender-Occupation Bias ✅

The electrician warned the homeowner that he might need an

extra day to finish rewiring the house.

The electrician warned the homeowner that she might need an

extra day to finish rewiring the house.

coref

WinoGender

The electrician cautioned the homeowner that he might need an

extra day to finish rewiring the house.

The electrician cautioned the homeowner that she might need an

extra day to finish rewiring the house.

WinoGender-Alternate Construction

coref

Figure 1: Two potential constructions of WINOGEN-

DER with minor differences: a model (span-BERT,

in this case) with the original dataset might seem to

have gender-occupation bias (

green tick

) based on the

change in its pronoun resolution. However, a minor

change in its phrasing with no change in meaning (e.g.,

synonymous verb

) can drastically affect the perceived

bias of the model and changes the conclusion (

no bias

bias) in different real-world models (Chowdhery

et al.,2022;Thoppilan et al.,2022) before deploy-

ing them in a myriad of applications. However,

they also inadvertently measure other non-social

biases in their datasets. For example, consider the

sentence from WINOGENDER in Figure 1. In this

dataset, any change in a co-reference resolution

model’s predictions due to the change in pronoun

is assumed to be due to gender-occupation bias.

However, this assumption only holds for a model

with near-perfect language understanding with no

other biases. This may not often be the case, e.g., a

model’s positional bias (Murray and Chiang,2018;

Ko et al.,2020) (bias to resolve “she" to a close-

by entity) or spurious correlations (Schlegel et al.,

2020) (bias to resolve “he” to the object of the verb

“warned”) would also be measured as a gender-

occupation bias. As a result, a slightly different

template (e.g., changing the verb to “cautioned”)

arXiv:2210.10040v2 [cs.CL] 16 Jun 2023

could result in completely different bias measure-

ments.

The goal of this work is to illustrate the extent

to which social bias measurements are effected by

assumptions that are built into dataset construc-

tions. To that end, we consider several alternate

dataset constructions for

bias benchmarks WINO-

GENDER and BIASNLI. We show that, just by the

choice of certain target-bias-irrelevant elements in

a dataset, it is possible to discover different de-

grees of bias for the same model as well as dif-

ferent model rankings

. For instance, one experi-

ment on BIASNLI demonstrated that merely negat-

ing verbs drastically reduced the measured bias

(

41.64 →13.40

) on an ELMo-based Decompos-

able Attention model and even caused a switch in

the comparative ranking with RoBERTa. Our ﬁnd-

ings demonstrate the unreliability of current bench-

marks to truly measure social bias in models and

suggest caution when considering these measures

as the gold truth. We provide a detailed discussion

(§5) of the implications of our ﬁndings, relation to

experienced harms, suggestions for improving bias

benchmarks, and directions for future work.

2 Related Work

A large body of work investigates ways to eval-

uate biases carried inherently in language mod-

els (Bolukbasi et al.,2016;Caliskan et al.,2017;

Nadeem et al.,2021) and expressed in speciﬁc

tasks (Nangia et al.,2020;Kirk et al.,2021;

Schramowski et al.,2022;Prabhumoye et al.,2021;

Srinivasan and Bisk,2021;Kirk et al.,2021;Par-

rish et al.,2021;Baldini et al.,2022;Czarnowska

et al.,2021;Dev et al.,2021a;Zhao et al.,2021).

Alongside, there is also growing concern about the

measures not relating to experienced harms (Blod-

gett et al.,2020), not inclusive in framing (Dev

et al.,2021b), ambiguous about what bias is mea-

sured (Blodgett et al.,2021;Goldfarb-Tarrant et al.,

2023), not correlated in their ﬁndings of bias across

intrinsic versus extrinsic techniques (Goldfarb-

Tarrant et al.,2021;Cao et al.,2022), and suscepti-

ble to adversarial perturbations (Zhang et al.,2021)

and seed word selection (Antoniak and Mimno,

2021).

The concurrent work by (Seshadri et al.,2022)

discusses the unreliability of quantifying social bi-

All preprocessed datasets (original and al-

ternate constructions) and code are available at

https://github.com/uclanlp/socialbias-dataset-construction-

biases.

ases using templates by varying templates in a se-

mantic preserving manner. While their ﬁndings are

consistent with ours, the two works provide com-

plementary experimental observations. Seshadri

et al. (2022) study a wider range of tasks, though

we focus our experiments on a wider set of models

and alternate dataset constructions (with a greater

range of syntactic and semantic variability). As a

result, we are able to illustrate the effect of the ob-

served variability on ranking large language models

according to measured bias for deployment in real

world applications.

3 Social Bias Measurements and

Alternate Constructions

Bias measures in NLP are often quantiﬁed through

comparative prediction disparities on language

datasets that follow existing tasks such as classi-

ﬁcation (De-Arteaga et al.,2019) or coreference

resolution (Rudinger et al.,2018). As a result, these

datasets are central to what eventually gets mea-

sured as “bias”. Not only do they determine the

“amount” of bias measured but also the “type” of

bias or stereotype measured. Datasets often vary

combinations of gendered pronouns and occupa-

tions to evaluate stereotypical associations. It is im-

portant to note that these constructs of datasets and

their templates, which determine what gets mea-

sured, are often arbitrary choices. The sentences

could be differently structured, be generated from a

different set of seed words, and more. However, we

expect that for any faithful bias benchmark, such

dataset alterations that are not relevant to social

bias should not have a signiﬁcant impact on the

artifact (e.g. gender bias) being measured.

Thus, to evaluate the faithfulness of current

benchmarks, we develop alternate dataset construc-

tions through modiﬁcations that should not have

any effect on the social bias being measured in a

dataset. They are minor changes that should not

inﬂuence models with true language understand-

ing – the implicit assumption made by current bias

benchmarks. Any notable observed changes in a

model’s bias measure due to these modiﬁcations

would highlight the incorrectness of this assump-

tion. Consequently, this would bring to light the

unreliability of current benchmarks to faithfully

measure the target bias and disentangle the mea-

surement from measurement of other non-social

biases. A non-exhaustive set of such alternate con-

structions considered in this work are listed below.













































 

   









Figure 2: An instance (“The engineer informed the client that he would need to make all future payments on

time”) from WINOGENDER benchmark modiﬁed under various shallow modiﬁcations (§3). To a human eye, such

modiﬁcations do not necessarily affect the outcome of the given pronoun resolution problem.

Negations: A basic function in language under-

standing is to understand the negations of word

groups such as action verbs, or adjectives. Altering

verbs in particular, such as ‘the doctor bought’ to

‘the doctor did not buy’ should typically not affect

the inferences made about occupation associations.

Synonym substitutions: Another fundamental

function of language understanding is the ability

to parse the usage of similar words or synonyms

used in identical contexts, to derive the same over-

all meaning of a sentence. For bias measuring

datasets, synonymizing non-pivotal words (such as

non-identity words like verbs) should not change

the outcome of how much bias is measured.

Varying length of the text: In typical evaluation

datasets, the number of clauses that each sentence

is composed of and overall the sentence length are

arbitrary experimental choices. Fixing this length is

common, especially when such datasets need to be

created at scale. If language is understood, adding

a neutral phrase without impacting the task-speciﬁc

semantics should not alter the bias measured.

Adding descriptors: Sentences used in real life are

structured in complex ways and can have descrip-

tors, such as adjectives about an action, person, or

object, without changing the net message expressed

by the text. For example, the sentences, “The doc-

tor bought an apple.", and “The doctor bought a

red apple." do not change any assumptions made

about the doctor, or the action of buying an apple.

Random samples: Since the sentence constructs

of these datasets are not unique, a very simple al-

ternate construction of a dataset is a different sub-

sample of itself. This is because the dataset is

scraped or generated with speciﬁc assumptions or

parameters, such as seed word lists, templates of

sentences, and word order. However, neither the

sentence constructs or templates, nor the seed word

lists typically used are exhaustive or representative

of entire categories of words (such as gendered

words, emotions, and occupations).

See Fig. 2for example constructions on WINO-

GENDER (App. A,Bfor detailed descriptions).

4 Case Studies

We discuss here the impact of alternate construc-

tions on two task-based measures of bias.2

4.1 Coreference Resolution

Several different bias measures (Rudinger et al.,

2018;Zhao et al.,2018;Cao and Daumé III,2021)

for coreference resolution work similar to Wino-

grad Schema (Winograd,1972) where a sentence

has two entities and the task is to resolve which en-

tity a speciﬁc pronoun or noun refers to. We work

here with WINOGENDER (Rudinger et al.,2018),

popularly used to measure biases. It is worth noting

that WINOGENDER was originally intended by its

authors to merely be a diagnostic tool that checks

for bias in a model; the authors note that it may

demonstrate the presence of model bias but not

prove the absence of the same. Nonetheless, mod-

els developed today are indeed tested and compared

for social bias on WinoGender, leading to its usage

as a comparative standard or benchmark (Chowdh-

ery et al.,2022;Thoppilan et al.,2022).

The metric used to evaluate bias is the percent-

age of sentence pairs where there is a mismatch

in predictions for the male and female gendered

pronouns. For instance, in Fig. 2, if the pronoun

“he” is linked to “engineer” but switches to “client”

for the pronoun “she”, that would indicate a gender-

occupation bias. Higher the number of mismatches,

We note that throughout this paper, we focus on gender-

occupation bias as an illustrative example; however, our dis-

cussion can be extended to other aspects of biases too.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheTailWaggingtheDog:DatasetConstructionBiasesofSocialBiasBenchmarksNikilRoashanSelvam1SunipaDev2DanielKhashabi3TusharKhot4Kai-WeiChang11UniversityofCalifornia,LosAngeles2GoogleResearch3JohnsHopkinsUniversity4AllenInstituteforAI{nikilrselvam,kwchang}@ucla.edu,sunipadev@google.comdanielk@jhu.edu,tush...

展开>> 收起<<

The Tail Wagging the Dog Dataset Construction Biases of Social Bias Benchmarks Nikil Roashan Selvam1Sunipa Dev2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Tail Wagging the Dog Dataset Construction Biases of Social Bias Benchmarks Nikil Roashan Selvam1Sunipa Dev2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: