Choose Your Lenses Flaws in Gender Bias Evaluation Hadas Orgad Yonatan Belinkov orgad.hadascs.technion.ac.il belinkovtechnion.ac.il

2025-04-29 0 0 543.47KB 18 页 10玖币

侵权投诉

Choose Your Lenses: Flaws in Gender Bias Evaluation

Hadas Orgad Yonatan Belinkov∗

orgad.hadas@cs.technion.ac.il belinkov@technion.ac.il

Technion – Israel Institute of Technology

Abstract

Considerable efforts to measure and mitigate

gender bias in recent years have led to the in-

troduction of an abundance of tasks, datasets,

and metrics used in this vein. In this po-

sition paper, we assess the current paradigm

of gender bias evaluation and identify several

ﬂaws in it. First, we highlight the impor-

tance of extrinsic bias metrics that measure

how a model’s performance on some task is af-

fected by gender, as opposed to intrinsic evalu-

ations of model representations, which are less

strongly connected to speciﬁc harms to people

interacting with systems. We ﬁnd that only

a few extrinsic metrics are measured in most

studies, although more can be measured. Sec-

ond, we ﬁnd that datasets and metrics are of-

ten coupled, and discuss how their coupling

hinders the ability to obtain reliable conclu-

sions, and how one may decouple them. We

then investigate how the choice of the dataset

and its composition, as well as the choice of

the metric, affect bias measurement, ﬁnding

signiﬁcant variations across each of them. Fi-

nally, we propose several guidelines for more

reliable gender bias evaluation.

1 Introduction

A large body of work has been devoted to mea-

surement and mitigation of social biases in natural

language processing (NLP), with a particular focus

on gender bias (Sun et al.,2019;Blodgett et al.,

2020;Garrido-Muñoz et al.,2021;Stanczak and

Augenstein,2021). These considerable efforts have

been accompanied by various tasks, datasets, and

metrics for evaluation and mitigation of gender bias

in NLP models. In this position paper, we critically

assess the predominant evaluation paradigm and

identify several ﬂaws in it. These ﬂaws hinder

progress in the ﬁeld, since they make it difﬁcult to

ascertain whether progress has been actually made.

∗

Supported by the Viterbi Fellowship in the Center for

Computer Engineering at the Technion.

Gender bias metrics can be divided into two

groups: extrinsic metrics, such as performance dif-

ference across genders, measure gender bias with

respect to a speciﬁc downstream task, while in-

trinsic metrics, such as WEAT (Caliskan et al.,

2017), are based on the internal representations of

the language model. We argue that measuring ex-

trinsic metrics is crucial for building conﬁdence

in proposed metrics, deﬁning the harms caused by

biases found, and justifying the motivation for de-

biasing a model and using the suggested metrics

as a measure of success. However, we ﬁnd that

many studies on gender bias only measure intrin-

sic metrics. As a result, it is difﬁcult to determine

what harm the presumably found bias may be caus-

ing. When it comes to gender bias mitigation ef-

forts, improving intrinsic metrics may produce an

illusion of greater success than reality, since their

correlation to downstream tasks is questionable

(Goldfarb-Tarrant et al.,2021;Cao et al.,2022). In

the minority of cases where extrinsic metrics are re-

ported, only few metrics are measured, although it

is possible and sometimes crucial to measure more.

Additionally, gender bias measures are often ap-

plied as a dataset coupled with a measurement tech-

nique (a.k.a metric), but we show that they can be

separated. A single gender bias metric can be mea-

sured using a wide range of datasets, and a single

dataset can be applied to a wide variety of metrics.

We then demonstrate how the choice of gender bias

metric and the choice of dataset can each affect the

resulting measures signiﬁcantly. As an example,

measuring the same metric on the same model with

an imbalanced or a balanced dataset

may result in

very different results. It is thus difﬁcult to compare

newly proposed metrics and debiasing methods

with previous ones, hindering progress in the ﬁeld.

To summarize, our contributions are:

•

We argue that extrinsic metrics are important

Balanced with respect to the amount of examples for each

gender, per task label.

arXiv:2210.11471v1 [cs.CL] 20 Oct 2022

for deﬁning harms (§2), but researchers do not

use them enough even though they can (§5).

•

We demonstrate the coupling of datasets with

metrics and the feasibility of other combina-

tions (§3).

•

On observing that a speciﬁc metric can be

measured on many possible datasets and vice-

versa, we demonstrate how the choice and

composition of a dataset (§4), as well as the

choice of bias metric to measure (§5), can

strongly inﬂuence the measured results.

•

We provide guidelines for researchers on how

to correctly evaluate gender bias (§6).

Bias Statement

This paper examines metrics

and datasets that are used to measure gender

bias, and discusses several pitfalls in the current

paradigm. As a result of the observations and pro-

posed guidelines in this work, we hope that future

results and conclusions will become clearer and

more reliable.

The deﬁnition of gender bias in this paper is

through the discussed metrics, as each metric re-

ﬂects a different deﬁnition. Some of the examined

metrics are measured on concrete downstream tasks

(extrinsic metrics), while others are measured on

internal model representations (intrinsic metrics).

The deﬁnitions of intrinsic and extrinsic metrics

do not align perfectly with the deﬁnitions of allo-

cational and representational harms (Kate Craw-

ford,2017). In the case of allocational harm, re-

sources or opportunities are unfairly allocated due

to bias. Representative harm, on the other hand, is

when a certain group is negatively represented or ig-

nored by the system. Extrinsic metrics can be used

to quantify both allocational and representational

harms, while intrinsic metrics can only quantify

representational harms, in some cases.

There are also other important pitfalls that are

not discussed in this paper, like the focus on high-

resource languages such as English and the binary

treatment of gender (Sun et al.,2019;Stanczak

and Augenstein,2021;Dev et al.,2021). Inclusive

research of non-binary genders would require a

new set of methods, which could beneﬁt from the

observations in this work.

2 The Importance of Extrinsic Metrics in

Deﬁning Harms

In this paper, we divide metrics for gender bias to

three groups:

•Extrinsic performance:

measures how a

model’s performance is affected by gender,

and is calculated with respect to particular

gold labels. For example, the True Positive

Rate (TPR) gap between female and male ex-

amples.

•Extrinsic prediction:

measures model’s pre-

dictions, such as the output probabilities, but

the bias is not calculated with respect to some

gold labels. Instead, the bias is measured by

the effect of gender or stereotypes on model

predictions. For example, the probability gap

can be measured on a language model queried

on two sentences, one pro-stereotypical (“he

is an engineer”) and another anti-stereotypical

(“she is an engineer”).

•Intrinsic:

measures bias in internal model

representations, and is not directly related to

any downstream task. For example, WEAT.

It is crucial to deﬁne how measured bias harms

those interacting with the biased systems (Barocas

et al.,2017;Kate Crawford,2017;Blodgett et al.,

2020;Bommasani et al.,2021). Extrinsic metrics

are important for motivating bias mitigation and

for accurately deﬁning “why the system behaviors

that are described as ‘bias’ are harmful, in what

ways, and to whom” (Blodgett et al.,2020), since

they clearly demonstrate the performance disparity

between protected groups.

For example, in a theoretical CV-ﬁltering sys-

tem, one can measure the TPR gap between female

and male candidates. A gap in TPR favoring men

means that, given a set of valid candidates, the sys-

tem picks valid male candidates more often than

valid female candidates. The impact of this gap is

clear: Qualiﬁed women are overlooked because of

bias. In contrast, consider an intrinsic metric such

as WEAT (Caliskan et al.,2017), which is derived

from the proximity (in vector space) of words like

“career” or “family” to “male” or “female” names.

If one ﬁnds that male names relate more to career

and female names relate more to family, the conse-

quences are unclear. In fact, Goldfarb-Tarrant et al.

(2021) found that WEAT does not correlate with

other extrinsic metrics. However, many studies re-

The developer argued with the designer because she

did not like the design.

The developer argued with the designer because he

did not like the design.

Figure 1: Coreference resolution example from Wino-

bias: a pair of anti-stereotypical (top) and pro-

stereotypical examples (bottom). Developers are

stereotyped to be males.

port only intrinsic metrics (a third of the papers we

reviewed, §5).

3 Coupling of Datasets and Metrics

In this section, we discuss how datasets and met-

rics for gender bias evaluation are typically cou-

pled, how they may be decoupled, and why this

is important. We begin with a representative test

case, followed by a discussion of the general phe-

nomenon.

3.1 Case study: Winobias

Coreference resolution aims to ﬁnd all textual ex-

pressions that refer to the same real-world entities.

A popular dataset for evaluating gender bias in

coreference resolution systems is Winobias (Zhao

et al.,2018a). It consists of Winograd schema

(Levesque et al.,2012) instances: two sentences

that differ only by one or two words, but contain

ambiguities that are resolved differently in the two

sentences based on world knowledge and reason-

ing. Winobias sentences consist of an anti- and

a pro- stereotypical sentence, as shown in Figure

1. Coreference systems should be able to resolve

both sentences correctly, but most perform poorly

on the anti-stereotypical ones (Zhao et al.,2018a,

2019;de Vassimon Manela et al.,2021;Orgad et al.,

2022).

Winobias was originally proposed as an extrin-

sic evaluation dataset, with a reported metric of

anti- and pro- stereotypical performance dispar-

ity. However, other metrics can also be measured,

both intrinsic and extrinsic, as shown in several

studies (Zhao et al.,2019;Nangia et al.,2020b;

Orgad et al.,2022). For example, one can mea-

sure how many stereotypical choices the model

preferred over anti-stereotypical choices (an ex-

trinsic performance measure), as done on Wino-

gender (Rudinger et al.,2018), a similar dataset.

Winobias sentences can also be used to evaluate

language models (LMs), by evaluating if an LM

gives higher probabilities to pro-stereotypical sen-

tences (Nangia et al.,2020b) (an extrinsic predic-

tion measure). Winobias can also be used for in-

trinsic metrics, for example as a template for SEAT

(May et al.,2019a) and CEAT (Guo and Caliskan,

2021) (contextual extensions of WEAT). Each of

these metrics reveals a different facet of gender

bias in a model. An explicit measure of how many

pro-stereotypical choices were preferred over anti-

stereotypical choices has a different meaning than

measuring a performance metric gap between two

different genders. Additionally, measuring an in-

trinsic metric on Winobias may be help tie the re-

sults to the model’s behavior on the same dataset

in the downstream coreference resolution task.

3.2 Many possible combinations for datasets

and metrics

Winobias is one example out of many. In fact,

benchmarks for gender bias evaluation are typically

proposed as a package of two components:

1. A dataset

on which the benchmark task is

performed.

2. A metric

, which is the particular method used

to calculate bias of a model on the dataset.

Usually, these benchmarks are considered as

a bundle; however, they can often be decoupled,

mixed, and matched, as discussed in the Winobias

test case above. The work by Delobelle et al. (2021)

is an exception, in that they gathered a set of tem-

plates from diverse studies and tested them using

the same metric.

In Table 1, we present possible combinations

of datasets (rows) and metrics (columns) from the

gender bias literature. The metrics are partitioned

according to the three classes of metrics deﬁned

in Section 2. We present only metrics valid for

assessing bias in contextualized LMs (rather than

static word embeddings), since they are the com-

mon practice nowadays. The table does not claim

to be exhaustive, but rather illustrates how metrics

and datasets can be repurposed in many different

ways. The metrics are described in appendix A,

but the categories are very general and even a sin-

gle column like “Gap (Label)” represents a wide

variety of metrics that can be measured.

Table 1shows that many metrics are compatible

across many datasets (many

’s in the same col-

umn), and that datasets can be used to measure a

variety of metrics other than those typically mea-

Extrinsic Performance Extrinsic Predictions Intrinsic

Dataset

Metric Gap Gap Gap % or # of % or # Model Pred LM Prediction SEAT CEAT Probe Cluster Nearest Cos PCA

(Label) (Stereo) (Gender) Answer Changed Prefers Stereotype Gap On Target words Neighbors

Winogender (Rudinger et al.,2018)XX

X X X X X X X X X X X

Winobias (Zhao et al.,2018a)XX

X X X X X X X X X X X

Gap (Webster et al.,2018)X

X(aug)

Crow-S (Nangia et al.,2020a)X

X X X X X X X

StereoSet (Nadeem et al.,2021)X

X X X X X X X

Bias in Bios (De-Arteaga et al.,2019)X

X X X(aug) X(aug) X(aug) X X X X X X X

EEC (Kiritchenko and Mohammad,2018)X

X X X X X

STS-B for genders (Beutel et al.,2020)XX

X X X X X X X X

Dev et al. (2020a) (NLI) X X X X X X X X X X

PTB, WikiText, CNN/DailyMail X

X

(Bordia and Bowman,2019)

BOLD (Dhamala et al.,2021)X



Templates from May et al. (2019a)XX

X X X X X X

Templates from Kurita et al. (2019)X

X

X X X X X X

DisCo templates (Beutel et al.,2020)XX

X X X X X X

BEC-Pro templates (Bartl et al.,2020)XX

X X X X X X

English-German news corpus X X X

X



(Basta et al.,2021)

Reddit (Guo and Caliskan 2021,XX

X X X X X

Voigt et al. 2018)

MAP (Cao and Daumé III,2021)X

X X X X X X X

GICoref (Cao and Daumé III,2021)X

X X X X X X

Table 1: Combinations of gender bias datasets and metrics in the literature. Xmarks a feasible combination of

a metric and a dataset. X

marks the original metrics used on the dataset, and X(aug) marks metrics that can

be measured after augmenting the dataset such that every example is matched with a counterexample of another

gender. Extrinsic performance metrics depend on gold labels while extrinsic prediction metrics do not. A full

description of the metrics is given in Appendix A.

sured (many X’s in the same row). Some datasets,

such as Bias in Bios (De-Arteaga et al.,2019), have

numerous metrics compatible, while others have

fewer, but still multiple, compatible metrics. Bias

in Bios has many compatible metrics since it has

information that can be used to calculate them: in

addition to gold labels, it also has gender labels

and clear stereotype deﬁnitions derived from the

labels which are professions. Text corpora and

template data, which do not address a speciﬁc task

(bottom seven rows), are mostly compatible with

intrinsic metrics. The compatibility of intrinsic

metrics with many datasets may explain why pa-

pers report intrinsic metrics more often (§5). Ad-

ditionally, Table 1indicates that not many datasets

can be used to measure extrinsic metrics, partic-

ularly extrinsic performance metrics that require

gold labels. On the other hand, measuring LM

prediction on target words, which we consider as

extrinsic, can be done on many datasets. This is

useful for analyzing bias when dealing with LMs.

It can be done by computing bias metrics from the

LM output predictions, such as the mean proba-

bility gap when predicting the word “he” versus

“she” in speciﬁc contexts. Also, some templates are

valid for measuring extrinsic prediction metrics,

especially stereotype-related metrics, as they were

developed with explicit stereotypes in mind (such

as profession-related stereotypes).

Based on Table 1, it is clear that there are many

possible ways to measure gender bias in the litera-

ture, but they all fall under the vague category of

“gender bias”. Each of the possible combinations

gives a different deﬁnition, or interpretation, for

gender bias. The large number of different metrics

makes it difﬁcult or even impossible to compare

different studies, including proposed gender bias

mitigation methods. This raises questions about the

validity of results derived from speciﬁc combina-

tions of measurements. In the next two sections, we

demonstrate how the choice of datasets and metrics

can affect the bias measurement.

4 Effect of Dataset on Measured Results

The choice of data to measure bias has an impact

on the calculated bias. Many researchers used sen-

tence templates that are “semantically bleached”

(e.g., “This is <word>.”, “<person> studied <pro-

fession> at college.”) to adjust metrics developed

for static word embeddings to contextualized rep-

resentations (May et al.,2019b;Kurita et al.,2019;

Webster et al.,2020;Bartl et al.,2020). Delobelle

et al. (2021) found that the choice of templates

signiﬁcantly affected the results, with little corre-

lation between different templates. Additionally,

May et al. (2019b) reported that templates are not

as semantically bleached as expected.

Another common feature of bias metrics is the

use of hand-curated word lexicons by almost every

bias metric in the literature. Antoniak and Mimno

(2021) reported that the lexicon choice can greatly

affect bias measurement, leading to differing con-

clusions between different lexicons.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ChooseYourLenses:FlawsinGenderBiasEvaluationHadasOrgadYonatanBelinkovorgad.hadas@cs.technion.ac.ilbelinkov@technion.ac.ilTechnionIsraelInstituteofTechnologyAbstractConsiderableeffortstomeasureandmitigategenderbiasinrecentyearshaveledtothein-troductionofanabundanceoftasks,datasets,andmetricsusedint...

展开>> 收起<<

Choose Your Lenses Flaws in Gender Bias Evaluation Hadas Orgad Yonatan Belinkov orgad.hadascs.technion.ac.il belinkovtechnion.ac.il.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Choose Your Lenses Flaws in Gender Bias Evaluation Hadas Orgad Yonatan Belinkov orgad.hadascs.technion.ac.il belinkovtechnion.ac.il

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: