Choose Your Lenses Flaws in Gender Bias Evaluation Hadas Orgad Yonatan Belinkov orgad.hadascs.technion.ac.il belinkovtechnion.ac.il

2025-04-29 0 0 543.47KB 18 页 10玖币
侵权投诉
Choose Your Lenses: Flaws in Gender Bias Evaluation
Hadas Orgad Yonatan Belinkov
orgad.hadas@cs.technion.ac.il belinkov@technion.ac.il
Technion – Israel Institute of Technology
Abstract
Considerable efforts to measure and mitigate
gender bias in recent years have led to the in-
troduction of an abundance of tasks, datasets,
and metrics used in this vein. In this po-
sition paper, we assess the current paradigm
of gender bias evaluation and identify several
flaws in it. First, we highlight the impor-
tance of extrinsic bias metrics that measure
how a model’s performance on some task is af-
fected by gender, as opposed to intrinsic evalu-
ations of model representations, which are less
strongly connected to specific harms to people
interacting with systems. We find that only
a few extrinsic metrics are measured in most
studies, although more can be measured. Sec-
ond, we find that datasets and metrics are of-
ten coupled, and discuss how their coupling
hinders the ability to obtain reliable conclu-
sions, and how one may decouple them. We
then investigate how the choice of the dataset
and its composition, as well as the choice of
the metric, affect bias measurement, finding
significant variations across each of them. Fi-
nally, we propose several guidelines for more
reliable gender bias evaluation.
1 Introduction
A large body of work has been devoted to mea-
surement and mitigation of social biases in natural
language processing (NLP), with a particular focus
on gender bias (Sun et al.,2019;Blodgett et al.,
2020;Garrido-Muñoz et al.,2021;Stanczak and
Augenstein,2021). These considerable efforts have
been accompanied by various tasks, datasets, and
metrics for evaluation and mitigation of gender bias
in NLP models. In this position paper, we critically
assess the predominant evaluation paradigm and
identify several flaws in it. These flaws hinder
progress in the field, since they make it difficult to
ascertain whether progress has been actually made.
Supported by the Viterbi Fellowship in the Center for
Computer Engineering at the Technion.
Gender bias metrics can be divided into two
groups: extrinsic metrics, such as performance dif-
ference across genders, measure gender bias with
respect to a specific downstream task, while in-
trinsic metrics, such as WEAT (Caliskan et al.,
2017), are based on the internal representations of
the language model. We argue that measuring ex-
trinsic metrics is crucial for building confidence
in proposed metrics, defining the harms caused by
biases found, and justifying the motivation for de-
biasing a model and using the suggested metrics
as a measure of success. However, we find that
many studies on gender bias only measure intrin-
sic metrics. As a result, it is difficult to determine
what harm the presumably found bias may be caus-
ing. When it comes to gender bias mitigation ef-
forts, improving intrinsic metrics may produce an
illusion of greater success than reality, since their
correlation to downstream tasks is questionable
(Goldfarb-Tarrant et al.,2021;Cao et al.,2022). In
the minority of cases where extrinsic metrics are re-
ported, only few metrics are measured, although it
is possible and sometimes crucial to measure more.
Additionally, gender bias measures are often ap-
plied as a dataset coupled with a measurement tech-
nique (a.k.a metric), but we show that they can be
separated. A single gender bias metric can be mea-
sured using a wide range of datasets, and a single
dataset can be applied to a wide variety of metrics.
We then demonstrate how the choice of gender bias
metric and the choice of dataset can each affect the
resulting measures significantly. As an example,
measuring the same metric on the same model with
an imbalanced or a balanced dataset
1
may result in
very different results. It is thus difficult to compare
newly proposed metrics and debiasing methods
with previous ones, hindering progress in the field.
To summarize, our contributions are:
We argue that extrinsic metrics are important
1
Balanced with respect to the amount of examples for each
gender, per task label.
arXiv:2210.11471v1 [cs.CL] 20 Oct 2022
for defining harms (§2), but researchers do not
use them enough even though they can (§5).
We demonstrate the coupling of datasets with
metrics and the feasibility of other combina-
tions (§3).
On observing that a specific metric can be
measured on many possible datasets and vice-
versa, we demonstrate how the choice and
composition of a dataset (§4), as well as the
choice of bias metric to measure (§5), can
strongly influence the measured results.
We provide guidelines for researchers on how
to correctly evaluate gender bias (§6).
Bias Statement
This paper examines metrics
and datasets that are used to measure gender
bias, and discusses several pitfalls in the current
paradigm. As a result of the observations and pro-
posed guidelines in this work, we hope that future
results and conclusions will become clearer and
more reliable.
The definition of gender bias in this paper is
through the discussed metrics, as each metric re-
flects a different definition. Some of the examined
metrics are measured on concrete downstream tasks
(extrinsic metrics), while others are measured on
internal model representations (intrinsic metrics).
The definitions of intrinsic and extrinsic metrics
do not align perfectly with the definitions of allo-
cational and representational harms (Kate Craw-
ford,2017). In the case of allocational harm, re-
sources or opportunities are unfairly allocated due
to bias. Representative harm, on the other hand, is
when a certain group is negatively represented or ig-
nored by the system. Extrinsic metrics can be used
to quantify both allocational and representational
harms, while intrinsic metrics can only quantify
representational harms, in some cases.
There are also other important pitfalls that are
not discussed in this paper, like the focus on high-
resource languages such as English and the binary
treatment of gender (Sun et al.,2019;Stanczak
and Augenstein,2021;Dev et al.,2021). Inclusive
research of non-binary genders would require a
new set of methods, which could benefit from the
observations in this work.
2 The Importance of Extrinsic Metrics in
Defining Harms
In this paper, we divide metrics for gender bias to
three groups:
Extrinsic performance:
measures how a
model’s performance is affected by gender,
and is calculated with respect to particular
gold labels. For example, the True Positive
Rate (TPR) gap between female and male ex-
amples.
Extrinsic prediction:
measures model’s pre-
dictions, such as the output probabilities, but
the bias is not calculated with respect to some
gold labels. Instead, the bias is measured by
the effect of gender or stereotypes on model
predictions. For example, the probability gap
can be measured on a language model queried
on two sentences, one pro-stereotypical (“he
is an engineer”) and another anti-stereotypical
(“she is an engineer”).
Intrinsic:
measures bias in internal model
representations, and is not directly related to
any downstream task. For example, WEAT.
It is crucial to define how measured bias harms
those interacting with the biased systems (Barocas
et al.,2017;Kate Crawford,2017;Blodgett et al.,
2020;Bommasani et al.,2021). Extrinsic metrics
are important for motivating bias mitigation and
for accurately defining “why the system behaviors
that are described as ‘bias’ are harmful, in what
ways, and to whom” (Blodgett et al.,2020), since
they clearly demonstrate the performance disparity
between protected groups.
For example, in a theoretical CV-filtering sys-
tem, one can measure the TPR gap between female
and male candidates. A gap in TPR favoring men
means that, given a set of valid candidates, the sys-
tem picks valid male candidates more often than
valid female candidates. The impact of this gap is
clear: Qualified women are overlooked because of
bias. In contrast, consider an intrinsic metric such
as WEAT (Caliskan et al.,2017), which is derived
from the proximity (in vector space) of words like
“career” or “family” to “male” or “female” names.
If one finds that male names relate more to career
and female names relate more to family, the conse-
quences are unclear. In fact, Goldfarb-Tarrant et al.
(2021) found that WEAT does not correlate with
other extrinsic metrics. However, many studies re-
The developer argued with the designer because she
did not like the design.
The developer argued with the designer because he
did not like the design.
Figure 1: Coreference resolution example from Wino-
bias: a pair of anti-stereotypical (top) and pro-
stereotypical examples (bottom). Developers are
stereotyped to be males.
port only intrinsic metrics (a third of the papers we
reviewed, §5).
3 Coupling of Datasets and Metrics
In this section, we discuss how datasets and met-
rics for gender bias evaluation are typically cou-
pled, how they may be decoupled, and why this
is important. We begin with a representative test
case, followed by a discussion of the general phe-
nomenon.
3.1 Case study: Winobias
Coreference resolution aims to find all textual ex-
pressions that refer to the same real-world entities.
A popular dataset for evaluating gender bias in
coreference resolution systems is Winobias (Zhao
et al.,2018a). It consists of Winograd schema
(Levesque et al.,2012) instances: two sentences
that differ only by one or two words, but contain
ambiguities that are resolved differently in the two
sentences based on world knowledge and reason-
ing. Winobias sentences consist of an anti- and
a pro- stereotypical sentence, as shown in Figure
1. Coreference systems should be able to resolve
both sentences correctly, but most perform poorly
on the anti-stereotypical ones (Zhao et al.,2018a,
2019;de Vassimon Manela et al.,2021;Orgad et al.,
2022).
Winobias was originally proposed as an extrin-
sic evaluation dataset, with a reported metric of
anti- and pro- stereotypical performance dispar-
ity. However, other metrics can also be measured,
both intrinsic and extrinsic, as shown in several
studies (Zhao et al.,2019;Nangia et al.,2020b;
Orgad et al.,2022). For example, one can mea-
sure how many stereotypical choices the model
preferred over anti-stereotypical choices (an ex-
trinsic performance measure), as done on Wino-
gender (Rudinger et al.,2018), a similar dataset.
Winobias sentences can also be used to evaluate
language models (LMs), by evaluating if an LM
gives higher probabilities to pro-stereotypical sen-
tences (Nangia et al.,2020b) (an extrinsic predic-
tion measure). Winobias can also be used for in-
trinsic metrics, for example as a template for SEAT
(May et al.,2019a) and CEAT (Guo and Caliskan,
2021) (contextual extensions of WEAT). Each of
these metrics reveals a different facet of gender
bias in a model. An explicit measure of how many
pro-stereotypical choices were preferred over anti-
stereotypical choices has a different meaning than
measuring a performance metric gap between two
different genders. Additionally, measuring an in-
trinsic metric on Winobias may be help tie the re-
sults to the model’s behavior on the same dataset
in the downstream coreference resolution task.
3.2 Many possible combinations for datasets
and metrics
Winobias is one example out of many. In fact,
benchmarks for gender bias evaluation are typically
proposed as a package of two components:
1. A dataset
on which the benchmark task is
performed.
2. A metric
, which is the particular method used
to calculate bias of a model on the dataset.
Usually, these benchmarks are considered as
a bundle; however, they can often be decoupled,
mixed, and matched, as discussed in the Winobias
test case above. The work by Delobelle et al. (2021)
is an exception, in that they gathered a set of tem-
plates from diverse studies and tested them using
the same metric.
In Table 1, we present possible combinations
of datasets (rows) and metrics (columns) from the
gender bias literature. The metrics are partitioned
according to the three classes of metrics defined
in Section 2. We present only metrics valid for
assessing bias in contextualized LMs (rather than
static word embeddings), since they are the com-
mon practice nowadays. The table does not claim
to be exhaustive, but rather illustrates how metrics
and datasets can be repurposed in many different
ways. The metrics are described in appendix A,
but the categories are very general and even a sin-
gle column like “Gap (Label)” represents a wide
variety of metrics that can be measured.
Table 1shows that many metrics are compatible
across many datasets (many
X
s in the same col-
umn), and that datasets can be used to measure a
variety of metrics other than those typically mea-
Extrinsic Performance Extrinsic Predictions Intrinsic
Dataset
Metric Gap Gap Gap % or # of % or # Model Pred LM Prediction SEAT CEAT Probe Cluster Nearest Cos PCA
(Label) (Stereo) (Gender) Answer Changed Prefers Stereotype Gap On Target words Neighbors
Winogender (Rudinger et al.,2018)XX
X X X X X X X X X X X
Winobias (Zhao et al.,2018a)XX
X X X X X X X X X X X
Gap (Webster et al.,2018)X
X(aug)
Crow-S (Nangia et al.,2020a)X
X X X X X X X
StereoSet (Nadeem et al.,2021)X
X X X X X X X
Bias in Bios (De-Arteaga et al.,2019)X
X X X(aug) X(aug) X(aug) X X X X X X X
EEC (Kiritchenko and Mohammad,2018)X
X X X X X
STS-B for genders (Beutel et al.,2020)XX
X X X X X X X X
Dev et al. (2020a) (NLI) X X X X X X X X X X
PTB, WikiText, CNN/DailyMail X
X
(Bordia and Bowman,2019)
BOLD (Dhamala et al.,2021)X
Templates from May et al. (2019a)XX
X X X X X X
Templates from Kurita et al. (2019)X
X
X X X X X X
DisCo templates (Beutel et al.,2020)XX
X X X X X X
BEC-Pro templates (Bartl et al.,2020)XX
X X X X X X
English-German news corpus X X X
X
X
X
X
(Basta et al.,2021)
Reddit (Guo and Caliskan 2021,XX
X X X X X
Voigt et al. 2018)
MAP (Cao and Daumé III,2021)X
X X X X X X X
GICoref (Cao and Daumé III,2021)X
X X X X X X
Table 1: Combinations of gender bias datasets and metrics in the literature. Xmarks a feasible combination of
a metric and a dataset. X
marks the original metrics used on the dataset, and X(aug) marks metrics that can
be measured after augmenting the dataset such that every example is matched with a counterexample of another
gender. Extrinsic performance metrics depend on gold labels while extrinsic prediction metrics do not. A full
description of the metrics is given in Appendix A.
sured (many Xs in the same row). Some datasets,
such as Bias in Bios (De-Arteaga et al.,2019), have
numerous metrics compatible, while others have
fewer, but still multiple, compatible metrics. Bias
in Bios has many compatible metrics since it has
information that can be used to calculate them: in
addition to gold labels, it also has gender labels
and clear stereotype definitions derived from the
labels which are professions. Text corpora and
template data, which do not address a specific task
(bottom seven rows), are mostly compatible with
intrinsic metrics. The compatibility of intrinsic
metrics with many datasets may explain why pa-
pers report intrinsic metrics more often (§5). Ad-
ditionally, Table 1indicates that not many datasets
can be used to measure extrinsic metrics, partic-
ularly extrinsic performance metrics that require
gold labels. On the other hand, measuring LM
prediction on target words, which we consider as
extrinsic, can be done on many datasets. This is
useful for analyzing bias when dealing with LMs.
It can be done by computing bias metrics from the
LM output predictions, such as the mean proba-
bility gap when predicting the word “he” versus
“she” in specific contexts. Also, some templates are
valid for measuring extrinsic prediction metrics,
especially stereotype-related metrics, as they were
developed with explicit stereotypes in mind (such
as profession-related stereotypes).
Based on Table 1, it is clear that there are many
possible ways to measure gender bias in the litera-
ture, but they all fall under the vague category of
“gender bias”. Each of the possible combinations
gives a different definition, or interpretation, for
gender bias. The large number of different metrics
makes it difficult or even impossible to compare
different studies, including proposed gender bias
mitigation methods. This raises questions about the
validity of results derived from specific combina-
tions of measurements. In the next two sections, we
demonstrate how the choice of datasets and metrics
can affect the bias measurement.
4 Effect of Dataset on Measured Results
The choice of data to measure bias has an impact
on the calculated bias. Many researchers used sen-
tence templates that are “semantically bleached”
(e.g., “This is <word>.”, “<person> studied <pro-
fession> at college.”) to adjust metrics developed
for static word embeddings to contextualized rep-
resentations (May et al.,2019b;Kurita et al.,2019;
Webster et al.,2020;Bartl et al.,2020). Delobelle
et al. (2021) found that the choice of templates
significantly affected the results, with little corre-
lation between different templates. Additionally,
May et al. (2019b) reported that templates are not
as semantically bleached as expected.
Another common feature of bias metrics is the
use of hand-curated word lexicons by almost every
bias metric in the literature. Antoniak and Mimno
(2021) reported that the lexicon choice can greatly
affect bias measurement, leading to differing con-
clusions between different lexicons.
摘要:

ChooseYourLenses:FlawsinGenderBiasEvaluationHadasOrgadYonatanBelinkovorgad.hadas@cs.technion.ac.ilbelinkov@technion.ac.ilTechnion–IsraelInstituteofTechnologyAbstractConsiderableeffortstomeasureandmitigategenderbiasinrecentyearshaveledtothein-troductionofanabundanceoftasks,datasets,andmetricsusedint...

展开>> 收起<<
Choose Your Lenses Flaws in Gender Bias Evaluation Hadas Orgad Yonatan Belinkov orgad.hadascs.technion.ac.il belinkovtechnion.ac.il.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:543.47KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注