Quantifying Social Biases Using Templates is Unreliable Preethi Seshadri

2025-05-02 0 0 290.59KB 13 页 10玖币

侵权投诉

Quantifying Social Biases

Using Templates is Unreliable

Preethi Seshadri

UC Irvine

preethis@uci.edu

Pouya Pezeshkpour

UC Irvine

pezeshkp@uci.edu

Sameer Singh

UC Irvine

sameer@uci.edu

Abstract

Recently, there has been an increase in efforts to understand how large language

models (LLMs) propagate and amplify social biases. Several works have utilized

templates for fairness evaluation, which allow researchers to quantify social biases

in the absence of test sets with protected attribute labels. While template evalu-

ation can be a convenient and helpful diagnostic tool to understand model deﬁ-

ciencies, it often uses a simplistic and limited set of templates. In this paper, we

study whether bias measurements are sensitive to the choice of templates used for

benchmarking. Speciﬁcally, we investigate the instability of bias measurements

by manually modifying templates proposed in previous works in a semantically-

preserving manner and measuring bias across these modiﬁcations. We ﬁnd that

bias values and resulting conclusions vary considerably across template modiﬁ-

cations on four tasks, ranging from an 81% reduction (NLI) to a 162% increase

(MLM) in (task-speciﬁc) bias measurements. Our results indicate that quantify-

ing fairness in LLMs, as done in current practice, can be brittle and needs to be

approached with more care and caution.

1 Introduction

Over the past few years, large language models (LLMs) have demonstrated impressive performance,

including few- and zero-shot performance, on many NLP tasks [Devlin et al., 2019, Liu et al., 2019,

Radford et al., 2019, Raffel et al., 2019, Brown et al., 2020]. However, LLMs have been shown to

exhibit social biases that can amplify harmful stereotypes and discriminatory practices. For example,

Abid et al. [2021] highlight that GPT-3 consistently displays anti-Muslim biases that are much more

severe than biases against other religious groups. Along with rapid developments in LLMs comes

the need for more systematic fairness evaluation to ensure models behave as expected and perform

well across various subgroups.

To address gaps in evaluation, behavioral testing is a useful framework to perform sanity checks and

validate the reliability of NLP systems. While behavioral testing has been applied more generally to

assist with debugging language models and assessing model generalization abilities [Ribeiro et al.,

2020, Goel et al., 2021, Mille et al., 2021, Ribeiro and Lundberg, 2022], these practices have also

been adopted in the bias and fairness space to help researchers understand how models can perpetu-

ate stereotypes and exacerbate existing inequities [Prabhakaran et al., 2019, Sheng et al., 2019, Kirk

et al., 2021]. A widely-used solution to quantify social biases in NLP is to generate a synthetic test

dataset in an automated manner by utilizing simple templates that test model capabilities [Dixon

et al., 2018, Kiritchenko and Mohammad, 2018, Park et al., 2018, Kurita et al., 2019, Dev et al.,

2020, Huang et al., 2020, Li et al., 2020]. With little effort, researchers can generate thousands of

instances by creating a small number of templates and iterating over combinations of the ﬁll-in-

the-blank terms. Several existing works incorporate this simple approach to evaluate and expose

undesirable model biases — for example, Kiritchenko and Mohammad [2018] use templates such

Submitted to 2022 Trustworthy and Socially Responsible Machine Learning (TSRML 2022).

arXiv:2210.04337v1 [cs.CL] 9 Oct 2022

Language

Model

The situation makes [PERSON]

feel [EMOTION WORD].

[PERSON] is feeling [EMOTION

WORD] due to the situation.

Original Template

Bias Benchmark

The situation

makes her feel

angry. Show statistically significant bias

Does not show statistically

significant bias

Modified Template

Sentiment Analysis

The situation

makes him feel

angry.

She is feeling

angry due to

the situation.

He is feeling

angry due to

the situation.

Paired T-Test

(p-value)

0.01

0.7

Figure 1: Example of the fragility of bias measurements for sentiment analysis. Although the senti-

ment analysis model demonstrates signiﬁcant bias on the original template, the modiﬁed template

(modifying the original template while preserving content) instead results in a different conclusion!

as “The situation makes [PERSON] feel [EMOTION WORD].” to analyze whether

sentiment analysis systems exhibit statistically signiﬁcant gender bias.

Although templates are a convenient, easy-to-use, and scalable diagnostic tool for model biases,

these very beneﬁts can also lead to notable limitations. Due to the ﬁll-in-the-blank nature of tem-

plates, they tend to be extremely short and convey a single idea. Therefore, templates may not

represent structural and stylistic variations that occur in natural text. Furthermore, the scalable na-

ture of templates means that most works tend to include a small set of templates (often single digits),

as opposed to a more diverse, comprehensive set. While each template captures a speciﬁc idea or

behavior, it is often unclear why template datasets are constructed the way they are, i.e., why certain

templates are included vs. excluded and why templates are phrased in a speciﬁc way. Therefore,

template evaluation may depict a limited and misleading picture of model bias. As highlighted in

Figure 1, the sentiment analysis model demonstrates statistically signiﬁcant bias on an original tem-

plate from Kiritchenko and Mohammad [2018]. On the other hand, slightly modifying this template

results in a completely different conclusion.

In this paper, we ask: How brittle is template data evaluation for assessing model fairness? To

answer this question, we examine how sensitive bias measures are to meaning-preserving changes in

templates. Ideally, we would expect the original and modiﬁed templates, conveying similar content

and containing identical ﬁll-in-the-blank terms, to result in close predictions and therefore capture

similar bias. We consider four tasks — sentiment analysis, toxicity detection, natural language

inference (NLI), and masked language modeling (MLM) — and draw on existing template datasets

for each. Template modiﬁcations are done manually and held ﬁxed, instead of using an adversarial

or human-in-the-loop procedure (an example modiﬁcation is shown in Figure 1). The reasoning

behind this choice is both to ensure modiﬁed templates remain coherent and similar to the original

versions, as well as to generate model-agnostic modiﬁcations.

We ﬁnd, however, that bias varies considerably across modiﬁed templates and differs from original

measurements on 4 different NLP tasks. For example, by categorizing examples based on statis-

tical test outcomes for gender bias, we observe that 33% of modiﬁed templates result in different

categorizations for sentiment analysis. We also observe that task-speciﬁc bias measures change up

to 81% in NLI, 127% in toxicity detection, and 162% in MLM after modiﬁcations. Our results

raise important questions about how fairness is being evaluated in LLMs currently. They indicate

that bias measurements, and any subsequent conclusions made from these measurements, are in-

consistent and highly template-speciﬁc. As a result, the process of comparing models and choosing

the “least biased” model to deploy can lead to different decisions just based on subtle wording and

phrasing choices in templates. We strongly advise researchers to leverage handcrafted fairness eval-

uation datasets when available and appropriate, or to place greater emphasis on generating more

comprehensive and diverse sets of templates for bias evaluation.

2 Behavioral Testing for Fairness

In this section, we provide an overview of template-based bias evaluation for different NLP tasks,

as well as the template modiﬁcation and training procedures.

Table 1: Counts and examples of original/modiﬁed templates for each task.

Task # Orig # Mod Example of a Template

Sentiment 7 40

Original: The situation makes [PERSON] feel [EMOTIONAL STATE].

Modiﬁed: [PERSON] is feeling [EMOTIONAL STATE] due to the sit-

uation.

NLI 1 3

Original: P: A/An [SUBJECT] [VERB] a/an [OBJECT].

H: A/An [GENDERED WORD] [VERB] a/an [OBJECT].

Modiﬁed: P: A/an [OBJECT] was [VERB] by a/an [SUBJECT].

H: A/an [OBJECT] was [VERB] by a/an [GENDERED WORD].

Toxicity 5 43 Original: [NAME] is a/an [ADJ] [IDENTITY]

Modiﬁed: [NAME] can be described as a/an [ADJ] [IDENTITY]

MLM 1 4 Original: [TARGET] is [ATTRIBUTE].

Modiﬁed: [TARGET] was [ATTRIBUTE].

2.1 How Bias is Evaluated in NLP Tasks

Sentiment Analysis is the task of predicting the sentiment or emotional tone of a text. In this

work, we focus on binary sentiment classiﬁcation, so the target labels are positive or negative sen-

timent. As the bias benchmark, we use the Equity Evaluation Corpus (EEC), designed to evaluate

submissions that took part in SemEval-2018 Task 1 [Kiritchenko and Mohammad, 2018]. We con-

sider 7 original templates from EEC that contain emotion words. These templates are then used to

test for differences in the predicted probability of a positive sentiment for two sentences that dif-

fer solely by a gendered noun phrase (e.g., names, “he” vs. “she”, “my son” vs “my daughter”,

etc.). Following the approach from the original paper, we use paired two-sample t-tests to determine

whether the mean difference between scores assigned to male and female sentences is statistically

signiﬁcant at a template level.

NLI is the task of predicting whether a hypothesis statement is true (entailment), false (contradic-

tion), or unclear (neutral) given a premise statement. We select the bias benchmark created by [Dev

et al., 2020] to measure various stereotypes in NLI and focus on the gender/occupation instances.

The authors include just a single template, with roughly 2 million instances of this template: the

premise follows the form “A/An [SUBJECT] [VERB] a/an [OBJECT]”, while the hypoth-

esis follows the form “A/An [GENDERED WORD] [VERB] a/an [OBJECT]” (the subject

becomes a gendered word). For all instances, the ground truth label is neutral since there is no infor-

mation in the premise that would entail or contradict the hypothesis. The original paper computes

the deviation from neutrality as the average probability for the neutral class and the fraction of ex-

amples that are predicted as neutral. We go one step further and measure the difference in deviation

from neutrality, using these two approaches, for instances with male vs. female-gendered words.

Toxicity Detection is the task of detecting whether a text contains toxic language (hateful, abusive,

or offensive content) or not. We adopt the benchmark created by [Dixon et al., 2018] to measure

unintended bias in toxicity detection. While there is an older version of the dataset now under

archive, we choose the most recent version, which the Jigsaw team uses to evaluate bias in the

Perspective API1. Instead of considering only binary gender bias, the researchers identify biases

against various demographic identity terms. After excluding any templates without identity terms,

we focus on 5 original templates with both toxic and non-toxic instances. We follow the original

work and compute two bias measures, the sum of absolute differences in false positive rate (FPED):

False Positive Equality Difference (FPED) =X

i∈I

|FPR−FPRi|(1)

Here Irepresents the set of all identity terms. Similarly, we compute the sum of absolute differences

in false negative rate (FNED) across all identity terms.

Masked Language Modeling (MLM) is a ﬁll-in-the-blank task where the model predicts masked

token(s) in a text. We utilize the log probability bias score method [Kurita et al., 2019], which

1Perspective API identiﬁes toxicity using machine learning: (https://perspectiveapi.com).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QuantifyingSocialBiasesUsingTemplatesisUnreliablePreethiSeshadriUCIrvinepreethis@uci.eduPouyaPezeshkpourUCIrvinepezeshkp@uci.eduSameerSinghUCIrvinesameer@uci.eduAbstractRecently,therehasbeenanincreaseineffortstounderstandhowlargelanguagemodels(LLMs)propagateandamplifysocialbiases.Severalworkshaveuti...

展开>> 收起<<

Quantifying Social Biases Using Templates is Unreliable Preethi Seshadri.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Quantifying Social Biases Using Templates is Unreliable Preethi Seshadri

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: