Quantifying Social Biases Using Templates is Unreliable Preethi Seshadri

2025-05-02 0 0 290.59KB 13 页 10玖币
侵权投诉
Quantifying Social Biases
Using Templates is Unreliable
Preethi Seshadri
UC Irvine
preethis@uci.edu
Pouya Pezeshkpour
UC Irvine
pezeshkp@uci.edu
Sameer Singh
UC Irvine
sameer@uci.edu
Abstract
Recently, there has been an increase in efforts to understand how large language
models (LLMs) propagate and amplify social biases. Several works have utilized
templates for fairness evaluation, which allow researchers to quantify social biases
in the absence of test sets with protected attribute labels. While template evalu-
ation can be a convenient and helpful diagnostic tool to understand model defi-
ciencies, it often uses a simplistic and limited set of templates. In this paper, we
study whether bias measurements are sensitive to the choice of templates used for
benchmarking. Specifically, we investigate the instability of bias measurements
by manually modifying templates proposed in previous works in a semantically-
preserving manner and measuring bias across these modifications. We find that
bias values and resulting conclusions vary considerably across template modifi-
cations on four tasks, ranging from an 81% reduction (NLI) to a 162% increase
(MLM) in (task-specific) bias measurements. Our results indicate that quantify-
ing fairness in LLMs, as done in current practice, can be brittle and needs to be
approached with more care and caution.
1 Introduction
Over the past few years, large language models (LLMs) have demonstrated impressive performance,
including few- and zero-shot performance, on many NLP tasks [Devlin et al., 2019, Liu et al., 2019,
Radford et al., 2019, Raffel et al., 2019, Brown et al., 2020]. However, LLMs have been shown to
exhibit social biases that can amplify harmful stereotypes and discriminatory practices. For example,
Abid et al. [2021] highlight that GPT-3 consistently displays anti-Muslim biases that are much more
severe than biases against other religious groups. Along with rapid developments in LLMs comes
the need for more systematic fairness evaluation to ensure models behave as expected and perform
well across various subgroups.
To address gaps in evaluation, behavioral testing is a useful framework to perform sanity checks and
validate the reliability of NLP systems. While behavioral testing has been applied more generally to
assist with debugging language models and assessing model generalization abilities [Ribeiro et al.,
2020, Goel et al., 2021, Mille et al., 2021, Ribeiro and Lundberg, 2022], these practices have also
been adopted in the bias and fairness space to help researchers understand how models can perpetu-
ate stereotypes and exacerbate existing inequities [Prabhakaran et al., 2019, Sheng et al., 2019, Kirk
et al., 2021]. A widely-used solution to quantify social biases in NLP is to generate a synthetic test
dataset in an automated manner by utilizing simple templates that test model capabilities [Dixon
et al., 2018, Kiritchenko and Mohammad, 2018, Park et al., 2018, Kurita et al., 2019, Dev et al.,
2020, Huang et al., 2020, Li et al., 2020]. With little effort, researchers can generate thousands of
instances by creating a small number of templates and iterating over combinations of the fill-in-
the-blank terms. Several existing works incorporate this simple approach to evaluate and expose
undesirable model biases — for example, Kiritchenko and Mohammad [2018] use templates such
Submitted to 2022 Trustworthy and Socially Responsible Machine Learning (TSRML 2022).
arXiv:2210.04337v1 [cs.CL] 9 Oct 2022
Language
Model
The situation makes [PERSON]
feel [EMOTION WORD].
[PERSON] is feeling [EMOTION
WORD] due to the situation.
Original Template
Bias Benchmark
The situation
makes her feel
angry. Show statistically significant bias
Does not show statistically
significant bias
Modified Template
Sentiment Analysis
The situation
makes him feel
angry.
She is feeling
angry due to
the situation.
He is feeling
angry due to
the situation.
Paired T-Test
(p-value)
0.01
0.7
Figure 1: Example of the fragility of bias measurements for sentiment analysis. Although the senti-
ment analysis model demonstrates significant bias on the original template, the modified template
(modifying the original template while preserving content) instead results in a different conclusion!
as “The situation makes [PERSON] feel [EMOTION WORD].” to analyze whether
sentiment analysis systems exhibit statistically significant gender bias.
Although templates are a convenient, easy-to-use, and scalable diagnostic tool for model biases,
these very benefits can also lead to notable limitations. Due to the fill-in-the-blank nature of tem-
plates, they tend to be extremely short and convey a single idea. Therefore, templates may not
represent structural and stylistic variations that occur in natural text. Furthermore, the scalable na-
ture of templates means that most works tend to include a small set of templates (often single digits),
as opposed to a more diverse, comprehensive set. While each template captures a specific idea or
behavior, it is often unclear why template datasets are constructed the way they are, i.e., why certain
templates are included vs. excluded and why templates are phrased in a specific way. Therefore,
template evaluation may depict a limited and misleading picture of model bias. As highlighted in
Figure 1, the sentiment analysis model demonstrates statistically significant bias on an original tem-
plate from Kiritchenko and Mohammad [2018]. On the other hand, slightly modifying this template
results in a completely different conclusion.
In this paper, we ask: How brittle is template data evaluation for assessing model fairness? To
answer this question, we examine how sensitive bias measures are to meaning-preserving changes in
templates. Ideally, we would expect the original and modified templates, conveying similar content
and containing identical fill-in-the-blank terms, to result in close predictions and therefore capture
similar bias. We consider four tasks — sentiment analysis, toxicity detection, natural language
inference (NLI), and masked language modeling (MLM) — and draw on existing template datasets
for each. Template modifications are done manually and held fixed, instead of using an adversarial
or human-in-the-loop procedure (an example modification is shown in Figure 1). The reasoning
behind this choice is both to ensure modified templates remain coherent and similar to the original
versions, as well as to generate model-agnostic modifications.
We find, however, that bias varies considerably across modified templates and differs from original
measurements on 4 different NLP tasks. For example, by categorizing examples based on statis-
tical test outcomes for gender bias, we observe that 33% of modified templates result in different
categorizations for sentiment analysis. We also observe that task-specific bias measures change up
to 81% in NLI, 127% in toxicity detection, and 162% in MLM after modifications. Our results
raise important questions about how fairness is being evaluated in LLMs currently. They indicate
that bias measurements, and any subsequent conclusions made from these measurements, are in-
consistent and highly template-specific. As a result, the process of comparing models and choosing
the “least biased” model to deploy can lead to different decisions just based on subtle wording and
phrasing choices in templates. We strongly advise researchers to leverage handcrafted fairness eval-
uation datasets when available and appropriate, or to place greater emphasis on generating more
comprehensive and diverse sets of templates for bias evaluation.
2 Behavioral Testing for Fairness
In this section, we provide an overview of template-based bias evaluation for different NLP tasks,
as well as the template modification and training procedures.
2
Table 1: Counts and examples of original/modified templates for each task.
Task # Orig # Mod Example of a Template
Sentiment 7 40
Original: The situation makes [PERSON] feel [EMOTIONAL STATE].
Modified: [PERSON] is feeling [EMOTIONAL STATE] due to the sit-
uation.
NLI 1 3
Original: P: A/An [SUBJECT] [VERB] a/an [OBJECT].
H: A/An [GENDERED WORD] [VERB] a/an [OBJECT].
Modified: P: A/an [OBJECT] was [VERB] by a/an [SUBJECT].
H: A/an [OBJECT] was [VERB] by a/an [GENDERED WORD].
Toxicity 5 43 Original: [NAME] is a/an [ADJ] [IDENTITY]
Modified: [NAME] can be described as a/an [ADJ] [IDENTITY]
MLM 1 4 Original: [TARGET] is [ATTRIBUTE].
Modified: [TARGET] was [ATTRIBUTE].
2.1 How Bias is Evaluated in NLP Tasks
Sentiment Analysis is the task of predicting the sentiment or emotional tone of a text. In this
work, we focus on binary sentiment classification, so the target labels are positive or negative sen-
timent. As the bias benchmark, we use the Equity Evaluation Corpus (EEC), designed to evaluate
submissions that took part in SemEval-2018 Task 1 [Kiritchenko and Mohammad, 2018]. We con-
sider 7 original templates from EEC that contain emotion words. These templates are then used to
test for differences in the predicted probability of a positive sentiment for two sentences that dif-
fer solely by a gendered noun phrase (e.g., names, “he” vs. “she”, “my son” vs “my daughter”,
etc.). Following the approach from the original paper, we use paired two-sample t-tests to determine
whether the mean difference between scores assigned to male and female sentences is statistically
significant at a template level.
NLI is the task of predicting whether a hypothesis statement is true (entailment), false (contradic-
tion), or unclear (neutral) given a premise statement. We select the bias benchmark created by [Dev
et al., 2020] to measure various stereotypes in NLI and focus on the gender/occupation instances.
The authors include just a single template, with roughly 2 million instances of this template: the
premise follows the form “A/An [SUBJECT] [VERB] a/an [OBJECT]”, while the hypoth-
esis follows the form “A/An [GENDERED WORD] [VERB] a/an [OBJECT]” (the subject
becomes a gendered word). For all instances, the ground truth label is neutral since there is no infor-
mation in the premise that would entail or contradict the hypothesis. The original paper computes
the deviation from neutrality as the average probability for the neutral class and the fraction of ex-
amples that are predicted as neutral. We go one step further and measure the difference in deviation
from neutrality, using these two approaches, for instances with male vs. female-gendered words.
Toxicity Detection is the task of detecting whether a text contains toxic language (hateful, abusive,
or offensive content) or not. We adopt the benchmark created by [Dixon et al., 2018] to measure
unintended bias in toxicity detection. While there is an older version of the dataset now under
archive, we choose the most recent version, which the Jigsaw team uses to evaluate bias in the
Perspective API1. Instead of considering only binary gender bias, the researchers identify biases
against various demographic identity terms. After excluding any templates without identity terms,
we focus on 5 original templates with both toxic and non-toxic instances. We follow the original
work and compute two bias measures, the sum of absolute differences in false positive rate (FPED):
False Positive Equality Difference (FPED) =X
iI
|FPRFPRi|(1)
Here Irepresents the set of all identity terms. Similarly, we compute the sum of absolute differences
in false negative rate (FNED) across all identity terms.
Masked Language Modeling (MLM) is a fill-in-the-blank task where the model predicts masked
token(s) in a text. We utilize the log probability bias score method [Kurita et al., 2019], which
1Perspective API identifies toxicity using machine learning: (https://perspectiveapi.com).
3
摘要:

QuantifyingSocialBiasesUsingTemplatesisUnreliablePreethiSeshadriUCIrvinepreethis@uci.eduPouyaPezeshkpourUCIrvinepezeshkp@uci.eduSameerSinghUCIrvinesameer@uci.eduAbstractRecently,therehasbeenanincreaseineffortstounderstandhowlargelanguagemodels(LLMs)propagateandamplifysocialbiases.Severalworkshaveuti...

展开>> 收起<<
Quantifying Social Biases Using Templates is Unreliable Preethi Seshadri.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:290.59KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注