A Robust Bias Mitigation Procedure Based on the Stereotype Content Model Eddie L. Ungless andAmy Rafferty andHrichika Nag andBjörn Ross

2025-04-30 0 0 325.6KB 11 页 10玖币
侵权投诉
A Robust Bias Mitigation Procedure Based on the Stereotype Content
Model
Eddie L. Ungless and Amy Rafferty and Hrichika Nag and Björn Ross
School of Informatics
University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, United Kingdom
e.l.ungless@sms.ed.ac.uk a.rafferty@live.com
naghrichika@gmail.com b.ross@ed.ac.uk
Abstract
The Stereotype Content model (SCM) states
that we tend to perceive minority groups as
cold, incompetent or both. In this paper we
adapt existing work to demonstrate that the
Stereotype Content model holds for contextu-
alised word embeddings, then use these results
to evaluate a fine-tuning process designed to
drive a language model away from stereotyped
portrayals of minority groups. We find the
SCM terms are better able to capture bias than
demographic agnostic terms related to pleas-
antness. Further, we were able to reduce the
presence of stereotypes in the model through
a simple fine-tuning procedure that required
minimal human and computer resources, with-
out harming downstream performance. We
present this work as a prototype of a debias-
ing procedure that aims to remove the need for
a priori knowledge of the specifics of bias in
the model.
1 Introduction
It is well established that large language models
(LLMs) such as BERT (Devlin et al.,2019), GPT2
(Radford et al.,2019) and related contextualised
word embeddings such as ELMo (Peters et al.,
2018) are biased against different demographic
groups (Guo and Caliskan,2021;Webster et al.,
2020;Kaneko and Bollegala,2021), in that they
often reflect stereotypes in their output. For ex-
ample, given the prompt “naturally, the nurse is
a", these systems will typically output “woman”
(Schick et al.,2021). Given the common practice
of adapting pre-trained language models for a range
of tasks through fine-tuning, upstream bias mitiga-
tion may prove to be the most efficient solution (Jin
et al.,2021) (though cf. Steed et al. (2022). In this
paper, we demonstrate the success of modifying
an existing debiasing algorithm to be grounded in
a psychological theory of stereotypes - the SCM
(Cuddy et al.,2008), to efficiently reduce biases in
LLMs across a range of identities. Our proposed
debiasing pipeline has the benefit of minimising
the time spent researching identity terms and asso-
ciated stereotypes. Being a fine-tuning procedure,
this also reduces the amount of computational re-
sources needed compared to training an unbiased
model from scratch. This renders our approach
efficient and widely applicable. We demonstrate
using BERT, but this same procedure could easily
be adapted to other LLMs.
We adapt the fine-tuning procedure from Kaneko
and Bollegala (2021). They reduce gender bias in
a range of LLMs by fine-tuning using a data set of
sentences containing (binary) gendered terms (like
“he, man” or “she, lady”) (which they call attributes)
or stereotypes associated with different genders
(“assertive, secretary”) (which they call targets).
The training objective is to remove associations
with gender in the contextualised embeddings of
the targets whilst maintaining these associations
for the gendered attributes.
Crucially, rather than relying on stereotypes spe-
cific to a particular demographic such as men and
women (as in Kaneko and Bollegala (2021)) we
plan to use the SCM to inform our production
of fine-tuning data, inspired by work by Fraser
et al. (2021). The SCM states that our stereotyped
perception of different demographics can be con-
ceptualised as lying in a vector space with axes
of warmth/coldness and competence/incompetence
(Cuddy et al.,2008). We tend to consider our
own identity group to be warm and competent, and
stereotype disfavoured groups such as people expe-
riencing homelessness as cold and/or incompetent
(Cuddy et al.,2008).
In the terminology of Kaneko and Bollegala
(2021), our attributes are terms relating to warmth
and competence taken from Nicolas et al. (2021)
(as in Fraser et al. (2021), a paper on stereotypes
in static embeddings), our targets are demographic
identity terms. Because the SCM is designed to
encompass many different minority groups, this
arXiv:2210.14552v1 [cs.CL] 26 Oct 2022
avoids the need to generate lists of stereotypes
unique to each minority group, reducing work load
and making the tool easy to adapt to different tar-
gets. Therefore, the procedure should be effective
for all identity terms we use. We demonstrate this
technique for Black/white ethnicity and also the
intersectional power dynamic between white men
and Mexican American women, but this could eas-
ily be expanded to other aspects of identity such
as disability and sexuality. Further, whilst we fo-
cus on English language and American identities,
there is evidence that the SCM may hold relatively
well cross-culturally (Cuddy et al.,2009), so this
approach may be transferable to other LLMs.
We adapt the Contextualised Embedding Associ-
ation Test (CEAT) (Guo and Caliskan,2021) using
the vocabulary from Nicolas et al. (2021) in order
to measure stereotypes in contextualised word em-
beddings. The CEAT provides a robust measure of
bias in contextualised word embeddings for target
words, and is suited for use with the SCM terms.
In addition to using the CEAT to test for bias,
we also measure the performance of the model on
the language modeling benchmark GLUE (Wang
et al.,2018), to ensure the fine-tuning procedure
does not adversely impact the quality of the model,
an issue Meade et al. (2022) identify as affecting
several debiasing techniques.
The main contributions of this paper are to
demonstrate:
that the SCM can be used to detect bias in
contextualised word embeddings
a debiasing procedure that is demographic ag-
nostic and resource efficient1
2 Related work
Several contributions have been made towards mea-
suring and mitigating bias in NLU models with
minimal a priori knowledge. Fraser and colleagues
(2021) demonstrated the validity of the SCM for
static word embeddings, in that the embeddings
of words associated with traditionally oppressed
minority groups such as Mexican Americans or
Africans tend to lie in the cold, incompetent space,
as determined by cosine similarity. Note that, un-
like Fraser et al. (2021), we focus on the embed-
dings of the identity terms themselves, not of words
associated with those identities, as we explicitly
1
Code available at
https://github.com/
MxEddie/Demagnosticdebias
want to identify whether there is bias in the em-
beddings. Fraser et al. (2021) looked to establish
if the embeddings of associated terms followed
the SCM’s predictions, not whether the word em-
beddings were biased in a way as to reflect these
stereotypes.
Utama et al. (2020) propose a strategy for debias-
ing “unknown biases”. They train a shallow model
which picks up superficial patterns in data that are
likely to indicate bias. This is then used to train the
main model, which works by downweighting the
potentially biased examples, paired with an anneal-
ing mechanism which prevents the loss of useful
training signals caused by this approach. The mod-
els obtained from this self-debiasing framework
were shown to perform just as well as models de-
biased using prior knowledge. In our work we do
not train our model from scratch and only focus
on social bias, whereas Utama et al. (2020) do not
target specific bias types. We chose to prioritise so-
cially relevant biases with the hopes of minimising
harm done to minority communities. Further, our
method requires far less compute.
Webster et al. (2020) take gendered correlations
in pretrained language representations as a case
study for measuring and mitigating bias. They
build an evaluation framework for detecting and
quantifying gendered correlations in models. They
find that both dropout regularization and counter-
factual data augmentation minimize gendered cor-
relations while maintaining strong model accuracy.
Their techniques are applicable when training a
model from scratch, whilst ours is a fine-tuning pro-
cedure, meaning it requires fewer computational
resources.
Schick et al. (2021) explore whether language
models can self-diagnose undesirable outputs for
self-debiasing purposes. Their approach encour-
ages the model to output biased text, and uses the
resulting distribution to tune the model’s original
output. We argue that our model is more demo-
graphic agnostic, as their approach depends heav-
ily on biases captured by Perspective API. Their
approach may miss less salient forms of bias as
it relies on the model having some representation
of the bias category beforehand. Using the SCM,
we can work “backwards” from the fact that these
communities are harmed to then assume they will
be represented as cold and/or incompetent, making
our approach more universally applicable.
Cao et al. (2022b) focuses on identifying stereo-
typed group-trait associations in language mod-
els, by introducing a sensitivity test for measur-
ing stereotypical associations. They compare US-
based human judgements to language model stereo-
types, and discover moderate correlations. They
also extend their framework to measure language
model stereotyping of intersectional identities, find-
ing problems with identifying emergent intersec-
tional stereotypes. Our work is unique from this
in that we have additionally performed debiasing
informed by the SCM.
Overall, our methodology and approach differs
from most other contributions in this field as it
focuses on targeting social bias specifically, and we
propose a fine-tuning debiasing approach which
requires little in the way of human or computer
resources and is not limited to a small number of
demographics.
3 Data sets and tasks
3.1 Data for Debiasing Procedure
3.1.1 Identity terms (targets)
We established two sets of identity terms (targets)
for use with the context debiasing algorithm. The
first set relates to racial bias (bias against people
of colour based on their (perceived) race). BERT
has been shown to demonstrate racial bias in both
intrinsic (Guo and Caliskan,2021) and extrinsic
measures (Nadeem et al.,2021;Sheng et al.,2019).
To reduce bias against Black people compared to
white, we created a list of 20 African American
(AA) and 20 European American (EA), 10 male
and 10 female names for each, to use in the debi-
asing procedure. We used names from Guo and
Caliskan (2021) (excluding any included in the
CEAT tests we deploy, see Section 3.2) and sup-
plemented these lists with common names from a
database of US first names (Tzioumis,2018). Ex-
cluding names from the CEAT tests was crucial to
ensure a reduction in bias was due to a restructur-
ing of the embedding space and an overall change
in how Black individuals were represented, and not
due to bias reduction for the specific names we ran
the debiasing procedure with.
The second set relates to intersectional bias
against Mexican American (MA) women, that is
bias against women based on both patriarchal be-
liefs about their gender and prejudice against their
ethnicity. This intersectional bias is evident in the
contextualised embeddings BERT produces (Guo
and Caliskan,2021). To reduce bias against MA
women compared to white men, we additionally
took 10 common Hispanic female names (and man-
ually confirmed that each was used by the Mexican
American community through a Google search)
from Tzioumis (2018).
The validity of using names to represent demo-
graphic groups has been questioned (Blodgett et al.,
2021). However, we assume that reducing bias
present in the representations of these names will
go some way to reducing racial bias in the model.
3.1.2 Stereotype Content terms (attributes)
As with Fraser et al. (2021), we use the Stereotype
Content terms from Nicolas et al. (2021), whereby
the high morality, high sociability terms are taken
to indicate warmth; low morality, low sociability
to indicate coldness; high ability, high agency to
indicate competence; and low ability, low agency
to indicate incompetence. We selected the top 32
most frequent terms from each list (as measured
using the Brown Corpus and the NLTK toolkit),
to increase the likelihood we would find a large
number of example sentences for each. During
finetuning, we wish for these terms to maintain
their projection in the warmth/coldness or compe-
tence/incompetence space, respectively, whilst re-
moving projection in these directions for the target
terms (see Section 4and Figure 1).
Whilst the exact “position” of demographic
groups in this conceptual space would vary depend-
ing on who is describing them, in this work we
always assume the minority group will be repre-
sented in the original model as cold and incom-
petent, in other words the most disfavoured and
most likely to experience harm (Cuddy et al.,2008).
This minimises workload (no need to establish
likely predictions for every demographic consid-
ered, beyond identifying the more marginalised
group) and centers our approach around improving
results for the most negatively represented iden-
tity terms. Note, there is no harm in running our
debiasing procedure on identities that are already
equally associated with one concept i.e. warmth,
whilst also reducing stereotyped associations with
the other concept i.e. competence.
3.1.3 Fine-tuning data
Having established the list of attribute and target
terms, we follow an adapted version of Kaneko
and Bollegala (2021)’s procedure for generating
fine-tuning development data. During early analy-
摘要:

ARobustBiasMitigationProcedureBasedontheStereotypeContentModelEddieL.UnglessandAmyRaffertyandHrichikaNagandBjörnRossSchoolofInformaticsUniversityofEdinburgh10CrichtonStreet,EdinburghEH89AB,UnitedKingdome.l.ungless@sms.ed.ac.uka.rafferty@live.comnaghrichika@gmail.comb.ross@ed.ac.ukAbstractTheStereoty...

展开>> 收起<<
A Robust Bias Mitigation Procedure Based on the Stereotype Content Model Eddie L. Ungless andAmy Rafferty andHrichika Nag andBjörn Ross.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:325.6KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注