A Robust Bias Mitigation Procedure Based on the Stereotype Content Model Eddie L. Ungless andAmy Rafferty andHrichika Nag andBjörn Ross

2025-04-30 3 0 325.6KB 11 页 10玖币

侵权投诉

A Robust Bias Mitigation Procedure Based on the Stereotype Content

Model

Eddie L. Ungless and Amy Rafferty and Hrichika Nag and Björn Ross

School of Informatics

University of Edinburgh

10 Crichton Street, Edinburgh EH8 9AB, United Kingdom

e.l.ungless@sms.ed.ac.uk a.rafferty@live.com

naghrichika@gmail.com b.ross@ed.ac.uk

Abstract

The Stereotype Content model (SCM) states

that we tend to perceive minority groups as

cold, incompetent or both. In this paper we

adapt existing work to demonstrate that the

Stereotype Content model holds for contextu-

alised word embeddings, then use these results

to evaluate a ﬁne-tuning process designed to

drive a language model away from stereotyped

portrayals of minority groups. We ﬁnd the

SCM terms are better able to capture bias than

demographic agnostic terms related to pleas-

antness. Further, we were able to reduce the

presence of stereotypes in the model through

a simple ﬁne-tuning procedure that required

minimal human and computer resources, with-

out harming downstream performance. We

present this work as a prototype of a debias-

ing procedure that aims to remove the need for

a priori knowledge of the speciﬁcs of bias in

the model.

1 Introduction

It is well established that large language models

(LLMs) such as BERT (Devlin et al.,2019), GPT2

(Radford et al.,2019) and related contextualised

word embeddings such as ELMo (Peters et al.,

2018) are biased against different demographic

groups (Guo and Caliskan,2021;Webster et al.,

2020;Kaneko and Bollegala,2021), in that they

often reﬂect stereotypes in their output. For ex-

ample, given the prompt “naturally, the nurse is

a", these systems will typically output “woman”

(Schick et al.,2021). Given the common practice

of adapting pre-trained language models for a range

of tasks through ﬁne-tuning, upstream bias mitiga-

tion may prove to be the most efﬁcient solution (Jin

et al.,2021) (though cf. Steed et al. (2022). In this

paper, we demonstrate the success of modifying

an existing debiasing algorithm to be grounded in

a psychological theory of stereotypes - the SCM

(Cuddy et al.,2008), to efﬁciently reduce biases in

LLMs across a range of identities. Our proposed

debiasing pipeline has the beneﬁt of minimising

the time spent researching identity terms and asso-

ciated stereotypes. Being a ﬁne-tuning procedure,

this also reduces the amount of computational re-

sources needed compared to training an unbiased

model from scratch. This renders our approach

efﬁcient and widely applicable. We demonstrate

using BERT, but this same procedure could easily

be adapted to other LLMs.

We adapt the ﬁne-tuning procedure from Kaneko

and Bollegala (2021). They reduce gender bias in

a range of LLMs by ﬁne-tuning using a data set of

sentences containing (binary) gendered terms (like

“he, man” or “she, lady”) (which they call attributes)

or stereotypes associated with different genders

(“assertive, secretary”) (which they call targets).

The training objective is to remove associations

with gender in the contextualised embeddings of

the targets whilst maintaining these associations

for the gendered attributes.

Crucially, rather than relying on stereotypes spe-

ciﬁc to a particular demographic such as men and

women (as in Kaneko and Bollegala (2021)) we

plan to use the SCM to inform our production

of ﬁne-tuning data, inspired by work by Fraser

et al. (2021). The SCM states that our stereotyped

perception of different demographics can be con-

ceptualised as lying in a vector space with axes

of warmth/coldness and competence/incompetence

(Cuddy et al.,2008). We tend to consider our

own identity group to be warm and competent, and

stereotype disfavoured groups such as people expe-

riencing homelessness as cold and/or incompetent

(Cuddy et al.,2008).

In the terminology of Kaneko and Bollegala

(2021), our attributes are terms relating to warmth

and competence taken from Nicolas et al. (2021)

(as in Fraser et al. (2021), a paper on stereotypes

in static embeddings), our targets are demographic

identity terms. Because the SCM is designed to

encompass many different minority groups, this

arXiv:2210.14552v1 [cs.CL] 26 Oct 2022

avoids the need to generate lists of stereotypes

unique to each minority group, reducing work load

and making the tool easy to adapt to different tar-

gets. Therefore, the procedure should be effective

for all identity terms we use. We demonstrate this

technique for Black/white ethnicity and also the

intersectional power dynamic between white men

and Mexican American women, but this could eas-

ily be expanded to other aspects of identity such

as disability and sexuality. Further, whilst we fo-

cus on English language and American identities,

there is evidence that the SCM may hold relatively

well cross-culturally (Cuddy et al.,2009), so this

approach may be transferable to other LLMs.

We adapt the Contextualised Embedding Associ-

ation Test (CEAT) (Guo and Caliskan,2021) using

the vocabulary from Nicolas et al. (2021) in order

to measure stereotypes in contextualised word em-

beddings. The CEAT provides a robust measure of

bias in contextualised word embeddings for target

words, and is suited for use with the SCM terms.

In addition to using the CEAT to test for bias,

we also measure the performance of the model on

the language modeling benchmark GLUE (Wang

et al.,2018), to ensure the ﬁne-tuning procedure

does not adversely impact the quality of the model,

an issue Meade et al. (2022) identify as affecting

several debiasing techniques.

The main contributions of this paper are to

demonstrate:

•

that the SCM can be used to detect bias in

contextualised word embeddings

•

a debiasing procedure that is demographic ag-

nostic and resource efﬁcient1

2 Related work

Several contributions have been made towards mea-

suring and mitigating bias in NLU models with

minimal a priori knowledge. Fraser and colleagues

(2021) demonstrated the validity of the SCM for

static word embeddings, in that the embeddings

of words associated with traditionally oppressed

minority groups such as Mexican Americans or

Africans tend to lie in the cold, incompetent space,

as determined by cosine similarity. Note that, un-

like Fraser et al. (2021), we focus on the embed-

dings of the identity terms themselves, not of words

associated with those identities, as we explicitly

Code available at

https://github.com/

MxEddie/Demagnosticdebias

want to identify whether there is bias in the em-

beddings. Fraser et al. (2021) looked to establish

if the embeddings of associated terms followed

the SCM’s predictions, not whether the word em-

beddings were biased in a way as to reﬂect these

stereotypes.

Utama et al. (2020) propose a strategy for debias-

ing “unknown biases”. They train a shallow model

which picks up superﬁcial patterns in data that are

likely to indicate bias. This is then used to train the

main model, which works by downweighting the

potentially biased examples, paired with an anneal-

ing mechanism which prevents the loss of useful

training signals caused by this approach. The mod-

els obtained from this self-debiasing framework

were shown to perform just as well as models de-

biased using prior knowledge. In our work we do

not train our model from scratch and only focus

on social bias, whereas Utama et al. (2020) do not

target speciﬁc bias types. We chose to prioritise so-

cially relevant biases with the hopes of minimising

harm done to minority communities. Further, our

method requires far less compute.

Webster et al. (2020) take gendered correlations

in pretrained language representations as a case

study for measuring and mitigating bias. They

build an evaluation framework for detecting and

quantifying gendered correlations in models. They

ﬁnd that both dropout regularization and counter-

factual data augmentation minimize gendered cor-

relations while maintaining strong model accuracy.

Their techniques are applicable when training a

model from scratch, whilst ours is a ﬁne-tuning pro-

cedure, meaning it requires fewer computational

resources.

Schick et al. (2021) explore whether language

models can self-diagnose undesirable outputs for

self-debiasing purposes. Their approach encour-

ages the model to output biased text, and uses the

resulting distribution to tune the model’s original

output. We argue that our model is more demo-

graphic agnostic, as their approach depends heav-

ily on biases captured by Perspective API. Their

approach may miss less salient forms of bias as

it relies on the model having some representation

of the bias category beforehand. Using the SCM,

we can work “backwards” from the fact that these

communities are harmed to then assume they will

be represented as cold and/or incompetent, making

our approach more universally applicable.

Cao et al. (2022b) focuses on identifying stereo-

typed group-trait associations in language mod-

els, by introducing a sensitivity test for measur-

ing stereotypical associations. They compare US-

based human judgements to language model stereo-

types, and discover moderate correlations. They

also extend their framework to measure language

model stereotyping of intersectional identities, ﬁnd-

ing problems with identifying emergent intersec-

tional stereotypes. Our work is unique from this

in that we have additionally performed debiasing

informed by the SCM.

Overall, our methodology and approach differs

from most other contributions in this ﬁeld as it

focuses on targeting social bias speciﬁcally, and we

propose a ﬁne-tuning debiasing approach which

requires little in the way of human or computer

resources and is not limited to a small number of

demographics.

3 Data sets and tasks

3.1 Data for Debiasing Procedure

3.1.1 Identity terms (targets)

We established two sets of identity terms (targets)

for use with the context debiasing algorithm. The

ﬁrst set relates to racial bias (bias against people

of colour based on their (perceived) race). BERT

has been shown to demonstrate racial bias in both

intrinsic (Guo and Caliskan,2021) and extrinsic

measures (Nadeem et al.,2021;Sheng et al.,2019).

To reduce bias against Black people compared to

white, we created a list of 20 African American

(AA) and 20 European American (EA), 10 male

and 10 female names for each, to use in the debi-

asing procedure. We used names from Guo and

Caliskan (2021) (excluding any included in the

CEAT tests we deploy, see Section 3.2) and sup-

plemented these lists with common names from a

database of US ﬁrst names (Tzioumis,2018). Ex-

cluding names from the CEAT tests was crucial to

ensure a reduction in bias was due to a restructur-

ing of the embedding space and an overall change

in how Black individuals were represented, and not

due to bias reduction for the speciﬁc names we ran

the debiasing procedure with.

The second set relates to intersectional bias

against Mexican American (MA) women, that is

bias against women based on both patriarchal be-

liefs about their gender and prejudice against their

ethnicity. This intersectional bias is evident in the

contextualised embeddings BERT produces (Guo

and Caliskan,2021). To reduce bias against MA

women compared to white men, we additionally

took 10 common Hispanic female names (and man-

ually conﬁrmed that each was used by the Mexican

American community through a Google search)

from Tzioumis (2018).

The validity of using names to represent demo-

graphic groups has been questioned (Blodgett et al.,

2021). However, we assume that reducing bias

present in the representations of these names will

go some way to reducing racial bias in the model.

3.1.2 Stereotype Content terms (attributes)

As with Fraser et al. (2021), we use the Stereotype

Content terms from Nicolas et al. (2021), whereby

the high morality, high sociability terms are taken

to indicate warmth; low morality, low sociability

to indicate coldness; high ability, high agency to

indicate competence; and low ability, low agency

to indicate incompetence. We selected the top 32

most frequent terms from each list (as measured

using the Brown Corpus and the NLTK toolkit),

to increase the likelihood we would ﬁnd a large

number of example sentences for each. During

ﬁnetuning, we wish for these terms to maintain

their projection in the warmth/coldness or compe-

tence/incompetence space, respectively, whilst re-

moving projection in these directions for the target

terms (see Section 4and Figure 1).

Whilst the exact “position” of demographic

groups in this conceptual space would vary depend-

ing on who is describing them, in this work we

always assume the minority group will be repre-

sented in the original model as cold and incom-

petent, in other words the most disfavoured and

most likely to experience harm (Cuddy et al.,2008).

This minimises workload (no need to establish

likely predictions for every demographic consid-

ered, beyond identifying the more marginalised

group) and centers our approach around improving

results for the most negatively represented iden-

tity terms. Note, there is no harm in running our

debiasing procedure on identities that are already

equally associated with one concept i.e. warmth,

whilst also reducing stereotyped associations with

the other concept i.e. competence.

3.1.3 Fine-tuning data

Having established the list of attribute and target

terms, we follow an adapted version of Kaneko

and Bollegala (2021)’s procedure for generating

ﬁne-tuning development data. During early analy-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ARobustBiasMitigationProcedureBasedontheStereotypeContentModelEddieL.UnglessandAmyRaffertyandHrichikaNagandBjörnRossSchoolofInformaticsUniversityofEdinburgh10CrichtonStreet,EdinburghEH89AB,UnitedKingdome.l.ungless@sms.ed.ac.uka.rafferty@live.comnaghrichika@gmail.comb.ross@ed.ac.ukAbstractTheStereoty...

展开>> 收起<<

A Robust Bias Mitigation Procedure Based on the Stereotype Content Model Eddie L. Ungless andAmy Rafferty andHrichika Nag andBjörn Ross.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Robust Bias Mitigation Procedure Based on the Stereotype Content Model Eddie L. Ungless andAmy Rafferty andHrichika Nag andBjörn Ross

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: