MABEL Attenuating Gender Bias using Textual Entailment Data Jacqueline HeMengzhou Xia Christiane Fellbaum Danqi Chen Department of Computer Science Princeton University

2025-05-02 0 0 1.35MB 22 页 10玖币

侵权投诉

MABEL: Attenuating Gender Bias using Textual Entailment Data

Jacqueline He∗Mengzhou Xia Christiane Fellbaum Danqi Chen

Department of Computer Science, Princeton University

jacquelinehe00@gmail.com

{mengzhou, fellbaum, danqic}@cs.princeton.edu

Abstract

Pre-trained language models encode undesir-

able social biases, which are further exacer-

bated in downstream use. To this end, we

propose MABEL (a Method for Attenuating

Gender Bias using Entailment Labels), an in-

termediate pre-training approach for mitigat-

ing gender bias in contextualized representa-

tions. Key to our approach is the use of a

contrastive learning objective on counterfac-

tually augmented, gender-balanced entailment

pairs from natural language inference (NLI)

datasets. We also introduce an alignment reg-

ularizer that pulls identical entailment pairs

along opposite gender directions closer. We

extensively evaluate our approach on intrinsic

and extrinsic metrics, and show that MABEL

outperforms previous task-agnostic debiasing

approaches in terms of fairness. It also pre-

serves task performance after ﬁne-tuning on

downstream tasks. Together, these ﬁndings

demonstrate the suitability of NLI data as an

effective means of bias mitigation, as opposed

to only using unlabeled sentences in the lit-

erature. Finally, we identify that existing ap-

proaches often use evaluation settings that are

insufﬁcient or inconsistent. We make an effort

to reproduce and compare previous methods,

and call for unifying the evaluation settings

across gender debiasing methods for better fu-

ture comparison.1

1 Introduction

Pre-trained language models have reshaped the

landscape of modern natural language process-

ing (Peters et al.,2018;Devlin et al.,2019;Liu

et al.,2019). As these powerful networks are opti-

mized to learn statistical properties from large train-

ing corpora imbued with signiﬁcant social biases

(e.g., gender, racial), they produce encoded repre-

sentations that inherit undesirable associations as a

∗

This work was done before JH graduated from Princeton

University.

Our code is publicly available at

https://github.com/

princeton-nlp/MABEL.

byproduct (Zhao et al.,2019;Webster et al.,2020;

Nadeem et al.,2021). More concerningly, models

trained on these representations can not only prop-

agate but also amplify discriminatory judgments in

downstream applications (Kurita et al.,2019).

A multitude of recent efforts have focused on

alleviating biases in language models. These can

be classed into two categories (Table 1): 1) task-

speciﬁc approaches perform bias mitigation during

downstream ﬁne-tuning, and require data to be an-

notated for sensitive attributes; 2) task-agnostic

approaches directly improve pre-trained representa-

tions, most commonly either by removing discrim-

inative biases through projection (Dev et al.,2020;

Liang et al.,2020;Kaneko and Bollegala,2021), or

by performing intermediate pre-training on gender-

balanced data (Webster et al.,2020;Cheng et al.,

2021;Lauscher et al.,2021;Guo et al.,2022), re-

sulting in a new encoder that can transfer fairness

effects downstream via standard ﬁne-tuning.

In this work, we present MABEL, a novel and

lightweight method for attenuating gender bias.

MABEL is task-agnostic and can be framed as

an intermediate pre-training approach with a con-

trastive learning framework. Our approach hinges

on the use of entailment pairs from supervised nat-

ural language inference datasets (Bowman et al.,

2015;Williams et al.,2018). We augment the train-

ing data by swapping gender words in both premise

and hypothesis sentences and model them using a

contrastive objective. We also propose an align-

ment regularizer, which minimizes the distance

between the entailment pair and its augmented

one. MABEL optionally incorporates a masked

language modeling objective, so that it can be used

for token-level downstream tasks.

To the best of our knowledge, MABEL is the

ﬁrst to exploit supervised sentence pairs for learn-

ing fairer contextualized representations. Super-

vised contrastive learning via entailment pairs is

known to learn a more uniformly distributed rep-

arXiv:2210.14975v1 [cs.CL] 26 Oct 2022

resentation space, wherein similarity measures be-

tween sentences better correspond to their seman-

tic meanings (Gao et al.,2021). Meanwhile, our

proposed alignment loss, which pulls identical sen-

tences along contrasting gender directions closer,

is well-suited to learning a fairer semantic space.

We systematically evaluate MABEL on a com-

prehensive suite of intrinsic and extrinsic measures

spanning language modeling, text classiﬁcation,

NLI, and coreference resolution. MABEL per-

forms well against existing gender debiasing ef-

forts in terms of both fairness and downstream task

performance, and it also preserves language under-

standing on the GLUE benchmark (Wang et al.,

2019). Altogether, these results demonstrate the

effectiveness of harnessing NLI data for bias at-

tenuation, and underscore MABEL’s potential as a

general-purpose fairer encoder.

Lastly, we identify two major issues in existing

gender bias mitigation literature. First, many pre-

vious approaches solely quantify bias through the

Sentence Encoding Association Test (SEAT) (May

et al.,2019), a metric that compares the geometric

relations between sentence representations. De-

spite scoring well on SEAT, many debiasing meth-

ods do not show the same fairness gains across

other evaluation settings. Second, previous ap-

proaches evaluate on extrinsic benchmarks in an

inconsistent manner. For a fairer comparison, we

either reproduce or summarize the performance of

many recent methodologies on major evaluation

tasks. We believe that unifying the evaluation set-

tings lays the groundwork for more meaningful

methodological comparisons in future research.

2 Background

2.1 Debiasing Contextualized

Representations

Debiasing attempts in NLP can be divided into two

categories. In the ﬁrst category, the model learns to

disregard the inﬂuence of sensitive attributes in rep-

resentations during ﬁne-tuning, through projection-

based (Ravfogel et al.,2020,2022), adversar-

ial (Han et al.,2021a,b) or contrastive (Shen et al.,

2021;Chi et al.,2022) downstream objectives. This

approach is task-speciﬁc as it requires ﬁne-tuning

data that is annotated for the sensitive attribute.

The second type, task-agnostic training, mitigates

bias by leveraging textual information from gen-

eral corpora. This can involve computing a gender

subspace and eliminating it from encoded represen-

tations (Dev et al.,2020;Liang et al.,2020;Dev

et al.,2021;Kaneko and Bollegala,2021), or by

re-training the encoder with a higher dropout (Web-

ster et al.,2020) or equalizing objectives (Cheng

et al.,2021;Guo et al.,2022) to alleviate unwanted

gender associations.

We summarize recent efforts of both task-

speciﬁc and task-agnostic approaches in Table 1.

Compared to task-speciﬁc approaches that only

debias for the task at hand, task-agnostic models

produce fair encoded representations that can be

used toward a variety of applications. MABEL

is task-agnostic, as it produces a general-purpose

debiased model. Some recent efforts have broad-

ened the scope of task-speciﬁc approaches. For in-

stance, Meade et al. (2022) adapt the task-speciﬁc

Iterative Nullspace Linear Projection (INLP) (Rav-

fogel et al.,2020) algorithm to rely on Wikipedia

data for language model probing. While non-task-

agnostic approaches can potentially be adapted to

general-purpose debiasing, we primarily consider

other task-agnostic approaches in this work.

2.2 Evaluating Biases in NLP

The recent surge of interest in fairer NLP systems

has surfaced a key question: how should bias be

quantiﬁed? Intrinsic metrics directly probe the up-

stream language model, whether by measuring the

geometry of the embedding space (Caliskan et al.,

2017;May et al.,2019;Guo and Caliskan,2021),

or through likelihood-scoring (Kurita et al.,2019;

Nangia et al.,2020;Nadeem et al.,2021). Extrinsic

metrics evaluate for fairness by comparing the sys-

tem’s predictions across different populations on a

downstream task (De-Arteaga et al.,2019a;Zhao

et al.,2019;Dev et al.,2020). Though opaque,

intrinsic metrics are fast and cheap to compute,

which makes them popular among contemporary

works (Meade et al.,2022;Qian et al.,2022). Com-

paratively, though extrinsic metrics are more inter-

pretable and reﬂect tangible social harms, they are

often time- and compute-intensive, and so tend to

be less frequently used.2

To date, the most popular bias metric among

task-agnostic approaches is the Sentence Encoder

Association Test (SEAT) (May et al.,2019), which

compares the relative distance between the encoded

representations. Recent studies have cast doubt on

the predictive power of these intrinsic indicators.

As Table 17 in Appendix F indicates, many previous bias

mitigation approaches limit evaluation to 1 or 2 metrics.

Method Proj. Con. Gen. LM Fine- Intermediate pre-training data

based obj. aug. probe tune

Task-speciﬁc approaches

INLP (Ravfogel et al.,2020)3 3*3Wikipedia*

CON (Shen et al.,2021)3 3 -

DADV (Han et al.,2021b)3-

GATE (Han et al.,2021a)3-

R-LACE (Ravfogel et al.,2022)3 3 -

Task-agnostic approaches

CDA (Webster et al.,2020)3 3 3 Wikipedia (1M steps, 36h on 8x 16 TPU)

DROPOUT (Webster et al.,2020)3 3 Wikipedia (100K steps, 3.5h on 8x 16 TPU)

ADELE (Lauscher et al.,2021)3 3 3 Wikipedia, BookCorpus (105M sentences)

BIAS PROJECTION (Dev et al.,2020)3>3 3 Wikisplit (1M sentences)

OSCAR (Dev et al.,2021)>3SNLI](190.1K sentences)

SENT-DEBIAS (Liang et al.,2020)3 3 3 3 WikiText-2, SST, Reddit, MELD, POM

CONTEXT-DEBIAS (Kaneko and Bollegala,2021)3>3 3 News-commentary-v1 (87.66K sentences)

AUTO-DEBIAS (Guo et al.,2022)3Bias prompts generated from Wikipedia (500)

FAIRFIL (Cheng et al.,2021)3 3 "3WikiText-2, SST, Reddit, MELD, POM

HMABEL (ours) 3 3 3 3 MNLI, SNLI with gender terms (134k sentences)

Table 1: Properties of existing gender debiasing approaches for contextualized representations. Proj. based:

projection-based. Con. obj.: based on contrastive objectives. Gen. aug.: these approaches use a seed list of

gender terms for counterfactual data augmentation. LM probe and Fine-tune denote that the approach can be

used for language model probing or ﬁne-tuning, respectively. ∗: INLP was originally only used for task-speciﬁc

ﬁne-tuning; Meade et al. (2022) later adapted it for task-agnostic training on Wikipedia for LM probing. ":

FAIRFIL shows poor LM probing performance in Table 2 as the debiasing ﬁlter is not trained with an MLM head.

MABEL ﬁxes this issue by jointly training with an MLM objective. >: these works use a single gender pair

“he/she” to calculate the gender subspace. ]:Dev et al. (2021) ﬁne-tunes on SNLI but does not use it for debiasing.

SEAT has been found to elicit counter-intuitive re-

sults from encoders (May et al.,2019) or exhibit

high variance across identical runs (Aribandi et al.,

2021). Goldfarb-Tarrant et al. (2021) show that

intrinsic metrics do not reliably correlate with ex-

trinsic metrics, meaning that a model could score

well on SEAT, but still form unfair judgements in

downstream conditions. This is especially concern-

ing as many debiasing studies (Liang et al.,2020;

Cheng et al.,2021) solely report on SEAT, which

is shown to be unreliable and incoherent. For these

reasons, we disregard SEAT as a main intrinsic

metric in this work.3

Bias evaluation is critical as it is the ﬁrst step to-

wards detection and mitigation. Given that bias re-

ﬂects across language in many ways, relying upon

a single bias indicator is insufﬁcient (Silva et al.,

2021). Therefore, we benchmark not just MABEL,

but also current task-agnostic methods against a

diverse set of intrinsic and extrinsic indicators.

3 Method

MABEL attenuates gender bias in pre-trained lan-

guage models by leveraging entailment pairs from

natural language inference (NLI) data to produce

For comprehensiveness, we report MABEL’s results on

SEAT in Appendix G.

general-purpose debiased representations. To the

best of our knowledge, MABEL is the ﬁrst method

that exploits semantic signals from supervised sen-

tence pairs for learning fairness.

3.1 Training Data

NLI data is shown to be especially effective in

training discriminative and high-quality sentence

representations (Conneau et al.,2017;Reimers

and Gurevych,2019;Gao et al.,2021). While

previous works in fair representation learning use

generic sentences from different domains (Liang

et al.,2020;Cheng et al.,2021;Kaneko and Bol-

legala,2021), we explore using sentence pairs

with an entailment relationship: a hypothesis sen-

tence that can be inferred to be true, based on a

premise sentence. Since gender is our area of in-

terest, we extract all entailment pairs that contain

at least one gendered term in either the premise

or the hypothesis from an NLI dataset. In our

experiments, we explore using two well-known

NLI datasets: the Stanford Natural Language Infer-

ence (SNLI) dataset (Bowman et al.,2015) and the

Multi-Genre Natural Language Inference (MNLI)

dataset (Williams et al.,2018).

As a pre-processing step, we ﬁrst conduct coun-

terfactual data augmentation (Webster et al.,2020)

on the entailment pairs. For any sensitive attribute

A woman is working on furniture.

Man putting together wooden shelf.

A man is working on furniture.

sim(p,h)

sim(̂

p,̂

Two girls are looking at something.

Two boys are looking at something.

Three humans together.

Positive example

Negative example

Augmented premise

Augmented hypothesis

Original premise

Original hypothesis

Masked tokens

+−−

−

Woman putting together wooden shelf.

LMLM

LAL

LCL

−

{hj,̂

hj}

Figure 1: MABEL consists of three losses: 1) an entailment-based contrastive loss (LCL)that uses the premises’s

hypothesis as a positive sample and other in-batch hypotheses as negative samples; 2) an alignment loss (LAL)that

minimizes the similarity difference between each original entailment pair and its gender-balanced counterpart; 3)

a masked language modeling loss (LMLM)to recover p= 15% of the masked tokens.

term in a word sequence, we swap it for a word

along the opposite bias direction, i.e., girl to boy,

and keep the non-attribute words unchanged.

This

transformation is systematically applied to each

sentence in every entailment pair. An example of

this augmentation, with gender bias as the sensitive

attribute, is shown in Figure 1.

3.2 Training Objective

Our training objective consists of three compo-

nents: a contrastive loss based on entailment pairs

and their augmentations, an alignment loss, and an

optional masked language modeling loss.

Entailment-based contrastive loss.

Training with

a contrastive loss induces a more isotropic repre-

sentation space, wherein the sentences’ geometric

positions can better align with their semantic mean-

ing (Wang and Isola,2020;Gao et al.,2021). We

hypothesize that this contrastive loss would be con-

ducive to bias mitigation, as concepts with similar

meanings, but along opposite gender directions,

move closer under this similarity measurement. In-

spired by Gao et al. (2021), we use a contrastive

loss that encourages the inter-association of en-

tailment pairs, with the goal of the encoder also

learning semantically richer associations.5

With

as the premise representation and

as the

hypothesis representation, let

{(pi, hi)}n

i=1

be the

sequence of representations for

original entail-

ment pairs, and

{(ˆpi,ˆ

hi)}n

i=1

counterfactually-

augmented entailment pairs. Each entailment pair

(and its corresponding augmented pair) forms a

We use the same list of attribute word pairs from Boluk-

basi et al. (2016), Liang et al. (2020), and Cheng et al. (2021),

which can be found in Appendix A.

In this work, we only refer to the supervised SimCSE

model, which leverages entailment pairs from NLI data.

positive pair, and the other in-batch sentences con-

stitute negative samples. With

pairs and their

augmentations in one training batch, the contrastive

objective for an entailment pair iis deﬁned as:

L(i)

CL =−log esim(pi,hi)/τ

j=1 esim(pi,hj)/τ +esim(pi,ˆ

hj)/τ

−log esim(ˆpi,ˆ

hi)/τ

j=1 esim(ˆpi,hj)/τ +esim(ˆpi,ˆ

hj)/τ ,

where

sim(·,·)

denotes the cosine similarity func-

tion, and

is the temperature.

LCL

is simply the av-

erage of all the losses in a training batch. Note that

when

hi=ˆ

(i.e., when

does not contain any

gender words and the augmentation is unchanged),

we exclude

from the denominator to avoid

a positive sample and

as a negative sample for

pi, and vice versa.

Alignment loss.

We want a loss that encourages

the intra-association between the original entail-

ment pairs and their augmented counterparts. Intu-

itively, the features from an entailment pair and its

gender-balanced opposite should be taken as posi-

tive samples and be spatially close. Our alignment

loss minimizes the distance between the cosine sim-

ilarities of the original sentence pairs

(pi, hi)

and

the gender-opposite sentence pairs (ˆpi,ˆ

hi):

LAL =1

i=1 sim(ˆpi,ˆ

hi)−sim(pi, hi)2.

We assume that a model is less biased if it as-

signs similar measurements to two gender-opposite

pairs, meaning that it maps the same concepts along

different gender directions to the same contexts.6

We also explore different loss functions for alignment and

report them in Appendix J.

Masked language modeling loss.

Optionally, we

can append an auxiliary masked language modeling

(MLM) loss to preserve the model’s language mod-

eling capability. Following Devlin et al. (2019), we

randomly mask

p= 15%

of tokens in all sentences.

By leveraging the surrounding context to predict

the original terms, the encoder is incentivized to

retain token-level knowledge.

In sum, our training objective is as follows:

L= (1 −α)· LCL +α· LAL +λ· LMLM,

wherein the two contrastive losses are linearly in-

terpolated by a tunable coefﬁcient

, and the MLM

loss is tempered by the hyper-parameter λ.

4 Evaluation Metrics

4.1 Intrinsic Metrics

StereoSet (Nadeem et al.,2021)

queries the lan-

guage model for stereotypical associations. Fol-

lowing Meade et al. (2022), we consider intra-

sentence examples from the gender domain. This

task can be formulated as a ﬁll-in-the-blank style

problem, wherein the model is presented with an

incomplete context sentence, and must choose be-

tween a stereotypical word, an anti-stereotypical

word, and an irrelevant word. The Language Mod-

eling Score (LM) is the percentage of instances in

which the model chooses a valid word (either the

stereotype or the anti-stereotype) over the random

word; the Stereotype Score (SS) is the percentage

in which the model chooses the stereotype over the

anti-stereotype. The Idealized Context Association

Test (ICAT) score combines the LM and SS scores

into a single metric.

CrowS-Pairs (Nangia et al.,2020)

is an intra-

sentence dataset of minimal pairs, where one sen-

tence contains a disadvantaged social group that

either fulﬁlls or violates a stereotype, and the other

sentence is minimally edited to contain a con-

trasting advantaged group. The language model

compares the masked token probability of tokens

unique to each sentence. Focusing only on gender

examples, we report the stereotype score (SS), the

percentage in which a model assigns a higher aggre-

gated masked token probability to a stereotypical

sentence over an anti-stereotypical one.

4.2 Extrinsic Metrics

As there has been some inconsistency in the eval-

uation settings in the literature, we mainly con-

sider the ﬁne-tuning setting for extrinsic metrics

and leave the discussion of the linear probing set-

ting to Appendix I.

Bias-in-Bios (De-Arteaga et al.,2019b)

is a third-

person biography dataset annotated by occupation

and gender. We ﬁne-tune the encoder, along with

a linear classiﬁcation layer, to predict an individ-

ual’s profession given their biography. We report

overall task accuracy and accuracy by gender, as

well as two common fairness metrics (De-Arteaga

et al.,2019b;Ravfogel et al.,2020): 1)

GAP T P R

the difference in true positive rate (TPR) between

male- and female-labeled instances; 2)

GAP T P R

M,y

the root-mean square of the TPR gap of each occu-

pation class.

Bias-NLI (Dev et al.,2020)

is an NLI dataset con-

sisting of neutral sentence pairs. It is systematically

constructed by populating sentence templates with

a gendered word and an occupation word with a

strong gender connotation (e.g., The woman ate a

bagel; The nurse ate a bagel). Bias can be inter-

preted as a deviation from neutrality and is deter-

mined by three metrics: Net Neutral (NN), Fraction

Neutral (FN) and Threshold:

(T:

). A bias-free

model should score a value of 1 across all 3 metrics.

We ﬁne-tune on SNLI and evaluate on Bias-NLI

during inference.

WinoBias (Zhao et al.,2018)

is an intra-sentence

coreference resolution task that evaluates a sys-

tem’s ability to correctly link a gendered pronoun

to an occupation across both pro-stereotypical and

anti-stereotypical contexts. Coreference can be

inferred based on syntactic cues in Type 1 sen-

tences or on more challenging semantic cues in

Type 2 sentences. We ﬁrst ﬁne-tune the model on

the OntoNotes 5.0 dataset (Hovy et al.,2006) be-

fore evaluating on the WinoBias benchmark. We

report the average F1-scores for pro-stereotypical

and anti-stereotypical instances, and the true pos-

itive rate difference in average F1-scores, across

Type 1 and Type 2 examples.

4.3 Language Understanding

To evaluate whether language models still preserve

general linguistic understanding after bias atten-

uation, we ﬁne-tune them on seven classiﬁcation

tasks and one regression task from the General Lan-

guage Understanding Evaluation (GLUE) bench-

mark (Wang et al.,2019).7

We also evaluate transfer performance on the SentEval

tasks (Conneau et al.,2017)inAppendix E.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MABEL:AttenuatingGenderBiasusingTextualEntailmentDataJacquelineHeMengzhouXiaChristianeFellbaumDanqiChenDepartmentofComputerScience,PrincetonUniversityjacquelinehe00@gmail.com{mengzhou,fellbaum,danqic}@cs.princeton.eduAbstractPre-trainedlanguagemodelsencodeundesir-ablesocialbiases,whicharefurtherexa...

展开>> 收起<<

MABEL Attenuating Gender Bias using Textual Entailment Data Jacqueline HeMengzhou Xia Christiane Fellbaum Danqi Chen Department of Computer Science Princeton University.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MABEL Attenuating Gender Bias using Textual Entailment Data Jacqueline HeMengzhou Xia Christiane Fellbaum Danqi Chen Department of Computer Science Princeton University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: