Are Pretrained Multilingual Models Equally Fair Across Languages

2025-04-24 0 0 777.21KB 9 页 10玖币

侵权投诉

Are Pretrained Multilingual Models Equally Fair Across Languages?

Laura Cabello Piqueras

University of Copenhagen

lcp@di.ku.dk

Anders Søgaard

University of Copenhagen

soegaard@di.ku.dk

Abstract

Pretrained multilingual language models can

help bridge the digital language divide, en-

abling high-quality NLP models for lower-

resourced languages. Studies of multilingual

models have so far focused on performance,

consistency, and cross-lingual generalisation.

However, with their wide-spread application

in the wild and downstream societal impact,

it is important to put multilingual models un-

der the same scrutiny as monolingual mod-

els. This work investigates the group fairness

of multilingual models, asking whether these

models are equally fair across languages. To

this end, we create a new four-way multilin-

gual dataset of parallel cloze test examples

(MozArt), equipped with demographic infor-

mation (balanced with regard to gender and

native tongue) about the test participants. We

evaluate three multilingual models on MozArt

– mBERT, XLM-R, and mT5 – and show that

across the four target languages, the three mod-

els exhibit different levels of group disparity,

e.g., exhibiting near-equal risk for Spanish, but

high levels of disparity for German.

1 Introduction

Fill-in-the-gap cloze tests (Taylor,1953) ask lan-

guage learners to predict what words were removed

from a text and it is a “procedure for measuring the

effectiveness of communication”. Today, language

models are trained to do the same (Devlin et al.,

2019). This has the advantage that we can now

use ﬁll-in-the-gap cloze tests to directly compare

the linguistic preferences of humans and language

models, e.g., to investigate task-independent soci-

olectal biases (group disparities) in language mod-

els (Zhang et al.,2021). This paper presents a novel

four-way parallel cloze dataset for English, French,

German, and Spanish that enables apples-to-apples

comparison across languages of group disparities

in multilingual language models.1

The language selection was given to us, because we rely

on an existing word alignment dataset; see §2.

EN ES DE FR

WordPiece (avg. #tokens) 19.7 22.0 23.6 23.1

SentencePiece (avg. #tokens) 22.3 22.9 24.9 25.3

#Sentences 100 100 100 100

#Annotations 600 600 600 600

#Annotators 60 60 60 60

Demographics

id_u, id_s, gender, age, nationality,

ﬁrst language, ﬂuent languages,

current country of residence,

country of birth, time taken

Table 1: MozArt details. The average number of to-

kens per sentence is reported using WordPiece and Sen-

tecePiece. The bottom row lists the demographic at-

tributes shared; id_u refers to user id (anonymised) and

id_s to sentence id.

Language models induced from historical data

are prone to implicit biases (Zhao et al.,2017;

Chang et al.,2019;Mehrabi et al.,2021), e.g., as a

result of the over-representation of male-dominated

text sources such as Wikipedia and newswire (Hovy

and Søgaard,2015). This may lead to language

models that are unfair to groups of users in the

sense that they work better for some groups rather

than others (Zhang et al.,2021). Multilingual lan-

guage models can be unfair to their training lan-

guages in similar ways (Choudhury and Deshpande,

2021;Wan,2022;Wang et al.,2021), but this work

goes beyond previous work in evaluating whether

multilingual language models are equally fair to

demographic groups across languages.

To this end, we create MozArt, a multilingual

dataset of ﬁll-in-the-gap sentences covering four

languages (English, French, German and Spanish).

The sentences reﬂect diastratic variation within

each language and can be used to compare bi-

ases in pretrained language models (PLMs) across

languages. We study the inﬂuence of four demo-

graphic groups, i.e., the cross-product of our anno-

arXiv:2210.05457v1 [cs.CL] 11 Oct 2022

tators’ gender – male (M) or female (F)

– and ﬁrst

language – native (N) or non-native (NN).

Table 1

presents a summary of dataset characteristics.

2 Dataset

We introduce MozArt, a four-way multilingual

cloze test dataset with annotator demographics.

We sampled 100 sentence quadruples from each

of the four languages (English, French, German,

Spanish) in the corpus provided for the WMT 2006

Shared Task.

The data was extracted from the

publicly available Europarl corpus (Koehn,2005)

and enhanced with word-level bitext alignments

(Koehn and Monz,2006). The word alignments

are important for what follows. We manually

verify that sentences make sense out of context

and use the data to generate comparable cloze

examples, e.g.:

en [MASK] that deplete the ozone layer

es [MASK] que agotan la capa de ozono

de [MASK], die zum Abbau der Ozonschicht führen

fr [MASK] appauvrissant la couche d’ozone

We only mask words which are (i) aligned by one-

to-one alignments, and which are (ii) either nouns,

verbs, adjectives or adverbs.

We mask one word

in each sentence and verify that one-to-one align-

ments exist in all languages. Following Kleijn et al.

(2019), we rely on part-of-speech information to

avoid masking words that are too predictable, e.g.,

auxiliary verbs or constituents of multi-word ex-

pressions, or words that are un-predictable, e.g.,

proper names and technical terms.

Annotators were recruited using Proliﬁc.

applied eligibility criteria to balance our annota-

tors across demographics. Participants were asked

to report (on a voluntary basis) their demographic

information regarding gender and languages spo-

ken. Each eligible participant was presented with

10 cloze examples. We collected answers from

240 annotators, 60 per language batch, divided in

2None of our annotators identiﬁed as non-binary.

See Schmitz (2016); Faez (2011) for discussion of the

native/non-native speaker dichotomy. Participants were asked

“What is your ﬁrst language?” and “Which of the following

languages are you ﬂuent in?”. We use native (N) for people

whose ﬁrst language coincides with the example sentences,

and non-native (NN) otherwise, without any sociocultural im-

plications.

4www.statmt.org/wmt06/shared-task

We use spaCy’s part-of-speech tagger (Honnibal and Mon-

tani,2017) to predict the syntactic categories of the input

words.

6prolific.co

four balanced demographic groups (gender

na-

tive language). We made sure that each sentence

had at least six annotations. Annotation guidelines

for each language were given in that language, to

avoid bias and ensure a minimum of language un-

derstanding for non-native speakers. We manually

ﬁltered out spammers to ensure data quality.

The dataset is made publicly available at

github.com/coastalcph/mozart

under a

CC-BY-4.0 license. We include all the demo-

graphic attributes of our annotators as per agree-

ment with the annotators. The full list of protected

attributes is found in Table 1. We hope MozArt

will become a useful resource for the community,

also for evaluating the fairness of language mod-

els across other attributes than gender and native

language.

3 Experimental Setup

Models

We evaluate three PLMs: mBERT (De-

vlin et al.,2019), XLM-RoBERTa/XLM-R (Con-

neau et al.,2020), and mT5 (Xue et al.,2021).

All three models were trained with a masked lan-

guage modelling objective. mBERT differs from

XLM-R and mT5 in including a next sentence pre-

diction objective (Devlin et al.,2019). mT5 differs

from mBERT and XLM-R in allowing for consec-

utive spans of input tokens to be masked (Raffel

et al.,2020). We adopt beam search decoding with

early stopping and constrain the generation to sin-

gle words. This enables better correlation of mT5’s

output with our group preferences. t-SNE plots are

included in Appendix Bto show how languages

are distributed in the PLM vector spaces.

Metrics

We use several metrics to compare how

the PLMs align with group preferences across lan-

guages. These include top-k precision

P@k

with

k={1, 5}, mean reciprocal rank (

MRR

), and two

classical univariate rank correlations: Spearman’s

(Spearman,1987) and Kendall’s

(Kendall,1938).

Given a set of

|S|

cloze sentences and a group

of annotators, for each sentence s, we denote

the list of answers, ranked by their frequency, as

Ws= [w1, w2, ...]

, and the list of model’s predic-

tions as

Cs= [c1, c2, ...]

, ranked by their model

likelihood. Then, we report

P@k = [ci∈Ws]

with

i∈[1, k]

, where

[·]

is the indicator function.

We use the base models available from

huggingface.

co/models

. We report results using uncased mBERT, since

it performed better on our data than its cased sibling.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ArePretrainedMultilingualModelsEquallyFairAcrossLanguages?LauraCabelloPiquerasUniversityofCopenhagenlcp@di.ku.dkAndersSøgaardUniversityofCopenhagensoegaard@di.ku.dkAbstractPretrainedmultilinguallanguagemodelscanhelpbridgethedigitallanguagedivide,en-ablinghigh-qualityNLPmodelsforlower-resourcedlangua...

展开>> 收起<<

Are Pretrained Multilingual Models Equally Fair Across Languages.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are Pretrained Multilingual Models Equally Fair Across Languages

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: