Are Pretrained Multilingual Models Equally Fair Across Languages

2025-04-24 0 0 777.21KB 9 页 10玖币
侵权投诉
Are Pretrained Multilingual Models Equally Fair Across Languages?
Laura Cabello Piqueras
University of Copenhagen
lcp@di.ku.dk
Anders Søgaard
University of Copenhagen
soegaard@di.ku.dk
Abstract
Pretrained multilingual language models can
help bridge the digital language divide, en-
abling high-quality NLP models for lower-
resourced languages. Studies of multilingual
models have so far focused on performance,
consistency, and cross-lingual generalisation.
However, with their wide-spread application
in the wild and downstream societal impact,
it is important to put multilingual models un-
der the same scrutiny as monolingual mod-
els. This work investigates the group fairness
of multilingual models, asking whether these
models are equally fair across languages. To
this end, we create a new four-way multilin-
gual dataset of parallel cloze test examples
(MozArt), equipped with demographic infor-
mation (balanced with regard to gender and
native tongue) about the test participants. We
evaluate three multilingual models on MozArt
– mBERT, XLM-R, and mT5 – and show that
across the four target languages, the three mod-
els exhibit different levels of group disparity,
e.g., exhibiting near-equal risk for Spanish, but
high levels of disparity for German.
1 Introduction
Fill-in-the-gap cloze tests (Taylor,1953) ask lan-
guage learners to predict what words were removed
from a text and it is a “procedure for measuring the
effectiveness of communication”. Today, language
models are trained to do the same (Devlin et al.,
2019). This has the advantage that we can now
use fill-in-the-gap cloze tests to directly compare
the linguistic preferences of humans and language
models, e.g., to investigate task-independent soci-
olectal biases (group disparities) in language mod-
els (Zhang et al.,2021). This paper presents a novel
four-way parallel cloze dataset for English, French,
German, and Spanish that enables apples-to-apples
comparison across languages of group disparities
in multilingual language models.1
1
The language selection was given to us, because we rely
on an existing word alignment dataset; see §2.
EN ES DE FR
WordPiece (avg. #tokens) 19.7 22.0 23.6 23.1
SentencePiece (avg. #tokens) 22.3 22.9 24.9 25.3
#Sentences 100 100 100 100
#Annotations 600 600 600 600
#Annotators 60 60 60 60
Demographics
id_u, id_s, gender, age, nationality,
first language, fluent languages,
current country of residence,
country of birth, time taken
Table 1: MozArt details. The average number of to-
kens per sentence is reported using WordPiece and Sen-
tecePiece. The bottom row lists the demographic at-
tributes shared; id_u refers to user id (anonymised) and
id_s to sentence id.
Language models induced from historical data
are prone to implicit biases (Zhao et al.,2017;
Chang et al.,2019;Mehrabi et al.,2021), e.g., as a
result of the over-representation of male-dominated
text sources such as Wikipedia and newswire (Hovy
and Søgaard,2015). This may lead to language
models that are unfair to groups of users in the
sense that they work better for some groups rather
than others (Zhang et al.,2021). Multilingual lan-
guage models can be unfair to their training lan-
guages in similar ways (Choudhury and Deshpande,
2021;Wan,2022;Wang et al.,2021), but this work
goes beyond previous work in evaluating whether
multilingual language models are equally fair to
demographic groups across languages.
To this end, we create MozArt, a multilingual
dataset of fill-in-the-gap sentences covering four
languages (English, French, German and Spanish).
The sentences reflect diastratic variation within
each language and can be used to compare bi-
ases in pretrained language models (PLMs) across
languages. We study the influence of four demo-
graphic groups, i.e., the cross-product of our anno-
arXiv:2210.05457v1 [cs.CL] 11 Oct 2022
tators’ gender – male (M) or female (F)
2
– and first
language – native (N) or non-native (NN).
3
Table 1
presents a summary of dataset characteristics.
2 Dataset
We introduce MozArt, a four-way multilingual
cloze test dataset with annotator demographics.
We sampled 100 sentence quadruples from each
of the four languages (English, French, German,
Spanish) in the corpus provided for the WMT 2006
Shared Task.
4
The data was extracted from the
publicly available Europarl corpus (Koehn,2005)
and enhanced with word-level bitext alignments
(Koehn and Monz,2006). The word alignments
are important for what follows. We manually
verify that sentences make sense out of context
and use the data to generate comparable cloze
examples, e.g.:
en [MASK] that deplete the ozone layer
es [MASK] que agotan la capa de ozono
de [MASK], die zum Abbau der Ozonschicht führen
fr [MASK] appauvrissant la couche d’ozone
We only mask words which are (i) aligned by one-
to-one alignments, and which are (ii) either nouns,
verbs, adjectives or adverbs.
5
We mask one word
in each sentence and verify that one-to-one align-
ments exist in all languages. Following Kleijn et al.
(2019), we rely on part-of-speech information to
avoid masking words that are too predictable, e.g.,
auxiliary verbs or constituents of multi-word ex-
pressions, or words that are un-predictable, e.g.,
proper names and technical terms.
Annotators were recruited using Prolific.
6
We
applied eligibility criteria to balance our annota-
tors across demographics. Participants were asked
to report (on a voluntary basis) their demographic
information regarding gender and languages spo-
ken. Each eligible participant was presented with
10 cloze examples. We collected answers from
240 annotators, 60 per language batch, divided in
2None of our annotators identified as non-binary.
3
See Schmitz (2016); Faez (2011) for discussion of the
native/non-native speaker dichotomy. Participants were asked
“What is your first language?” and “Which of the following
languages are you fluent in?”. We use native (N) for people
whose first language coincides with the example sentences,
and non-native (NN) otherwise, without any sociocultural im-
plications.
4www.statmt.org/wmt06/shared-task
5
We use spaCy’s part-of-speech tagger (Honnibal and Mon-
tani,2017) to predict the syntactic categories of the input
words.
6prolific.co
four balanced demographic groups (gender
×
na-
tive language). We made sure that each sentence
had at least six annotations. Annotation guidelines
for each language were given in that language, to
avoid bias and ensure a minimum of language un-
derstanding for non-native speakers. We manually
filtered out spammers to ensure data quality.
The dataset is made publicly available at
github.com/coastalcph/mozart
under a
CC-BY-4.0 license. We include all the demo-
graphic attributes of our annotators as per agree-
ment with the annotators. The full list of protected
attributes is found in Table 1. We hope MozArt
will become a useful resource for the community,
also for evaluating the fairness of language mod-
els across other attributes than gender and native
language.
3 Experimental Setup
Models
We evaluate three PLMs: mBERT (De-
vlin et al.,2019), XLM-RoBERTa/XLM-R (Con-
neau et al.,2020), and mT5 (Xue et al.,2021).
7
All three models were trained with a masked lan-
guage modelling objective. mBERT differs from
XLM-R and mT5 in including a next sentence pre-
diction objective (Devlin et al.,2019). mT5 differs
from mBERT and XLM-R in allowing for consec-
utive spans of input tokens to be masked (Raffel
et al.,2020). We adopt beam search decoding with
early stopping and constrain the generation to sin-
gle words. This enables better correlation of mT5’s
output with our group preferences. t-SNE plots are
included in Appendix Bto show how languages
are distributed in the PLM vector spaces.
Metrics
We use several metrics to compare how
the PLMs align with group preferences across lan-
guages. These include top-k precision
P@k
with
k={1, 5}, mean reciprocal rank (
MRR
), and two
classical univariate rank correlations: Spearman’s
ρ
(Spearman,1987) and Kendall’s
τ
(Kendall,1938).
Given a set of
|S|
cloze sentences and a group
of annotators, for each sentence s, we denote
the list of answers, ranked by their frequency, as
Ws= [w1, w2, ...]
, and the list of model’s predic-
tions as
Cs= [c1, c2, ...]
, ranked by their model
likelihood. Then, we report
P@k = [ciWs]
with
i[1, k]
, where
[·]
is the indicator function.
7
We use the base models available from
huggingface.
co/models
. We report results using uncased mBERT, since
it performed better on our data than its cased sibling.
摘要:

ArePretrainedMultilingualModelsEquallyFairAcrossLanguages?LauraCabelloPiquerasUniversityofCopenhagenlcp@di.ku.dkAndersSøgaardUniversityofCopenhagensoegaard@di.ku.dkAbstractPretrainedmultilinguallanguagemodelscanhelpbridgethedigitallanguagedivide,en-ablinghigh-qualityNLPmodelsforlower-resourcedlangua...

展开>> 收起<<
Are Pretrained Multilingual Models Equally Fair Across Languages.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:777.21KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注