
tators’ gender – male (M) or female (F)
2
– and first
language – native (N) or non-native (NN).
3
Table 1
presents a summary of dataset characteristics.
2 Dataset
We introduce MozArt, a four-way multilingual
cloze test dataset with annotator demographics.
We sampled 100 sentence quadruples from each
of the four languages (English, French, German,
Spanish) in the corpus provided for the WMT 2006
Shared Task.
4
The data was extracted from the
publicly available Europarl corpus (Koehn,2005)
and enhanced with word-level bitext alignments
(Koehn and Monz,2006). The word alignments
are important for what follows. We manually
verify that sentences make sense out of context
and use the data to generate comparable cloze
examples, e.g.:
en [MASK] that deplete the ozone layer
es [MASK] que agotan la capa de ozono
de [MASK], die zum Abbau der Ozonschicht führen
fr [MASK] appauvrissant la couche d’ozone
We only mask words which are (i) aligned by one-
to-one alignments, and which are (ii) either nouns,
verbs, adjectives or adverbs.
5
We mask one word
in each sentence and verify that one-to-one align-
ments exist in all languages. Following Kleijn et al.
(2019), we rely on part-of-speech information to
avoid masking words that are too predictable, e.g.,
auxiliary verbs or constituents of multi-word ex-
pressions, or words that are un-predictable, e.g.,
proper names and technical terms.
Annotators were recruited using Prolific.
6
We
applied eligibility criteria to balance our annota-
tors across demographics. Participants were asked
to report (on a voluntary basis) their demographic
information regarding gender and languages spo-
ken. Each eligible participant was presented with
10 cloze examples. We collected answers from
240 annotators, 60 per language batch, divided in
2None of our annotators identified as non-binary.
3
See Schmitz (2016); Faez (2011) for discussion of the
native/non-native speaker dichotomy. Participants were asked
“What is your first language?” and “Which of the following
languages are you fluent in?”. We use native (N) for people
whose first language coincides with the example sentences,
and non-native (NN) otherwise, without any sociocultural im-
plications.
4www.statmt.org/wmt06/shared-task
5
We use spaCy’s part-of-speech tagger (Honnibal and Mon-
tani,2017) to predict the syntactic categories of the input
words.
6prolific.co
four balanced demographic groups (gender
×
na-
tive language). We made sure that each sentence
had at least six annotations. Annotation guidelines
for each language were given in that language, to
avoid bias and ensure a minimum of language un-
derstanding for non-native speakers. We manually
filtered out spammers to ensure data quality.
The dataset is made publicly available at
github.com/coastalcph/mozart
under a
CC-BY-4.0 license. We include all the demo-
graphic attributes of our annotators as per agree-
ment with the annotators. The full list of protected
attributes is found in Table 1. We hope MozArt
will become a useful resource for the community,
also for evaluating the fairness of language mod-
els across other attributes than gender and native
language.
3 Experimental Setup
Models
We evaluate three PLMs: mBERT (De-
vlin et al.,2019), XLM-RoBERTa/XLM-R (Con-
neau et al.,2020), and mT5 (Xue et al.,2021).
7
All three models were trained with a masked lan-
guage modelling objective. mBERT differs from
XLM-R and mT5 in including a next sentence pre-
diction objective (Devlin et al.,2019). mT5 differs
from mBERT and XLM-R in allowing for consec-
utive spans of input tokens to be masked (Raffel
et al.,2020). We adopt beam search decoding with
early stopping and constrain the generation to sin-
gle words. This enables better correlation of mT5’s
output with our group preferences. t-SNE plots are
included in Appendix Bto show how languages
are distributed in the PLM vector spaces.
Metrics
We use several metrics to compare how
the PLMs align with group preferences across lan-
guages. These include top-k precision
P@k
with
k={1, 5}, mean reciprocal rank (
MRR
), and two
classical univariate rank correlations: Spearman’s
ρ
(Spearman,1987) and Kendall’s
τ
(Kendall,1938).
Given a set of
|S|
cloze sentences and a group
of annotators, for each sentence s, we denote
the list of answers, ranked by their frequency, as
Ws= [w1, w2, ...]
, and the list of model’s predic-
tions as
Cs= [c1, c2, ...]
, ranked by their model
likelihood. Then, we report
P@k = [ci∈Ws]
with
i∈[1, k]
, where
[·]
is the indicator function.
7
We use the base models available from
huggingface.
co/models
. We report results using uncased mBERT, since
it performed better on our data than its cased sibling.