
multilingual pre-trained models (and improving
them) remains unexplored. Hence, we investigate–
(1) are multilingual models robust to noise seen in
different languages (that may be dissimilar to noise
types seen in English)? (2) can we get and leverage
multi-lingual noise data to improve multilingual
models? and (3) do automatic data-augmentation
methods designed for English improve robustness
to multilingual noise?
To boost the robustness of multilingual models
to diverse multilingual noise, we leverage multilin-
gual data augmentation at the pretraining stage and
use contrastive learning. Our effort complements
work in computer vision that showcases contrastive
learning with adversarial learning at task-training
(Fan et al.,2021;Ghosh and Lan,2021) and pre-
training time (Jiang et al.,2020;Kim et al.,2020)
can improve model robustness. NLP has also seen
a plethora of work that leverages contrastive learn-
ing, but seldom to alleviate robustness concerns
(Jaiswal et al.,2020). Similar concepts, such as Ad-
versarial Logit Pairing (Einolghozati et al.,2019),
used at task-training time have proven to be less
effective than data augmentation approaches (Sen-
gupta et al.,2021) in boosting robustness.
All the aforementioned works lack in at least
one of the two novel aspects of this paper– ro-
bustness to real-world (as opposed to adversarial)
noise, and/or multilinguality. Lastly, the aspect
of cross-lingual knowledge transfer has been stud-
ied in the context of different NLP tasks; typically,
from a high-resource language to a low-resource
one, as exemplified by the XTREME benchmark
(Hu et al.,2020). In this paper, we investigate the
cross-lingual transferability of robustness to real-
world noise.
3 Constructing Noisy Test Data
As no existing benchmarks exist to evaluate the ro-
bustness of multilingual models, we construct noisy
test sets in multiple languages for four tasks. First,
we construct a word-level error-and–correction dic-
tionary by leveraging the Wikipedia edit corpora.
Then, we sample replacements from this dictio-
nary and inject them into the test data for the var-
ious multilingual tasks, focusing on replacements
that only affect individual words but do not change
word order. Finally, we conduct human evalua-
tion to filter out test sets that are not deemed to be
realistic by language experts.
3.1 Wiki-edit Mining
Wikipedia
2
is a public encyclopedia available in
multiple languages. Wikipedia editors create and
iteratively edit its contents. We leverage these ed-
its to construct error-correction word dictionaries
(later used to create noisy test data). Our approach
to mining edits is similar to Tanaka et al. (2020),
but we consider multiple languages (as opposed to
only Japanese), and additionally create dictionaries
of word-level edits.
To isolate likely useful edits, we first consider
each revision page of an article and split it into a list
of sentences using NLTK (Bird et al.,2009). Sec-
ond, we filter out sentence pairs from two consecu-
tive edit versions ensuring both sentences have (1)
2-120 tokens, (2) a difference if
<5
tokens, and (3)
a relative edit-distance within
30%
of the shorter
sentence. Third, we leverage language-specific
tokenizes
difflib3
to extract exact token-level
deltas between the sentence pair. At last, we en-
sure word pairs (in these deltas) that have at least
one character-level Levenshtein edit-distance from
each other
4
and none of words are only numbers or
punctuation tokens. Note that edits to Wikipedia in-
volve changes to factual information, such as dates,
rather than incorrect spelling or grammar; thus, the
last step is necessary.
We can finally create a noise dictionary of
correct-to-incorrect words that has frequency
information about the different errors. For
example, an element of the dictionary (in Spanish)
looks like
{de: [(del, 0.52), (se,
0.32), (do, 0.1), (dë, 0.04),
(en, 0.02)]}.
3.2 Injecting Noise into Test sets
We use the noise dictionaries to create a noised
version of the original test data for the four
tasks– MultiATIS++ (Xu et al.,2020), MultiSNIPS,
WikiANN (Pan et al.,2017) and XNLI (Conneau
et al.,2018). After tokenization, we sample tokens
randomly without replacement. In each sampling
step, we sample based on a uniform probability dis-
tribution over the individual tokens and then check
if the token exists in the noise dictionary. If so,
we replace it with a noised version from the dic-
2https://meta.wikimedia.org/wiki/List_
of_Wikipedias
3https://docs.python.org/3/library/
difflib.html
4
For Chinese characters, including Kanji, even a single
character distance could imply a different word.