Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining Asa Cooper Sticklandy Sailik Senguptaz Jason Kronez He Hez Saab Mansourz

2025-05-03 0 0 670.24KB 17 页 10玖币
侵权投诉
Robustification of Multilingual Language Models to Real-world Noise in
Crosslingual Zero-shot Settings with Robust Contrastive Pretraining
Asa Cooper Stickland?¶†, Sailik Sengupta?, Jason Krone¶‡, He He‡♦, Saab Mansour
University of Edinburgh, ÀWS AI Labs, New York University
a.cooper.stickland@ed.ac.uk,{sailiks,saabm,hehea}@amazon.com
Abstract
Advances in neural modeling have achieved
state-of-the-art (SOTA) results on public nat-
ural language processing (NLP) benchmarks,
at times surpassing human performance. How-
ever, there is a gap between public benchmarks
and real-world applications where noise, such
as typographical or grammatical mistakes, is
abundant and can result in degraded perfor-
mance. Unfortunately, works which evaluate
the robustness of neural models on noisy data
and propose improvements, are limited to the
English language. Upon analyzing noise in dif-
ferent languages, we observe that noise types
vary greatly across languages. Thus, exist-
ing investigations do not generalize trivially to
multilingual settings. To benchmark the per-
formance of pretrained multilingual language
models, we construct noisy datasets covering
five languages and four NLP tasks and observe
a clear gap in the performance between clean
and noisy data in the zero-shot cross-lingual
setting. After investigating several ways to
boost the robustness of multilingual models
in this setting, we propose Robust Contrastive
Pretraining (RCP). RCP combines data aug-
mentation with a contrastive loss term at the
pretraining stage and achieves large improve-
ments on noisy (& original test data) across
two sentence-level (+3.2%) and two sequence-
labeling (+10 F1-score) multilingual classifica-
tion tasks.
1 Introduction
Recently, multilingual pre-trained language mod-
els like mBERT (Devlin et al.,2019), XLM-R
(Conneau et al.,2020) and various others (Chi
et al.,2021;Xue et al.,2021;Chi et al.,2022)
have improved multilingual language understand-
ing by pretraining large Transformer models on
web-scale corpora (such as Wikipedia, Common-
Crawl). These models achieve state-of-the-art per-
formance on cross-lingual transfer and many mul-
?Equal Contribution, Work done while at Amazon.
tilingual NLP tasks (Wu and Dredze,2019;Pires
et al.,2019). However, a real-world system will
encounter real-world noise, such as linguistic vari-
ations and common errors observed in textual data,
that are often absent from benchmark datasets.
While prior works focused on this issue of ro-
bustness in monolingual settings (Peng et al.,2021;
Sengupta et al.,2021;Tan et al.,2020), investiga-
tion has been scarce for multilingual settings. In
this paper, we study the effect of realistic noise
in multilingual settings and propose methods to
boost the robustness of multilingual language mod-
els across four NLP tasks: Intent Classification
(IC), Slot Labeling (SL), Named Entity Recogni-
tion (NER) and Natural Language Inference (NLI).
Due to the lack of multilingual noisy evaluation
data, we synthesize benchmarks by mining noise
from publicly available corpora and injecting them
into the test sets associated with each of the four
tasks. We conduct human validation to ensure that
this noised data is indeed realistic (see examples
from MultiATIS++ in Figure 1) and identify the
variety of noise-types seen across languages (in §3).
These analyses highlight the potential of our test-
set in evaluating (and motivating future research
on) multilingual robustness.
To benchmark the performance of multilingual
systems, we consider accuracy metrics on two
utterance-level tasks (IC% and NLI%) and F1-
scores on two token-level classification tasks (SL-
F1 and NER-F1). Specifically, we seek to evaluate
the model’s performance on the noised version of
the test datasets in a zero-shot cross-lingual setting.
In this scenario, we have training data for a task
available only in one language (in our case, En-
glish) and test-data in various languages (Liu et al.,
2019,2020).
While training data augmentation increases
model robustness for monolingual (i.e. English)
settings, it is not immediately obvious if these ro-
bustness gains can transfer across languages, as
arXiv:2210.04782v2 [cs.CL] 10 Feb 2023
Language Noise
Injection
Ratio
Realistic
Utt. %
Realistic Examples (test-set) Unrealistic Examples (test-set)
French
(fr)
0.1 95.4%
Me montré les vols directs de Charlotte à Min-
neapolis mardi matin .
Quelle compagnie aérienne fut YX
Me montré des vols entre Détroit er St. Louis sur
Delta Northwest US Air est United Airlines .
Lister des vols de Las Vegas à Son Diego
German
(de)
0.2 94.5%
Zeige mir der Flüge zwischen Housten und Or-
lando
Welche Flüge gibt es vom Tacoma nach San Jose
Zeige mit alle Flüge vor Charlotte nach Minnea-
polis zum Dienstag morgen
Zeige mit Flüge an Milwaukee nach Washington
DC v. 12 Uhr
Spanish
(es)
0.1 96.9%
qué aerolíneas vuelan de baltimore a san fran-
cesc
muéstrame vuelos entr toronto y san diego
necesito información de un vuelo y la tarifa de
oakland a salt lake city para el jueves antes e sus
8 am
de nuevo york a las vegas el domingo con la tarde
Hindi
(hi)
0.05 95.4%
󰽬टा उड़न󱀡 क बा 󰸊 बताइए जो कच क
याित्रय󱀡 को ना󰽯ता ता ह󱀡
ि󰽪फस स लास वस तक उड़न कज󰿷रत
सोमवर दोपहलॉस एिज󰽬स स िसबग󰸓
रिवर दोर को िममी 󰸊 󰽙लीव󰸋
Japanese
(jp)
0.1 92.3%
来水曜日にカンザスシティシカゴ行き
でシカゴ 7時ごろ到着して、
のフライトが木曜日のフライト
ワシントン コロンバス間のすべてのフ
ライトの運賃はいくら
シャロット空港 土曜日 err 午后 1時に
出すUS エアのフライトをリストア
ップして
水曜日のフェニックスミルウォキ
逝き
Chinese
(zh)
0.1 86.2%
我需要 4后 在 达拉斯起飞飞往旧金山的
联程航班
请列出从纽瓦克飞往 洛杉机 的航班
然而 天上午 10 点之前从密尔沃基飞往
特兰大
拉瓜迪亚 豪华轿车服务要多少钱
Figure 1: MultiATIS++ test set injected with real-world noise mined from Wikipedia edits. The highest error
injection ratio found to be realistic by human experts is shown alongside the realistic utterance percentage. We do
not include the noisy test sets for Chinese and Japanese in our analysis owing to low (<95%) realism.
error types can often be language-specific. For ex-
ample, typos in Devanagari script can differ from
those seen in Latin scripts (e.g.
-k
lskul
in Devanagari showcases that a joined character
is incorrectly separated into two characters in the
word ‘school’).
Thus, to improve the robustness of pretrained
multilingual models across noise in all languages,
we propose Robust Constrastive Pretraining (RCP)
that couples multilingual noisy data-augmentation
with a contrastive learning loss term during pre-
training; this encourages the model to develop sim-
ilar representations for the original and the noised
version of a sentence.
On the noisy test sets, our method improves the
multilingual model performance across all metrics
and multilingual tasks– IC% by
4.9%
on Multi-
ATIS++,
4.1%
on MultiSNIPS; SL-F1 by
18.4
on
MultiATIS++,
8.6
on MultiSNIPS; NER-F1 by
2.9
on WikiANN; NLI% by
0.7%
on XNLI. In sum-
mary, our primary contributions are:
1.
We construct multilingual test data to evaluate
the robustness of NLP models to noise (§3).
2.
We show that the performance of existing mul-
tilingual language models deteriorates on four
tasks when tested on the noisy test data (§5.1).
3.
We introduce Robust Contrastive Pretraining
(RCP) to boost the robustness of existing mul-
tilingual language models (§5.2).
Our code and data is available on Github (repo:
amazon-science/multilingual-robust-contrastive-
pretraining) .
2 Related Work
Many prior works demonstrate the brittleness
of neural models on different noise phenomena
such as misspellings (Belinkov and Bisk,2017;
Karpukhin et al.,2019;Moradi et al.,2021), cas-
ing variation (van Miltenburg et al.,2020), para-
phrases (Einolghozati et al.,2019), morphologi-
cal variance (Tan et al.,2020), synonyms (Sen-
gupta et al.,2021), and dialectical variance (Sarkar
et al.,2022). A popular approach to improve the
robustness to noise is fine-tuning models with data
augmentation (Feng et al.,2021) at either the pre-
training (Tan et al.,2020;Sarkar et al.,2022) or the
task-training stage (Peng et al.,2021). These works
consider monolingual pre-trained models and pri-
marily focus on English. While recent works on
token-free models motivate robustness in multilin-
gual settings (Clark et al.,2021;Xue et al.,2022;
Tay et al.,2021), examining the robustness of SOTA
multilingual pre-trained models (and improving
them) remains unexplored. Hence, we investigate–
(1) are multilingual models robust to noise seen in
different languages (that may be dissimilar to noise
types seen in English)? (2) can we get and leverage
multi-lingual noise data to improve multilingual
models? and (3) do automatic data-augmentation
methods designed for English improve robustness
to multilingual noise?
To boost the robustness of multilingual models
to diverse multilingual noise, we leverage multilin-
gual data augmentation at the pretraining stage and
use contrastive learning. Our effort complements
work in computer vision that showcases contrastive
learning with adversarial learning at task-training
(Fan et al.,2021;Ghosh and Lan,2021) and pre-
training time (Jiang et al.,2020;Kim et al.,2020)
can improve model robustness. NLP has also seen
a plethora of work that leverages contrastive learn-
ing, but seldom to alleviate robustness concerns
(Jaiswal et al.,2020). Similar concepts, such as Ad-
versarial Logit Pairing (Einolghozati et al.,2019),
used at task-training time have proven to be less
effective than data augmentation approaches (Sen-
gupta et al.,2021) in boosting robustness.
All the aforementioned works lack in at least
one of the two novel aspects of this paper– ro-
bustness to real-world (as opposed to adversarial)
noise, and/or multilinguality. Lastly, the aspect
of cross-lingual knowledge transfer has been stud-
ied in the context of different NLP tasks; typically,
from a high-resource language to a low-resource
one, as exemplified by the XTREME benchmark
(Hu et al.,2020). In this paper, we investigate the
cross-lingual transferability of robustness to real-
world noise.
3 Constructing Noisy Test Data
As no existing benchmarks exist to evaluate the ro-
bustness of multilingual models, we construct noisy
test sets in multiple languages for four tasks. First,
we construct a word-level error-and–correction dic-
tionary by leveraging the Wikipedia edit corpora.
Then, we sample replacements from this dictio-
nary and inject them into the test data for the var-
ious multilingual tasks, focusing on replacements
that only affect individual words but do not change
word order. Finally, we conduct human evalua-
tion to filter out test sets that are not deemed to be
realistic by language experts.
3.1 Wiki-edit Mining
Wikipedia
2
is a public encyclopedia available in
multiple languages. Wikipedia editors create and
iteratively edit its contents. We leverage these ed-
its to construct error-correction word dictionaries
(later used to create noisy test data). Our approach
to mining edits is similar to Tanaka et al. (2020),
but we consider multiple languages (as opposed to
only Japanese), and additionally create dictionaries
of word-level edits.
To isolate likely useful edits, we first consider
each revision page of an article and split it into a list
of sentences using NLTK (Bird et al.,2009). Sec-
ond, we filter out sentence pairs from two consecu-
tive edit versions ensuring both sentences have (1)
2-120 tokens, (2) a difference if
<5
tokens, and (3)
a relative edit-distance within
30%
of the shorter
sentence. Third, we leverage language-specific
tokenizes
difflib3
to extract exact token-level
deltas between the sentence pair. At last, we en-
sure word pairs (in these deltas) that have at least
one character-level Levenshtein edit-distance from
each other
4
and none of words are only numbers or
punctuation tokens. Note that edits to Wikipedia in-
volve changes to factual information, such as dates,
rather than incorrect spelling or grammar; thus, the
last step is necessary.
We can finally create a noise dictionary of
correct-to-incorrect words that has frequency
information about the different errors. For
example, an element of the dictionary (in Spanish)
looks like
{de: [(del, 0.52), (se,
0.32), (do, 0.1), (dë, 0.04),
(en, 0.02)]}.
3.2 Injecting Noise into Test sets
We use the noise dictionaries to create a noised
version of the original test data for the four
tasks– MultiATIS++ (Xu et al.,2020), MultiSNIPS,
WikiANN (Pan et al.,2017) and XNLI (Conneau
et al.,2018). After tokenization, we sample tokens
randomly without replacement. In each sampling
step, we sample based on a uniform probability dis-
tribution over the individual tokens and then check
if the token exists in the noise dictionary. If so,
we replace it with a noised version from the dic-
2https://meta.wikimedia.org/wiki/List_
of_Wikipedias
3https://docs.python.org/3/library/
difflib.html
4
For Chinese characters, including Kanji, even a single
character distance could imply a different word.
tionary; the noised version is sampled based on its
probability in the noise dictionary (that is propor-
tional to the frequency of its occurrence in the noisy
corpora). This procedure continues till we noise
a particular number of tokens, precisely between
1
and
min(4, pL)
where
p
a controllable fraction
(chosen as a hyperparameter at first, and finalized
based on human evaluation described in §3.3), and
Lis the number of words in the sentence.
3.3 Human Verification of Noised Test-sets
During human evaluation, we analyse the noisy
data created for the MultiATIS++ dataset. We
asked the language expert to assume that a user
who may not be a native speaker, or in a hurry,
or sloppy, was trying to find out flight informa-
tion via text chat, and evaluate realism with this in
mind. Note that analysis of noise types for Multi-
ATIS++ generalizes well to other datasets as we use
the same error-correction dictionaries for injecting
noise into all the test-sets.
Our language experts have graduate/doctoral de-
grees in linguistics, computational linguistics, or
natural language processing and are fluent/native
speakers of the respective languages. We employed
the human experts and compensated them fairly to
conduct this study (see §7for details). The experts
are given
45
examples without being told that
15
examples have
5%
,
15
have
10%
, and
15
have
20%
noised tokens and asked three questions about each
example. (1) Is the noised sentence realistic, mod-
erately realistic, or unrealistic? (2) What type of
noise is present in the sentence (we supply an ini-
tial list and let them add more)? and (3) Are the
intent and slot labels unchanged? Based on their
initial feedback, we choose the most realistic noise
fraction (i.e.
5
,
10
or
20%
) and provide them with
60
more examples from that set. We considered
15
utterances enough to determine the noise fraction,
but used the ratings on
75
utterances for evaluating
realism (see realistic utterance % in Figure 1).
In Figure 1, we summarize the results of the hu-
man evaluation. Column two shows the error injec-
tion ratio that was deemed to have more than
95%
realistic utterances. We set a high cut-off of
95%
to ensure we can make confident statements about
the robustness of multilingual models to realistic
alterations exhibited in our benchmarks. Hence,
Chinese and Japanese (with a realism of
86.2%
and
92.3%
resp.) are omitted in our benchmarks.
The last two columns highlight examples deemed
Figure 2: The column-wise color density (which adds
up to one) shows the percentage of a different noise
types observed for a particular language. The row-wise
values show that some noise types (eg. homophononic)
is only present for a single language (eg. zh).
as realistic and unrealistic by human experts with
the noised tokens highlighted in orange.
Given the sentence length and similarity in task
types, we use the error injection percentage deter-
mined to be the most realistic for MultiATIS++ as
the error injection percentage for MultiSNIPS and
Wiki-ann. For XNLI, experts deemed higher noise
injection ratios (of
>0.05
) to be unrealistic (
15%
for 0.1,
27%
for 0.2) because (1) the premise, usu-
ally much longer than sentences in MultiATIS++,
had (impractically high) number of noise tokens,
and (2) the classification label (implies/neutral/-
contradicts) sometimes changed with large noise
additions. Thus, for XNLI, we choose
0.05
to be
the default noise injection ratio. Finally, one expert
noted the Turkish data for MultiATIS++ lacked
many diacritic characters, muddling the distinction
between noise injected by our procedure and exist-
ing misspellings; hence, it was ignored.
In Figure 2, we list the noise-types identified
by our experts in different languages. While cer-
tain noise-types, such as typographical errors, mis-
spellings are common across multiple languages,
摘要:

RobusticationofMultilingualLanguageModelstoReal-worldNoiseinCrosslingualZero-shotSettingswithRobustContrastivePretrainingAsaCooperStickland?{y,SailikSengupta?z,JasonKrone{z,HeHez},SaabMansourzyUniversityofEdinburgh,zÀWSAILabs,}NewYorkUniversitya.cooper.stickland@ed.ac.uk,{sailiks,saabm,hehea}@amazo...

展开>> 收起<<
Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining Asa Cooper Sticklandy Sailik Senguptaz Jason Kronez He Hez Saab Mansourz.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:670.24KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注