Robustiﬁcation of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining Asa Cooper Sticklandy Sailik Senguptaz Jason Kronez He Hez Saab Mansourz

2025-05-03 0 0 670.24KB 17 页 10玖币

侵权投诉

Robustiﬁcation of Multilingual Language Models to Real-world Noise in

Crosslingual Zero-shot Settings with Robust Contrastive Pretraining

Asa Cooper Stickland?¶†, Sailik Sengupta?‡, Jason Krone¶‡, He He‡♦, Saab Mansour‡

†University of Edinburgh, ‡ÀWS AI Labs, ♦New York University

a.cooper.stickland@ed.ac.uk,{sailiks,saabm,hehea}@amazon.com

Abstract

Advances in neural modeling have achieved

state-of-the-art (SOTA) results on public nat-

ural language processing (NLP) benchmarks,

at times surpassing human performance. How-

ever, there is a gap between public benchmarks

and real-world applications where noise, such

as typographical or grammatical mistakes, is

abundant and can result in degraded perfor-

mance. Unfortunately, works which evaluate

the robustness of neural models on noisy data

and propose improvements, are limited to the

English language. Upon analyzing noise in dif-

ferent languages, we observe that noise types

vary greatly across languages. Thus, exist-

ing investigations do not generalize trivially to

multilingual settings. To benchmark the per-

formance of pretrained multilingual language

models, we construct noisy datasets covering

ﬁve languages and four NLP tasks and observe

a clear gap in the performance between clean

and noisy data in the zero-shot cross-lingual

setting. After investigating several ways to

boost the robustness of multilingual models

in this setting, we propose Robust Contrastive

Pretraining (RCP). RCP combines data aug-

mentation with a contrastive loss term at the

pretraining stage and achieves large improve-

ments on noisy (& original test data) across

two sentence-level (+3.2%) and two sequence-

labeling (+10 F1-score) multilingual classiﬁca-

tion tasks.

1 Introduction

Recently, multilingual pre-trained language mod-

els like mBERT (Devlin et al.,2019), XLM-R

(Conneau et al.,2020) and various others (Chi

et al.,2021;Xue et al.,2021;Chi et al.,2022)

have improved multilingual language understand-

ing by pretraining large Transformer models on

web-scale corpora (such as Wikipedia, Common-

Crawl). These models achieve state-of-the-art per-

formance on cross-lingual transfer and many mul-

?Equal Contribution, ¶Work done while at Amazon.

tilingual NLP tasks (Wu and Dredze,2019;Pires

et al.,2019). However, a real-world system will

encounter real-world noise, such as linguistic vari-

ations and common errors observed in textual data,

that are often absent from benchmark datasets.

While prior works focused on this issue of ro-

bustness in monolingual settings (Peng et al.,2021;

Sengupta et al.,2021;Tan et al.,2020), investiga-

tion has been scarce for multilingual settings. In

this paper, we study the effect of realistic noise

in multilingual settings and propose methods to

boost the robustness of multilingual language mod-

els across four NLP tasks: Intent Classiﬁcation

(IC), Slot Labeling (SL), Named Entity Recogni-

tion (NER) and Natural Language Inference (NLI).

Due to the lack of multilingual noisy evaluation

data, we synthesize benchmarks by mining noise

from publicly available corpora and injecting them

into the test sets associated with each of the four

tasks. We conduct human validation to ensure that

this noised data is indeed realistic (see examples

from MultiATIS++ in Figure 1) and identify the

variety of noise-types seen across languages (in §3).

These analyses highlight the potential of our test-

set in evaluating (and motivating future research

on) multilingual robustness.

To benchmark the performance of multilingual

systems, we consider accuracy metrics on two

utterance-level tasks (IC% and NLI%) and F1-

scores on two token-level classiﬁcation tasks (SL-

F1 and NER-F1). Speciﬁcally, we seek to evaluate

the model’s performance on the noised version of

the test datasets in a zero-shot cross-lingual setting.

In this scenario, we have training data for a task

available only in one language (in our case, En-

glish) and test-data in various languages (Liu et al.,

2019,2020).

While training data augmentation increases

model robustness for monolingual (i.e. English)

settings, it is not immediately obvious if these ro-

bustness gains can transfer across languages, as

arXiv:2210.04782v2 [cs.CL] 10 Feb 2023

Language Noise

Injection

Ratio

Realistic

Utt. %

Realistic Examples (test-set) Unrealistic Examples (test-set)

French

(fr)

0.1 95.4%

Me montré les vols directs de Charlotte à Min-

neapolis mardi matin .

Quelle compagnie aérienne fut YX

Me montré des vols entre Détroit er St. Louis sur

Delta Northwest US Air est United Airlines .

Lister des vols de Las Vegas à Son Diego

German

(de)

0.2 94.5%

Zeige mir der Flüge zwischen Housten und Or-

lando

Welche Flüge gibt es vom Tacoma nach San Jose

Zeige mit alle Flüge vor Charlotte nach Minnea-

polis zum Dienstag morgen

Zeige mit Flüge an Milwaukee nach Washington

DC v. 12 Uhr

Spanish

(es)

0.1 96.9%

qué aerolíneas vuelan de baltimore a san fran-

cesc

muéstrame vuelos entr toronto y san diego

necesito información de un vuelo y la tarifa de

oakland a salt lake city para el jueves antes e sus

8 am

de nuevo york a las vegas el domingo con la tarde

Hindi

(hi)

0.05 95.4%

मुझे डे󰽬टा उड़ान󱀡 के बारे म󰸊 बताइए जो कोच के

याित्रय󱀡 को ना󰽯ता देता ह󱀡

मुझे मेि󰽪फस से लास वेगास तक उड़ान की ज󰿷रत है

सोमवार दोपहर ने लॉस एंिज󰽬स से िपट्सबग󰸓

रिववार दोपहर को िमयामी म󰸊 󰽙लीवल󰸋ड

Japanese

(jp)

0.1 92.3%

来水曜日にカンザスシティ初シカゴ行き

でシカゴの午後 7時ごろ到着して、り

のフライトが木曜日のフライト

ワシントンをコロンバス間のすべてのフ

ライトの運賃はいくら

シャロット空港の土曜日 err 午后 1時に

出する US エアのフライトをリストア

ップして

水曜日のフェニックスミルウォキ

逝き

Chinese

(zh)

0.1 86.2%

我需要 4点后在达拉斯起飞飞往旧金山的

联程航班

请列出从纽瓦克飞往洛杉机的航班

然而每天上午 10 点之前从密尔沃基飞往亚

特兰大

拉瓜迪亚了豪华轿车服务要多少钱

Figure 1: MultiATIS++ test set injected with real-world noise mined from Wikipedia edits. The highest error

injection ratio found to be realistic by human experts is shown alongside the realistic utterance percentage. We do

not include the noisy test sets for Chinese and Japanese in our analysis owing to low (<95%) realism.

error types can often be language-speciﬁc. For ex-

ample, typos in Devanagari script can differ from

those seen in Latin scripts (e.g.

-k

l→skul

in Devanagari showcases that a joined character

is incorrectly separated into two characters in the

word ‘school’).

Thus, to improve the robustness of pretrained

multilingual models across noise in all languages,

we propose Robust Constrastive Pretraining (RCP)

that couples multilingual noisy data-augmentation

with a contrastive learning loss term during pre-

training; this encourages the model to develop sim-

ilar representations for the original and the noised

version of a sentence.

On the noisy test sets, our method improves the

multilingual model performance across all metrics

and multilingual tasks– IC% by

4.9%

on Multi-

ATIS++,

4.1%

on MultiSNIPS; SL-F1 by

18.4

MultiATIS++,

8.6

on MultiSNIPS; NER-F1 by

2.9

on WikiANN; NLI% by

0.7%

on XNLI. In sum-

mary, our primary contributions are:

We construct multilingual test data to evaluate

the robustness of NLP models to noise (§3).

We show that the performance of existing mul-

tilingual language models deteriorates on four

tasks when tested on the noisy test data (§5.1).

We introduce Robust Contrastive Pretraining

(RCP) to boost the robustness of existing mul-

tilingual language models (§5.2).

Our code and data is available on Github (repo:

amazon-science/multilingual-robust-contrastive-

pretraining) .

2 Related Work

Many prior works demonstrate the brittleness

of neural models on different noise phenomena

such as misspellings (Belinkov and Bisk,2017;

Karpukhin et al.,2019;Moradi et al.,2021), cas-

ing variation (van Miltenburg et al.,2020), para-

phrases (Einolghozati et al.,2019), morphologi-

cal variance (Tan et al.,2020), synonyms (Sen-

gupta et al.,2021), and dialectical variance (Sarkar

et al.,2022). A popular approach to improve the

robustness to noise is ﬁne-tuning models with data

augmentation (Feng et al.,2021) at either the pre-

training (Tan et al.,2020;Sarkar et al.,2022) or the

task-training stage (Peng et al.,2021). These works

consider monolingual pre-trained models and pri-

marily focus on English. While recent works on

token-free models motivate robustness in multilin-

gual settings (Clark et al.,2021;Xue et al.,2022;

Tay et al.,2021), examining the robustness of SOTA

multilingual pre-trained models (and improving

them) remains unexplored. Hence, we investigate–

(1) are multilingual models robust to noise seen in

different languages (that may be dissimilar to noise

types seen in English)? (2) can we get and leverage

multi-lingual noise data to improve multilingual

models? and (3) do automatic data-augmentation

methods designed for English improve robustness

to multilingual noise?

To boost the robustness of multilingual models

to diverse multilingual noise, we leverage multilin-

gual data augmentation at the pretraining stage and

use contrastive learning. Our effort complements

work in computer vision that showcases contrastive

learning with adversarial learning at task-training

(Fan et al.,2021;Ghosh and Lan,2021) and pre-

training time (Jiang et al.,2020;Kim et al.,2020)

can improve model robustness. NLP has also seen

a plethora of work that leverages contrastive learn-

ing, but seldom to alleviate robustness concerns

(Jaiswal et al.,2020). Similar concepts, such as Ad-

versarial Logit Pairing (Einolghozati et al.,2019),

used at task-training time have proven to be less

effective than data augmentation approaches (Sen-

gupta et al.,2021) in boosting robustness.

All the aforementioned works lack in at least

one of the two novel aspects of this paper– ro-

bustness to real-world (as opposed to adversarial)

noise, and/or multilinguality. Lastly, the aspect

of cross-lingual knowledge transfer has been stud-

ied in the context of different NLP tasks; typically,

from a high-resource language to a low-resource

one, as exempliﬁed by the XTREME benchmark

(Hu et al.,2020). In this paper, we investigate the

cross-lingual transferability of robustness to real-

world noise.

3 Constructing Noisy Test Data

As no existing benchmarks exist to evaluate the ro-

bustness of multilingual models, we construct noisy

test sets in multiple languages for four tasks. First,

we construct a word-level error-and–correction dic-

tionary by leveraging the Wikipedia edit corpora.

Then, we sample replacements from this dictio-

nary and inject them into the test data for the var-

ious multilingual tasks, focusing on replacements

that only affect individual words but do not change

word order. Finally, we conduct human evalua-

tion to ﬁlter out test sets that are not deemed to be

realistic by language experts.

3.1 Wiki-edit Mining

Wikipedia

is a public encyclopedia available in

multiple languages. Wikipedia editors create and

iteratively edit its contents. We leverage these ed-

its to construct error-correction word dictionaries

(later used to create noisy test data). Our approach

to mining edits is similar to Tanaka et al. (2020),

but we consider multiple languages (as opposed to

only Japanese), and additionally create dictionaries

of word-level edits.

To isolate likely useful edits, we ﬁrst consider

each revision page of an article and split it into a list

of sentences using NLTK (Bird et al.,2009). Sec-

ond, we ﬁlter out sentence pairs from two consecu-

tive edit versions ensuring both sentences have (1)

2-120 tokens, (2) a difference if

tokens, and (3)

a relative edit-distance within

30%

of the shorter

sentence. Third, we leverage language-speciﬁc

tokenizes

difflib3

to extract exact token-level

deltas between the sentence pair. At last, we en-

sure word pairs (in these deltas) that have at least

one character-level Levenshtein edit-distance from

each other

and none of words are only numbers or

punctuation tokens. Note that edits to Wikipedia in-

volve changes to factual information, such as dates,

rather than incorrect spelling or grammar; thus, the

last step is necessary.

We can ﬁnally create a noise dictionary of

correct-to-incorrect words that has frequency

information about the different errors. For

example, an element of the dictionary (in Spanish)

looks like

{de: [(del, 0.52), (se,

0.32), (do, 0.1), (dë, 0.04),

(en, 0.02)]}.

3.2 Injecting Noise into Test sets

We use the noise dictionaries to create a noised

version of the original test data for the four

tasks– MultiATIS++ (Xu et al.,2020), MultiSNIPS,

WikiANN (Pan et al.,2017) and XNLI (Conneau

et al.,2018). After tokenization, we sample tokens

randomly without replacement. In each sampling

step, we sample based on a uniform probability dis-

tribution over the individual tokens and then check

if the token exists in the noise dictionary. If so,

we replace it with a noised version from the dic-

2https://meta.wikimedia.org/wiki/List_

of_Wikipedias

3https://docs.python.org/3/library/

difflib.html

For Chinese characters, including Kanji, even a single

character distance could imply a different word.

tionary; the noised version is sampled based on its

probability in the noise dictionary (that is propor-

tional to the frequency of its occurrence in the noisy

corpora). This procedure continues till we noise

a particular number of tokens, precisely between

and

min(4, pL)

where

a controllable fraction

(chosen as a hyperparameter at ﬁrst, and ﬁnalized

based on human evaluation described in §3.3), and

Lis the number of words in the sentence.

3.3 Human Veriﬁcation of Noised Test-sets

During human evaluation, we analyse the noisy

data created for the MultiATIS++ dataset. We

asked the language expert to assume that a user

who may not be a native speaker, or in a hurry,

or sloppy, was trying to ﬁnd out ﬂight informa-

tion via text chat, and evaluate realism with this in

mind. Note that analysis of noise types for Multi-

ATIS++ generalizes well to other datasets as we use

the same error-correction dictionaries for injecting

noise into all the test-sets.

Our language experts have graduate/doctoral de-

grees in linguistics, computational linguistics, or

natural language processing and are ﬂuent/native

speakers of the respective languages. We employed

the human experts and compensated them fairly to

conduct this study (see §7for details). The experts

are given

examples without being told that

examples have

have

10%

, and

have

20%

noised tokens and asked three questions about each

example. (1) Is the noised sentence realistic, mod-

erately realistic, or unrealistic? (2) What type of

noise is present in the sentence (we supply an ini-

tial list and let them add more)? and (3) Are the

intent and slot labels unchanged? Based on their

initial feedback, we choose the most realistic noise

fraction (i.e.

20%

) and provide them with

more examples from that set. We considered

utterances enough to determine the noise fraction,

but used the ratings on

utterances for evaluating

realism (see realistic utterance % in Figure 1).

In Figure 1, we summarize the results of the hu-

man evaluation. Column two shows the error injec-

tion ratio that was deemed to have more than

95%

realistic utterances. We set a high cut-off of

95%

to ensure we can make conﬁdent statements about

the robustness of multilingual models to realistic

alterations exhibited in our benchmarks. Hence,

Chinese and Japanese (with a realism of

86.2%

and

92.3%

resp.) are omitted in our benchmarks.

The last two columns highlight examples deemed

Figure 2: The column-wise color density (which adds

up to one) shows the percentage of a different noise

types observed for a particular language. The row-wise

values show that some noise types (eg. homophononic)

is only present for a single language (eg. zh).

as realistic and unrealistic by human experts with

the noised tokens highlighted in orange.

Given the sentence length and similarity in task

types, we use the error injection percentage deter-

mined to be the most realistic for MultiATIS++ as

the error injection percentage for MultiSNIPS and

Wiki-ann. For XNLI, experts deemed higher noise

injection ratios (of

>0.05

) to be unrealistic (

15%

for 0.1,

27%

for 0.2) because (1) the premise, usu-

ally much longer than sentences in MultiATIS++,

had (impractically high) number of noise tokens,

and (2) the classiﬁcation label (implies/neutral/-

contradicts) sometimes changed with large noise

additions. Thus, for XNLI, we choose

0.05

to be

the default noise injection ratio. Finally, one expert

noted the Turkish data for MultiATIS++ lacked

many diacritic characters, muddling the distinction

between noise injected by our procedure and exist-

ing misspellings; hence, it was ignored.

In Figure 2, we list the noise-types identiﬁed

by our experts in different languages. While cer-

tain noise-types, such as typographical errors, mis-

spellings are common across multiple languages,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobusticationofMultilingualLanguageModelstoReal-worldNoiseinCrosslingualZero-shotSettingswithRobustContrastivePretrainingAsaCooperStickland?{y,SailikSengupta?z,JasonKrone{z,HeHez},SaabMansourzyUniversityofEdinburgh,zÀWSAILabs,}NewYorkUniversitya.cooper.stickland@ed.ac.uk,{sailiks,saabm,hehea}@amazo...

展开>> 收起<<

Robustiﬁcation of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining Asa Cooper Sticklandy Sailik Senguptaz Jason Kronez He Hez Saab Mansourz.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robustiﬁcation of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining Asa Cooper Sticklandy Sailik Senguptaz Jason Kronez He Hez Saab Mansourz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: