Checks and Strategies for Enabling Code-Switched Machine Translation Thamme Gowda andMozhdeh Gheini andJonathan May Information Sciences Institute and Computer Science Department_2

2025-04-30 0 0 776.08KB 11 页 10玖币

侵权投诉

Checks and Strategies for Enabling Code-Switched Machine Translation

Thamme Gowda and Mozhdeh Gheini and Jonathan May

Information Sciences Institute and Computer Science Department

University of Southern California

{tg,gheini,jonmay}@isi.edu

Abstract

Code-switching is a common phenomenon

among multilingual speakers, where alterna-

tion between two or more languages occurs

within the context of a single conversation.

While multilingual humans can seamlessly

switch back and forth between languages, mul-

tilingual neural machine translation (NMT)

models are not robust to such sudden changes

in input. This work explores multilingual

NMT models’ ability to handle code-switched

text. First, we propose checks to measure

switching capability. Second, we investigate

simple and effective data augmentation meth-

ods that can enhance an NMT model’s ability

to support code-switching. Finally, by using

a glass-box analysis of attention modules, we

demonstrate the effectiveness of these methods

in improving robustness.

1 Introduction

Neural machine translation (NMT) (Sutskever

et al.,2014;Bahdanau et al.,2015;Vaswani et al.,

2017) has made signiﬁcant progress, from support-

ing only a pair of languages per model to simultane-

ously supporting hundreds of languages (Johnson

et al.,2017;Zhang et al.,2020;Tiedemann,2020;

Gowda et al.,2021b). Multilingual NMT models

have been deployed in production systems and are

actively used to translate across languages in day-

to-day settings (Wu et al.,2016;Caswell,2020;

Mohan and Skotdal,2021). A great many metrics

for evaluation of machine translation have been

proposed (Doddington,2002;Banerjee and Lavie,

2005;Snover et al.,2006;Popovi´

c,2015;Gowda

et al.,2021a); simply citing a more comprehensive

list would exceed space limitations, however, ex-

cept context-aware MT, nearly all approaches con-

sider translation in the context of a single sentence.

Even approaches that generalize to support trans-

lation of multiple languages (Zhang et al.,2020;

Tiedemann,2020;Gowda et al.,2021b) continue to

use the single-sentence, single-language paradigm.

In reality, however, multilingual environments of-

ten involve language alternation or code-switching

(CS), where seamless alternation between two or

more languages occurs (Myers-Scotton and Ury,

1977).

CS can be broadly classiﬁed into two types

(Myers-Scotton,1989): (i) intra-sentential CS,

where switching occurs within sentence or clause

boundary, and (ii) inter-sentential CS, where

switching occurs at sentence or clause boundaries.

An example for each type is given in Table 1. CS

has been studied extensively in linguistics commu-

nities (Nilep,2006); however, the efforts in the MT

community are scant (Gupta et al.,2021).

Intra Ce

moment when you start

penser en deux langues

at the

same temps.

(The moment when you start to think in two

languages at the same time.)

Inter Comme on fait son lit

, you must lie

on it.

(As you make your bed, you must lie on it.)

Table 1: Intra- and inter- sentential code-switching ex-

amples between French and English.

In this work, we show that, as commonly built,

multilingual NMT models are not robust to multi-

sentence translation, especially when CS is in-

volved. The contributions of this work are out-

lined as follows: Firstly, a few simple but effective

checks for improving the test coverage in multi-

lingual NMT evaluation are described (Section 2).

Secondly, we explore training data augmentation

techniques such as concatenation and noise addi-

tion in the context of multilingual NMT (Section 3).

Third, using a many-to-one multilingual translation

task setup (Section 4), we investigate the relation-

ship between training data augmentation methods

and their impact on multilingual test cases. Fourth,

arXiv:2210.05096v1 [cs.CL] 11 Oct 2022

we conduct a glass-box analysis of cross-attention

in the Transformer architecture and show visually

as well as quantitatively that the models trained

with concatenated training sentences learn a more

sharply focused attention mechanism than others.

Finally, we examine how our data augmentation

strategies generalize to multi-sentence translation

for a variable number of sentences, and determine

that two-sentence concatenation in training is suf-

ﬁcient to model many-sentence concatenation in

inference (Section 5.2).

2 Multilingual Translation Evaluation:

Additional Checks

Notation: For simplicity, consider a many-to-one

model that translates sentences from

source lan-

guages,

{Lk|k= 1,2, ...K}

, to a target language,

. Let

x(Lk)

be a sentence in the source language

, and let its translation in the target language be

y(T)

; where unambiguous we omit the superscripts.

We propose the following checks to be used for

multilingual NMT:

C-TL:

Consecutive sentences in the source and

target languages. This check tests if the translator

can translate in the presence of inter-sentential

CS, and preserve phrases that are already in the

target language. For completeness, we can test

both source-to-target and target-to-source CS, as

follows:

x(Lk)

i+yi+1 →yi+yi+1 (1)

yi+x(Lk)

i+1 →yi+yi+1 (2)

In practice, we use a space character to join sen-

tences, indicated by the concatenation operator

‘

’.

This check requires the held-out set sen-

tence order to preserve the coherency of the orig-

inal document.

C-XL:

This check tests if a multilingual translator

is agnostic to CS. This check is created by con-

catenating consecutive sentences across source

languages. This is possible iff the held-out sets

are multi-parallel across languages, and, similar

to the previous, each preserves the coherency of

the original documents. Given two languages

and Lm, we obtain a test sentence as follows:

x(Lk)

i+x(Lm)

i+1 →yi+yi+1 (3)

We focus on orthographies that use space as a word-

breaker. In orthographies without a word-breaker, joining

may be performed without any glue character.

R-XL:

This check tests if a multilingual translator

can function in light of a topic switch among its

supported source languages. For any two lan-

guages

and

and random positions

and

in their original corpus, we obtain a test segment

by concatenating them as:

x(Lk)

i+x(Lm)

j→yi+yj(4)

This method makes the fewest assumptions about

the nature of held-out datasets, i.e., unlike pre-

vious methods, neither multi-parallelism nor co-

herency in sentence order is necessary.

C-SL:

Concatenate consecutive sentences in the

same language. While this check is not a test on

CS, this helps in testing if the model is invariant

to a missed segmentation, as it is not always triv-

ial to determine sentence segmentation in contin-

uous language. This check is possible iff held-out

set sentence order preserves the coherency of the

original document. Formally,

x(Lk)

i+x(Lk)

i+1 →yi+yi+1 (5)

3 Achieving Robustness via Data

Augmentation Methods

In the previous section, we described several ways

of improving test coverage for multilingual trans-

lation models. In this section, we explore training

data augmentation techniques to improve robust-

ness to code-switching settings.

3.1 Concatenation

Concatenation of training sentences has been

proven to be a useful data augmentation technique;

Nguyen et al. (2021) investigate key factors behind

the usefulness of training segment concatenations

in bilingual settings. Their experiments reveal that

concatenating random sentences performs as well

as consecutive sentence concatenation, which sug-

gests that discourse coherence is unlikely the driv-

ing factor behind the gains. They attribute the gains

to three factors: context diversity, length diversity,

and position shifting.

In this work, we investigate training data con-

catenation under multilingual settings, hypothesiz-

ing that concatenation helps achieve the robustness

checks that are described in Section 2. Our train-

ing concatenation approaches are similar to our

check sets, with the notable exception that we do

not consider consecutive sentence training speciﬁ-

cally, both because of Nguyen et al. (2021)’s ﬁnd-

ing and because training data gathering techniques

can often restrict the availability of consecutive

data (Bañón et al.,2020). We investigate the fol-

lowing sub-settings for concatenations:

CatSL:

Concatenate a pair of source sentences in

the same language, using space whenever ap-

propriate (e.g., languages with space separated

tokens).

x(Lk)

i+x(Lk)

j→yi+yj(6)

CatXL:

Concatenate a pair of source sentences,

without constraint on language.

x(Lk)

i+x(Lm)

j→yi+yj(7)

CatRepeat:

The same sentence is repeated and

then concatenated. Although this seems unin-

teresting, it serves a key role in ruling out gains

possibly due to data repetition and modiﬁcation

of sentence lengths.

x(Lk)

i+x(Lk)

i→yi+yi(8)

3.2 Adding Noise

We hypothesize that introducing noise during train-

ing might help achieve robustness and investigate

two approaches that rely on noise addition:

DenoiseTgt:

Form the source side of a target

segment by adding noise to it. Formally,

noise(y;r)→y

, where hyperparameter

con-

trols the noise ratio. Denoising is an important

technique in unsupervised NMT (Artetxe et al.,

2018;Lample et al.,2018).

NoisySrc:

Add noise to the source side of a

translation pair. Formally,

noise(x;r)→y

This resembles back-translation (Sennrich et al.,

2016a) where augmented data is formed by pair-

ing noisy source sentences with clean target sen-

tences.

The function

noise(...;r)

is implemented as fol-

lows: (i)

of random tokens are dropped, (ii)

of random tokens are replaced with random types

uniformly sampled from vocabulary, and (iii)

of random tokens’ positions are displaced within a

sequence. We use r= 10% in this work.

4 Setup

4.1 Dataset

We use publicly available datasets from The Work-

shop on Asian Translation 2021 (WAT21)’s Mul-

Language In-domain All-data

Bengali (BN) 23.3k/0.4M/0.4M 1.3M/19.5M/21.3M

Gujarati (GU) 41.6k/0.7M/0.8M 0.5M/07.2M/09.5M

Hindi (HI) 50.3k/1.1M/1.0M 3.1M/54.7M/51.8M

Kannada (KN) 28.9k/0.4M/0.6M 0.4M/04.6M/08.7M

Malayalam(ML) 26.9k/0.3M/0.5M 1.1M/11.6M/19.0M

Marathi (MR) 29.0k/0.4M/0.5M 0.6M/09.2M/13.1M

Oriya (OR) 32.0k/0.5M/0.6M 0.3M/04.4M/05.1M

Punjabi (PA) 28.3k/0.6M/0.5M 0.5M/10.1M/10.9M

Tamil (TA) 32.6k/0.4M/0.6M 1.4M/16.0M/27.0M

Telugu (TE) 33.4k/0.5M/0.6M 0.5M/05.7M/09.1M

All 326k/5.3M/6.1M 9.6M/143M/175M

Table 2: Training dataset statistics: segments / source /

target tokens, before tokenization.

Name Dev Test

Orig 10k/140.5k/163.2k 23.9k/331.1k/385.1k

C-TL 10k/303.7k/326.4k 23.9k/716.1k/770.1k

C-XL 10k/283.9k/326.4k 23.9k/670.7k/770.1k

R-XL 10k/216.0k/251.2k 23.9k/514.5k/600.5k

C-SL 10k/281.0k/326.4k 23.9k/662.1k/770.1k

Table 3: Development and test set statistics: segments

/ source / target tokens, before subword tokenization.

The row named ‘Orig’ is the union of all ten individual

languages’ datasets, and the rest are created as per deﬁ-

nitions in Section 2. Dev-Orig set is used for validation

and early stopping in all our multilingual models.

tiIndicMT (Nakazawa et al.,2021)

shared task.

This task involves translation between English(EN)

and 10 Indic Languages, namely: Bengali(BN),

Gujarati(GU), Hindi(HI), Kannada(KN), Malay-

alam(ML), Marathi(MR), Oriya(OR), Punjabi(PA),

Tamil(TA) and Telugu(TE). The development and

held-out test sets are multi-parallel and contain

1,000 and 2,390 sentences, respectively. The train-

ing set contains a small portion of data from the

same domain as the held-out sets, as well as addi-

tional datasets from other domains. All the training

data statistics are given in Table 2. We focus on the

Indic

)

English (many-to-one) translation direction

in this work.

Following the deﬁnitions in Section 2, we create

C-SL, C-TL, C-XL, and R-XL versions of devel-

opment and test sets; statistics are given in Table 3.

An example demonstrating the nuances in all these

four methods is shown in Table 4. Following the

deﬁnitions in Section 3, we create CatSL, CatXL,

CatRpeat, DenoiseTgt, and NoisySrc augmented

training segments. For each of these training cor-

pus augmentation methods, we restrict the total

2http://lotus.kuee.kyoto-u.ac.jp/WAT/

indic-multilingual/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ChecksandStrategiesforEnablingCode-SwitchedMachineTranslationThammeGowdaandMozhdehGheiniandJonathanMayInformationSciencesInstituteandComputerScienceDepartmentUniversityofSouthernCalifornia{tg,gheini,jonmay}@isi.eduAbstractCode-switchingisacommonphenomenonamongmultilingualspeakers,wherealterna-tionbe...

展开>> 收起<<

Checks and Strategies for Enabling Code-Switched Machine Translation Thamme Gowda andMozhdeh Gheini andJonathan May Information Sciences Institute and Computer Science Department_2.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Checks and Strategies for Enabling Code-Switched Machine Translation Thamme Gowda andMozhdeh Gheini andJonathan May Information Sciences Institute and Computer Science Department_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: