Checks and Strategies for Enabling Code-Switched Machine Translation Thamme Gowda andMozhdeh Gheini andJonathan May Information Sciences Institute and Computer Science Department_2

2025-04-30 0 0 776.08KB 11 页 10玖币
侵权投诉
Checks and Strategies for Enabling Code-Switched Machine Translation
Thamme Gowda and Mozhdeh Gheini and Jonathan May
Information Sciences Institute and Computer Science Department
University of Southern California
{tg,gheini,jonmay}@isi.edu
Abstract
Code-switching is a common phenomenon
among multilingual speakers, where alterna-
tion between two or more languages occurs
within the context of a single conversation.
While multilingual humans can seamlessly
switch back and forth between languages, mul-
tilingual neural machine translation (NMT)
models are not robust to such sudden changes
in input. This work explores multilingual
NMT models’ ability to handle code-switched
text. First, we propose checks to measure
switching capability. Second, we investigate
simple and effective data augmentation meth-
ods that can enhance an NMT model’s ability
to support code-switching. Finally, by using
a glass-box analysis of attention modules, we
demonstrate the effectiveness of these methods
in improving robustness.
1 Introduction
Neural machine translation (NMT) (Sutskever
et al.,2014;Bahdanau et al.,2015;Vaswani et al.,
2017) has made significant progress, from support-
ing only a pair of languages per model to simultane-
ously supporting hundreds of languages (Johnson
et al.,2017;Zhang et al.,2020;Tiedemann,2020;
Gowda et al.,2021b). Multilingual NMT models
have been deployed in production systems and are
actively used to translate across languages in day-
to-day settings (Wu et al.,2016;Caswell,2020;
Mohan and Skotdal,2021). A great many metrics
for evaluation of machine translation have been
proposed (Doddington,2002;Banerjee and Lavie,
2005;Snover et al.,2006;Popovi´
c,2015;Gowda
et al.,2021a); simply citing a more comprehensive
list would exceed space limitations, however, ex-
cept context-aware MT, nearly all approaches con-
sider translation in the context of a single sentence.
Even approaches that generalize to support trans-
lation of multiple languages (Zhang et al.,2020;
Tiedemann,2020;Gowda et al.,2021b) continue to
use the single-sentence, single-language paradigm.
In reality, however, multilingual environments of-
ten involve language alternation or code-switching
(CS), where seamless alternation between two or
more languages occurs (Myers-Scotton and Ury,
1977).
CS can be broadly classified into two types
(Myers-Scotton,1989): (i) intra-sentential CS,
where switching occurs within sentence or clause
boundary, and (ii) inter-sentential CS, where
switching occurs at sentence or clause boundaries.
An example for each type is given in Table 1. CS
has been studied extensively in linguistics commu-
nities (Nilep,2006); however, the efforts in the MT
community are scant (Gupta et al.,2021).
Intra Ce
moment when you start
penser en deux langues
at the
same temps.
(The moment when you start to think in two
languages at the same time.)
Inter Comme on fait son lit
, you must lie
on it.
(As you make your bed, you must lie on it.)
Table 1: Intra- and inter- sentential code-switching ex-
amples between French and English.
In this work, we show that, as commonly built,
multilingual NMT models are not robust to multi-
sentence translation, especially when CS is in-
volved. The contributions of this work are out-
lined as follows: Firstly, a few simple but effective
checks for improving the test coverage in multi-
lingual NMT evaluation are described (Section 2).
Secondly, we explore training data augmentation
techniques such as concatenation and noise addi-
tion in the context of multilingual NMT (Section 3).
Third, using a many-to-one multilingual translation
task setup (Section 4), we investigate the relation-
ship between training data augmentation methods
and their impact on multilingual test cases. Fourth,
arXiv:2210.05096v1 [cs.CL] 11 Oct 2022
we conduct a glass-box analysis of cross-attention
in the Transformer architecture and show visually
as well as quantitatively that the models trained
with concatenated training sentences learn a more
sharply focused attention mechanism than others.
Finally, we examine how our data augmentation
strategies generalize to multi-sentence translation
for a variable number of sentences, and determine
that two-sentence concatenation in training is suf-
ficient to model many-sentence concatenation in
inference (Section 5.2).
2 Multilingual Translation Evaluation:
Additional Checks
Notation: For simplicity, consider a many-to-one
model that translates sentences from
K
source lan-
guages,
{Lk|k= 1,2, ...K}
, to a target language,
T
. Let
x(Lk)
i
be a sentence in the source language
Lk
, and let its translation in the target language be
y(T)
i
; where unambiguous we omit the superscripts.
We propose the following checks to be used for
multilingual NMT:
C-TL:
Consecutive sentences in the source and
target languages. This check tests if the translator
can translate in the presence of inter-sentential
CS, and preserve phrases that are already in the
target language. For completeness, we can test
both source-to-target and target-to-source CS, as
follows:
x(Lk)
i+yi+1 yi+yi+1 (1)
yi+x(Lk)
i+1 yi+yi+1 (2)
In practice, we use a space character to join sen-
tences, indicated by the concatenation operator
+
’.
1
This check requires the held-out set sen-
tence order to preserve the coherency of the orig-
inal document.
C-XL:
This check tests if a multilingual translator
is agnostic to CS. This check is created by con-
catenating consecutive sentences across source
languages. This is possible iff the held-out sets
are multi-parallel across languages, and, similar
to the previous, each preserves the coherency of
the original documents. Given two languages
Lk
and Lm, we obtain a test sentence as follows:
x(Lk)
i+x(Lm)
i+1 yi+yi+1 (3)
1
We focus on orthographies that use space as a word-
breaker. In orthographies without a word-breaker, joining
may be performed without any glue character.
R-XL:
This check tests if a multilingual translator
can function in light of a topic switch among its
supported source languages. For any two lan-
guages
Lk
and
Lm
and random positions
i
and
j
in their original corpus, we obtain a test segment
by concatenating them as:
x(Lk)
i+x(Lm)
jyi+yj(4)
This method makes the fewest assumptions about
the nature of held-out datasets, i.e., unlike pre-
vious methods, neither multi-parallelism nor co-
herency in sentence order is necessary.
C-SL:
Concatenate consecutive sentences in the
same language. While this check is not a test on
CS, this helps in testing if the model is invariant
to a missed segmentation, as it is not always triv-
ial to determine sentence segmentation in contin-
uous language. This check is possible iff held-out
set sentence order preserves the coherency of the
original document. Formally,
x(Lk)
i+x(Lk)
i+1 yi+yi+1 (5)
3 Achieving Robustness via Data
Augmentation Methods
In the previous section, we described several ways
of improving test coverage for multilingual trans-
lation models. In this section, we explore training
data augmentation techniques to improve robust-
ness to code-switching settings.
3.1 Concatenation
Concatenation of training sentences has been
proven to be a useful data augmentation technique;
Nguyen et al. (2021) investigate key factors behind
the usefulness of training segment concatenations
in bilingual settings. Their experiments reveal that
concatenating random sentences performs as well
as consecutive sentence concatenation, which sug-
gests that discourse coherence is unlikely the driv-
ing factor behind the gains. They attribute the gains
to three factors: context diversity, length diversity,
and position shifting.
In this work, we investigate training data con-
catenation under multilingual settings, hypothesiz-
ing that concatenation helps achieve the robustness
checks that are described in Section 2. Our train-
ing concatenation approaches are similar to our
check sets, with the notable exception that we do
not consider consecutive sentence training specifi-
cally, both because of Nguyen et al. (2021)’s find-
ing and because training data gathering techniques
can often restrict the availability of consecutive
data (Bañón et al.,2020). We investigate the fol-
lowing sub-settings for concatenations:
CatSL:
Concatenate a pair of source sentences in
the same language, using space whenever ap-
propriate (e.g., languages with space separated
tokens).
x(Lk)
i+x(Lk)
jyi+yj(6)
CatXL:
Concatenate a pair of source sentences,
without constraint on language.
x(Lk)
i+x(Lm)
jyi+yj(7)
CatRepeat:
The same sentence is repeated and
then concatenated. Although this seems unin-
teresting, it serves a key role in ruling out gains
possibly due to data repetition and modification
of sentence lengths.
x(Lk)
i+x(Lk)
iyi+yi(8)
3.2 Adding Noise
We hypothesize that introducing noise during train-
ing might help achieve robustness and investigate
two approaches that rely on noise addition:
DenoiseTgt:
Form the source side of a target
segment by adding noise to it. Formally,
noise(y;r)y
, where hyperparameter
r
con-
trols the noise ratio. Denoising is an important
technique in unsupervised NMT (Artetxe et al.,
2018;Lample et al.,2018).
NoisySrc:
Add noise to the source side of a
translation pair. Formally,
noise(x;r)y
.
This resembles back-translation (Sennrich et al.,
2016a) where augmented data is formed by pair-
ing noisy source sentences with clean target sen-
tences.
The function
noise(...;r)
is implemented as fol-
lows: (i)
r%
of random tokens are dropped, (ii)
r%
of random tokens are replaced with random types
uniformly sampled from vocabulary, and (iii)
r%
of random tokens’ positions are displaced within a
sequence. We use r= 10% in this work.
4 Setup
4.1 Dataset
We use publicly available datasets from The Work-
shop on Asian Translation 2021 (WAT21)’s Mul-
Language In-domain All-data
Bengali (BN) 23.3k/0.4M/0.4M 1.3M/19.5M/21.3M
Gujarati (GU) 41.6k/0.7M/0.8M 0.5M/07.2M/09.5M
Hindi (HI) 50.3k/1.1M/1.0M 3.1M/54.7M/51.8M
Kannada (KN) 28.9k/0.4M/0.6M 0.4M/04.6M/08.7M
Malayalam(ML) 26.9k/0.3M/0.5M 1.1M/11.6M/19.0M
Marathi (MR) 29.0k/0.4M/0.5M 0.6M/09.2M/13.1M
Oriya (OR) 32.0k/0.5M/0.6M 0.3M/04.4M/05.1M
Punjabi (PA) 28.3k/0.6M/0.5M 0.5M/10.1M/10.9M
Tamil (TA) 32.6k/0.4M/0.6M 1.4M/16.0M/27.0M
Telugu (TE) 33.4k/0.5M/0.6M 0.5M/05.7M/09.1M
All 326k/5.3M/6.1M 9.6M/143M/175M
Table 2: Training dataset statistics: segments / source /
target tokens, before tokenization.
Name Dev Test
Orig 10k/140.5k/163.2k 23.9k/331.1k/385.1k
C-TL 10k/303.7k/326.4k 23.9k/716.1k/770.1k
C-XL 10k/283.9k/326.4k 23.9k/670.7k/770.1k
R-XL 10k/216.0k/251.2k 23.9k/514.5k/600.5k
C-SL 10k/281.0k/326.4k 23.9k/662.1k/770.1k
Table 3: Development and test set statistics: segments
/ source / target tokens, before subword tokenization.
The row named ‘Orig’ is the union of all ten individual
languages’ datasets, and the rest are created as per defi-
nitions in Section 2. Dev-Orig set is used for validation
and early stopping in all our multilingual models.
tiIndicMT (Nakazawa et al.,2021)
2
shared task.
This task involves translation between English(EN)
and 10 Indic Languages, namely: Bengali(BN),
Gujarati(GU), Hindi(HI), Kannada(KN), Malay-
alam(ML), Marathi(MR), Oriya(OR), Punjabi(PA),
Tamil(TA) and Telugu(TE). The development and
held-out test sets are multi-parallel and contain
1,000 and 2,390 sentences, respectively. The train-
ing set contains a small portion of data from the
same domain as the held-out sets, as well as addi-
tional datasets from other domains. All the training
data statistics are given in Table 2. We focus on the
Indic
)
English (many-to-one) translation direction
in this work.
Following the definitions in Section 2, we create
C-SL, C-TL, C-XL, and R-XL versions of devel-
opment and test sets; statistics are given in Table 3.
An example demonstrating the nuances in all these
four methods is shown in Table 4. Following the
definitions in Section 3, we create CatSL, CatXL,
CatRpeat, DenoiseTgt, and NoisySrc augmented
training segments. For each of these training cor-
pus augmentation methods, we restrict the total
2http://lotus.kuee.kyoto-u.ac.jp/WAT/
indic-multilingual/
摘要:

ChecksandStrategiesforEnablingCode-SwitchedMachineTranslationThammeGowdaandMozhdehGheiniandJonathanMayInformationSciencesInstituteandComputerScienceDepartmentUniversityofSouthernCalifornia{tg,gheini,jonmay}@isi.eduAbstractCode-switchingisacommonphenomenonamongmultilingualspeakers,wherealterna-tionbe...

展开>> 收起<<
Checks and Strategies for Enabling Code-Switched Machine Translation Thamme Gowda andMozhdeh Gheini andJonathan May Information Sciences Institute and Computer Science Department_2.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:11 页 大小:776.08KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注