Domain Curricula for Code-Switched MT at MixMT 2022 Lekan Raheem Maab Elrashid

2025-08-18 2 0 359.52KB 8 页 10玖币

侵权投诉

Domain Curricula for Code-Switched MT at MixMT 2022

Lekan Raheem & Maab Elrashid

African Institute for Mathematical Sciences (AIMS)

{rwaliyu,mnimir}@aimsammi.org

Abstract

In multilingual colloquial settings, it is a ha-

bitual occurrence to compose expressions of

text or speech containing tokens or phrases

of different languages, a phenomenon popu-

larly known as code-switching or code-mixing

(CMX). We present our approach and re-

sults for the Code-mixed Machine Transla-

tion (MixMT) shared task at WMT 2022: the

task consists of two subtasks, monolingual

to code-mixed machine translation (Subtask-

1) and code-mixed to monolingual machine

translation (Subtask-2). Most non-synthetic

code-mixed data are from social media but

gathering a signiﬁcant amount of this kind

of data would be laborious and this form of

data has more writing variation than other do-

mains, so for both subtasks, we experimented

with data schedules for out-of-domain data.

We jointly learn multiple domains of text by

pretraining and ﬁne-tuning, combined with a

sentence alignment objective. We found that

switching between domains caused improved

performance in the domains seen earliest dur-

ing training, but depleted the performance on

the remaining domains. A continuous training

run with strategically dispensed data of differ-

ent domains showed a signiﬁcantly improved

performance over ﬁne-tuning.

1 Introduction

Code-mixing (CMX) denotes the alternation of

two languages within a single utterance (Poplack,

1980;Sitaram et al.,2019). Code-mixing occurs

mostly in unofﬁcial groups in multilingual envi-

ronments. More than 77% of Asians are mul-

tilingual (Ramakrishnan and Ahmad,2014), and

other statistics estimate that 64.5% of Europeans

speak more than two languages, with more than

80% of adults in the region being bilingual (Eu-

rostat,2019). Code-mixing happens far more of-

ten in conversations than in writing, and mostly in

unofﬁcial settings, hence it rarely occurs in docu-

mented settings. This makes substantial data gath-

ering for computational approaches to translations

of code-mixed language difﬁcult. Parallel corpora

for code-switched data is very scarce (Menacer

et al.,2019), this is because code-mixing mostly

occurs in unofﬁcial conversations like social me-

dia interactions.

Contemporary Neural Machine Translation

(NMT) mostly makes use of parametric sequence-

to-sequence models (Bahdanau et al.,2014;

Vaswani et al.,2017), where an encoder receives a

source sentence and outputs a set of hidden states,

the decoder then scrutinizes these hidden states at

each step, and outputs a sequence of softmax dis-

tribution over the target vocabulary space. Consid-

ering that we would need vast quantities of data to

train an adequate NMT for this task, we leverage

large-scale synthetic and available small data and

notably rank data on domain relevance, by ﬁne-

tuning with it, initiating training with the relevant

domain and strategically placing it at the premier

batches of the training data.

Essentially, the characteristics of the data an

NMT model is trained on are paramount to its

translation quality, in particular in terms of size

and domain. It is quintessential to train NMT

models based on the domain relevance of corpora.

Since most code-mixing occurs in unofﬁcial com-

munication, it is costly to ﬁnd a lot of labeled data

for every domain we are interested in. Hence we

attempt to ﬁnd less expensive exigencies to sup-

plement training data, pretrain on largely available

data of different domains, strategically construct

synthetic data, and apportion data to make up for

missing domains.

In these WMT Subtasks – monolingual to

code-mixed machine translation (Subtask-1) and

code-mixed to monolingual machine translation

(Subtask-2), we also ﬁne-tune on different do-

mains, align representations of data and ﬁnd the

best combination of approaches to solving the sub-

tasks. The main intuition behind our proposed

solution is that NMT models exhibit a signiﬁ-

cant translation correlation when trained on data

arXiv:2210.17463v1 [cs.CL] 31 Oct 2022

from the same or similar domains. With different

data domain requirements, it performs better when

trained with data of the most relevant domain as

preliminary batches compared to ﬁnetuning. As

most natural code-mixed data source is social me-

dia and it is difﬁcult to gather a good amount to

train a model, it is incumbent to ﬁnd a strategy

that makes the model prioritize this form of data

above others. Accordingly, we attempt to ﬁnd less

expensive techniques to supplement training data,

pretrain on largely available data of different do-

mains, strategically construct synthetic data, and

apportion data to make up for missing domains.

Our result showed improved performance on in-

nate code-mixed data (and non-synthetic WMT

test set samples) when this was prioritized and per-

formed strongly in a test with a mix of several

other data sources. We observed a better perfor-

mance with domain-speciﬁc evaluation upon ﬁne-

tuning but this intensely plummeted performance

on other ‘pretraining domains’, and more balanced

performance on passing the interesting domain in

the preliminary batches in a single ‘all domain’

training.

2 Related Work

It is laborious to obtain ‘one-ﬁts-all’ training data

for NMT. Most publicly available parallel cor-

pora like Tanzil, OPUS, UNPC are sourced from

documented communication, and these are often

domain-speciﬁc. In NMT, data selection e.g. Ax-

elrod et al. (2011) has remained as an under-

lying and important research concern. Choos-

ing training examples that are relevant to the tar-

get domain, or by choosing high-quality exam-

ples for data cleaning (also known as denoising),

has been essential in domain adaptation. Building

a large-scale multi-domain NMT model that ex-

cels on several domains simultaneously becomes

both technically difﬁcult and practically back-

breaking. Addressing research problems such as

catastrophic forgetting (Goodfellow et al.,2013),

data balancing (Wang et al.,2020), Adapters

(Houlsby et al.,2019) have shown improvement.

Unfortunately, several domains are difﬁcult to

handle with the single-domain data selection tech-

niques currently in use. For instance, improving

translation quality of one domain will often hurt

that of another (Britz et al.,2017;van der Wees

et al.,2017).

Song et al. (2019) replaced phrases with pre-

speciﬁed translation to perform “soft” constraint

decoding. Xu and Yvon (2021) generated code-

mixed data from regular parallel texts and showed

this training strategy yields MT systems that sur-

pass multilingual systems for code-mixed texts.

Considering that code-mixed text belongs in

less documented domains than most, there may be

a need for domain adaptation used on sufﬁciently

available data domains. Our work is inspired

by the following approaches: Wang et al. (2019)

executed simultaneous data selection across sev-

eral domains by gradually focusing on multi-

domain relevant and noise-reduced data batches

while carefully introducing instance-level domain-

relevance features and automatically constructing

a training curriculum. Park et al. (2022) demon-

strated that instance-level features are better able

to distinguish between different domains com-

pared to corpus-level attributes. Dou et al. (2019)

proposed modeling the difference between do-

mains instead of smoothing over domains for ma-

chine translation.

Anwar et al. (2022) showed that an encoder

alignment objective is beneﬁcial for code-mixed

translation, in addition to Arivazhagan et al.

(2019) that proposed auxiliary losses on the NMT

encoder that imposed representational invariance

across languages for multilingual machine trans-

lation.

English Code-Mixed (CMX)

@dh*v*l2410*6 sure

brother :)

@dh*v*l2410*6 sure bhai

"I just need reviews like

these, this motivates me a

lot"

"Bas aise hi reviews ki za-

roorat hai, kaaﬁ protsahan

milta hai in baaton se. "

When the sorrow got miss-

ing in this room, the blood

also became thin, #Guess-

TheSong

Jab gam ye rum mein kho

gaya, toh khoon bhi patla

hogaya #GuessTheSong

Table 1: Examples from the WMT Shared Task

Dataset.

3 Data

In table 1we show some samples from the WMT

shared task, sourced from the non-synthetic val-

idation data. The data provided for Subtask-1

(Srivastava and Singh,2021) contains synthetic

and human-generated data and Subtask-2 Parallel

Hinglish Social Media Code-Mixed Corpus (Sri-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DomainCurriculaforCode-SwitchedMTatMixMT2022LekanRaheem&MaabElrashidAfricanInstituteforMathematicalSciences(AIMS){rwaliyu,mnimir}@aimsammi.orgAbstractInmultilingualcolloquialsettings,itisaha-bitualoccurrencetocomposeexpressionsoftextorspeechcontainingtokensorphrasesofdifferentlanguages,aphenomenonpo...

展开>> 收起<<

Domain Curricula for Code-Switched MT at MixMT 2022 Lekan Raheem Maab Elrashid.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Domain Curricula for Code-Switched MT at MixMT 2022 Lekan Raheem Maab Elrashid

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: