Domain Curricula for Code-Switched MT at MixMT 2022 Lekan Raheem Maab Elrashid

2025-08-18 1 0 359.52KB 8 页 10玖币
侵权投诉
Domain Curricula for Code-Switched MT at MixMT 2022
Lekan Raheem & Maab Elrashid
African Institute for Mathematical Sciences (AIMS)
{rwaliyu,mnimir}@aimsammi.org
Abstract
In multilingual colloquial settings, it is a ha-
bitual occurrence to compose expressions of
text or speech containing tokens or phrases
of different languages, a phenomenon popu-
larly known as code-switching or code-mixing
(CMX). We present our approach and re-
sults for the Code-mixed Machine Transla-
tion (MixMT) shared task at WMT 2022: the
task consists of two subtasks, monolingual
to code-mixed machine translation (Subtask-
1) and code-mixed to monolingual machine
translation (Subtask-2). Most non-synthetic
code-mixed data are from social media but
gathering a significant amount of this kind
of data would be laborious and this form of
data has more writing variation than other do-
mains, so for both subtasks, we experimented
with data schedules for out-of-domain data.
We jointly learn multiple domains of text by
pretraining and fine-tuning, combined with a
sentence alignment objective. We found that
switching between domains caused improved
performance in the domains seen earliest dur-
ing training, but depleted the performance on
the remaining domains. A continuous training
run with strategically dispensed data of differ-
ent domains showed a significantly improved
performance over fine-tuning.
1 Introduction
Code-mixing (CMX) denotes the alternation of
two languages within a single utterance (Poplack,
1980;Sitaram et al.,2019). Code-mixing occurs
mostly in unofficial groups in multilingual envi-
ronments. More than 77% of Asians are mul-
tilingual (Ramakrishnan and Ahmad,2014), and
other statistics estimate that 64.5% of Europeans
speak more than two languages, with more than
80% of adults in the region being bilingual (Eu-
rostat,2019). Code-mixing happens far more of-
ten in conversations than in writing, and mostly in
unofficial settings, hence it rarely occurs in docu-
mented settings. This makes substantial data gath-
ering for computational approaches to translations
of code-mixed language difficult. Parallel corpora
for code-switched data is very scarce (Menacer
et al.,2019), this is because code-mixing mostly
occurs in unofficial conversations like social me-
dia interactions.
Contemporary Neural Machine Translation
(NMT) mostly makes use of parametric sequence-
to-sequence models (Bahdanau et al.,2014;
Vaswani et al.,2017), where an encoder receives a
source sentence and outputs a set of hidden states,
the decoder then scrutinizes these hidden states at
each step, and outputs a sequence of softmax dis-
tribution over the target vocabulary space. Consid-
ering that we would need vast quantities of data to
train an adequate NMT for this task, we leverage
large-scale synthetic and available small data and
notably rank data on domain relevance, by fine-
tuning with it, initiating training with the relevant
domain and strategically placing it at the premier
batches of the training data.
Essentially, the characteristics of the data an
NMT model is trained on are paramount to its
translation quality, in particular in terms of size
and domain. It is quintessential to train NMT
models based on the domain relevance of corpora.
Since most code-mixing occurs in unofficial com-
munication, it is costly to find a lot of labeled data
for every domain we are interested in. Hence we
attempt to find less expensive exigencies to sup-
plement training data, pretrain on largely available
data of different domains, strategically construct
synthetic data, and apportion data to make up for
missing domains.
In these WMT Subtasks – monolingual to
code-mixed machine translation (Subtask-1) and
code-mixed to monolingual machine translation
(Subtask-2), we also fine-tune on different do-
mains, align representations of data and find the
best combination of approaches to solving the sub-
tasks. The main intuition behind our proposed
solution is that NMT models exhibit a signifi-
cant translation correlation when trained on data
1
arXiv:2210.17463v1 [cs.CL] 31 Oct 2022
from the same or similar domains. With different
data domain requirements, it performs better when
trained with data of the most relevant domain as
preliminary batches compared to finetuning. As
most natural code-mixed data source is social me-
dia and it is difficult to gather a good amount to
train a model, it is incumbent to find a strategy
that makes the model prioritize this form of data
above others. Accordingly, we attempt to find less
expensive techniques to supplement training data,
pretrain on largely available data of different do-
mains, strategically construct synthetic data, and
apportion data to make up for missing domains.
Our result showed improved performance on in-
nate code-mixed data (and non-synthetic WMT
test set samples) when this was prioritized and per-
formed strongly in a test with a mix of several
other data sources. We observed a better perfor-
mance with domain-specific evaluation upon fine-
tuning but this intensely plummeted performance
on other ‘pretraining domains’, and more balanced
performance on passing the interesting domain in
the preliminary batches in a single ‘all domain’
training.
2 Related Work
It is laborious to obtain ‘one-fits-all’ training data
for NMT. Most publicly available parallel cor-
pora like Tanzil, OPUS, UNPC are sourced from
documented communication, and these are often
domain-specific. In NMT, data selection e.g. Ax-
elrod et al. (2011) has remained as an under-
lying and important research concern. Choos-
ing training examples that are relevant to the tar-
get domain, or by choosing high-quality exam-
ples for data cleaning (also known as denoising),
has been essential in domain adaptation. Building
a large-scale multi-domain NMT model that ex-
cels on several domains simultaneously becomes
both technically difficult and practically back-
breaking. Addressing research problems such as
catastrophic forgetting (Goodfellow et al.,2013),
data balancing (Wang et al.,2020), Adapters
(Houlsby et al.,2019) have shown improvement.
Unfortunately, several domains are difficult to
handle with the single-domain data selection tech-
niques currently in use. For instance, improving
translation quality of one domain will often hurt
that of another (Britz et al.,2017;van der Wees
et al.,2017).
Song et al. (2019) replaced phrases with pre-
specified translation to perform “soft” constraint
decoding. Xu and Yvon (2021) generated code-
mixed data from regular parallel texts and showed
this training strategy yields MT systems that sur-
pass multilingual systems for code-mixed texts.
Considering that code-mixed text belongs in
less documented domains than most, there may be
a need for domain adaptation used on sufficiently
available data domains. Our work is inspired
by the following approaches: Wang et al. (2019)
executed simultaneous data selection across sev-
eral domains by gradually focusing on multi-
domain relevant and noise-reduced data batches
while carefully introducing instance-level domain-
relevance features and automatically constructing
a training curriculum. Park et al. (2022) demon-
strated that instance-level features are better able
to distinguish between different domains com-
pared to corpus-level attributes. Dou et al. (2019)
proposed modeling the difference between do-
mains instead of smoothing over domains for ma-
chine translation.
Anwar et al. (2022) showed that an encoder
alignment objective is beneficial for code-mixed
translation, in addition to Arivazhagan et al.
(2019) that proposed auxiliary losses on the NMT
encoder that imposed representational invariance
across languages for multilingual machine trans-
lation.
English Code-Mixed (CMX)
@dh*v*l2410*6 sure
brother :)
@dh*v*l2410*6 sure bhai
:)
"I just need reviews like
these, this motivates me a
lot"
"Bas aise hi reviews ki za-
roorat hai, kaafi protsahan
milta hai in baaton se. "
When the sorrow got miss-
ing in this room, the blood
also became thin, #Guess-
TheSong
Jab gam ye rum mein kho
gaya, toh khoon bhi patla
hogaya #GuessTheSong
Table 1: Examples from the WMT Shared Task
Dataset.
3 Data
In table 1we show some samples from the WMT
shared task, sourced from the non-synthetic val-
idation data. The data provided for Subtask-1
(Srivastava and Singh,2021) contains synthetic
and human-generated data and Subtask-2 Parallel
Hinglish Social Media Code-Mixed Corpus (Sri-
摘要:

DomainCurriculaforCode-SwitchedMTatMixMT2022LekanRaheem&MaabElrashidAfricanInstituteforMathematicalSciences(AIMS){rwaliyu,mnimir}@aimsammi.orgAbstractInmultilingualcolloquialsettings,itisaha-bitualoccurrencetocomposeexpressionsoftextorspeechcontainingtokensorphrasesofdifferentlanguages,aphenomenonpo...

展开>> 收起<<
Domain Curricula for Code-Switched MT at MixMT 2022 Lekan Raheem Maab Elrashid.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:359.52KB 格式:PDF 时间:2025-08-18

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注