
from the same or similar domains. With different
data domain requirements, it performs better when
trained with data of the most relevant domain as
preliminary batches compared to finetuning. As
most natural code-mixed data source is social me-
dia and it is difficult to gather a good amount to
train a model, it is incumbent to find a strategy
that makes the model prioritize this form of data
above others. Accordingly, we attempt to find less
expensive techniques to supplement training data,
pretrain on largely available data of different do-
mains, strategically construct synthetic data, and
apportion data to make up for missing domains.
Our result showed improved performance on in-
nate code-mixed data (and non-synthetic WMT
test set samples) when this was prioritized and per-
formed strongly in a test with a mix of several
other data sources. We observed a better perfor-
mance with domain-specific evaluation upon fine-
tuning but this intensely plummeted performance
on other ‘pretraining domains’, and more balanced
performance on passing the interesting domain in
the preliminary batches in a single ‘all domain’
training.
2 Related Work
It is laborious to obtain ‘one-fits-all’ training data
for NMT. Most publicly available parallel cor-
pora like Tanzil, OPUS, UNPC are sourced from
documented communication, and these are often
domain-specific. In NMT, data selection e.g. Ax-
elrod et al. (2011) has remained as an under-
lying and important research concern. Choos-
ing training examples that are relevant to the tar-
get domain, or by choosing high-quality exam-
ples for data cleaning (also known as denoising),
has been essential in domain adaptation. Building
a large-scale multi-domain NMT model that ex-
cels on several domains simultaneously becomes
both technically difficult and practically back-
breaking. Addressing research problems such as
catastrophic forgetting (Goodfellow et al.,2013),
data balancing (Wang et al.,2020), Adapters
(Houlsby et al.,2019) have shown improvement.
Unfortunately, several domains are difficult to
handle with the single-domain data selection tech-
niques currently in use. For instance, improving
translation quality of one domain will often hurt
that of another (Britz et al.,2017;van der Wees
et al.,2017).
Song et al. (2019) replaced phrases with pre-
specified translation to perform “soft” constraint
decoding. Xu and Yvon (2021) generated code-
mixed data from regular parallel texts and showed
this training strategy yields MT systems that sur-
pass multilingual systems for code-mixed texts.
Considering that code-mixed text belongs in
less documented domains than most, there may be
a need for domain adaptation used on sufficiently
available data domains. Our work is inspired
by the following approaches: Wang et al. (2019)
executed simultaneous data selection across sev-
eral domains by gradually focusing on multi-
domain relevant and noise-reduced data batches
while carefully introducing instance-level domain-
relevance features and automatically constructing
a training curriculum. Park et al. (2022) demon-
strated that instance-level features are better able
to distinguish between different domains com-
pared to corpus-level attributes. Dou et al. (2019)
proposed modeling the difference between do-
mains instead of smoothing over domains for ma-
chine translation.
Anwar et al. (2022) showed that an encoder
alignment objective is beneficial for code-mixed
translation, in addition to Arivazhagan et al.
(2019) that proposed auxiliary losses on the NMT
encoder that imposed representational invariance
across languages for multilingual machine trans-
lation.
English Code-Mixed (CMX)
@dh*v*l2410*6 sure
brother :)
@dh*v*l2410*6 sure bhai
:)
"I just need reviews like
these, this motivates me a
lot"
"Bas aise hi reviews ki za-
roorat hai, kaafi protsahan
milta hai in baaton se. "
When the sorrow got miss-
ing in this room, the blood
also became thin, #Guess-
TheSong
Jab gam ye rum mein kho
gaya, toh khoon bhi patla
hogaya #GuessTheSong
Table 1: Examples from the WMT Shared Task
Dataset.
3 Data
In table 1we show some samples from the WMT
shared task, sourced from the non-synthetic val-
idation data. The data provided for Subtask-1
(Srivastava and Singh,2021) contains synthetic
and human-generated data and Subtask-2 Parallel
Hinglish Social Media Code-Mixed Corpus (Sri-