
in NLP with code-mixed text (such as understand-
ing cultural and linguistic context, evaluation, and
a lack of user-facing applications), the most no-
table obstacle for this shared task is the lack of
data. They note that there are very few code-mixed
datasets, making it challenging to build deep learn-
ing models such as those for NMT. In this work, we
use backtranslation as our main data augmentation
method (Edunov et al.,2020;Barrault et al.,2020;
Akhbardeh et al.,2021,inter alia). This allows
us to leverage the larger amount of monolingual
data for better final model performance. The XLM
toolkit (Lample and Conneau,2019) seemed an
ideal choice to backtranslate our Hinglish. This is
because it has shown promising results in unsuper-
vised and semi-supervised settings where parallel
data is sparse, but monolingual data is ample. Also
given that Hinglish is closely related to both lan-
guages, we believed Hinglish should be an ideal
language to use in a semi-supervised setting.
2.2 Constrained decoding
Constrained decoding involves applying restric-
tions to the generation of output tokens during infer-
ence. Most implementations have the goal of ensur-
ing that desired vocabulary items appear in the tar-
get side sequence (Hokamp and Liu,2017;Hasler
et al.,2018;Post and Vilar,2018). Alternatively,
Kajiwara (2019) paraphrase an input sentence by
forcing the output to not include source words, and
Chen et al. (2020) constrain NMT decoding to fol-
low a corpus built in a trie data structure to find
parallel sentences.
To the best of our knowledge, previous linguis-
tics research investigated and applied the grammati-
cal constraints in code-mixing (Sciullo et al.,1986;
Belazi et al.,1994;Li and Fung,2013), rather than
the novel method in our work of introducing lexical
constraints.
2.3 Aligned augmentation
Several recent works (Yang et al.,2020a,b;Lin
et al.,2020;Pan et al.,2021) have explored enhanc-
ing cross-lingual transfer learning by pretraining
models on the task of ‘denoising’ artificially code-
mixed text. Methods to create the necessary code-
mixed data vary, and include bilingual or multilin-
gual datasets and word aligners (Yang et al.,2020a,
2021), lexicons (Yang et al.,2020b;Lin et al.,2020;
Pan et al.,2021), or combining code-mixed nois-
ing with traditional masked noising approaches (Li
et al.,2022).
The most successful among these methods is
Aligned Augmentation (AA) (Pan et al.,2021),
which randomly substituting words in the source
sentence with their word-level translations, as ob-
tained from a MUSE (Lample et al.,2018) dictio-
nary. Pan et al. (2021) showed that their technique
can effectively align multilingual semantic word
representations and boost performance across var-
ious languages. However, these methods focus
on training general-purpose MT models. In this
work, we investigate their utility for translating real
human-generated code-mixed text.
2.4 Automatic evaluation metrics
Automatic translation evaluation is usually done
using BLEU (Papineni et al.,2002), yet there is
no comprehensive study on its suitability for code-
switched translation. Specifically in this task, the
organisers announced that the participating sys-
tems will be evaluated using ROUGE-L (Lin,2004)
and word error rate (WER). Nonetheless, the pack-
ages implementing these metrics were not speci-
fied. Since ROUGE comes with different language,
stemming and tokenisation settings, we instead
used BLEU, ChrF++ (Popovi´
c,2017), translation
error rate (TER), and WER
2
for our internal val-
idation. The first three are as implemented with
sacreBLEU (Post,2018). We stick to the default
configurations, except that the ChrF word n-gram
order is explicitly set to 2 to make it ChrF++. In
addition, the organisers performed a small-scale
human evaluation on 20 test instances for all sub-
missions.
In this work, we advocate for a character-based
metric when evaluating the Hinglish output in sub-
task 1. This is because for the code-switched lan-
guage, there is no formal spelling or defined gram-
mar, and words may have a diverse range of accept-
able transliterations and lexical forms.
3 Subtask 1: Translating into Hinglish
Good quality Hinglish data is hard to come by,
and parallel Hinglish data with Hindi or English
even more frugal. Therefore, for both subtasks
we concentrated our efforts on generating good
Hinglish backtranslation. We planned to use the
model which produced the highest quality Hinglish
for subtask 1 as our backtranslator for subtask 2,
hence we focused our efforts on each subtask se-
quentially.
2https://github.com/jitsi/jiwer