The University of Edinburghs Submission to the WMT22 Code-Mixing Shared Task MixMT Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell

2025-05-06 0 0 420.82KB 13 页 10玖币
侵权投诉
The University of Edinburgh’s Submission to the WMT22 Code-Mixing
Shared Task (MixMT)
Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell
School of Informatics, University of Edinburgh
{fkirefu,vivek.iyer,pinzhen.chen,laurie.burchell}@ed.ac.uk
Abstract
The University of Edinburgh participated in
the WMT22 shared task on code-mixed trans-
lation. This consists of two subtasks: i) gen-
erating code-mixed Hindi/English (Hinglish)
text generation from parallel Hindi and En-
glish sentences and ii) machine translation
from Hinglish to English. As both subtasks
are considered low-resource, we focused our
efforts on careful data generation and cura-
tion, especially the use of backtranslation from
monolingual resources. For subtask 1 we ex-
plored the effects of constrained decoding on
English and transliterated subwords in order to
produce Hinglish. For subtask 2, we investi-
gated different pretraining techniques, namely
comparing simple initialisation from existing
machine translation models and aligned aug-
mentation. For both subtasks, we found that
our baseline systems worked best. Our sys-
tems for both subtasks were one of the overall
top-performing submissions.
1 Introduction
Code-mixing is the shift from one language to
another within a single conversation or utterance
(Sitaram et al.,2019). It is an extremely common
and diverse communicative phenomenon world-
wide (Do˘
gruöz et al.,2021;Sitaram et al.,2019),
though one which is currently under-served by
many NLP technologies (Solorio et al.,2021).
One of the most well-known examples of code-
mixing is between Hindi and English, commonly
referred to as Hinglish
1
. It is extremely common
amongst Hindi-English bilingual speakers in both
speech and text, used across a range of genres and
media (Parshad et al.,2016), and has its own dis-
tinctive features and linguistic forms (Kumar,1986;
Sailaja,2011). The process of generating Hinglish
from the written text is non-trivial, as code-mixing
1
In the scope of this paper, we designate “hg” as the lan-
guage code for Hinglish.
may happen at the phrase or word level, but Hindi
and English differ substantially syntactically.
As a novel addition to the current code-mixing
NLP research, we investigated lexically constrain-
ing the Hinglish output in subtask 1 to only contain
words from English and Hindi sources. Through
analysis, we demonstrated that transliteration mis-
matches could affect performance.
Another novel approach we explore for this task,
particularly for subtask 2, is a denoising-based pre-
training technique called Aligned Augmentation
(AA) (Pan et al.,2021). AA, which trains MT
models to denoise artificially generated code-mixed
text, was shown by Pan et al. (2021) to boost trans-
lation performance across a variety of languages
- thanks to the enhanced transfer learning brought
about by code-mixed pretraining. In this work, we
explored if this general-purpose approach could be
useful for translating authentic, human-generated
code-mixed text, focusing on Hinglish.
Despite these efforts, we found that for both sub-
tasks our original baselines worked better and con-
stituted our final submissions for this task, which
ranked as one of the top-performing systems for
both subtasks, by both automatic and human evalu-
ation. We hope our methods, particularly Hinglish
data generation, that allowed us to build these sys-
tems would be useful to the community; as would
the findings from our additional research explo-
rations.
2 Related Work
2.1 Code-mixing
Due to an increasing prevalence of code-mixed
data on the Internet, there is a growing body of re-
search into code-mixing, particularly for Hinglish,
in the NLP community. Do˘
gruöz et al. (2021) pro-
vide a comprehensive literature review of code-
mixing in the context of language technologies.
Whilst they highlight several challenges inherent
arXiv:2210.11309v1 [cs.CL] 20 Oct 2022
in NLP with code-mixed text (such as understand-
ing cultural and linguistic context, evaluation, and
a lack of user-facing applications), the most no-
table obstacle for this shared task is the lack of
data. They note that there are very few code-mixed
datasets, making it challenging to build deep learn-
ing models such as those for NMT. In this work, we
use backtranslation as our main data augmentation
method (Edunov et al.,2020;Barrault et al.,2020;
Akhbardeh et al.,2021,inter alia). This allows
us to leverage the larger amount of monolingual
data for better final model performance. The XLM
toolkit (Lample and Conneau,2019) seemed an
ideal choice to backtranslate our Hinglish. This is
because it has shown promising results in unsuper-
vised and semi-supervised settings where parallel
data is sparse, but monolingual data is ample. Also
given that Hinglish is closely related to both lan-
guages, we believed Hinglish should be an ideal
language to use in a semi-supervised setting.
2.2 Constrained decoding
Constrained decoding involves applying restric-
tions to the generation of output tokens during infer-
ence. Most implementations have the goal of ensur-
ing that desired vocabulary items appear in the tar-
get side sequence (Hokamp and Liu,2017;Hasler
et al.,2018;Post and Vilar,2018). Alternatively,
Kajiwara (2019) paraphrase an input sentence by
forcing the output to not include source words, and
Chen et al. (2020) constrain NMT decoding to fol-
low a corpus built in a trie data structure to find
parallel sentences.
To the best of our knowledge, previous linguis-
tics research investigated and applied the grammati-
cal constraints in code-mixing (Sciullo et al.,1986;
Belazi et al.,1994;Li and Fung,2013), rather than
the novel method in our work of introducing lexical
constraints.
2.3 Aligned augmentation
Several recent works (Yang et al.,2020a,b;Lin
et al.,2020;Pan et al.,2021) have explored enhanc-
ing cross-lingual transfer learning by pretraining
models on the task of ‘denoising’ artificially code-
mixed text. Methods to create the necessary code-
mixed data vary, and include bilingual or multilin-
gual datasets and word aligners (Yang et al.,2020a,
2021), lexicons (Yang et al.,2020b;Lin et al.,2020;
Pan et al.,2021), or combining code-mixed nois-
ing with traditional masked noising approaches (Li
et al.,2022).
The most successful among these methods is
Aligned Augmentation (AA) (Pan et al.,2021),
which randomly substituting words in the source
sentence with their word-level translations, as ob-
tained from a MUSE (Lample et al.,2018) dictio-
nary. Pan et al. (2021) showed that their technique
can effectively align multilingual semantic word
representations and boost performance across var-
ious languages. However, these methods focus
on training general-purpose MT models. In this
work, we investigate their utility for translating real
human-generated code-mixed text.
2.4 Automatic evaluation metrics
Automatic translation evaluation is usually done
using BLEU (Papineni et al.,2002), yet there is
no comprehensive study on its suitability for code-
switched translation. Specifically in this task, the
organisers announced that the participating sys-
tems will be evaluated using ROUGE-L (Lin,2004)
and word error rate (WER). Nonetheless, the pack-
ages implementing these metrics were not speci-
fied. Since ROUGE comes with different language,
stemming and tokenisation settings, we instead
used BLEU, ChrF++ (Popovi´
c,2017), translation
error rate (TER), and WER
2
for our internal val-
idation. The first three are as implemented with
sacreBLEU (Post,2018). We stick to the default
configurations, except that the ChrF word n-gram
order is explicitly set to 2 to make it ChrF++. In
addition, the organisers performed a small-scale
human evaluation on 20 test instances for all sub-
missions.
In this work, we advocate for a character-based
metric when evaluating the Hinglish output in sub-
task 1. This is because for the code-switched lan-
guage, there is no formal spelling or defined gram-
mar, and words may have a diverse range of accept-
able transliterations and lexical forms.
3 Subtask 1: Translating into Hinglish
Good quality Hinglish data is hard to come by,
and parallel Hinglish data with Hindi or English
even more frugal. Therefore, for both subtasks
we concentrated our efforts on generating good
Hinglish backtranslation. We planned to use the
model which produced the highest quality Hinglish
for subtask 1 as our backtranslator for subtask 2,
hence we focused our efforts on each subtask se-
quentially.
2https://github.com/jitsi/jiwer
3.1 Data cleaning and preprocessing
After deduplicating the data, we removed non-
printing characters and normalised the punctuation.
We then ran rule-based filters, removing any sen-
tences with fewer than two or more than 150 words,
where fewer than 40% of the words are written in
the relevant script, or where over 50% of characters
are not letters in the relevant script. For English
and Hindi, we ran
fasttext
language ID and re-
moved any sentence which was not classified as the
relevant language.
3
For Hinglish, we also removed
any sentence with a predicted probability of En-
glish greater than 0.99 in order to remove sentences
that were solely in English. We tokenised English
and Hinglish using Moses scripts (Koehn et al.,
2007) and tokenised Hindi using the
indicnlp
library (Kunchukuttan,2020).
We decided to add explicit preprocessing and
postprocessing capabilities for handling social me-
dia text, given that this was the domain for subtask
2. On both source and target sides, we replaced
URLs, Twitter handles, hashtags and emoticons
each with their own placeholder tokens
4
, to be re-
placed back from the source after inference. These
placeholders made up 1.7% of the validation set
tokens for subtask 2, far higher than would appear
in general domain data.
3.1.1 The HinGe dataset
The primary dataset for subtask 1 was the HinGe
dataset (Srivastava and Singh,2021), which con-
sisted of hi-en-hg parallel sentences, with some ex-
amples synthetic and some human-generated. This
was provided to us pre-split into training and de-
velopment sets for both data types. However, we
noticed that these sets were not mutually exclu-
sive, and after deduplication and filtering on the
synthetic data human annotations
5
, we had 6,727
hi-en-hg examples in total.
3.1.2 Base hien translation models
Firstly, we trained four Transformer-base (Vaswani
et al.,2017) models with different seeds using Mar-
ian (Junczys-Dowmunt et al.,2018) for both hi
en
and en
hi directions, using the data from the hi-en
3
Our cleaning scripts are adapted from
those provided by the Bergamot project.
https://github.com/browsermt/students/tree/master/train-
student Specifically, we add support for Hindi and Hinglish
text.
4<URL>, <TH>, <HT> and <EMO> respectively
5
We only kept sentences with an average rating greater
than 4, and annotator disagreement less than 5
parallel Samanantar corpus
6
(Ramesh et al.,2021).
Given the findings of Ding et al. (2019) with regard
to vocabulary choice for low-resource scenarios,
and that our task inherently contains transliteration,
we opted for a low BPE (Sennrich et al.,2016)
merge size of 4k, resulting in a small joint vocab-
ulary of 7.9k. We used the hi-en FLORES devel-
opment set (Goyal et al.,2022) for validation and
early stopping, and noticed our model produced
surprisingly good quality translations in both di-
rections
7
. We used these models (along with vo-
cabulary) to both initialise subsequent models and
generate backtranslation for more training data.
3.1.3 Hinglish data
L3Cube-HingCorpus (Nayak and Joshi,2022) and
CC-100 Hindi Romanized (Conneau et al.,2020a)
are two Hinglish corpora that we wished to back-
translate into both English and Hindi. Given that
we only had a small amount of parallel Hinglish
data, compared to our ‘monolingual’ datasets, we
used the XLM toolkit (Lample and Conneau,2019)
to train a semi-supervised model (see Appendix A
for details). We then backtranslated the monolin-
gual Hinglish data into both English and Hindi.
However, given the noisy quality of the data and
translations themselves, we decided to evaluate
them using our hi
en and en
hi Marian models.
Specifically, for an en-hi backtranslated (XLM)
sentence pair, we translated the en/hi into hi/en re-
spectively, then evaluated the double translated out-
put using ChrF, with the XLM backtranslations as
the references. We then took a mean of the English
and Hindi ChrF score to get our final confidence
value. We used the resulting hg-en-hi sentence trios
with values at least 0.4, to compromise between
the quality and quantity of data available to use as
training. Most of the sentences scored quite poorly,
and filtering on 0.4 yielded 2.1M sentences, only
about 12% of the original Hinglish monolingual
dataset.
3.1.4 Transliteration
In order to best leverage the Samanantar hi-en par-
allel corpus, we transliterated the Hindi side into
Roman script
8
, on the word level. Although this
6
Each sentence was annotated with the LaBSE (Feng et al.,
2022) Alignment Score (between 0 and 1), so we filtered out
values less than 0.65, resulting in around 10.1M sentences
7
sacreBLEU: 33.8 for hi
en and 32.7 for hi
en on FLO-
RES development set
8
In the scope of this paper, we use “ht” to denote pure
romanised Hindi transliteration
摘要:

TheUniversityofEdinburgh'sSubmissiontotheWMT22Code-MixingSharedTask(MixMT)FaheemKirefuVivekIyerPinzhenChenLaurieBurchellSchoolofInformatics,UniversityofEdinburgh{fkirefu,vivek.iyer,pinzhen.chen,laurie.burchell}@ed.ac.ukAbstractTheUniversityofEdinburghparticipatedintheWMT22sharedtaskoncode-mixedtrans...

展开>> 收起<<
The University of Edinburghs Submission to the WMT22 Code-Mixing Shared Task MixMT Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:420.82KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注