The University of Edinburghs Submission to the WMT22 Code-Mixing Shared Task MixMT Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell

2025-05-06 0 0 420.82KB 13 页 10玖币

侵权投诉

The University of Edinburgh’s Submission to the WMT22 Code-Mixing

Shared Task (MixMT)

Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell

School of Informatics, University of Edinburgh

{fkirefu,vivek.iyer,pinzhen.chen,laurie.burchell}@ed.ac.uk

Abstract

The University of Edinburgh participated in

the WMT22 shared task on code-mixed trans-

lation. This consists of two subtasks: i) gen-

erating code-mixed Hindi/English (Hinglish)

text generation from parallel Hindi and En-

glish sentences and ii) machine translation

from Hinglish to English. As both subtasks

are considered low-resource, we focused our

efforts on careful data generation and cura-

tion, especially the use of backtranslation from

monolingual resources. For subtask 1 we ex-

plored the effects of constrained decoding on

English and transliterated subwords in order to

produce Hinglish. For subtask 2, we investi-

gated different pretraining techniques, namely

comparing simple initialisation from existing

machine translation models and aligned aug-

mentation. For both subtasks, we found that

our baseline systems worked best. Our sys-

tems for both subtasks were one of the overall

top-performing submissions.

1 Introduction

Code-mixing is the shift from one language to

another within a single conversation or utterance

(Sitaram et al.,2019). It is an extremely common

and diverse communicative phenomenon world-

wide (Do˘

gruöz et al.,2021;Sitaram et al.,2019),

though one which is currently under-served by

many NLP technologies (Solorio et al.,2021).

One of the most well-known examples of code-

mixing is between Hindi and English, commonly

referred to as Hinglish

. It is extremely common

amongst Hindi-English bilingual speakers in both

speech and text, used across a range of genres and

media (Parshad et al.,2016), and has its own dis-

tinctive features and linguistic forms (Kumar,1986;

Sailaja,2011). The process of generating Hinglish

from the written text is non-trivial, as code-mixing

In the scope of this paper, we designate “hg” as the lan-

guage code for Hinglish.

may happen at the phrase or word level, but Hindi

and English differ substantially syntactically.

As a novel addition to the current code-mixing

NLP research, we investigated lexically constrain-

ing the Hinglish output in subtask 1 to only contain

words from English and Hindi sources. Through

analysis, we demonstrated that transliteration mis-

matches could affect performance.

Another novel approach we explore for this task,

particularly for subtask 2, is a denoising-based pre-

training technique called Aligned Augmentation

(AA) (Pan et al.,2021). AA, which trains MT

models to denoise artiﬁcially generated code-mixed

text, was shown by Pan et al. (2021) to boost trans-

lation performance across a variety of languages

- thanks to the enhanced transfer learning brought

about by code-mixed pretraining. In this work, we

explored if this general-purpose approach could be

useful for translating authentic, human-generated

code-mixed text, focusing on Hinglish.

Despite these efforts, we found that for both sub-

tasks our original baselines worked better and con-

stituted our ﬁnal submissions for this task, which

ranked as one of the top-performing systems for

both subtasks, by both automatic and human evalu-

ation. We hope our methods, particularly Hinglish

data generation, that allowed us to build these sys-

tems would be useful to the community; as would

the ﬁndings from our additional research explo-

rations.

2 Related Work

2.1 Code-mixing

Due to an increasing prevalence of code-mixed

data on the Internet, there is a growing body of re-

search into code-mixing, particularly for Hinglish,

in the NLP community. Do˘

gruöz et al. (2021) pro-

vide a comprehensive literature review of code-

mixing in the context of language technologies.

Whilst they highlight several challenges inherent

arXiv:2210.11309v1 [cs.CL] 20 Oct 2022

in NLP with code-mixed text (such as understand-

ing cultural and linguistic context, evaluation, and

a lack of user-facing applications), the most no-

table obstacle for this shared task is the lack of

data. They note that there are very few code-mixed

datasets, making it challenging to build deep learn-

ing models such as those for NMT. In this work, we

use backtranslation as our main data augmentation

method (Edunov et al.,2020;Barrault et al.,2020;

Akhbardeh et al.,2021,inter alia). This allows

us to leverage the larger amount of monolingual

data for better ﬁnal model performance. The XLM

toolkit (Lample and Conneau,2019) seemed an

ideal choice to backtranslate our Hinglish. This is

because it has shown promising results in unsuper-

vised and semi-supervised settings where parallel

data is sparse, but monolingual data is ample. Also

given that Hinglish is closely related to both lan-

guages, we believed Hinglish should be an ideal

language to use in a semi-supervised setting.

2.2 Constrained decoding

Constrained decoding involves applying restric-

tions to the generation of output tokens during infer-

ence. Most implementations have the goal of ensur-

ing that desired vocabulary items appear in the tar-

get side sequence (Hokamp and Liu,2017;Hasler

et al.,2018;Post and Vilar,2018). Alternatively,

Kajiwara (2019) paraphrase an input sentence by

forcing the output to not include source words, and

Chen et al. (2020) constrain NMT decoding to fol-

low a corpus built in a trie data structure to ﬁnd

parallel sentences.

To the best of our knowledge, previous linguis-

tics research investigated and applied the grammati-

cal constraints in code-mixing (Sciullo et al.,1986;

Belazi et al.,1994;Li and Fung,2013), rather than

the novel method in our work of introducing lexical

constraints.

2.3 Aligned augmentation

Several recent works (Yang et al.,2020a,b;Lin

et al.,2020;Pan et al.,2021) have explored enhanc-

ing cross-lingual transfer learning by pretraining

models on the task of ‘denoising’ artiﬁcially code-

mixed text. Methods to create the necessary code-

mixed data vary, and include bilingual or multilin-

gual datasets and word aligners (Yang et al.,2020a,

2021), lexicons (Yang et al.,2020b;Lin et al.,2020;

Pan et al.,2021), or combining code-mixed nois-

ing with traditional masked noising approaches (Li

et al.,2022).

The most successful among these methods is

Aligned Augmentation (AA) (Pan et al.,2021),

which randomly substituting words in the source

sentence with their word-level translations, as ob-

tained from a MUSE (Lample et al.,2018) dictio-

nary. Pan et al. (2021) showed that their technique

can effectively align multilingual semantic word

representations and boost performance across var-

ious languages. However, these methods focus

on training general-purpose MT models. In this

work, we investigate their utility for translating real

human-generated code-mixed text.

2.4 Automatic evaluation metrics

Automatic translation evaluation is usually done

using BLEU (Papineni et al.,2002), yet there is

no comprehensive study on its suitability for code-

switched translation. Speciﬁcally in this task, the

organisers announced that the participating sys-

tems will be evaluated using ROUGE-L (Lin,2004)

and word error rate (WER). Nonetheless, the pack-

ages implementing these metrics were not speci-

ﬁed. Since ROUGE comes with different language,

stemming and tokenisation settings, we instead

used BLEU, ChrF++ (Popovi´

c,2017), translation

error rate (TER), and WER

for our internal val-

idation. The ﬁrst three are as implemented with

sacreBLEU (Post,2018). We stick to the default

conﬁgurations, except that the ChrF word n-gram

order is explicitly set to 2 to make it ChrF++. In

addition, the organisers performed a small-scale

human evaluation on 20 test instances for all sub-

missions.

In this work, we advocate for a character-based

metric when evaluating the Hinglish output in sub-

task 1. This is because for the code-switched lan-

guage, there is no formal spelling or deﬁned gram-

mar, and words may have a diverse range of accept-

able transliterations and lexical forms.

3 Subtask 1: Translating into Hinglish

Good quality Hinglish data is hard to come by,

and parallel Hinglish data with Hindi or English

even more frugal. Therefore, for both subtasks

we concentrated our efforts on generating good

Hinglish backtranslation. We planned to use the

model which produced the highest quality Hinglish

for subtask 1 as our backtranslator for subtask 2,

hence we focused our efforts on each subtask se-

quentially.

2https://github.com/jitsi/jiwer

3.1 Data cleaning and preprocessing

After deduplicating the data, we removed non-

printing characters and normalised the punctuation.

We then ran rule-based ﬁlters, removing any sen-

tences with fewer than two or more than 150 words,

where fewer than 40% of the words are written in

the relevant script, or where over 50% of characters

are not letters in the relevant script. For English

and Hindi, we ran

fasttext

language ID and re-

moved any sentence which was not classiﬁed as the

relevant language.

For Hinglish, we also removed

any sentence with a predicted probability of En-

glish greater than 0.99 in order to remove sentences

that were solely in English. We tokenised English

and Hinglish using Moses scripts (Koehn et al.,

2007) and tokenised Hindi using the

indicnlp

library (Kunchukuttan,2020).

We decided to add explicit preprocessing and

postprocessing capabilities for handling social me-

dia text, given that this was the domain for subtask

2. On both source and target sides, we replaced

URLs, Twitter handles, hashtags and emoticons

each with their own placeholder tokens

, to be re-

placed back from the source after inference. These

placeholders made up 1.7% of the validation set

tokens for subtask 2, far higher than would appear

in general domain data.

3.1.1 The HinGe dataset

The primary dataset for subtask 1 was the HinGe

dataset (Srivastava and Singh,2021), which con-

sisted of hi-en-hg parallel sentences, with some ex-

amples synthetic and some human-generated. This

was provided to us pre-split into training and de-

velopment sets for both data types. However, we

noticed that these sets were not mutually exclu-

sive, and after deduplication and ﬁltering on the

synthetic data human annotations

, we had 6,727

hi-en-hg examples in total.

3.1.2 Base hi↔en translation models

Firstly, we trained four Transformer-base (Vaswani

et al.,2017) models with different seeds using Mar-

ian (Junczys-Dowmunt et al.,2018) for both hi

→

and en

→

hi directions, using the data from the hi-en

Our cleaning scripts are adapted from

those provided by the Bergamot project.

https://github.com/browsermt/students/tree/master/train-

student Speciﬁcally, we add support for Hindi and Hinglish

text.

4<URL>, <TH>, <HT> and <EMO> respectively

We only kept sentences with an average rating greater

than 4, and annotator disagreement less than 5

parallel Samanantar corpus

(Ramesh et al.,2021).

Given the ﬁndings of Ding et al. (2019) with regard

to vocabulary choice for low-resource scenarios,

and that our task inherently contains transliteration,

we opted for a low BPE (Sennrich et al.,2016)

merge size of 4k, resulting in a small joint vocab-

ulary of 7.9k. We used the hi-en FLORES devel-

opment set (Goyal et al.,2022) for validation and

early stopping, and noticed our model produced

surprisingly good quality translations in both di-

rections

. We used these models (along with vo-

cabulary) to both initialise subsequent models and

generate backtranslation for more training data.

3.1.3 Hinglish data

L3Cube-HingCorpus (Nayak and Joshi,2022) and

CC-100 Hindi Romanized (Conneau et al.,2020a)

are two Hinglish corpora that we wished to back-

translate into both English and Hindi. Given that

we only had a small amount of parallel Hinglish

data, compared to our ‘monolingual’ datasets, we

used the XLM toolkit (Lample and Conneau,2019)

to train a semi-supervised model (see Appendix A

for details). We then backtranslated the monolin-

gual Hinglish data into both English and Hindi.

However, given the noisy quality of the data and

translations themselves, we decided to evaluate

them using our hi

→

en and en

→

hi Marian models.

Speciﬁcally, for an en-hi backtranslated (XLM)

sentence pair, we translated the en/hi into hi/en re-

spectively, then evaluated the double translated out-

put using ChrF, with the XLM backtranslations as

the references. We then took a mean of the English

and Hindi ChrF score to get our ﬁnal conﬁdence

value. We used the resulting hg-en-hi sentence trios

with values at least 0.4, to compromise between

the quality and quantity of data available to use as

training. Most of the sentences scored quite poorly,

and ﬁltering on 0.4 yielded 2.1M sentences, only

about 12% of the original Hinglish monolingual

dataset.

3.1.4 Transliteration

In order to best leverage the Samanantar hi-en par-

allel corpus, we transliterated the Hindi side into

Roman script

, on the word level. Although this

Each sentence was annotated with the LaBSE (Feng et al.,

2022) Alignment Score (between 0 and 1), so we ﬁltered out

values less than 0.65, resulting in around 10.1M sentences

sacreBLEU: 33.8 for hi

→

en and 32.7 for hi

→

en on FLO-

RES development set

In the scope of this paper, we use “ht” to denote pure

romanised Hindi transliteration

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheUniversityofEdinburgh'sSubmissiontotheWMT22Code-MixingSharedTask(MixMT)FaheemKirefuVivekIyerPinzhenChenLaurieBurchellSchoolofInformatics,UniversityofEdinburgh{fkirefu,vivek.iyer,pinzhen.chen,laurie.burchell}@ed.ac.ukAbstractTheUniversityofEdinburghparticipatedintheWMT22sharedtaskoncode-mixedtrans...

展开>> 收起<<

The University of Edinburghs Submission to the WMT22 Code-Mixing Shared Task MixMT Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The University of Edinburghs Submission to the WMT22 Code-Mixing Shared Task MixMT Faheem Kirefu Vivek Iyer Pinzhen Chen Laurie Burchell

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: