Gui at MixMT 2022 English-Hinglish An MT approach for translation of code mixed data Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale

2025-05-06 0 0 151KB 5 页 10玖币

侵权投诉

Gui at MixMT 2022 : English-Hinglish : An MT approach for translation

of code mixed data

Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale

Saransh Rajput Tanvi Kamble Dipti Misra Sharma Vasudeva Varma

International Institute of Information Technology, Hyderabad

{akshat.gahoi,anshul.padhi,saransh.rajput,tanvi.kamble}@research.iiit.ac.in

{dunejajayant,shivammangale}@gmail.com

Abstract

Code-mixed machine translation has become

an important task in multilingual communities

and extending the task of machine translation

to code mixed data has become a common

task for these languages. In the shared tasks of

WMT 2022, we try to tackle the same for both

English + Hindi to Hinglish and Hinglish to

English. The ﬁrst task dealt with both Roman

and Devanagari script as we had monolingual

data in both English and Hindi whereas the

second task only had data in Roman script.

To our knowledge, we achieved one of the

top ROUGE-L and WER scores for the ﬁrst

task of Monolingual to Code-Mixed machine

translation. In this paper, we discuss the use

of mBART with some special pre-processing

and post-processing (transliteration from

Devanagari to Roman) for the ﬁrst task in

detail and the experiments that we performed

for the second task of translating code-mixed

Hinglish to monolingual English.

1 Introduction

Code Mixing occurs when a multi-lingual

individual uses two or more languages while

communicating with others. It is the most natural

form of conversation for multilinguals. It is

often confused with code-switching but there is

a slight difference between the two. Both these

phenomena include communicating in multiple

languages but code switching usually takes place

within multiple sentences while code mixing

usually refers to words of different languages used

in the same sentence. In code mixing, phrases,

words and morphemes of one language may be

embedded within an utterance of another language.

Code mixing is extensively observed on social

media sites like Facebook and twitter. With the

rapid growth of social media and consequently,

increase in the use of code-mixed data, it becomes

important to develop systems to process such text.

Machine Translation, also known as automated

translation, is the process where a software trans-

lates text from one language to another without

any human involvement. There are multiple forms

of machine translation, however, over the past

few years, neural machine translation has become

extremely popular. The WMT shared task had

two subtasks. The ﬁrst subtask consisted of the

translation of Hindi-English parallel sentence pairs

to Hindi-English code mixed sentences through

machine translation. The second subtask consisted

of the translation of Hindi-English code mixed

sentences to English.

2 Background

While there is a growing interest in code-mixed

text analysis as a research problem, there is one

bottleneck that has hindered the growth of such

works, and that is the lack of data. Due to this,

there aren’t many robust models for code-mixed

text. To build standardized datasets of code-mixed

text, we need to come up with ways of text genera-

tion of these code-mixed texts. These texts would

be very helpful in training language models for

various code-mixed pairs as language models only

need unsupervised data.

Code Mixed text generation is a relatively new

problem, and so is its initial stage. One of the

recent works in this ﬁeld (Rizvi et al.,2021) tried

to use linguistic theories to synthetically build code-

mixed text using parallel monolingual corpora of

two languages. The Equivalence Constraint Theory

(Poplack,1980) says that code-mixing can only oc-

cur at parts of the text where the surface structures

of two languages map onto each other. So in these

parts, the grammatical rules of both languages are

followed. The Matrix Language Theory (McClure,

1995) tries to solve this problem by separating the

two languages into a base language and a second

language. The grammatical rules of the base lan-

arXiv:2210.12215v1 [cs.CL] 21 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GuiatMixMT2022:English-Hinglish:AnMTapproachfortranslationofcodemixeddataAkshatGahoiJayantDunejaAnshulPadhiShivamMangaleSaranshRajputTanviKambleDiptiMisraSharmaVasudevaVarmaInternationalInstituteofInformationTechnology,Hyderabad{akshat.gahoi,anshul.padhi,saransh.rajput,tanvi.kamble}@research.iiit.ac...

展开>> 收起<<

Gui at MixMT 2022 English-Hinglish An MT approach for translation of code mixed data Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Gui at MixMT 2022 English-Hinglish An MT approach for translation of code mixed data Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: