Gui at MixMT 2022 English-Hinglish An MT approach for translation of code mixed data Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale

2025-05-06 0 0 151KB 5 页 10玖币
侵权投诉
Gui at MixMT 2022 : English-Hinglish : An MT approach for translation
of code mixed data
Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale
Saransh Rajput Tanvi Kamble Dipti Misra Sharma Vasudeva Varma
International Institute of Information Technology, Hyderabad
{akshat.gahoi,anshul.padhi,saransh.rajput,tanvi.kamble}@research.iiit.ac.in
{dunejajayant,shivammangale}@gmail.com
Abstract
Code-mixed machine translation has become
an important task in multilingual communities
and extending the task of machine translation
to code mixed data has become a common
task for these languages. In the shared tasks of
WMT 2022, we try to tackle the same for both
English + Hindi to Hinglish and Hinglish to
English. The first task dealt with both Roman
and Devanagari script as we had monolingual
data in both English and Hindi whereas the
second task only had data in Roman script.
To our knowledge, we achieved one of the
top ROUGE-L and WER scores for the first
task of Monolingual to Code-Mixed machine
translation. In this paper, we discuss the use
of mBART with some special pre-processing
and post-processing (transliteration from
Devanagari to Roman) for the first task in
detail and the experiments that we performed
for the second task of translating code-mixed
Hinglish to monolingual English.
1 Introduction
Code Mixing occurs when a multi-lingual
individual uses two or more languages while
communicating with others. It is the most natural
form of conversation for multilinguals. It is
often confused with code-switching but there is
a slight difference between the two. Both these
phenomena include communicating in multiple
languages but code switching usually takes place
within multiple sentences while code mixing
usually refers to words of different languages used
in the same sentence. In code mixing, phrases,
words and morphemes of one language may be
embedded within an utterance of another language.
Code mixing is extensively observed on social
media sites like Facebook and twitter. With the
rapid growth of social media and consequently,
increase in the use of code-mixed data, it becomes
important to develop systems to process such text.
Machine Translation, also known as automated
translation, is the process where a software trans-
lates text from one language to another without
any human involvement. There are multiple forms
of machine translation, however, over the past
few years, neural machine translation has become
extremely popular. The WMT shared task had
two subtasks. The first subtask consisted of the
translation of Hindi-English parallel sentence pairs
to Hindi-English code mixed sentences through
machine translation. The second subtask consisted
of the translation of Hindi-English code mixed
sentences to English.
2 Background
While there is a growing interest in code-mixed
text analysis as a research problem, there is one
bottleneck that has hindered the growth of such
works, and that is the lack of data. Due to this,
there aren’t many robust models for code-mixed
text. To build standardized datasets of code-mixed
text, we need to come up with ways of text genera-
tion of these code-mixed texts. These texts would
be very helpful in training language models for
various code-mixed pairs as language models only
need unsupervised data.
Code Mixed text generation is a relatively new
problem, and so is its initial stage. One of the
recent works in this field (Rizvi et al.,2021) tried
to use linguistic theories to synthetically build code-
mixed text using parallel monolingual corpora of
two languages. The Equivalence Constraint Theory
(Poplack,1980) says that code-mixing can only oc-
cur at parts of the text where the surface structures
of two languages map onto each other. So in these
parts, the grammatical rules of both languages are
followed. The Matrix Language Theory (McClure,
1995) tries to solve this problem by separating the
two languages into a base language and a second
language. The grammatical rules of the base lan-
arXiv:2210.12215v1 [cs.CL] 21 Oct 2022
摘要:

GuiatMixMT2022:English-Hinglish:AnMTapproachfortranslationofcodemixeddataAkshatGahoiJayantDunejaAnshulPadhiShivamMangaleSaranshRajputTanviKambleDiptiMisraSharmaVasudevaVarmaInternationalInstituteofInformationTechnology,Hyderabad{akshat.gahoi,anshul.padhi,saransh.rajput,tanvi.kamble}@research.iiit.ac...

展开>> 收起<<
Gui at MixMT 2022 English-Hinglish An MT approach for translation of code mixed data Akshat Gahoi Jayant Duneja Anshul Padhi Shivam Mangale.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:151KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注