The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation Tadesse Destaw Belay

2025-05-06 0 0 234.29KB 6 页 10玖币
侵权投诉
The Effect of Normalization for Bi-directional
Amharic-English Neural Machine Translation
Tadesse Destaw Belay
College of Informatics
Wollo University
Kombolcha, Ethiopia
tadesseit@gmail.com
Atnafu Lambebo Tonja
Centro de Investigaci´
on en Computaci´
on
Instituto Polit´
ecnico Nacional
Mexico City, Mexico
alabedot2022@cic.ipn.mx
Olga Kolesnikova
Centro de Investigaci´
on en Computaci´
on
Instituto Polit´
ecnico Nacional
Mexico City, Mexico
kolesolga@gmail.com
Seid Muhie Yimam
Dept. of Informatics
Universit¨
at Hamburg
Hamburg, Germany
seid.muhie.yimam@uni-hamburg.de
Abinew Ali Ayele
ICT4D Research Center
Bahir Dar University
Bahir dar, Ethiopia
abinewaliayele@gmail.com
Silesh Bogale Haile
Dept. of Computer Science
Assosa University
Assosa, Ethiopia
sileshibogale123@gmail.com
Grigori Sidorov
Centro de Investigaci´
on en Computaci´
on
Instituto Polit´
ecnico Nacional
Mexico City, Mexico
sidorov@cic.ipn.mx
Alexander Gelbukh
Centro de Investigaci´
on en Computaci´
on
Instituto Polit´
ecnico Nacional
Mexico City, Mexico
gelbukh@cic.ipn.mx
Abstract—Machine translation (MT) is one of the promi-
nent tasks in natural language processing whose objective is
to translate texts automatically from one natural language to
another. Nowadays, using deep neural networks for MT task has
received a great deal of attention. These networks require lots of
data to learn abstract representations of the input and store it
in continuous vectors. This paper presents the first relatively
large-scale Amharic-English parallel sentence dataset. Using
these compiled data, we build bi-directional Amharic-English
translation models by fine-tuning the existing Facebook M2M100
pre-trained model achieving a BLEU score of 37.79 in Amharic-
English translation and 32.74 in English-Amharic translation.
Additionally, we explore the effects of Amharic homophone
normalization on the machine translation task. The results show
that normalization of Amharic homophone characters increases
the performance of Amharic-English machine translation in both
directions.
Index Terms—Neural machine translation, pre-trained models,
Amharic-English MT, homophone normalization, low-resourced
language
I. INTRODUCTION
Machine translation (MT) is a sub-field of natural lan-
guage processing (NLP) that investigates how to use computer
software to automatically translate text or speech from one
language to another without human involvement. MT is one
of the prominent tasks in NLP that is tackled in several ways
[1]. The first MT research began at about 1950s and in 1952
the first International Conference on Machine Translation was
organized at the Massachusetts Institute of Technology (MIT).
It has long research history and experienced four stages,
namely, Rule-based MT [2], Statistical MT (SMT) [3], hybrid
MT, and Neural MT (NMT) [4], [5].
The most severe drawback of the rule-based method is
that it has ignored the need for context information in the
translation process. It is highly dependent on hand-crafted
features. Phrase-based SMT (PBSMT), the most prevalent ver-
sion of SMT, generates translation by segmenting the source
sentence into several phrases and performing phrase translation
and replacement. It may ignore the long sentence dependency
and require high computing devices [6]. Recently, using deep
neural networks for MT task has received great attention.
NMT also improves training procedures due to the end-to-end
procedure without tedious feature engineering and complex
setups. NMT employs such techniques as recurrent neural
network (RNN) [7], convolutional neural network (CNN) [8],
and self-attention network (Transformer) [9].
Transformer models with the pre-training approach is a
new NMT strategy entirely based on attention mechanisms
proposed in 2017 [9]. Among the different neural network
architectures, the Transformer model has emerged as the dom-
inant NMT paradigm [10]–[12]. It has become the state-of-
the-art model for many artificial intelligence tasks, including
machine translation. In terms of model, the Transformer-based
pre-trained models are fast to fine-tune, highly accurate and
has been proven to outperform widely used recurrent networks
[6], [13], [14].
The focus of MT research for the Amharic language has
been on rule-based and SMT methods. In this work, we used
the transformer model as a baseline translation system to
arXiv:2210.15224v1 [cs.CL] 27 Oct 2022
explore the applicability of Facebook M2M100 multi-lingual
pre-trained language model for Amharic-English translation in
both directions. Furthermore, this research work investigated
the impact of normalization of the Amharic homophones on
Amharic-English MT tasks.
The main contributions of this work are:
1) Exploration of the Amharic-English and English-
Amharic machine translation tasks.
2) Introduction of the first large-scale publicly available
Amharic-English translation parallel dataset.
3) Development and implementation of state-of-the-art
Amharic-English translation models.
4) Investigation of the effect of Amharic homophone char-
acter normalization on the machine translation task.
The rest of this paper is organized as follows. Section
II presents a detail description of Amharic language while
Section III shows the motivation for this research. In Section
IV, we review related work. Section Vdescribes the existing
parallel corpus and the collection of a new corpus from
the news domain. The general pre-processing steps applied
to both corpora are presented in Section VI. Section VII
discusses the proposed NMT models and Section VIII gives
the experimental results. In the end, Section IX concludes the
paper and sheds some light on possible future work.
II. AMHARIC LANGUAGE
Amharic is the second most spoken Semitic language next
to Arabic which has its own alphabet and writing scripts called
’Fidel’, that was borrowed from Ge’ez, another Ethiopian
Semitic language. Fidel is a syllable-based writing system
where the consonants and vowels co-exist within each graphic
symbol. The Amharic language is spoken by more than 57
million people with up to 32 million native speakers and 25
million non-native speakers [15]. Amharic is the working lan-
guage of the Federal Democratic Republic of Ethiopia (FDRE)
and for many regional states in the country. In Amharic, there
are 34 core characters each having seven different derivatives
to represent vowels. In addition, it has 20 labialized characters,
more than 20 numerals, and 8 punctuation marks. Amharic
uses a total of more than 310 characters. The language is
known for being morphologically complex and it is highly
inflectional. Unlike English, French, Spanish, Japanese, and
Chinese, Amharic is considered low-resource because the data
are not well organized and technologically less supported [16].
III. MOTIVATION
Nowadays advancement in technology has made the
lifestyle of human beings much easier by helping daily activ-
ities. One of the applications that solved communication bar-
riers between people speaking different languages is machine
translation. Many big technology companies such as Google,
Microsoft, IBM, etc. provide translation services for many
languages to facilitate communication between people without
using a human translator. However, the quality of NMT is
massively dependent on quantity, quality, and relevance of the
training dataset [17]. Such companies have achieved promising
results for bilingual high-resource languages, but they are
inadequate for low-resource languages like Amharic.
Fig. 1. Examples of Google Amharic to English translation
Figure 1shows Google translation of Amharic words into
English. Google translated the Amharic input as ”stomach
empty” which is a wrong translation. The correct literal
meaning will be ”he become Disappointed”. This shows that
the applications used by companies that provide translation
system like Google require improvements. One of the cause for
poor performance of MT systems for languages like Amharic
is availability of limited resource in digital space [17]. In this
research work we present a newly curated English-Amharic
parallel dataset that can be used for MT research and help to
solve the performance issue of an Amharic MT system.
On the other hand, Amharic is one of morphologically
rich language and normalization of the Amharic homophone
characters might have an impact on such downstream NLP
applications as MT and sentiment analysis [18]. This research
work is intended to study the effect of homophone normal-
ization on Amharic-English machine translation. Furthermore,
expanding the translation dataset and developing state-of-the-
art bi-directional Amharic-English translation models are our
motivations to carryout this research work.
IV. RELATED WORK
Many automatic translation works have been carried out
for the major pairs of European and Asian languages, taking
advantage of large-scale parallel corpora. However, very few
research studies have been conducted on low-resource lan-
guages like Amharic to English due to its scarcity of parallel
data. In this section, we have focused on exploring how
machine translation is conducted for the Amharic language.
Among recent works, Biadgligne and Sma¨
ıli [19] described
the development of an English-Amharic Statistical Machine
Translation (SMT) and Neural Machine Translation (NMT)
experiments, achieving 26.47 and 32.44 BLEU scores, respec-
tively. They harvested and used 225,304 parallel sentences in
different domains. To the best of our knowledge, Biadgligne
and Sma¨
ıli [19] work is the largest dataset used in Amharic
machine translation research work.
摘要:

TheEffectofNormalizationforBi-directionalAmharic-EnglishNeuralMachineTranslationTadesseDestawBelayCollegeofInformaticsWolloUniversityKombolcha,Ethiopiatadesseit@gmail.comAtnafuLambeboTonjaCentrodeInvestigaci´onenComputaci´onInstitutoPolit´ecnicoNacionalMexicoCity,Mexicoalabedot2022@cic.ipn.mxOlgaKol...

展开>> 收起<<
The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation Tadesse Destaw Belay.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:6 页 大小:234.29KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注