The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation Tadesse Destaw Belay

2025-05-06 0 0 234.29KB 6 页 10玖币

侵权投诉

The Effect of Normalization for Bi-directional

Amharic-English Neural Machine Translation

Tadesse Destaw Belay

College of Informatics

Wollo University

Kombolcha, Ethiopia

tadesseit@gmail.com

Atnafu Lambebo Tonja

Centro de Investigaci´

on en Computaci´

Instituto Polit´

ecnico Nacional

Mexico City, Mexico

alabedot2022@cic.ipn.mx

Olga Kolesnikova

Centro de Investigaci´

on en Computaci´

Instituto Polit´

ecnico Nacional

Mexico City, Mexico

kolesolga@gmail.com

Seid Muhie Yimam

Dept. of Informatics

Universit¨

at Hamburg

Hamburg, Germany

seid.muhie.yimam@uni-hamburg.de

Abinew Ali Ayele

ICT4D Research Center

Bahir Dar University

Bahir dar, Ethiopia

abinewaliayele@gmail.com

Silesh Bogale Haile

Dept. of Computer Science

Assosa University

Assosa, Ethiopia

sileshibogale123@gmail.com

Grigori Sidorov

Centro de Investigaci´

on en Computaci´

Instituto Polit´

ecnico Nacional

Mexico City, Mexico

sidorov@cic.ipn.mx

Alexander Gelbukh

Centro de Investigaci´

on en Computaci´

Instituto Polit´

ecnico Nacional

Mexico City, Mexico

gelbukh@cic.ipn.mx

Abstract—Machine translation (MT) is one of the promi-

nent tasks in natural language processing whose objective is

to translate texts automatically from one natural language to

another. Nowadays, using deep neural networks for MT task has

received a great deal of attention. These networks require lots of

data to learn abstract representations of the input and store it

in continuous vectors. This paper presents the ﬁrst relatively

large-scale Amharic-English parallel sentence dataset. Using

these compiled data, we build bi-directional Amharic-English

translation models by ﬁne-tuning the existing Facebook M2M100

pre-trained model achieving a BLEU score of 37.79 in Amharic-

English translation and 32.74 in English-Amharic translation.

Additionally, we explore the effects of Amharic homophone

normalization on the machine translation task. The results show

that normalization of Amharic homophone characters increases

the performance of Amharic-English machine translation in both

directions.

Index Terms—Neural machine translation, pre-trained models,

Amharic-English MT, homophone normalization, low-resourced

language

I. INTRODUCTION

Machine translation (MT) is a sub-ﬁeld of natural lan-

guage processing (NLP) that investigates how to use computer

software to automatically translate text or speech from one

language to another without human involvement. MT is one

of the prominent tasks in NLP that is tackled in several ways

[1]. The ﬁrst MT research began at about 1950s and in 1952

the ﬁrst International Conference on Machine Translation was

organized at the Massachusetts Institute of Technology (MIT).

It has long research history and experienced four stages,

namely, Rule-based MT [2], Statistical MT (SMT) [3], hybrid

MT, and Neural MT (NMT) [4], [5].

The most severe drawback of the rule-based method is

that it has ignored the need for context information in the

translation process. It is highly dependent on hand-crafted

features. Phrase-based SMT (PBSMT), the most prevalent ver-

sion of SMT, generates translation by segmenting the source

sentence into several phrases and performing phrase translation

and replacement. It may ignore the long sentence dependency

and require high computing devices [6]. Recently, using deep

neural networks for MT task has received great attention.

NMT also improves training procedures due to the end-to-end

procedure without tedious feature engineering and complex

setups. NMT employs such techniques as recurrent neural

network (RNN) [7], convolutional neural network (CNN) [8],

and self-attention network (Transformer) [9].

Transformer models with the pre-training approach is a

new NMT strategy entirely based on attention mechanisms

proposed in 2017 [9]. Among the different neural network

architectures, the Transformer model has emerged as the dom-

inant NMT paradigm [10]–[12]. It has become the state-of-

the-art model for many artiﬁcial intelligence tasks, including

machine translation. In terms of model, the Transformer-based

pre-trained models are fast to ﬁne-tune, highly accurate and

has been proven to outperform widely used recurrent networks

[6], [13], [14].

The focus of MT research for the Amharic language has

been on rule-based and SMT methods. In this work, we used

the transformer model as a baseline translation system to

arXiv:2210.15224v1 [cs.CL] 27 Oct 2022

explore the applicability of Facebook M2M100 multi-lingual

pre-trained language model for Amharic-English translation in

both directions. Furthermore, this research work investigated

the impact of normalization of the Amharic homophones on

Amharic-English MT tasks.

The main contributions of this work are:

1) Exploration of the Amharic-English and English-

Amharic machine translation tasks.

2) Introduction of the ﬁrst large-scale publicly available

Amharic-English translation parallel dataset.

3) Development and implementation of state-of-the-art

Amharic-English translation models.

4) Investigation of the effect of Amharic homophone char-

acter normalization on the machine translation task.

The rest of this paper is organized as follows. Section

II presents a detail description of Amharic language while

Section III shows the motivation for this research. In Section

IV, we review related work. Section Vdescribes the existing

parallel corpus and the collection of a new corpus from

the news domain. The general pre-processing steps applied

to both corpora are presented in Section VI. Section VII

discusses the proposed NMT models and Section VIII gives

the experimental results. In the end, Section IX concludes the

paper and sheds some light on possible future work.

II. AMHARIC LANGUAGE

Amharic is the second most spoken Semitic language next

to Arabic which has its own alphabet and writing scripts called

’Fidel’, that was borrowed from Ge’ez, another Ethiopian

Semitic language. Fidel is a syllable-based writing system

where the consonants and vowels co-exist within each graphic

symbol. The Amharic language is spoken by more than 57

million people with up to 32 million native speakers and 25

million non-native speakers [15]. Amharic is the working lan-

guage of the Federal Democratic Republic of Ethiopia (FDRE)

and for many regional states in the country. In Amharic, there

are 34 core characters each having seven different derivatives

to represent vowels. In addition, it has 20 labialized characters,

more than 20 numerals, and 8 punctuation marks. Amharic

uses a total of more than 310 characters. The language is

known for being morphologically complex and it is highly

inﬂectional. Unlike English, French, Spanish, Japanese, and

Chinese, Amharic is considered low-resource because the data

are not well organized and technologically less supported [16].

III. MOTIVATION

Nowadays advancement in technology has made the

lifestyle of human beings much easier by helping daily activ-

ities. One of the applications that solved communication bar-

riers between people speaking different languages is machine

translation. Many big technology companies such as Google,

Microsoft, IBM, etc. provide translation services for many

languages to facilitate communication between people without

using a human translator. However, the quality of NMT is

massively dependent on quantity, quality, and relevance of the

training dataset [17]. Such companies have achieved promising

results for bilingual high-resource languages, but they are

inadequate for low-resource languages like Amharic.

Fig. 1. Examples of Google Amharic to English translation

Figure 1shows Google translation of Amharic words into

English. Google translated the Amharic input as ”stomach

empty” which is a wrong translation. The correct literal

meaning will be ”he become Disappointed”. This shows that

the applications used by companies that provide translation

system like Google require improvements. One of the cause for

poor performance of MT systems for languages like Amharic

is availability of limited resource in digital space [17]. In this

research work we present a newly curated English-Amharic

parallel dataset that can be used for MT research and help to

solve the performance issue of an Amharic MT system.

On the other hand, Amharic is one of morphologically

rich language and normalization of the Amharic homophone

characters might have an impact on such downstream NLP

applications as MT and sentiment analysis [18]. This research

work is intended to study the effect of homophone normal-

ization on Amharic-English machine translation. Furthermore,

expanding the translation dataset and developing state-of-the-

art bi-directional Amharic-English translation models are our

motivations to carryout this research work.

IV. RELATED WORK

Many automatic translation works have been carried out

for the major pairs of European and Asian languages, taking

advantage of large-scale parallel corpora. However, very few

research studies have been conducted on low-resource lan-

guages like Amharic to English due to its scarcity of parallel

data. In this section, we have focused on exploring how

machine translation is conducted for the Amharic language.

Among recent works, Biadgligne and Sma¨

ıli [19] described

the development of an English-Amharic Statistical Machine

Translation (SMT) and Neural Machine Translation (NMT)

experiments, achieving 26.47 and 32.44 BLEU scores, respec-

tively. They harvested and used 225,304 parallel sentences in

different domains. To the best of our knowledge, Biadgligne

and Sma¨

ıli [19] work is the largest dataset used in Amharic

machine translation research work.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheEffectofNormalizationforBi-directionalAmharic-EnglishNeuralMachineTranslationTadesseDestawBelayCollegeofInformaticsWolloUniversityKombolcha,Ethiopiatadesseit@gmail.comAtnafuLambeboTonjaCentrodeInvestigaci´onenComputaci´onInstitutoPolit´ecnicoNacionalMexicoCity,Mexicoalabedot2022@cic.ipn.mxOlgaKol...

展开>> 收起<<

The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation Tadesse Destaw Belay.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation Tadesse Destaw Belay

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: