
explore the applicability of Facebook M2M100 multi-lingual
pre-trained language model for Amharic-English translation in
both directions. Furthermore, this research work investigated
the impact of normalization of the Amharic homophones on
Amharic-English MT tasks.
The main contributions of this work are:
1) Exploration of the Amharic-English and English-
Amharic machine translation tasks.
2) Introduction of the first large-scale publicly available
Amharic-English translation parallel dataset.
3) Development and implementation of state-of-the-art
Amharic-English translation models.
4) Investigation of the effect of Amharic homophone char-
acter normalization on the machine translation task.
The rest of this paper is organized as follows. Section
II presents a detail description of Amharic language while
Section III shows the motivation for this research. In Section
IV, we review related work. Section Vdescribes the existing
parallel corpus and the collection of a new corpus from
the news domain. The general pre-processing steps applied
to both corpora are presented in Section VI. Section VII
discusses the proposed NMT models and Section VIII gives
the experimental results. In the end, Section IX concludes the
paper and sheds some light on possible future work.
II. AMHARIC LANGUAGE
Amharic is the second most spoken Semitic language next
to Arabic which has its own alphabet and writing scripts called
’Fidel’, that was borrowed from Ge’ez, another Ethiopian
Semitic language. Fidel is a syllable-based writing system
where the consonants and vowels co-exist within each graphic
symbol. The Amharic language is spoken by more than 57
million people with up to 32 million native speakers and 25
million non-native speakers [15]. Amharic is the working lan-
guage of the Federal Democratic Republic of Ethiopia (FDRE)
and for many regional states in the country. In Amharic, there
are 34 core characters each having seven different derivatives
to represent vowels. In addition, it has 20 labialized characters,
more than 20 numerals, and 8 punctuation marks. Amharic
uses a total of more than 310 characters. The language is
known for being morphologically complex and it is highly
inflectional. Unlike English, French, Spanish, Japanese, and
Chinese, Amharic is considered low-resource because the data
are not well organized and technologically less supported [16].
III. MOTIVATION
Nowadays advancement in technology has made the
lifestyle of human beings much easier by helping daily activ-
ities. One of the applications that solved communication bar-
riers between people speaking different languages is machine
translation. Many big technology companies such as Google,
Microsoft, IBM, etc. provide translation services for many
languages to facilitate communication between people without
using a human translator. However, the quality of NMT is
massively dependent on quantity, quality, and relevance of the
training dataset [17]. Such companies have achieved promising
results for bilingual high-resource languages, but they are
inadequate for low-resource languages like Amharic.
Fig. 1. Examples of Google Amharic to English translation
Figure 1shows Google translation of Amharic words into
English. Google translated the Amharic input as ”stomach
empty” which is a wrong translation. The correct literal
meaning will be ”he become Disappointed”. This shows that
the applications used by companies that provide translation
system like Google require improvements. One of the cause for
poor performance of MT systems for languages like Amharic
is availability of limited resource in digital space [17]. In this
research work we present a newly curated English-Amharic
parallel dataset that can be used for MT research and help to
solve the performance issue of an Amharic MT system.
On the other hand, Amharic is one of morphologically
rich language and normalization of the Amharic homophone
characters might have an impact on such downstream NLP
applications as MT and sentiment analysis [18]. This research
work is intended to study the effect of homophone normal-
ization on Amharic-English machine translation. Furthermore,
expanding the translation dataset and developing state-of-the-
art bi-directional Amharic-English translation models are our
motivations to carryout this research work.
IV. RELATED WORK
Many automatic translation works have been carried out
for the major pairs of European and Asian languages, taking
advantage of large-scale parallel corpora. However, very few
research studies have been conducted on low-resource lan-
guages like Amharic to English due to its scarcity of parallel
data. In this section, we have focused on exploring how
machine translation is conducted for the Amharic language.
Among recent works, Biadgligne and Sma¨
ıli [19] described
the development of an English-Amharic Statistical Machine
Translation (SMT) and Neural Machine Translation (NMT)
experiments, achieving 26.47 and 32.44 BLEU scores, respec-
tively. They harvested and used 225,304 parallel sentences in
different domains. To the best of our knowledge, Biadgligne
and Sma¨
ıli [19] work is the largest dataset used in Amharic
machine translation research work.