
ten contains words spelled alike in MSA and DA
but different in meaning due to different pronunci-
ation. An example of these homographs is in the
DA tweet
2
‘
I
. AKAK
A
®»
’ where the noun ‘
I
.
’
commonly means ‘fraud’ in DA with the diacritic
‘fatha’ (a short /a/ sound) on the first letter; the
tweet should read ‘Enough of the fraud’. The on-
line MT system flips the negative polarity as it
mistakes this word with its common homograph
in MSA meaning ‘monument’, pronounced with
‘damma’ (a short /u/ sound) on its first and sec-
ond letters. The mistranslation of the homograph
produces a neutral statement, ‘enough monument’,
which completely misses the negative polarity of
the source.
The third problem is that the way sentiment is
expressed by the DA used in UGT is different than
the structured DA data that is commonly used to
train DA-EN NMT systems (e.g. Zbib et al. (2012);
Bouamor et al. (2014); Elmahdy et al. (2014);
Meftouh et al. (2015); Bouamor et al. (2018)).
Some of the main differences is that UGT typi-
cally contains profanity and aggressive words that
are not to be found in the available dialectical data.
Moreover, the DA used on online platforms such
as Twitter usually contains unusual orthography to
express emotions or to obfuscate aggression and,
at times, nuanced words that are understood only
within context (Ranasinghe et al.,2019). A review
of the literature shows that the authentic parallel
datasets for DA-EN consist mainly of hand-crafted
structured data which significantly differ from this
type of noisy DA used in UGT. On the other hand,
there is a considerable number of large parallel
MSA-EN datasets in various domains (e.g OPUS
3
open-source parallel MSA-EN datasets include UN
documents, TEDx talks, subtitles, news commen-
tary, etc.). Since DA in the UGT domain has pe-
culiar qualities and since it differs on the lexico-
grammatical level from Standard Arabic and, at
times, same words can have opposite sentiment in
the two versions, the freely available MSA datasets
are not optimal for translating sentiment in UGT
written in a dialectical version.
Given the scarcity of any substantial gold-
standard DA-EN data within the UGT domain, we
propose to improve the transfer of sentiment in
Arabic UGT by training a semi-supervised NMT
2https://twitter.com/Abdullahehemidy/status/
221985043793444865, Accessed: Aug 2022
3https://opus.nlpl.eu/
system where we leverage the relatively large gold-
standard MSA-EN data with DA monolingual data
from the UGT domain. We take advantage of pre-
training a cross-lingual language model with both
a Masked Language Modelling (MLM) objective
and a Translation Language Modelling (TLM) ob-
jective for creating a shared embedding space for
English, MSA and DA. We show that initialising
our NMT model with these cross-lingual pretrained
word representations has a significant impact on
the translation performance in general and on the
transfer of sentiment in particular. In this research,
therefore, we make the following contributions:
•
We introduce a semi-supervised AR-EN NMT
system trained on both parallel and monolin-
gual data for a better translation of sentiment
in Arabic UGT.
•
We introduce an empirical evaluation method
for assessing the transfer of sentiment be-
tween Arabic and English in the UGT domain.
•
We make our compiled dataset, crosslingual
language models and semi-supervised NMT
system publicly available4.
To present our contributions, the paper is divided
as follows: Section 2provides a summary of rel-
evant approaches to supervised and unsupervised
MT as well as research attempts for the translation
of DA. Section 3describes our semi-supervised
NMT system set up and its requirements. Section
4presents the experiments we conducted on our
compiled datasets as well as the assessment meth-
ods used to evaluate the improvement of sentiment
translation in DA UGT. Finally, Section 5presents
our conclusions on the different experiments and
the limitations of the study.
2 Related Work
The earliest attempt to solve the problem of trans-
lating DA has been introduced by Zbib et al. (2012).
They created the largest existing parallel data for
DA to English which is relied upon in most MT
research for DA. The dataset consists of around
250k parallel sentences. They used Mechanical
Turk to translate sentences from DA to EN. Most
of the DA is in the Levantine and Egyptian dialects,
but none of the texts used belong to the UGT do-
main. They show that when translating the dialecti-
cal test sets, the DA-EN MT system performs 6.3
4Link removed to preserve anonymity