A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT Hadeel Saadany

2025-04-30 0 0 329.73KB 11 页 10玖币
侵权投诉
A Semi-supervised Approach for a Better Translation of Sentiment in
Dialectical Arabic UGT
Hadeel Saadany
Centre for Translation Studies
University of Surrey
United Kingdom
hadeel.saadany@surrey.ac.uk
Constantin Or˘
asan
Centre for Translation Studies
University of Surrey
United Kingdom
c.orasan@surrey.ac.uk
Emad Mohamed
RGCL
University of Wolverhampton
Wolverhampton, UK
e.mohamed2@wlv.ac.uk
Ashraf Tantawy
School of Computer Science and Informatics
De Montfort University
Leicester, UK
ashraf.tantavy@dmu.ac.uk
Abstract
In the online world, Machine Translation (MT)
systems are extensively used to translate User-
Generated Text (UGT) such as reviews, tweets,
and social media posts, where the main mes-
sage is often the author’s positive or negative
attitude towards the topic of the text. How-
ever, MT systems still lack accuracy in some
low-resource languages and sometimes make
critical translation errors that completely flip
the sentiment polarity of the target word or
phrase and hence delivers a wrong affect mes-
sage. This is particularly noticeable with texts
that do not follow common lexico-grammatical
standards such as the dialectical Arabic (DA)
used on online platforms. In this research,
we aim to improve the translation of senti-
ment in UGT written in the dialectical ver-
sions of the Arabic language to English. Given
the scarcity of gold-standard parallel data for
DA-EN in the UGT domain, we introduce a
semi-supervised approach that exploits both
monolingual and parallel data for training an
NMT system initialised by a cross-lingual lan-
guage model trained with a supervised and un-
supervised modelling objectives. We assess the
accuracy of sentiment translation by our pro-
posed system through a numerical ‘sentiment-
closeness’ measure as well as human evalua-
tion. We will show that our semi-supervised
MT system can significantly help with correct-
ing sentiment errors detected in the online trans-
lation of dialectical Arabic UGT.
1 Introduction
Incorporating automatic translation tools by web-
sites such as Twitter, amazon.com and book-
ing.com has become common practice to cater for
their multilingual users. In this context, sentiment
preservation is of great importance because deci-
sions about purchasing a product or service, as well
as analysis of public trends, are based on accurate
translation of the user’s affect message. Arabic
UGT constitutes a significant challenge for MT
systems because it is commonly a mix of Dialec-
tical Arabic (DA) and Modern Standard Arabic
(MSA) which differ significantly on the lexico-
grammatical level. Research has shown that The
code-switching between DA and MSA by online
users can lead to a serious mistranslation of sen-
timent for several reasons (Saadany and Orasan,
2020;Saadany et al.,2021b).
First, there are lexical and structural differences
between the two versions of the Arabic language
which cause confusion to MT systems in choos-
ing the correct sentiment-carrying word (Saadany
et al.,2021a). On the lexical level, there are pol-
ysemous words used in both MSA and DA which
can have exact opposite sentiment poles. To give
one example, the word ‘
Ag
.
’ means ‘rigid’ in
MSA, but in DA, within the UGT domain, it of-
ten means ‘great or awesome’. Hence, we find
the positive Goodreads review ‘
@Yg
.Ag
.H
.A
(A very good book)
1
is mistranslated by the on-
line MT tool into ‘A very rigid book’, incorrectly
reflecting a negative sentiment. The same word,
however, in another book review written in MSA
– ‘
H@YgB@ XQå ú
¯
Ë
ñÖÏ@ é
®K
Q£ @Yg
.èAg
.
’ – is
correctly translated as ‘The author’s way of nar-
rating events is very rigid’, rightly reflecting the
dissatisfaction of the author.
Second, the Arabic writing system does not have
letters for short vowels; instead short vowels are
realised as diacritic symbols on or below letters.
UGT commonly lacks diacritics and hence it of-
1https://www.goodreads.com/book/show/16031620
arXiv:2210.11899v2 [cs.CL] 8 Jun 2023
ten contains words spelled alike in MSA and DA
but different in meaning due to different pronunci-
ation. An example of these homographs is in the
DA tweet
2
I
. AKAK
A
®»
’ where the noun ‘
I
.
commonly means ‘fraud’ in DA with the diacritic
‘fatha’ (a short /a/ sound) on the first letter; the
tweet should read ‘Enough of the fraud’. The on-
line MT system flips the negative polarity as it
mistakes this word with its common homograph
in MSA meaning ‘monument’, pronounced with
‘damma’ (a short /u/ sound) on its first and sec-
ond letters. The mistranslation of the homograph
produces a neutral statement, ‘enough monument’,
which completely misses the negative polarity of
the source.
The third problem is that the way sentiment is
expressed by the DA used in UGT is different than
the structured DA data that is commonly used to
train DA-EN NMT systems (e.g. Zbib et al. (2012);
Bouamor et al. (2014); Elmahdy et al. (2014);
Meftouh et al. (2015); Bouamor et al. (2018)).
Some of the main differences is that UGT typi-
cally contains profanity and aggressive words that
are not to be found in the available dialectical data.
Moreover, the DA used on online platforms such
as Twitter usually contains unusual orthography to
express emotions or to obfuscate aggression and,
at times, nuanced words that are understood only
within context (Ranasinghe et al.,2019). A review
of the literature shows that the authentic parallel
datasets for DA-EN consist mainly of hand-crafted
structured data which significantly differ from this
type of noisy DA used in UGT. On the other hand,
there is a considerable number of large parallel
MSA-EN datasets in various domains (e.g OPUS
3
open-source parallel MSA-EN datasets include UN
documents, TEDx talks, subtitles, news commen-
tary, etc.). Since DA in the UGT domain has pe-
culiar qualities and since it differs on the lexico-
grammatical level from Standard Arabic and, at
times, same words can have opposite sentiment in
the two versions, the freely available MSA datasets
are not optimal for translating sentiment in UGT
written in a dialectical version.
Given the scarcity of any substantial gold-
standard DA-EN data within the UGT domain, we
propose to improve the transfer of sentiment in
Arabic UGT by training a semi-supervised NMT
2https://twitter.com/Abdullahehemidy/status/
221985043793444865, Accessed: Aug 2022
3https://opus.nlpl.eu/
system where we leverage the relatively large gold-
standard MSA-EN data with DA monolingual data
from the UGT domain. We take advantage of pre-
training a cross-lingual language model with both
a Masked Language Modelling (MLM) objective
and a Translation Language Modelling (TLM) ob-
jective for creating a shared embedding space for
English, MSA and DA. We show that initialising
our NMT model with these cross-lingual pretrained
word representations has a significant impact on
the translation performance in general and on the
transfer of sentiment in particular. In this research,
therefore, we make the following contributions:
We introduce a semi-supervised AR-EN NMT
system trained on both parallel and monolin-
gual data for a better translation of sentiment
in Arabic UGT.
We introduce an empirical evaluation method
for assessing the transfer of sentiment be-
tween Arabic and English in the UGT domain.
We make our compiled dataset, crosslingual
language models and semi-supervised NMT
system publicly available4.
To present our contributions, the paper is divided
as follows: Section 2provides a summary of rel-
evant approaches to supervised and unsupervised
MT as well as research attempts for the translation
of DA. Section 3describes our semi-supervised
NMT system set up and its requirements. Section
4presents the experiments we conducted on our
compiled datasets as well as the assessment meth-
ods used to evaluate the improvement of sentiment
translation in DA UGT. Finally, Section 5presents
our conclusions on the different experiments and
the limitations of the study.
2 Related Work
The earliest attempt to solve the problem of trans-
lating DA has been introduced by Zbib et al. (2012).
They created the largest existing parallel data for
DA to English which is relied upon in most MT
research for DA. The dataset consists of around
250k parallel sentences. They used Mechanical
Turk to translate sentences from DA to EN. Most
of the DA is in the Levantine and Egyptian dialects,
but none of the texts used belong to the UGT do-
main. They show that when translating the dialecti-
cal test sets, the DA-EN MT system performs 6.3
4Link removed to preserve anonymity
and 7.0 BLEU points higher than an MT system
trained on a 150M-word MSA-EN parallel corpus.
Another approach to solve the data scarcity prob-
lem was introduced by Salloum and Habash (2013)
who propose pivoting to MSA instead of directly
translating from DA to EN. They transform DA sen-
tences into MSA by a large number of hand-written
morphosyntactic transfer rules.
There have been other attempts to create DA-EN
and DA-MSA parallel datasets such as the multi-
dialectical MDC and MADAR datasets (Bouamor
et al.,2014,2018), the QCA speech corpus
(Elmahdy et al.,2014), and the PADIC paral-
lel corpus which includes five dialects and MSA,
but not English (Meftouh et al.,2015). These
datasets, however, are relatively too small (max
14.7k parallel sentences) and differ considerably
from the UGT domain. Since the problem of DA-
EN scarcity of data still exists up to the time of
writing this research, the most recent attempts to
improve the translation of DA to English have fo-
cused either on augmenting the available datasets
by bootstrapping techniques (Abid,2020) or on
training with the large available MSA datasets and
fine-tuning on the smaller DA datasets (Sajjad et al.,
2020).
A recent research line in MT which has been in-
troduced to overcome the sparsity of gold-standard
parallel data for low-resource languages is unsuper-
vised MT which relies solely on monolingual data
of the source and target languages in training (Lam-
ple et al.,2017,2018;Artetxe et al.,2017). The
key idea is to build a common latent space for two
languages (or more) which can be used to recon-
struct a sentence in a given language from a noisy
version of it (Vincent et al.,2008), or to obtain the
translated sentence by using a back-translation pro-
cedure (Sennrich et al.,2015a). The use of high
quality cross-lingual word embeddings pretrained
by state-of-the-art cross-lingual language models to
initialise the unsupervised MT systems has recently
contributed to a significant improvement in their
performance (Lample and Conneau,2019;Artetxe
et al.,2019;Conneau et al.,2020). In this research,
we combine both methods of supervised and un-
supervised MT to compensate for the sparsity of
the DA-EN data from the UGT domain. Our semi-
supervised system is explained in the following
section.
Figure 1: Semi-supervised NMT system
3 Semi-supervised NMT System Set Up
3.1 Cross-Lingual Language Model
Due to their lexico-grammatical differences, we
treat dialectical and standard Arabic as two distinct
languages. Hence, we construct a multi-directional
NMT system between the permutations of DA-
MSA-EN with the objective of obtaining the high-
est translation accuracy in the DA-EN direction.
The setup of this system is shown in Figure 1. For
constructing our semi-supervised NMT system we
require the following data:
1.
MSA-EN clean parallel data usually used for
training NMT,
2.
MSA-DA clean parallel data from any do-
main,
3.
DA-EN silver-standard parallel data from the
UGT domain with sentiment lexicon infused,
and
4. DA monolingual data from the UGT domain.
It should be noted that the Arabic UGT is not writ-
ten in DA per se, it is usually a mix of DA and
MSA. Since we are treating DA and MSA as two
distinct languages, we need to extract only the DA
instances from the UGT dataset. For this purpose,
we build our own DA detection classifier as per
step (1) in Figure 1.
In step (2), we pretrain a cross-lingual language
model to initialise our NMT system. We follow
Lample and Conneau (2019) approach to train a
摘要:

ASemi-supervisedApproachforaBetterTranslationofSentimentinDialecticalArabicUGTHadeelSaadanyCentreforTranslationStudiesUniversityofSurreyUnitedKingdomhadeel.saadany@surrey.ac.ukConstantinOr˘asanCentreforTranslationStudiesUniversityofSurreyUnitedKingdomc.orasan@surrey.ac.ukEmadMohamedRGCLUniversityofW...

展开>> 收起<<
A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT Hadeel Saadany.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:329.73KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注