A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT Hadeel Saadany

2025-04-30 0 0 329.73KB 11 页 10玖币

侵权投诉

A Semi-supervised Approach for a Better Translation of Sentiment in

Dialectical Arabic UGT

Hadeel Saadany

Centre for Translation Studies

University of Surrey

United Kingdom

hadeel.saadany@surrey.ac.uk

Constantin Or˘

asan

Centre for Translation Studies

University of Surrey

United Kingdom

c.orasan@surrey.ac.uk

Emad Mohamed

RGCL

University of Wolverhampton

Wolverhampton, UK

e.mohamed2@wlv.ac.uk

Ashraf Tantawy

School of Computer Science and Informatics

De Montfort University

Leicester, UK

ashraf.tantavy@dmu.ac.uk

Abstract

In the online world, Machine Translation (MT)

systems are extensively used to translate User-

Generated Text (UGT) such as reviews, tweets,

and social media posts, where the main mes-

sage is often the author’s positive or negative

attitude towards the topic of the text. How-

ever, MT systems still lack accuracy in some

low-resource languages and sometimes make

critical translation errors that completely ﬂip

the sentiment polarity of the target word or

phrase and hence delivers a wrong affect mes-

sage. This is particularly noticeable with texts

that do not follow common lexico-grammatical

standards such as the dialectical Arabic (DA)

used on online platforms. In this research,

we aim to improve the translation of senti-

ment in UGT written in the dialectical ver-

sions of the Arabic language to English. Given

the scarcity of gold-standard parallel data for

DA-EN in the UGT domain, we introduce a

semi-supervised approach that exploits both

monolingual and parallel data for training an

NMT system initialised by a cross-lingual lan-

guage model trained with a supervised and un-

supervised modelling objectives. We assess the

accuracy of sentiment translation by our pro-

posed system through a numerical ‘sentiment-

closeness’ measure as well as human evalua-

tion. We will show that our semi-supervised

MT system can signiﬁcantly help with correct-

ing sentiment errors detected in the online trans-

lation of dialectical Arabic UGT.

1 Introduction

Incorporating automatic translation tools by web-

sites such as Twitter, amazon.com and book-

ing.com has become common practice to cater for

their multilingual users. In this context, sentiment

preservation is of great importance because deci-

sions about purchasing a product or service, as well

as analysis of public trends, are based on accurate

translation of the user’s affect message. Arabic

UGT constitutes a signiﬁcant challenge for MT

systems because it is commonly a mix of Dialec-

tical Arabic (DA) and Modern Standard Arabic

(MSA) which differ signiﬁcantly on the lexico-

grammatical level. Research has shown that The

code-switching between DA and MSA by online

users can lead to a serious mistranslation of sen-

timent for several reasons (Saadany and Orasan,

2020;Saadany et al.,2021b).

First, there are lexical and structural differences

between the two versions of the Arabic language

which cause confusion to MT systems in choos-

ing the correct sentiment-carrying word (Saadany

et al.,2021a). On the lexical level, there are pol-

ysemous words used in both MSA and DA which

can have exact opposite sentiment poles. To give

one example, the word ‘

YÓAg

’ means ‘rigid’ in

MSA, but in DA, within the UGT domain, it of-

ten means ‘great or awesome’. Hence, we ﬁnd

the positive Goodreads review ‘

@Yg

.YÓAg

.A

J»

’

(A very good book)

is mistranslated by the on-

line MT tool into ‘A very rigid book’, incorrectly

reﬂecting a negative sentiment. The same word,

however, in another book review written in MSA

– ‘



H@YgB@ XQå ú



¯

Ë

ñÖÏ@ é

®K

Q£ @Yg

.èYÓAg

’ – is

correctly translated as ‘The author’s way of nar-

rating events is very rigid’, rightly reﬂecting the

dissatisfaction of the author.

Second, the Arabic writing system does not have

letters for short vowels; instead short vowels are

realised as diacritic symbols on or below letters.

UGT commonly lacks diacritics and hence it of-

1https://www.goodreads.com/book/show/16031620

arXiv:2210.11899v2 [cs.CL] 8 Jun 2023

ten contains words spelled alike in MSA and DA

but different in meaning due to different pronunci-

ation. An example of these homographs is in the

DA tweet

‘

. AKAK

A

®»

’ where the noun ‘

.

’

commonly means ‘fraud’ in DA with the diacritic

‘fatha’ (a short /a/ sound) on the ﬁrst letter; the

tweet should read ‘Enough of the fraud’. The on-

line MT system ﬂips the negative polarity as it

mistakes this word with its common homograph

in MSA meaning ‘monument’, pronounced with

‘damma’ (a short /u/ sound) on its ﬁrst and sec-

ond letters. The mistranslation of the homograph

produces a neutral statement, ‘enough monument’,

which completely misses the negative polarity of

the source.

The third problem is that the way sentiment is

expressed by the DA used in UGT is different than

the structured DA data that is commonly used to

train DA-EN NMT systems (e.g. Zbib et al. (2012);

Bouamor et al. (2014); Elmahdy et al. (2014);

Meftouh et al. (2015); Bouamor et al. (2018)).

Some of the main differences is that UGT typi-

cally contains profanity and aggressive words that

are not to be found in the available dialectical data.

Moreover, the DA used on online platforms such

as Twitter usually contains unusual orthography to

express emotions or to obfuscate aggression and,

at times, nuanced words that are understood only

within context (Ranasinghe et al.,2019). A review

of the literature shows that the authentic parallel

datasets for DA-EN consist mainly of hand-crafted

structured data which signiﬁcantly differ from this

type of noisy DA used in UGT. On the other hand,

there is a considerable number of large parallel

MSA-EN datasets in various domains (e.g OPUS

open-source parallel MSA-EN datasets include UN

documents, TEDx talks, subtitles, news commen-

tary, etc.). Since DA in the UGT domain has pe-

culiar qualities and since it differs on the lexico-

grammatical level from Standard Arabic and, at

times, same words can have opposite sentiment in

the two versions, the freely available MSA datasets

are not optimal for translating sentiment in UGT

written in a dialectical version.

Given the scarcity of any substantial gold-

standard DA-EN data within the UGT domain, we

propose to improve the transfer of sentiment in

Arabic UGT by training a semi-supervised NMT

2https://twitter.com/Abdullahehemidy/status/

221985043793444865, Accessed: Aug 2022

3https://opus.nlpl.eu/

system where we leverage the relatively large gold-

standard MSA-EN data with DA monolingual data

from the UGT domain. We take advantage of pre-

training a cross-lingual language model with both

a Masked Language Modelling (MLM) objective

and a Translation Language Modelling (TLM) ob-

jective for creating a shared embedding space for

English, MSA and DA. We show that initialising

our NMT model with these cross-lingual pretrained

word representations has a signiﬁcant impact on

the translation performance in general and on the

transfer of sentiment in particular. In this research,

therefore, we make the following contributions:

•

We introduce a semi-supervised AR-EN NMT

system trained on both parallel and monolin-

gual data for a better translation of sentiment

in Arabic UGT.

•

We introduce an empirical evaluation method

for assessing the transfer of sentiment be-

tween Arabic and English in the UGT domain.

•

We make our compiled dataset, crosslingual

language models and semi-supervised NMT

system publicly available4.

To present our contributions, the paper is divided

as follows: Section 2provides a summary of rel-

evant approaches to supervised and unsupervised

MT as well as research attempts for the translation

of DA. Section 3describes our semi-supervised

NMT system set up and its requirements. Section

4presents the experiments we conducted on our

compiled datasets as well as the assessment meth-

ods used to evaluate the improvement of sentiment

translation in DA UGT. Finally, Section 5presents

our conclusions on the different experiments and

the limitations of the study.

2 Related Work

The earliest attempt to solve the problem of trans-

lating DA has been introduced by Zbib et al. (2012).

They created the largest existing parallel data for

DA to English which is relied upon in most MT

research for DA. The dataset consists of around

250k parallel sentences. They used Mechanical

Turk to translate sentences from DA to EN. Most

of the DA is in the Levantine and Egyptian dialects,

but none of the texts used belong to the UGT do-

main. They show that when translating the dialecti-

cal test sets, the DA-EN MT system performs 6.3

4Link removed to preserve anonymity

and 7.0 BLEU points higher than an MT system

trained on a 150M-word MSA-EN parallel corpus.

Another approach to solve the data scarcity prob-

lem was introduced by Salloum and Habash (2013)

who propose pivoting to MSA instead of directly

translating from DA to EN. They transform DA sen-

tences into MSA by a large number of hand-written

morphosyntactic transfer rules.

There have been other attempts to create DA-EN

and DA-MSA parallel datasets such as the multi-

dialectical MDC and MADAR datasets (Bouamor

et al.,2014,2018), the QCA speech corpus

(Elmahdy et al.,2014), and the PADIC paral-

lel corpus which includes ﬁve dialects and MSA,

but not English (Meftouh et al.,2015). These

datasets, however, are relatively too small (max

14.7k parallel sentences) and differ considerably

from the UGT domain. Since the problem of DA-

EN scarcity of data still exists up to the time of

writing this research, the most recent attempts to

improve the translation of DA to English have fo-

cused either on augmenting the available datasets

by bootstrapping techniques (Abid,2020) or on

training with the large available MSA datasets and

ﬁne-tuning on the smaller DA datasets (Sajjad et al.,

2020).

A recent research line in MT which has been in-

troduced to overcome the sparsity of gold-standard

parallel data for low-resource languages is unsuper-

vised MT which relies solely on monolingual data

of the source and target languages in training (Lam-

ple et al.,2017,2018;Artetxe et al.,2017). The

key idea is to build a common latent space for two

languages (or more) which can be used to recon-

struct a sentence in a given language from a noisy

version of it (Vincent et al.,2008), or to obtain the

translated sentence by using a back-translation pro-

cedure (Sennrich et al.,2015a). The use of high

quality cross-lingual word embeddings pretrained

by state-of-the-art cross-lingual language models to

initialise the unsupervised MT systems has recently

contributed to a signiﬁcant improvement in their

performance (Lample and Conneau,2019;Artetxe

et al.,2019;Conneau et al.,2020). In this research,

we combine both methods of supervised and un-

supervised MT to compensate for the sparsity of

the DA-EN data from the UGT domain. Our semi-

supervised system is explained in the following

section.

Figure 1: Semi-supervised NMT system

3 Semi-supervised NMT System Set Up

3.1 Cross-Lingual Language Model

Due to their lexico-grammatical differences, we

treat dialectical and standard Arabic as two distinct

languages. Hence, we construct a multi-directional

NMT system between the permutations of DA-

MSA-EN with the objective of obtaining the high-

est translation accuracy in the DA-EN direction.

The setup of this system is shown in Figure 1. For

constructing our semi-supervised NMT system we

require the following data:

MSA-EN clean parallel data usually used for

training NMT,

MSA-DA clean parallel data from any do-

main,

DA-EN silver-standard parallel data from the

UGT domain with sentiment lexicon infused,

and

4. DA monolingual data from the UGT domain.

It should be noted that the Arabic UGT is not writ-

ten in DA per se, it is usually a mix of DA and

MSA. Since we are treating DA and MSA as two

distinct languages, we need to extract only the DA

instances from the UGT dataset. For this purpose,

we build our own DA detection classiﬁer as per

step (1) in Figure 1.

In step (2), we pretrain a cross-lingual language

model to initialise our NMT system. We follow

Lample and Conneau (2019) approach to train a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ASemi-supervisedApproachforaBetterTranslationofSentimentinDialecticalArabicUGTHadeelSaadanyCentreforTranslationStudiesUniversityofSurreyUnitedKingdomhadeel.saadany@surrey.ac.ukConstantinOr˘asanCentreforTranslationStudiesUniversityofSurreyUnitedKingdomc.orasan@surrey.ac.ukEmadMohamedRGCLUniversityofW...

展开>> 收起<<

A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT Hadeel Saadany.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT Hadeel Saadany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: