Automatic Evaluation and Analysis of Idioms in Neural Machine Translation Christos Baziotis

2025-05-02 0 0 1.7MB 19 页 10玖币

侵权投诉

Automatic Evaluation and Analysis of Idioms

in Neural Machine Translation

Christos Baziotis∗

University of Edinburgh

c.baziotis@ed.ac.uk

Prashant Mathur

Amazon AI

pramathu@amazon.com

Eva Hasler

Amazon AI

ehasler@amazon.com

Abstract

A major open problem in neural machine trans-

lation (NMT) is the translation of idiomatic ex-

pressions, such as “under the weather”. The

meaning of these expressions is not composed

by the meaning of their constituent words, and

NMT models tend to translate them literally

(i.e., word-by-word), which leads to confusing

and nonsensical translations. Research on id-

ioms in NMT is limited and obstructed by the

absence of automatic methods for quantifying

these errors. In this work, ﬁrst, we propose a

novel metric for automatically measuring the

frequency of literal translation errors without

human involvement. Equipped with this met-

ric, we present controlled translation experi-

ments with models trained in different con-

ditions (with/without the test-set idioms) and

across a wide range of (global and targeted)

metrics and test sets. We explore the role of

monolingual pretraining and ﬁnd that it yields

substantial targeted improvements, even with-

out observing any translation examples of the

test-set idioms. In our analysis, we probe the

role of idiom context. We ﬁnd that the ran-

domly initialized models are more local or

“myopic” as they are relatively unaffected by

variations of the idiom context, unlike the pre-

trained ones.

1 Introduction

Neural machine translation (NMT; Sutskever et al.

2014;Bahdanau et al. 2015;Vaswani et al. 2017)

struggles with the translation of rare multi-word

expressions (MWE) (Koehn and Knowles,2017).

Non-compositional phrases, such as idioms (e.g.,

“piece of cake”), are one of the most challenging

types of MWEs, because their meaning is ﬁgura-

tive and cannot be derived from the meaning of

their constituents (Nunberg et al.,1994;Liu,2017).

NMT models tend to translate these expressions

literally (i.e., word-by-word), which leads to erro-

neous translations. In this paper, our focus is on the

∗This work was done during an internship at Amazon.

translation of idiomatic expressions, in contrast to

most prior work, which is subsumed under MWEs

in general (Constant et al.,2017;Cook et al.,2021).

The absence of targeted and automatic evalua-

tion is a major obstacle to advances in idiom transla-

tion. Global metrics, such as BLEU (Papineni et al.,

2002) consider the full translation, and thus, the ef-

fects of idiom translation are overshadowed. Previ-

ous efforts on targeted evaluation isolate the idiom

translation using word alignments (Fadaee et al.,

2018) or word edit distance (Zaninello and Birch,

2020). These approaches measure the accuracy of

idiom translation but do not account for literal trans-

lation errors. Shao et al. (2018) proposed a method

for estimating the frequency of such errors, but

it requires the creation of language-speciﬁc hand-

crafted lists (i.e., blocklists) with words that corre-

spond to literal translation errors.

In this work

, we present a study of idioms in

NMT, with the goal of facilitating future research in

this direction. First, we propose a novel metric for

the automatic evaluation of literal translation errors

(LitTER), that does not require any hand-crafted

blocklists. We incorporate LitTER, which com-

plements alignment-based metrics (Fadaee et al.,

2018) into a uniﬁed targeted evaluation framework.

Next, we present translation experiments in a

controlled setting, by using different training splits

to test models under different conditions (e.g., zero-

shot). To improve idiom translation we leverage

monolingual data, which are more abundant than

parallel and contain idioms in higher frequencies

and more diverse contexts. We exploit mono-

lingual data via pretraining (mBART; Liu et al.

2020), which is a generic and task-agnostic ap-

proach, unlike prior work that considers ad-hoc so-

lutions (Fadaee et al.,2018;Zaninello and Birch,

2020). We ﬁnd that monolingual pretraining yields

strong targeted gains, even when models have not

seen any translation examples of the test idioms.

1Code and data in github.com/amazon-research/idiom-mt

arXiv:2210.04545v1 [cs.CL] 10 Oct 2022

We also present an extensive analysis of how dif-

ferent models translate idioms. Speciﬁcally, we

use a series of probing methods that encode id-

ioms within different contexts (Garcia et al.,2021;

Yu and Ettinger,2020), and measure how this af-

fects the translation outputs and the decoder dis-

tributions. We ﬁnd that the randomly initialized

models are more “myopic” compared to the pre-

trained ones, as they are relatively unaffected when

we vary the idiom context. Our contributions are:

We propose LitTER (§2.1), a novel metric for

measuring the frequency of literal translation

errors, and embed it into a framework (§2)

for automatic and targeted evaluation of idiom

translation, complementing prior work.

We present translation results (§3.3) in a con-

trolled setting and across a wide range of met-

rics. We ﬁnd that pre-training on monolingual

data yields substantial targeted improvements.

We present an extensive analysis (§4) with a se-

ries of probes, showing how context affects id-

iom translation. We ﬁnd that models are more

uncertain when translating idioms and that pre-

training makes models more contextual.

2 Automatic Targeted Evaluation

2.1 Literal Translation Error Rate (LitTER)

We propose literal translation error rate (LitTER),

a novel metric of the frequency of literal transla-

tion errors made by a model. A literal translation

error occurs if any of the words of a span in the

source sentence has been wrongly translated liter-

ally in the target language. Our metric is inspired

by the method of Shao et al. (2018) which iden-

tiﬁes possible literal translation errors, by check-

ing if a translation output contains any blocklisted

words. While this method is effective at capturing

these errors, it relies on hand-crafted blocklists. We

overcome this limitation by automatically creating

word blocklists for a given expression.

Our method, is based on two key ideas. First,

we use bilingual word dictionaries

, which are rel-

atively easy to obtain, to translate the words of an

annotated source span into the target language, and

produce blocklists with candidate literal translation

errors. Then, we use the reference translations to

ﬁlter the blocklists by removing those words that

occur in the reference. This avoids triggering the

blocklist when the correct translation is literal.

2In this work we use the MUSE (Lample et al.,2018).

"Ahmedabad got the first child-

friendly zebra crossing in the world."

"Tο Αχμενταμπάντ απέκτησε την

πρώτη φιλική προς τα παιδιά

διάβαση πεζών στον κόσμο."

"Tο Ahmedabad πήρε την πρώτη

φιλική προς τα παιδιά ζέβρα

διάβαση στον κόσμο."

𝑑𝑖𝑐𝑡 zebra =ζέβρα

𝑑𝑖𝑐𝑡 crossing =πέρασμα, διάβαση

Blocklists

{ζέβρα}

{πέρασμα, διάβαση}

SRC

REF

HYP

{ζέβρα}

{πέρασμα, διάβαση}

{ζέβρα}

1. Create candidate errors

2. Filter candidates

3. Check for errors

Figure 1: Overview of the algorithm for the Literal

Translation Error Rate (LitTER). For each sentence, we

ﬁrst produce candidate literal translation errors (block-

list), using all the word translations of the source idiom

words. Then, we ﬁlter the candidates in the blocklist

by looking at the reference. Finally, we check if the hy-

pothesis triggers the remaining words in the blocklist.

Algorithm

Select from the source text the list of words

s=hs1, s2, ..., sNi

that belong to the anno-

tated expression (i.e., idiom).

For each word

, obtain all its word transla-

tion(s) in the target language using a bilingual

word dictionary and add them to a blocklist

bi=ht1, t2, . . . , tMi

, creating a candidate list

of blocklists Bs=hb1, b2, ..., bNi.3

For each word in the reference (R), search

if it occurs in any of the blocklists

. If so,

remove the corresponding blocklist

from

to avoid false positives. For example in

Figure 1, where words

διάβαση

and

πέρασμα

are synonyms, if we remove only

διάβαση

but

leave

πέρασμα

as a blocklisted word and a

model generates it in its translation, this will

wrongly trigger a literal translation error.

Check if the hypothesis contains any block-

listed words. If it does, then we mark this hy-

pothesis as having a literal translation error.

The ﬁnal score is the percentage of translations that

trigger the blocklist. As LitTER requires source-

side annotations, we collect test data with idioms

on the source side and annotate the spans where

they occur (§3.1). Appendix Cshows examples of

LitTER evaluating real sentences in our data.

In practice,

t1, t2,...,tM

in a blocklist are synonyms of

each other as they are translations of the same source word.

2.2 Alignment-based Evaluation

To measure idiom translation accuracy, we use

Alignment-based Phrase Translation Evaluation

(APT-Eval), by extending Fadaee et al. (2018) with

subword-level metrics. APT-Eval uses word align-

ments to ﬁnd the words in the hypothesis and ref-

erence sentences, respectively, that align with the

annotated idiom source span, and then compares

the retrieved matches to each other. We consider

two evaluation metrics. First, we use unigram pre-

cision, that measures the ratio of words in the ref-

erence spans that occur in the hypothesis spans, as

in Fadaee et al. (2018). We also use ChrF (Popovi´

2015), that measures character n-gram overlap.

LitTER vs. APT-Eval

While APT-Eval is a tar-

geted evaluation metric, it only measures transla-

tion accuracy. This means that given an inaccurate

translation, it is impossible to measure whether it

has a literal translation error. LitTER, however,

quantiﬁes this particular issue that affects NMT.

2.3 Handling Idiom Frequency Imbalances

Different idioms have signiﬁcantly different fre-

quencies (Appendix A.1). However, prior work has

overlooked this fact (Zaninello and Birch,2020;

Fadaee et al.,2018;Shao et al.,2018;Rikters and

Bojar,2017). Thus, over-represented idioms can

skew the reported results and favour models that

have overﬁtted on them. To address this, we report

all of our targeted evaluation results (i.e., LitTER,

APT-Eval) by macro-averaging over idioms:

E(θ) = 1

|L|

j=1

|Lj|

|P|

i=1

M(θ(si), ti)(1)

where

denotes the set of distinct idioms in a test

set and

P={hsi, tii|Lj∈ hsi, tii}

denotes the

set of sentence pairs containing the idiom

. The

model is denoted by

and the translation of

θ(x)

. We ﬁrst compute the average score for the

test pairs of each idiom with a given metric

and then average these values to produce E.

3 Experiments

3.1 Data and Training Splits

We present experiments on en

→

fr and en

→

es data.

For each language pair, we concatenate the data

from Europarl v7

(Koehn,2005), part of the

WMT news translation task (Bojar et al.,2014),

4www.statmt.org/europarl/

and from TED talk transcripts released as part of

IWSLT 2017 shared task5(Cettolo et al.,2017).

Idiom Data

We split the parallel data into regu-

lar and idiom data using a pattern-matching tool

that we developed. Our tool takes as input a list of

idioms and extracts sentences from a corpus con-

taining these idioms. We also annotate the span in

which each idiom occurs within a sentence, to en-

able the targeted evaluation metrics. This approach

is similar to Fadaee et al. (2018), but we build

our tool on top of Spacy’s (Honnibal and Montani,

2017) rule-based matching engine. For each phrase

in the input list, we automatically create pattern-

matching rules that capture complex variations of

a given phrase. See Appendix Afor details.

In this work, we use a list of 225 English idioms,

that we manually collected and plan to make it pub-

licly available. We feed this list into our pattern-

matching tool, and extract (and annotate) transla-

tion pairs that contain an idiom on the source side.

The regular data are used only for training. The id-

iom data are further divided into the idiom-train

and idiom-test sets. For each idiom (e.g., “under

the weather”) in our original idiom data, we put

half of its sentence pairs to the idiom-train and the

other half to the idiom-test sets, to obtain a bal-

anced distribution. We discard sentences with id-

ioms that occur only once. We conduct controlled

experiments, in the following testing conditions:

•Zero

: training data includes only regular par-

allel data, and we measure how models per-

form on unseen idioms at test time.

•Joint

: training data includes the regular and

idiom-train data, and we measure how models

perform on idioms observed (in a different

context) in training data.

•Upsampling

: same as the joint split, but we

up-sample the idiom-train data

times. This

setting measures whether it is necessary to up-

sample the targeted training data (idiom-train)

to achieve better translation quality of idioms.

Evaluation

For development, we use the IWSLT

dev-set for each language pair. For general pur-

pose translation evaluation, we report results in

the WMT newstest14 and IWSLT’17 test sets for

→

fr, and in the WMT newstest13 in particular

and IWSLT’17 test sets for en

→

es. For the targeted

idiom evaluation (i.e., LitTER and APT-Eval) we

5sites.google.com/site/iwsltevaluation2017/TED-tasks

Data en→fr en→es

Europarl 2,007,723 1,965,734

IWSLT 275,085 265,625

Combined (after preprocessing) 2,155,543 2,119,686

Regular 2,152,716 2,116,889

Idiom-train 1,327 1,312

Idiom-test 1,383 1,373

WMT-test 3,003 3,000

IWSLT-test 2,632 2,502

Table 1: Dataset statistics

use the extracted idiom-test data per language pair.

To generate the word alignments for APT-Eval, we

trained a fast-align (Dyer et al.,2013) model on

each language-pair’s training data. For decoding,

we use beam search with beams of size 5, and eval-

uate all models using BLEU (Papineni et al.,2002)

computed with SacreBLEU (Post,2018).

Preprocessing

We ﬁrst ﬁlter out sentence pairs

with more than 80 words or with length ratio over

1.5. Then, we tokenize the remaining sentences

using sentencepiece

(SPM; Kudo and Richardson

2018). For the randomly initialized models, we

train SPM models with a joint vocabulary of 60K

symbols on the concatenation of the source- and

target-side of the regular training data. For the

mBART ﬁne-tuning experiments, we use the SPM

model of mBART (250K symbols).

3.2 Models

Besides training models from scratch, we also in-

vestigate how pretraining on monolingual data af-

fects idiom translation, which yields substantial im-

provements on generic translation quality (Lample

and Conneau,2019;Song et al.,2019;Liu et al.,

2020). However, it is not obvious if monolingual

data can help idiom translation, as they do not con-

tain any examples with how to translate an idiom

from one language into another.

We use mBART (Liu et al.,2020) via ﬁnetun-

ing, which is pretrained on monolingual data from

many languages. We hypothesize that one way

multilingual pre-training can help is by bootstrap-

ping over the source and target language contexts

in which idioms occur. We also consider inject-

ing different types of noise during ﬁne-tuning, to

corrupt the (encoder or decoder) input context and

measure the effects on the targeted evaluation met-

rics. Speciﬁcally, we use source-side word mask-

ing and replacement (Baziotis et al.,2021), and

We use the

unigram

model with

coverage=0.9999

target-side word-replacement noise (Voita et al.,

2021). In our experiments, “random” denotes

a randomly initialized model, while “mBART”

stands for using mBART as initialization. For

noisy ﬁnetuning we train the following variants:

“mBART+mask” where we mask 10% of the source

tokens, “mBART+replace (enc)” where we replace

10% of the source tokens with random ones, and

“mBART+replace (dec)” where we replace 10% of

the target tokens with random ones.

Model Conﬁguration

For fair comparison, the

randomly initialized models use the same architec-

ture as mBART. Speciﬁcally, the models are based

on the Transformer architecture, with 12 encoder

and decoder layers, 1024 embedding size and 16

self-attention heads. Our code is based on the ofﬁ-

cial mBART implementation in Fairseq.

Optimization

We optimized our models using

Adam (Kingma and Ba,2015) with

β1= 0.9, β2=

0.999

, and



=1e-6. For the random initialization

experiments, the models were trained for 140K up-

dates with batches of 24K tokens, using a learning

rate of 1e-4 with a linear warm-up of 4K steps, fol-

lowed by inverted squared decay. For the mBART

initialization experiments, the models were trained

for 140K updates with batches of 12K tokens, us-

ing a ﬁxed learning rate of 3e-5 with a linear warm-

up of 4K steps. In all experiments, we applied

dropout of 0.3, attention-dropout of 0.1 and label

smoothing of 0.1. For model selection, we evalu-

ated each model every 5K updates on the dev set,

and selected the one with the best BLEU.

3.3 Results

In this section, for brevity, we discuss a subset of

our results, in particular our experiments in en

→

fr.

Results for en

→

es are consistent with en

→

fr and

are included in Appendix B. Table 2summarizes

all of our main results. Besides global evaluation

using BLEU (§3.3.2) on diverse test sets, we also

consider two targeted evaluation methods (§3.3.1)

that focus on how the idioms are translated using

our idiom-test set. For the upsampling split, we up-

sample the idiom-train data 20x. We also experi-

mented with 100x upsampling, but models started

to exhibit overﬁtting effects (see §B, §D).

3.3.1 Targeted Evaluation

In targeted evaluation, we focus only on how mod-

els translate the source-side idioms. We present re-

sults on our proposed LitTER metric and on APT-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AutomaticEvaluationandAnalysisofIdiomsinNeuralMachineTranslationChristosBaziotisUniversityofEdinburghc.baziotis@ed.ac.ukPrashantMathurAmazonAIpramathu@amazon.comEvaHaslerAmazonAIehasler@amazon.comAbstractAmajoropenprobleminneuralmachinetrans-lation(NMT)isthetranslationofidiomaticex-pressions,suchas...

展开>> 收起<<

Automatic Evaluation and Analysis of Idioms in Neural Machine Translation Christos Baziotis.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Automatic Evaluation and Analysis of Idioms in Neural Machine Translation Christos Baziotis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: