Automatic Evaluation and Analysis of Idioms in Neural Machine Translation Christos Baziotis

2025-05-02 0 0 1.7MB 19 页 10玖币
侵权投诉
Automatic Evaluation and Analysis of Idioms
in Neural Machine Translation
Christos Baziotis
University of Edinburgh
c.baziotis@ed.ac.uk
Prashant Mathur
Amazon AI
pramathu@amazon.com
Eva Hasler
Amazon AI
ehasler@amazon.com
Abstract
A major open problem in neural machine trans-
lation (NMT) is the translation of idiomatic ex-
pressions, such as “under the weather”. The
meaning of these expressions is not composed
by the meaning of their constituent words, and
NMT models tend to translate them literally
(i.e., word-by-word), which leads to confusing
and nonsensical translations. Research on id-
ioms in NMT is limited and obstructed by the
absence of automatic methods for quantifying
these errors. In this work, first, we propose a
novel metric for automatically measuring the
frequency of literal translation errors without
human involvement. Equipped with this met-
ric, we present controlled translation experi-
ments with models trained in different con-
ditions (with/without the test-set idioms) and
across a wide range of (global and targeted)
metrics and test sets. We explore the role of
monolingual pretraining and find that it yields
substantial targeted improvements, even with-
out observing any translation examples of the
test-set idioms. In our analysis, we probe the
role of idiom context. We find that the ran-
domly initialized models are more local or
“myopic” as they are relatively unaffected by
variations of the idiom context, unlike the pre-
trained ones.
1 Introduction
Neural machine translation (NMT; Sutskever et al.
2014;Bahdanau et al. 2015;Vaswani et al. 2017)
struggles with the translation of rare multi-word
expressions (MWE) (Koehn and Knowles,2017).
Non-compositional phrases, such as idioms (e.g.,
“piece of cake”), are one of the most challenging
types of MWEs, because their meaning is figura-
tive and cannot be derived from the meaning of
their constituents (Nunberg et al.,1994;Liu,2017).
NMT models tend to translate these expressions
literally (i.e., word-by-word), which leads to erro-
neous translations. In this paper, our focus is on the
This work was done during an internship at Amazon.
translation of idiomatic expressions, in contrast to
most prior work, which is subsumed under MWEs
in general (Constant et al.,2017;Cook et al.,2021).
The absence of targeted and automatic evalua-
tion is a major obstacle to advances in idiom transla-
tion. Global metrics, such as BLEU (Papineni et al.,
2002) consider the full translation, and thus, the ef-
fects of idiom translation are overshadowed. Previ-
ous efforts on targeted evaluation isolate the idiom
translation using word alignments (Fadaee et al.,
2018) or word edit distance (Zaninello and Birch,
2020). These approaches measure the accuracy of
idiom translation but do not account for literal trans-
lation errors. Shao et al. (2018) proposed a method
for estimating the frequency of such errors, but
it requires the creation of language-specific hand-
crafted lists (i.e., blocklists) with words that corre-
spond to literal translation errors.
In this work
1
, we present a study of idioms in
NMT, with the goal of facilitating future research in
this direction. First, we propose a novel metric for
the automatic evaluation of literal translation errors
(LitTER), that does not require any hand-crafted
blocklists. We incorporate LitTER, which com-
plements alignment-based metrics (Fadaee et al.,
2018) into a unified targeted evaluation framework.
Next, we present translation experiments in a
controlled setting, by using different training splits
to test models under different conditions (e.g., zero-
shot). To improve idiom translation we leverage
monolingual data, which are more abundant than
parallel and contain idioms in higher frequencies
and more diverse contexts. We exploit mono-
lingual data via pretraining (mBART; Liu et al.
2020), which is a generic and task-agnostic ap-
proach, unlike prior work that considers ad-hoc so-
lutions (Fadaee et al.,2018;Zaninello and Birch,
2020). We find that monolingual pretraining yields
strong targeted gains, even when models have not
seen any translation examples of the test idioms.
1Code and data in github.com/amazon-research/idiom-mt
arXiv:2210.04545v1 [cs.CL] 10 Oct 2022
We also present an extensive analysis of how dif-
ferent models translate idioms. Specifically, we
use a series of probing methods that encode id-
ioms within different contexts (Garcia et al.,2021;
Yu and Ettinger,2020), and measure how this af-
fects the translation outputs and the decoder dis-
tributions. We find that the randomly initialized
models are more “myopic” compared to the pre-
trained ones, as they are relatively unaffected when
we vary the idiom context. Our contributions are:
1.
We propose LitTER (§2.1), a novel metric for
measuring the frequency of literal translation
errors, and embed it into a framework (§2)
for automatic and targeted evaluation of idiom
translation, complementing prior work.
2.
We present translation results (§3.3) in a con-
trolled setting and across a wide range of met-
rics. We find that pre-training on monolingual
data yields substantial targeted improvements.
3.
We present an extensive analysis 4) with a se-
ries of probes, showing how context affects id-
iom translation. We find that models are more
uncertain when translating idioms and that pre-
training makes models more contextual.
2 Automatic Targeted Evaluation
2.1 Literal Translation Error Rate (LitTER)
We propose literal translation error rate (LitTER),
a novel metric of the frequency of literal transla-
tion errors made by a model. A literal translation
error occurs if any of the words of a span in the
source sentence has been wrongly translated liter-
ally in the target language. Our metric is inspired
by the method of Shao et al. (2018) which iden-
tifies possible literal translation errors, by check-
ing if a translation output contains any blocklisted
words. While this method is effective at capturing
these errors, it relies on hand-crafted blocklists. We
overcome this limitation by automatically creating
word blocklists for a given expression.
Our method, is based on two key ideas. First,
we use bilingual word dictionaries
2
, which are rel-
atively easy to obtain, to translate the words of an
annotated source span into the target language, and
produce blocklists with candidate literal translation
errors. Then, we use the reference translations to
filter the blocklists by removing those words that
occur in the reference. This avoids triggering the
blocklist when the correct translation is literal.
2In this work we use the MUSE (Lample et al.,2018).
"Ahmedabad got the first child-
friendly zebra crossing in the world."
"Tο Αχμενταμπάντ απέκτησε την
πρώτη φιλική προς τα παιδιά
διάβαση πεζών στον κόσμο."
"Tο Ahmedabad πήρε την πρώτη
φιλική προς τα παιδιά ζέβρα
διάβαση στον κόσμο."
𝑑𝑖𝑐𝑡 zebra =ζέβρα
𝑑𝑖𝑐𝑡 crossing =πέρασμα, διάβαση
Blocklists
{ζέβρα}
{πέρασμα, διάβαση}
SRC
REF
HYP
{ζέβρα}
{πέρασμα, διάβαση}
{ζέβρα}
1. Create candidate errors
2. Filter candidates
3. Check for errors
Figure 1: Overview of the algorithm for the Literal
Translation Error Rate (LitTER). For each sentence, we
first produce candidate literal translation errors (block-
list), using all the word translations of the source idiom
words. Then, we filter the candidates in the blocklist
by looking at the reference. Finally, we check if the hy-
pothesis triggers the remaining words in the blocklist.
Algorithm
1.
Select from the source text the list of words
s=hs1, s2, ..., sNi
that belong to the anno-
tated expression (i.e., idiom).
2.
For each word
si
, obtain all its word transla-
tion(s) in the target language using a bilingual
word dictionary and add them to a blocklist
bi=ht1, t2, . . . , tMi
, creating a candidate list
of blocklists Bs=hb1, b2, ..., bNi.3
3.
For each word in the reference (R), search
if it occurs in any of the blocklists
bi
. If so,
remove the corresponding blocklist
bi
from
Bs
to avoid false positives. For example in
Figure 1, where words
διάβαση
and
πέρασμα
are synonyms, if we remove only
διάβαση
but
leave
πέρασμα
as a blocklisted word and a
model generates it in its translation, this will
wrongly trigger a literal translation error.
4.
Check if the hypothesis contains any block-
listed words. If it does, then we mark this hy-
pothesis as having a literal translation error.
The final score is the percentage of translations that
trigger the blocklist. As LitTER requires source-
side annotations, we collect test data with idioms
on the source side and annotate the spans where
they occur (§3.1). Appendix Cshows examples of
LitTER evaluating real sentences in our data.
3
In practice,
t1, t2,...,tM
in a blocklist are synonyms of
each other as they are translations of the same source word.
2.2 Alignment-based Evaluation
To measure idiom translation accuracy, we use
Alignment-based Phrase Translation Evaluation
(APT-Eval), by extending Fadaee et al. (2018) with
subword-level metrics. APT-Eval uses word align-
ments to find the words in the hypothesis and ref-
erence sentences, respectively, that align with the
annotated idiom source span, and then compares
the retrieved matches to each other. We consider
two evaluation metrics. First, we use unigram pre-
cision, that measures the ratio of words in the ref-
erence spans that occur in the hypothesis spans, as
in Fadaee et al. (2018). We also use ChrF (Popovi´
c,
2015), that measures character n-gram overlap.
LitTER vs. APT-Eval
While APT-Eval is a tar-
geted evaluation metric, it only measures transla-
tion accuracy. This means that given an inaccurate
translation, it is impossible to measure whether it
has a literal translation error. LitTER, however,
quantifies this particular issue that affects NMT.
2.3 Handling Idiom Frequency Imbalances
Different idioms have significantly different fre-
quencies (Appendix A.1). However, prior work has
overlooked this fact (Zaninello and Birch,2020;
Fadaee et al.,2018;Shao et al.,2018;Rikters and
Bojar,2017). Thus, over-represented idioms can
skew the reported results and favour models that
have overfitted on them. To address this, we report
all of our targeted evaluation results (i.e., LitTER,
APT-Eval) by macro-averaging over idioms:
E(θ) = 1
|L|
|L|
X
j=1
1
|Lj|
|P|
X
i=1
M(θ(si), ti)(1)
where
L
denotes the set of distinct idioms in a test
set and
P={hsi, tii|Lj∈ hsi, tii}
denotes the
set of sentence pairs containing the idiom
Lj
. The
model is denoted by
θ
and the translation of
x
by
θ(x)
. We first compute the average score for the
test pairs of each idiom with a given metric
M
,
and then average these values to produce E.
3 Experiments
3.1 Data and Training Splits
We present experiments on en
fr and en
es data.
For each language pair, we concatenate the data
from Europarl v7
4
(Koehn,2005), part of the
WMT news translation task (Bojar et al.,2014),
4www.statmt.org/europarl/
and from TED talk transcripts released as part of
IWSLT 2017 shared task5(Cettolo et al.,2017).
Idiom Data
We split the parallel data into regu-
lar and idiom data using a pattern-matching tool
that we developed. Our tool takes as input a list of
idioms and extracts sentences from a corpus con-
taining these idioms. We also annotate the span in
which each idiom occurs within a sentence, to en-
able the targeted evaluation metrics. This approach
is similar to Fadaee et al. (2018), but we build
our tool on top of Spacy’s (Honnibal and Montani,
2017) rule-based matching engine. For each phrase
in the input list, we automatically create pattern-
matching rules that capture complex variations of
a given phrase. See Appendix Afor details.
In this work, we use a list of 225 English idioms,
that we manually collected and plan to make it pub-
licly available. We feed this list into our pattern-
matching tool, and extract (and annotate) transla-
tion pairs that contain an idiom on the source side.
The regular data are used only for training. The id-
iom data are further divided into the idiom-train
and idiom-test sets. For each idiom (e.g., “under
the weather”) in our original idiom data, we put
half of its sentence pairs to the idiom-train and the
other half to the idiom-test sets, to obtain a bal-
anced distribution. We discard sentences with id-
ioms that occur only once. We conduct controlled
experiments, in the following testing conditions:
Zero
: training data includes only regular par-
allel data, and we measure how models per-
form on unseen idioms at test time.
Joint
: training data includes the regular and
idiom-train data, and we measure how models
perform on idioms observed (in a different
context) in training data.
Upsampling
: same as the joint split, but we
up-sample the idiom-train data
N
times. This
setting measures whether it is necessary to up-
sample the targeted training data (idiom-train)
to achieve better translation quality of idioms.
Evaluation
For development, we use the IWSLT
dev-set for each language pair. For general pur-
pose translation evaluation, we report results in
the WMT newstest14 and IWSLT’17 test sets for
en
fr, and in the WMT newstest13 in particular
and IWSLT’17 test sets for en
es. For the targeted
idiom evaluation (i.e., LitTER and APT-Eval) we
5sites.google.com/site/iwsltevaluation2017/TED-tasks
Data enfr enes
Europarl 2,007,723 1,965,734
IWSLT 275,085 265,625
Combined (after preprocessing) 2,155,543 2,119,686
Regular 2,152,716 2,116,889
Idiom-train 1,327 1,312
Idiom-test 1,383 1,373
WMT-test 3,003 3,000
IWSLT-test 2,632 2,502
Table 1: Dataset statistics
use the extracted idiom-test data per language pair.
To generate the word alignments for APT-Eval, we
trained a fast-align (Dyer et al.,2013) model on
each language-pair’s training data. For decoding,
we use beam search with beams of size 5, and eval-
uate all models using BLEU (Papineni et al.,2002)
computed with SacreBLEU (Post,2018).
Preprocessing
We first filter out sentence pairs
with more than 80 words or with length ratio over
1.5. Then, we tokenize the remaining sentences
using sentencepiece
6
(SPM; Kudo and Richardson
2018). For the randomly initialized models, we
train SPM models with a joint vocabulary of 60K
symbols on the concatenation of the source- and
target-side of the regular training data. For the
mBART fine-tuning experiments, we use the SPM
model of mBART (250K symbols).
3.2 Models
Besides training models from scratch, we also in-
vestigate how pretraining on monolingual data af-
fects idiom translation, which yields substantial im-
provements on generic translation quality (Lample
and Conneau,2019;Song et al.,2019;Liu et al.,
2020). However, it is not obvious if monolingual
data can help idiom translation, as they do not con-
tain any examples with how to translate an idiom
from one language into another.
We use mBART (Liu et al.,2020) via finetun-
ing, which is pretrained on monolingual data from
many languages. We hypothesize that one way
multilingual pre-training can help is by bootstrap-
ping over the source and target language contexts
in which idioms occur. We also consider inject-
ing different types of noise during fine-tuning, to
corrupt the (encoder or decoder) input context and
measure the effects on the targeted evaluation met-
rics. Specifically, we use source-side word mask-
ing and replacement (Baziotis et al.,2021), and
6
We use the
unigram
model with
coverage=0.9999
target-side word-replacement noise (Voita et al.,
2021). In our experiments, “random” denotes
a randomly initialized model, while “mBART”
stands for using mBART as initialization. For
noisy finetuning we train the following variants:
“mBART+mask” where we mask 10% of the source
tokens, “mBART+replace (enc)” where we replace
10% of the source tokens with random ones, and
“mBART+replace (dec)” where we replace 10% of
the target tokens with random ones.
Model Configuration
For fair comparison, the
randomly initialized models use the same architec-
ture as mBART. Specifically, the models are based
on the Transformer architecture, with 12 encoder
and decoder layers, 1024 embedding size and 16
self-attention heads. Our code is based on the offi-
cial mBART implementation in Fairseq.
Optimization
We optimized our models using
Adam (Kingma and Ba,2015) with
β1= 0.9, β2=
0.999
, and
=1e-6. For the random initialization
experiments, the models were trained for 140K up-
dates with batches of 24K tokens, using a learning
rate of 1e-4 with a linear warm-up of 4K steps, fol-
lowed by inverted squared decay. For the mBART
initialization experiments, the models were trained
for 140K updates with batches of 12K tokens, us-
ing a fixed learning rate of 3e-5 with a linear warm-
up of 4K steps. In all experiments, we applied
dropout of 0.3, attention-dropout of 0.1 and label
smoothing of 0.1. For model selection, we evalu-
ated each model every 5K updates on the dev set,
and selected the one with the best BLEU.
3.3 Results
In this section, for brevity, we discuss a subset of
our results, in particular our experiments in en
fr.
Results for en
es are consistent with en
fr and
are included in Appendix B. Table 2summarizes
all of our main results. Besides global evaluation
using BLEU (§3.3.2) on diverse test sets, we also
consider two targeted evaluation methods (§3.3.1)
that focus on how the idioms are translated using
our idiom-test set. For the upsampling split, we up-
sample the idiom-train data 20x. We also experi-
mented with 100x upsampling, but models started
to exhibit overfitting effects (see §B, §D).
3.3.1 Targeted Evaluation
In targeted evaluation, we focus only on how mod-
els translate the source-side idioms. We present re-
sults on our proposed LitTER metric and on APT-
摘要:

AutomaticEvaluationandAnalysisofIdiomsinNeuralMachineTranslationChristosBaziotisUniversityofEdinburghc.baziotis@ed.ac.ukPrashantMathurAmazonAIpramathu@amazon.comEvaHaslerAmazonAIehasler@amazon.comAbstractAmajoropenprobleminneuralmachinetrans-lation(NMT)isthetranslationofidiomaticex-pressions,suchas...

展开>> 收起<<
Automatic Evaluation and Analysis of Idioms in Neural Machine Translation Christos Baziotis.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.7MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注