Integrating Translation Memories into Non-Autoregressive Machine Translation Jitao XuyJosep CregozFrançois Yvony

2025-04-27 0 0 406.95KB 13 页 10玖币

侵权投诉

Integrating Translation Memories into Non-Autoregressive Machine

Translation

Jitao Xu†Josep Crego‡François Yvon†

†Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France

‡SYSTRAN, 5 rue Feydeau, 75002, Paris, France

{jitao.xu,francois.yvon}@limsi.fr, josep.crego@systrangroup.com

Abstract

Non-autoregressive machine translation (NAT)

has recently made great progress. However,

most works to date have focused on standard

translation tasks, even though some edit-based

NAT models, such as the Levenshtein Trans-

former (LevT), seem well suited to translate

with a Translation Memory (TM). This is the

scenario considered here. We ﬁrst analyze the

vanilla LevT model and explain why it does

not do well in this setting. We then propose a

new variant, TM-LevT, and show how to effec-

tively train this model. By modifying the data

presentation and introducing an extra deletion

operation, we obtain performance that are on

par with an autoregressive approach, while re-

ducing the decoding load. We also show that

incorporating TMs during training dispenses

to use knowledge distillation, a well-known

trick used to mitigate the multimodality issue.

1 Introduction

Non-autoregressive neural machine translation

(NAT) has been greatly advanced in recent years

(Xiao et al.,2022). NAT takes advantage from par-

allel decoding to generate multiple tokens simulta-

neously and speed up inference. This is often at the

cost of a loss in translation quality when compared

to autoregressive (AR) models (Gu et al.,2018a).

This gap is slowly closing and methods based on

iterative reﬁnement (Ghazvininejad et al.,2019;

Gu et al.,2019;Saharia et al.,2020) and on con-

nectionist temporal classiﬁcation (Libovický and

Helcl,2018;Gu and Kong,2021) are now reporting

BLEU scores similar to strong AR baselines.

Most works on NAT focus on the standard ma-

chine translation (MT) task, where the decoder

starts from scratch, with the exception of Susanto

et al. (2020); Xu and Carpuat (2021), who use

NAT to integrate lexical constraints in decoding.

However, edit-based NAT models, such as the Lev-

enshtein Transformer (LevT) of Gu et al. (2019),

seem to be a natural candidate to perform MT with

Translation Memories (TM). LevT is able to itera-

tively edit an initial target sequence by performing

insertion and deletion operations until convergence.

This design also matches the concept of using TMs

in MT, where given a source sentence, we aim to

edit a candidate translation retrieved from the TM.

This idea has been used for decades in the lo-

calization industry and implemented into basic

Computer-Aided Translation tools. Translators

wishing to translate a sentence can beneﬁt from

fuzzy matching techniques to retrieve similar seg-

ments from the TM. These segments can then be

revised, thereby improving productivity and consis-

tency of the translation process (Koehn and Senel-

lart,2010;Yamada,2011). The retrieval of similar

examples from a TM has also proved useful in con-

ventional (AR) neural MT systems; they can be

injected into the encoder (Bulte and Tezcan,2019;

Xu et al.,2020) or as priming signals in the decoder

(Pham et al.,2020) to inﬂuence the translation pro-

cess. These studies report signiﬁcant gains in trans-

lation performance in technical domains, where

the translation of terms and phraseology greatly

beneﬁts from examples found in a TM.

Our main focus in this work is to develop an

improved version of LevT suited to the revision

part of TM use, where the translation retrieved

from TM is modiﬁed via edit operations in a non-

autoregressive way. We ﬁrst show that the original

LevT cannot perform well on this task and explain

that this failure is a direct consequence of its train-

ing design. We propose to ﬁx this issue with TM-

LevT, which includes an additional deletion step.

Next, we propose to further improve the training

procedure in two ways: (a) by also including the re-

trieved candidate translation on the source side, as

done in AR TM-based approaches (Bulte and Tez-

can,2019;Xu et al.,2020); (b) by simultaneously

training with empty and non-empty initial target

sentences. In our experiments, TM-LevT achieves

performance that is on par with a strong AR ap-

arXiv:2210.06020v2 [cs.CL] 18 Feb 2023

proach on various domains when translating with

TMs, with a reduced decoding load. We also ob-

serve that incorporating an initial translation both

on the source and target sides makes Knowledge

Distillation (KD, Kim and Rush,2016) useless.

This contrasts with standard NAT models, which

rely on KD to alleviate the multimodality issue (Gu

et al.,2018a). As far as we know, this work is the

ﬁrst to study NAT with TMs in a controlled setting.

Our contributions are the following: (a) we show

that the original LevT training scheme is not suited

to edit similar translations from a TM; (b) we pro-

pose a variant of LevT, TM-LevT with an improved

training procedure, which yields performance that

are close, or even similar to AR approaches when

translating with good TM matches, with a reduced

decoding load; (c) we highlight the beneﬁts of

multi-task training (with and without TMs) to attain

the best performance; (d) we discuss the reasons

why KD hurts the training of NAT with TMs.

2 Using Translation Memories in NAT

2.1 Background

TM Retrieval

Given a source sentence

, we aim

to retrieve a good match

˜e

from the TM. For this,

we search the TM for a pair of sentences

(˜

f,˜e)

where

is similar to

. The corresponding target

˜e

is then used to initiate the translation of

. We

compute the similarity between fand ˜

fas:

sim(f,˜

f)=1−ED(f,˜

max(|f|,|˜

f|),(1)

where

ED(f,˜

is the edit distance between

and

, and

|f|

is the length of

. The intuition is that

the closer

and

are, the more suitable

˜e

will

be. As is custom, we only use TM matches when

the similarity score exceeds a predeﬁned threshold,

otherwise we translate from scratch. We discuss

the effect of the match similarity in Section 4.5.

Levenshtein Transformer

LevT is an edit-

based NAT model proposed by Gu et al. (2019). It

performs translation by iteratively editing an initial

target sequence with insertion and deletion opera-

tions until convergence. The insertion operation is

composed of a placeholder insertion module and

atoken predictor. The placeholder classiﬁer pre-

dicts the number of additional tokens that need to

be inserted between any two consecutive tokens in

its input sequence. The token predictor then gen-

erates a token for each placeholder position. The

deletion operation aims to detect prediction errors

made by the model. It makes a binary decision for

each token, indicating whether it should be deleted

or kept. During training, a noised initial target se-

quence

is ﬁrst generated by randomly dropping

tokens from the reference

. The insertion mod-

ules learn to reinsert the deleted tokens into

. The

deletion operation is then trained to erase erroneous

predictions made during insertion.

During inference, LevT starts with an empty tar-

get sequence (

e0= []

) and generates the translation

by alternatively performing deletion and insertion

operations until convergence or a maximum num-

ber of decoding rounds is reached. In the ﬁrst

iteration, the deletion is omitted, as no tokens can

be deleted from the empty sequence. This iterative

reﬁnement procedure converges when the input and

output of one iteration are the same, either because

LevT predicts nothing to delete and to insert, or

because it enters a loop where the deleted tokens

are reinserted in the same round. Unlike almost all

other NAT models, LevT does not require any ex-

ternal prediction of the target length, as the number

of target tokens is iteratively revised and adjusted

by the placeholder prediction module. We refer to

Gu et al. (2019) for more details about LevT.

2.2 Deﬁciencies of LevT Training

Even though the edit-based nature of LevT makes

it readily able to translate with TMs, it has mostly

been applied to standard MT, where the decoder

starts with an empty sentence.

This is consistent

with the overall training scheme, illustrated in Fig-

ure 1(Vanilla LevT), where inputs for the place-

holder insertion module are always subsequences

of the reference and the deletion module only sees

the outputs of the previous token insertion step.

Settings Empty Random Sent Shufﬂe Ref

Init - 1.3 5.0

LevT 45.4 2.1 40.2

LevT vs Init - 90.4 9.4

Table 1: BLEU scores of LevT decoding with various

target initialization. Empty refers to standard LevT in-

ference with an empty start. Random Sent uses a ran-

dom sentence as initial target. Shufﬂe Ref starts with a

random shufﬂe of the reference translation. Init reports

the BLEU score of the initialization, while LevT vs Init

compares LevT’s outputs with their starting points.

To illustrate the deﬁciency of this training

One notable exception is the attempt in Gu et al. (2019)

to perform automatic post-editing through iterative revisions.

TM-LevT encoder

Un chat dort. [sep] A cat is eating.

f ẽ

A cat is sleeping.

• e' = eẽ

• e'' is obtained from e' by applying

deletions from Pred and Ref labels.

• e''' is obtained from e'' by inserting

placeholders from Ref labels.

• e'''' is obtained from e''' by replacing

placeholders with Pred labels.

.eatingcatA is

Initial Deletion (init-del)

00010

0100 0

Ref:

Pred:

.cat is

Placeholder Insertion

0100

011 0

e''

Ref:

Pred:

.cat is

Token Prediction

sleepingThe

sleepingA

[] []

Ref:

Pred:

e'''

.sleepingcatThe is

Final Deletion (final-del)

0001 0

Ref:

Pred:

e''''

Vanilla

LevT

Union

Figure 1: A complete training step for TM-LevT. Compared to the original LevT which starts training from e00 ,

TM-LevT adds the init-del step to delete unrelated tokens from a TM match. Figure better viewed in color.

scheme, we learned a vanilla LevT model using the

datasets of Section 3.1 and initialized the decoder

with a sentence randomly selected from the test set

and totally unrelated to the source sentence. We

observe (Table 1, Random Sent) that LevT’s out-

puts are almost as bad as their starting point. This

is because the deletion module fails to delete irrele-

vant input words, presenting the insertion modules

with a fully ﬂuent yet fully inadequate sequence

that the insertion module is hard-pressed to revise.

This contrasts with the Shufﬂe Ref scenario, where

the decoder starts with a random shufﬂe of the ref-

erence. LevT can now make changes during the it-

erative reﬁnement and generates translations (

40.2

BLEU) that are close to standard decoding (

45.4

BLEU). The TM-based scenario discussed below

presents the same challenge for the deletion mod-

ule, that of spotting and deleting irrelevant words.

Our proposal will ﬁrst focus on ﬁxing this issue.

2.3 Improving Editions with TM-LevT

The experiment of previous section suggests that

LevT will have issues editing TM matches, as they

often contain tokens that are unrelated to the source

and should be removed (see Figure 1for an ex-

ample TM match

˜e

containing an unrelated word

"eating"). The distribution of unrelated tokens may

greatly differ from token prediction errors made by

LevT, which are tokens LevT is trained to delete.

We propose a variant of LevT denoted TM-LevT,

where we include an extra deletion step (init-del)

that applies before the insertion modules. As shown

in Figure 1, init-del is trained to detect unrelated

tokens from the initial

, whereas the ﬁnal deletion

(ﬁnal-del) focuses on prediction errors. During

training, we generate examples for the insertion

modules by removing from

tokens that either are

not in the reference, or should be deleted according

to the init-del operation. The resulting subsequence

e00

is then used to train the insertion operation. TM-

LevT does not change the number of parameters, as

we use the same classiﬁer for the init-del and ﬁnal-

del steps. During inference, TM-LevT behaves

exactly as LevT, iteratively applying deletions and

insertions to an initial candidate translation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IntegratingTranslationMemoriesintoNon-AutoregressiveMachineTranslationJitaoXuyJosepCregozFrançoisYvonyyUniversitéParis-Saclay,CNRS,LISN,91400,Orsay,FrancezSYSTRAN,5rueFeydeau,75002,Paris,France{jitao.xu,francois.yvon}@limsi.fr,josep.crego@systrangroup.comAbstractNon-autoregressivemachinetranslation(...

展开>> 收起<<

Integrating Translation Memories into Non-Autoregressive Machine Translation Jitao XuyJosep CregozFrançois Yvony.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Integrating Translation Memories into Non-Autoregressive Machine Translation Jitao XuyJosep CregozFrançois Yvony

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: