Integrating Translation Memories into Non-Autoregressive Machine Translation Jitao XuyJosep CregozFrançois Yvony

2025-04-27 0 0 406.95KB 13 页 10玖币
侵权投诉
Integrating Translation Memories into Non-Autoregressive Machine
Translation
Jitao XuJosep CregoFrançois Yvon
Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
SYSTRAN, 5 rue Feydeau, 75002, Paris, France
{jitao.xu,francois.yvon}@limsi.fr, josep.crego@systrangroup.com
Abstract
Non-autoregressive machine translation (NAT)
has recently made great progress. However,
most works to date have focused on standard
translation tasks, even though some edit-based
NAT models, such as the Levenshtein Trans-
former (LevT), seem well suited to translate
with a Translation Memory (TM). This is the
scenario considered here. We first analyze the
vanilla LevT model and explain why it does
not do well in this setting. We then propose a
new variant, TM-LevT, and show how to effec-
tively train this model. By modifying the data
presentation and introducing an extra deletion
operation, we obtain performance that are on
par with an autoregressive approach, while re-
ducing the decoding load. We also show that
incorporating TMs during training dispenses
to use knowledge distillation, a well-known
trick used to mitigate the multimodality issue.
1 Introduction
Non-autoregressive neural machine translation
(NAT) has been greatly advanced in recent years
(Xiao et al.,2022). NAT takes advantage from par-
allel decoding to generate multiple tokens simulta-
neously and speed up inference. This is often at the
cost of a loss in translation quality when compared
to autoregressive (AR) models (Gu et al.,2018a).
This gap is slowly closing and methods based on
iterative refinement (Ghazvininejad et al.,2019;
Gu et al.,2019;Saharia et al.,2020) and on con-
nectionist temporal classification (Libovický and
Helcl,2018;Gu and Kong,2021) are now reporting
BLEU scores similar to strong AR baselines.
Most works on NAT focus on the standard ma-
chine translation (MT) task, where the decoder
starts from scratch, with the exception of Susanto
et al. (2020); Xu and Carpuat (2021), who use
NAT to integrate lexical constraints in decoding.
However, edit-based NAT models, such as the Lev-
enshtein Transformer (LevT) of Gu et al. (2019),
seem to be a natural candidate to perform MT with
Translation Memories (TM). LevT is able to itera-
tively edit an initial target sequence by performing
insertion and deletion operations until convergence.
This design also matches the concept of using TMs
in MT, where given a source sentence, we aim to
edit a candidate translation retrieved from the TM.
This idea has been used for decades in the lo-
calization industry and implemented into basic
Computer-Aided Translation tools. Translators
wishing to translate a sentence can benefit from
fuzzy matching techniques to retrieve similar seg-
ments from the TM. These segments can then be
revised, thereby improving productivity and consis-
tency of the translation process (Koehn and Senel-
lart,2010;Yamada,2011). The retrieval of similar
examples from a TM has also proved useful in con-
ventional (AR) neural MT systems; they can be
injected into the encoder (Bulte and Tezcan,2019;
Xu et al.,2020) or as priming signals in the decoder
(Pham et al.,2020) to influence the translation pro-
cess. These studies report significant gains in trans-
lation performance in technical domains, where
the translation of terms and phraseology greatly
benefits from examples found in a TM.
Our main focus in this work is to develop an
improved version of LevT suited to the revision
part of TM use, where the translation retrieved
from TM is modified via edit operations in a non-
autoregressive way. We first show that the original
LevT cannot perform well on this task and explain
that this failure is a direct consequence of its train-
ing design. We propose to fix this issue with TM-
LevT, which includes an additional deletion step.
Next, we propose to further improve the training
procedure in two ways: (a) by also including the re-
trieved candidate translation on the source side, as
done in AR TM-based approaches (Bulte and Tez-
can,2019;Xu et al.,2020); (b) by simultaneously
training with empty and non-empty initial target
sentences. In our experiments, TM-LevT achieves
performance that is on par with a strong AR ap-
arXiv:2210.06020v2 [cs.CL] 18 Feb 2023
proach on various domains when translating with
TMs, with a reduced decoding load. We also ob-
serve that incorporating an initial translation both
on the source and target sides makes Knowledge
Distillation (KD, Kim and Rush,2016) useless.
This contrasts with standard NAT models, which
rely on KD to alleviate the multimodality issue (Gu
et al.,2018a). As far as we know, this work is the
first to study NAT with TMs in a controlled setting.
Our contributions are the following: (a) we show
that the original LevT training scheme is not suited
to edit similar translations from a TM; (b) we pro-
pose a variant of LevT, TM-LevT with an improved
training procedure, which yields performance that
are close, or even similar to AR approaches when
translating with good TM matches, with a reduced
decoding load; (c) we highlight the benefits of
multi-task training (with and without TMs) to attain
the best performance; (d) we discuss the reasons
why KD hurts the training of NAT with TMs.
2 Using Translation Memories in NAT
2.1 Background
TM Retrieval
Given a source sentence
f
, we aim
to retrieve a good match
˜e
from the TM. For this,
we search the TM for a pair of sentences
(˜
f,˜e)
,
where
˜
f
is similar to
f
. The corresponding target
˜e
is then used to initiate the translation of
f
. We
compute the similarity between fand ˜
fas:
sim(f,˜
f)=1ED(f,˜
f)
max(|f|,|˜
f|),(1)
where
ED(f,˜
f)
is the edit distance between
f
and
˜
f
, and
|f|
is the length of
f
. The intuition is that
the closer
f
and
˜
f
are, the more suitable
˜e
will
be. As is custom, we only use TM matches when
the similarity score exceeds a predefined threshold,
otherwise we translate from scratch. We discuss
the effect of the match similarity in Section 4.5.
Levenshtein Transformer
LevT is an edit-
based NAT model proposed by Gu et al. (2019). It
performs translation by iteratively editing an initial
target sequence with insertion and deletion opera-
tions until convergence. The insertion operation is
composed of a placeholder insertion module and
atoken predictor. The placeholder classifier pre-
dicts the number of additional tokens that need to
be inserted between any two consecutive tokens in
its input sequence. The token predictor then gen-
erates a token for each placeholder position. The
deletion operation aims to detect prediction errors
made by the model. It makes a binary decision for
each token, indicating whether it should be deleted
or kept. During training, a noised initial target se-
quence
e0
is first generated by randomly dropping
tokens from the reference
e
. The insertion mod-
ules learn to reinsert the deleted tokens into
e0
. The
deletion operation is then trained to erase erroneous
predictions made during insertion.
During inference, LevT starts with an empty tar-
get sequence (
e0= []
) and generates the translation
by alternatively performing deletion and insertion
operations until convergence or a maximum num-
ber of decoding rounds is reached. In the first
iteration, the deletion is omitted, as no tokens can
be deleted from the empty sequence. This iterative
refinement procedure converges when the input and
output of one iteration are the same, either because
LevT predicts nothing to delete and to insert, or
because it enters a loop where the deleted tokens
are reinserted in the same round. Unlike almost all
other NAT models, LevT does not require any ex-
ternal prediction of the target length, as the number
of target tokens is iteratively revised and adjusted
by the placeholder prediction module. We refer to
Gu et al. (2019) for more details about LevT.
2.2 Deficiencies of LevT Training
Even though the edit-based nature of LevT makes
it readily able to translate with TMs, it has mostly
been applied to standard MT, where the decoder
starts with an empty sentence.
1
This is consistent
with the overall training scheme, illustrated in Fig-
ure 1(Vanilla LevT), where inputs for the place-
holder insertion module are always subsequences
of the reference and the deletion module only sees
the outputs of the previous token insertion step.
Settings Empty Random Sent Shuffle Ref
Init - 1.3 5.0
LevT 45.4 2.1 40.2
LevT vs Init - 90.4 9.4
Table 1: BLEU scores of LevT decoding with various
target initialization. Empty refers to standard LevT in-
ference with an empty start. Random Sent uses a ran-
dom sentence as initial target. Shuffle Ref starts with a
random shuffle of the reference translation. Init reports
the BLEU score of the initialization, while LevT vs Init
compares LevT’s outputs with their starting points.
To illustrate the deficiency of this training
1
One notable exception is the attempt in Gu et al. (2019)
to perform automatic post-editing through iterative revisions.
TM-LevT encoder
Un chat dort. [sep] A cat is eating.
f ẽ
A cat is sleeping.
e
e' = e
e'' is obtained from e' by applying
deletions from Pred and Ref labels.
e''' is obtained from e'' by inserting
placeholders from Ref labels.
e'''' is obtained from e''' by replacing
placeholders with Pred labels.
.eatingcatA is
Initial Deletion (init-del)
00010
0100 0
e'
Ref:
Pred:
.cat is
Placeholder Insertion
0100
011 0
e''
Ref:
Pred:
.cat is
Token Prediction
sleepingThe
sleepingA
[] []
Ref:
Pred:
e'''
.sleepingcatThe is
Final Deletion (final-del)
0001 0
0001 0
Ref:
Pred:
e''''
Vanilla
LevT
Union
Figure 1: A complete training step for TM-LevT. Compared to the original LevT which starts training from e00 ,
TM-LevT adds the init-del step to delete unrelated tokens from a TM match. Figure better viewed in color.
scheme, we learned a vanilla LevT model using the
datasets of Section 3.1 and initialized the decoder
with a sentence randomly selected from the test set
and totally unrelated to the source sentence. We
observe (Table 1, Random Sent) that LevT’s out-
puts are almost as bad as their starting point. This
is because the deletion module fails to delete irrele-
vant input words, presenting the insertion modules
with a fully fluent yet fully inadequate sequence
that the insertion module is hard-pressed to revise.
This contrasts with the Shuffle Ref scenario, where
the decoder starts with a random shuffle of the ref-
erence. LevT can now make changes during the it-
erative refinement and generates translations (
40.2
BLEU) that are close to standard decoding (
45.4
BLEU). The TM-based scenario discussed below
presents the same challenge for the deletion mod-
ule, that of spotting and deleting irrelevant words.
Our proposal will first focus on fixing this issue.
2.3 Improving Editions with TM-LevT
The experiment of previous section suggests that
LevT will have issues editing TM matches, as they
often contain tokens that are unrelated to the source
and should be removed (see Figure 1for an ex-
ample TM match
˜e
containing an unrelated word
"eating"). The distribution of unrelated tokens may
greatly differ from token prediction errors made by
LevT, which are tokens LevT is trained to delete.
We propose a variant of LevT denoted TM-LevT,
where we include an extra deletion step (init-del)
that applies before the insertion modules. As shown
in Figure 1, init-del is trained to detect unrelated
tokens from the initial
e0
, whereas the final deletion
(final-del) focuses on prediction errors. During
training, we generate examples for the insertion
modules by removing from
e0
tokens that either are
not in the reference, or should be deleted according
to the init-del operation. The resulting subsequence
e00
is then used to train the insertion operation. TM-
LevT does not change the number of parameters, as
we use the same classifier for the init-del and final-
del steps. During inference, TM-LevT behaves
exactly as LevT, iteratively applying deletions and
insertions to an initial candidate translation.
摘要:

IntegratingTranslationMemoriesintoNon-AutoregressiveMachineTranslationJitaoXuyJosepCregozFrançoisYvonyyUniversitéParis-Saclay,CNRS,LISN,91400,Orsay,FrancezSYSTRAN,5rueFeydeau,75002,Paris,France{jitao.xu,francois.yvon}@limsi.fr,josep.crego@systrangroup.comAbstractNon-autoregressivemachinetranslation(...

展开>> 收起<<
Integrating Translation Memories into Non-Autoregressive Machine Translation Jitao XuyJosep CregozFrançois Yvony.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:406.95KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注