Bilingual Synchronization Restoring Translational Relationships with Editing Operations Jitao XuyJosep CregozFrançois Yvony

2025-05-06 0 0 469.3KB 15 页 10玖币
侵权投诉
Bilingual Synchronization: Restoring Translational Relationships with
Editing Operations
Jitao XuJosep CregoFrançois Yvon
Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
SYSTRAN, 5 rue Feydeau, 75002, Paris, France
{jitao.xu,francois.yvon}@limsi.fr, josep.crego@systrangroup.com
Abstract
Machine Translation (MT) is usually viewed
as a one-shot process that generates the tar-
get language equivalent of some source text
from scratch. We consider here a more gen-
eral setting which assumes an initial target se-
quence, that must be transformed into a valid
translation of the source, thereby restoring
parallelism between source and target. For
this bilingual synchronization task, we con-
sider several architectures (both autoregressive
and non-autoregressive) and training regimes,
and experiment with multiple practical settings
such as simulated interactive MT, translating
with Translation Memory (TM) and TM clean-
ing. Our results suggest that one single generic
edit-based system, once fine-tuned, can com-
pare with, or even outperform, dedicated sys-
tems specifically trained for these tasks.
1 Introduction
Neural Machine Translation (NMT) systems have
made tangible progress in recent years (Bahdanau
et al.,2015;Vaswani et al.,2017), as they started to
produce usable translations in production environ-
ments. NMT is generally viewed as a one-shot ac-
tivity process in autoregressive approaches, which
generates the target translation based on the sole
source side input. Recently, Non-autoregressive
Machine Translation (NAT) models have proposed
to perform iterative refinement decoding (Lee et al.,
2018;Ghazvininejad et al.,2019;Gu et al.,2019),
where translations are generated through an itera-
tive revision process, starting with a possibly empty
initial hypothesis.
This paper focuses on the revision part of the ma-
chine translation (MT) process and consider bilin-
gual synchronization (Bi-sync), which we define
as follows: given a pair of a source (
f
) and a target
(
˜e
) sentences, which may or may not be mutual
translations, the task is to compute a revised ver-
sion
e
of
˜e
, such that
e
is an actual translation of
f
. This is necessary when the source side of an ex-
isting translation is edited, requiring to update the
target and keep both sides synchronized. Bi-sync
subsumes standard MT, where the synchronization
starts with an empty target (
˜e
= []). Other interest-
ing cases occur when parts of the initial target can
be reused, so that the synchronization only requires
a few changes.
Bi-sync encompasses several tasks: synchroniza-
tion is needed in interactive MT (IMT, Knowles and
Koehn,2016) and bilingual editing (Bronner et al.,
2012), with
˜e
the translation of a previous version
of
f
; in MT with lexical constraints (Hokamp and
Liu,2017), where
˜e
contains target-side constraints
(Susanto et al.,2020;Xu and Carpuat,2021); in
Translation Memory (TM) based approaches (Bulte
and Tezcan,2019), where
˜e
is a TM match for a
similar example; in automatic post-editing (APE)
(do Carmo et al.,2021), where ˜e is an MT output.
We consider here several implementations of
sequence-to-sequence models dedicated to these sit-
uations, contrasting an autoregressive model with a
non-autoregressive approach. The former is similar
to Bulte and Tezcan (2019), where the source sen-
tence and the initial translation are concatenated as
one input sequence; the latter uses the Levenshtein
Transformer (LevT) of Gu et al. (2019). We also
study various ways to generate appropriate training
samples (
f
,
˜e
,
e
). Our experiments consider sev-
eral tasks, including TM cleaning, which attempts
to fix and synchronize noisy segments in a parallel
corpus. This setting is more difficult than Bi-sync,
as many initial translations are already correct and
need to be left unchanged. Our results suggest that
one single AR system, once fine-tuned, can favor-
ably compare with dedicated systems for each of
these tasks. To recap, our main contributions are
(a) the generalization of several tasks subsumed by
a generic synchronization objective, allowing us
to develop a unified perspective about otherwise
unrelated subdomains of MT; (b) the design of a
arXiv:2210.13163v1 [cs.CL] 24 Oct 2022
pasCela
Cela pas .
Top 5 Sampling
Constrained
Decoding
Cela n' arrivera pas .
[gap]
ne y donc . 2
GAP
Insertion
[gap][gap][gap]
Insertion
WikiAtomicEdits
Model
Substitution
Deletion 2
Deletion 1
n' arrivera .
n' arrivera .
Cela n' arrivera pas .That 's not going to happen .
Cela n' arrivera pas .
Cela n' arrivera pas .
Cela n' arrivera pas , mais seulement .
Cela ne n' y arrivera donc pas . 2 .
Cela ne se produira pas .
That will not happen .
Cela n' arrivera pas .That 's not going to happen .
INITIAL
[sep]
Translation
Edit-MT
Figure 1: Methods for generating synthetic initial translations ˜e for each edit type. Rectangle purple boxes refer to
separate models used to generate the desired operations. Differences in artificial initial translations (in blue boxes)
are marked in bold. Initial translations ˜eins for insertion are generated by randomly removing segments in the
reference sentence e. For ˜esub,eis first back-translated into an intermediate sentence fusing top-5sampling,
then translated back to ˜esub with LCD. The first method to generate ˜edel randomly inserts [gap] tokens into eand
decodes with a GAP insertion model (Xiao et al.,2022). ˜edel1is obtained by replacing [gap] with the predicted
segments. The second method automatically edits ewith a model trained on WikiAtomicEdits data.
training procedure for a generic edit-based model;
(c) an empirical validation on five settings and do-
mains.
2 Methods
2.1 Generating Editing Data
We consider a general scenario where, given a pair
of sentences
f
and
˜e
, assumed to be related, but
not necessarily parallel, we aim to generate a target
sentence
e
that is parallel to
f
. We would also like
˜e
and
e
to be close, as
˜e
is often a valid transla-
tion of a sentence
˜
f
that is close to
f
. Training
such models requires triplets (
f
,
˜e
,
e
). While large
amounts of parallel bilingual data are available for
many language pairs, they are hardly ever associ-
ated to related translations
˜e
(except for APE). We
therefore study ways to simulate synthetic
˜e
from
e
, while preserving large portions of
e
in
˜e
. String
edits can be decomposed into a sequence of three
basic edits (insertions, substitutions and deletions),
we design our artificial samples so that edits from
˜e
to eonly involve one type of operation (Figure 1).
Insertions
We mainly follow Xiao et al. (2022)
to generate initial translations
˜eins
for insertion by
randomly deleting segments from
e
. For each
e
, we
first randomly sample an integer
k[1,5]
, then ran-
domly remove
k
non-overlapping segments from
e
. The length of each removed segment is also
randomly sampled with a maximum of
5
tokens.
We also impose that the overall ratio of removed
segments does not exceed
0.5
of
e
. Different from
Xiao et al. (2022),
˜eins
does not include any place-
holders to locate the positions of removed segments.
This makes
˜eins
a more realistic starting point as
the insertion positions are rarely known in prac-
tical settings. Our preliminary experiments also
show that identifying insertion positions makes the
infilling task easier than when they are unknown.
Substitutions
To simulate substitutions, we ap-
ply round-trip translation with lexically constrained
decoding (LCD, Post and Vilar,2018) to generate
initial translations for substitution
˜esub
. Round-
trip translation is already used for the APE task
in Junczys-Dowmunt and Grundkiewicz (2016).
This requires two standard NMT models separately
trained on parallel data, one for each direction. For
each training example (
f
,
e
), we first (a) translate
e
into an intermediate source sentence
f
using
top-
5
sampling (Edunov et al.,2018);
1
(b) gener-
ate an abbreviated version
˜e0
ins
using the method
described above for insertions. We then translate
f
using LCD, with
˜e0
ins
as constraints, to obtain
˜esub
. In this way, we ensure that at least half of
e
remains unchanged in
˜esub
, while the other parts
have been substituted. To increase diversity,
˜e0
ins
(used to create
˜esub
) is sampled with a different
random seed than
˜eins
(used for the insertion task).
Deletions
Simulating deletions requires the ini-
tial translation
˜edel
to be an extension of
e
. We pro-
pose two strategies to generate
˜edel
. The first uses
a GAP insertion model as in Xiao et al. (2022), in
which word segments are randomly replaced with
a placeholder
[gap]
to generate
˜egap
. The task is
then to predict the missing segments based on the
concatenation of
f
and
˜egap
as input. This differs
from our own insertion task, as (a) insertion posi-
tions are identified as a
[gap]
symbol in
˜egap
and
(b) generation only computes the sequence of miss-
ing segments
eseg
, rather than a complete sentence.
We use GAP to generate extra segments for a
pair of parallel sentences as follows. We randomly
insert
k[1,5] [gap]
tokens into
e
, concatenate it
with
f
and use GAP to predict extra segments, yield-
ing the synthetic target sentence
˜edel1
. This method
always extends parallel sentences with additional
segments on the target side. However, these seg-
ments are arbitrary and may not contain any valid
semantic information, nor be syntactically correct.
We thus consider a second strategy, based on ac-
tual edit operations collected in the WikiAtomicEd-
its dataset
2
(Faruqui et al.,2018), which contains
edits of an original segment
x
and the resulting seg-
ment
x0
, with exactly one insertion or deletion op-
eration for each example, collected from Wikipedia
edit history. This notably ensures that both versions
of each utterance are syntactically correct. We treat
the deletion data of WikiAtomicEdits as “reversed”
insertions, and use both of them to train a seq-to-
seq
wiki
model (
xshortxlong
), generating longer
sentences from shorter ones. The
wiki
model is
then used to expand
e
into an
˜edel2
. Compared to
˜edel1
,
˜edel2
is syntactically more correct. However,
it is also by design very close (one edit away) to
e
.
1
Early experiments showed that using sampling instead of
beam search increases the diversity of the generated ˜esub.
2https://github.com/google-research-datasets/
wiki-atomic-edits
As both simulation methods have merits and
flaws, we randomly select examples from
˜edel1
and
˜edel2
to build the final synthetic initial translation
samples for the deletion operation ˜edel.
Copy and Translate Operations
To handle par-
allel sentences that do not require any changes, we
add a fourth copy operation, where the initial trans-
lation
˜ecp
is equal to the target sentence (
˜ecp=e
).
Hence, the data used to learn edit operations is
built with triplets (
f
,
˜e
,
e
) where
˜e
is uniformly
randomly selected from
˜eins
,
˜esub
,
˜edel
and
˜ecp
.
Finally, to maintain the capacity to perform stan-
dard MT from scratch, we also consider samples
where
˜e
is empty. The implementation of standard
MT varies slightly upon approaches, as we explain
below.
2.2 Model Architectures
We implement Bi-sync with Transformer-based
(Vaswani et al.,2017) autoregressive and non-
autoregressive models. The former (Edit-MT) is a
regular Transformer with a combined input made of
the concatenation of
f
and
˜e
; the latter (Edit-LevT)
is the LevT of Gu et al. (2019).
Edit-MT
In this model,
˜e
is simply concatenated
to
f
, with a special token to separate the two sen-
tences. This technique has been used e.g. in Dabre
et al. (2017) for multi-source MT or in Bulte and
Tezcan (2019) for translating with a similar exam-
ple. The input side of the editing training data is
thus f[sep] ˜e, as shown in Figure 1(top).
On the target side, we add a categorical prefix to
indicate the type of edit(s) associated with a given
training sample, as is commonly done for multi-
domain or multilingual MT. For each basic edit
(insertion, substitution and deletion), we use a bi-
nary tag to indicate if the operation is required. For
instance, an
˜eins
needing insertions would have tags
[ins] [!sub] [!del]
prepended to
e
. Copy cor-
responds to all three tags set to negative as
[!ins]
[!sub] [!del]
. The tagging scheme provides us
with various ways to perform edit-based MT: (a)
we can perform inference without knowing the re-
quired edit type of
˜e
by generating tags then trans-
lations; (b) when the edits are known, we can gen-
erate translations with desired edits by using the
corresponding tags as a forced prefix; (c) infer-
ence can also only output the edit tags and predict
the relation between
f
and
˜e
. The ability to per-
form standard MT is preserved by training with a
balanced mixture of editing data and parallel data.
摘要:

BilingualSynchronization:RestoringTranslationalRelationshipswithEditingOperationsJitaoXuyJosepCregozFrançoisYvonyyUniversitéParis-Saclay,CNRS,LISN,91400,Orsay,FrancezSYSTRAN,5rueFeydeau,75002,Paris,France{jitao.xu,francois.yvon}@limsi.fr,josep.crego@systrangroup.comAbstractMachineTranslation(MT)isus...

展开>> 收起<<
Bilingual Synchronization Restoring Translational Relationships with Editing Operations Jitao XuyJosep CregozFrançois Yvony.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:469.3KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注