Bilingual Synchronization Restoring Translational Relationships with Editing Operations Jitao XuyJosep CregozFrançois Yvony

2025-05-06 1 0 469.3KB 15 页 10玖币

侵权投诉

Bilingual Synchronization: Restoring Translational Relationships with

Editing Operations

Jitao Xu†Josep Crego‡François Yvon†

†Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France

‡SYSTRAN, 5 rue Feydeau, 75002, Paris, France

{jitao.xu,francois.yvon}@limsi.fr, josep.crego@systrangroup.com

Abstract

Machine Translation (MT) is usually viewed

as a one-shot process that generates the tar-

get language equivalent of some source text

from scratch. We consider here a more gen-

eral setting which assumes an initial target se-

quence, that must be transformed into a valid

translation of the source, thereby restoring

parallelism between source and target. For

this bilingual synchronization task, we con-

sider several architectures (both autoregressive

and non-autoregressive) and training regimes,

and experiment with multiple practical settings

such as simulated interactive MT, translating

with Translation Memory (TM) and TM clean-

ing. Our results suggest that one single generic

edit-based system, once ﬁne-tuned, can com-

pare with, or even outperform, dedicated sys-

tems speciﬁcally trained for these tasks.

1 Introduction

Neural Machine Translation (NMT) systems have

made tangible progress in recent years (Bahdanau

et al.,2015;Vaswani et al.,2017), as they started to

produce usable translations in production environ-

ments. NMT is generally viewed as a one-shot ac-

tivity process in autoregressive approaches, which

generates the target translation based on the sole

source side input. Recently, Non-autoregressive

Machine Translation (NAT) models have proposed

to perform iterative reﬁnement decoding (Lee et al.,

2018;Ghazvininejad et al.,2019;Gu et al.,2019),

where translations are generated through an itera-

tive revision process, starting with a possibly empty

initial hypothesis.

This paper focuses on the revision part of the ma-

chine translation (MT) process and consider bilin-

gual synchronization (Bi-sync), which we deﬁne

as follows: given a pair of a source (

) and a target

(

˜e

) sentences, which may or may not be mutual

translations, the task is to compute a revised ver-

sion

˜e

, such that

is an actual translation of

. This is necessary when the source side of an ex-

isting translation is edited, requiring to update the

target and keep both sides synchronized. Bi-sync

subsumes standard MT, where the synchronization

starts with an empty target (

˜e

= []). Other interest-

ing cases occur when parts of the initial target can

be reused, so that the synchronization only requires

a few changes.

Bi-sync encompasses several tasks: synchroniza-

tion is needed in interactive MT (IMT, Knowles and

Koehn,2016) and bilingual editing (Bronner et al.,

2012), with

˜e

the translation of a previous version

; in MT with lexical constraints (Hokamp and

Liu,2017), where

˜e

contains target-side constraints

(Susanto et al.,2020;Xu and Carpuat,2021); in

Translation Memory (TM) based approaches (Bulte

and Tezcan,2019), where

˜e

is a TM match for a

similar example; in automatic post-editing (APE)

(do Carmo et al.,2021), where ˜e is an MT output.

We consider here several implementations of

sequence-to-sequence models dedicated to these sit-

uations, contrasting an autoregressive model with a

non-autoregressive approach. The former is similar

to Bulte and Tezcan (2019), where the source sen-

tence and the initial translation are concatenated as

one input sequence; the latter uses the Levenshtein

Transformer (LevT) of Gu et al. (2019). We also

study various ways to generate appropriate training

samples (

˜e

). Our experiments consider sev-

eral tasks, including TM cleaning, which attempts

to ﬁx and synchronize noisy segments in a parallel

corpus. This setting is more difﬁcult than Bi-sync,

as many initial translations are already correct and

need to be left unchanged. Our results suggest that

one single AR system, once ﬁne-tuned, can favor-

ably compare with dedicated systems for each of

these tasks. To recap, our main contributions are

(a) the generalization of several tasks subsumed by

a generic synchronization objective, allowing us

to develop a uniﬁed perspective about otherwise

unrelated subdomains of MT; (b) the design of a

arXiv:2210.13163v1 [cs.CL] 24 Oct 2022

pasCela

Cela pas .

Top 5 Sampling

Constrained

Decoding

Cela n' arrivera pas .

[gap]

ne y donc . 2

GAP

Insertion

[gap][gap][gap]

Insertion

WikiAtomicEdits

Model

Substitution

Deletion 2

Deletion 1

n' arrivera .

Cela n' arrivera pas .That 's not going to happen .

Cela n' arrivera pas .

Cela n' arrivera pas , mais seulement .

Cela ne n' y arrivera donc pas . 2 .

Cela ne se produira pas .

That will not happen .

Cela n' arrivera pas .That 's not going to happen .

INITIAL

[sep]

Translation

Edit-MT

Figure 1: Methods for generating synthetic initial translations ˜e for each edit type. Rectangle purple boxes refer to

separate models used to generate the desired operations. Differences in artiﬁcial initial translations (in blue boxes)

are marked in bold. Initial translations ˜eins for insertion are generated by randomly removing segments in the

reference sentence e. For ˜esub,eis ﬁrst back-translated into an intermediate sentence f∗using top-5sampling,

then translated back to ˜esub with LCD. The ﬁrst method to generate ˜edel randomly inserts [gap] tokens into eand

decodes with a GAP insertion model (Xiao et al.,2022). ˜edel1is obtained by replacing [gap] with the predicted

segments. The second method automatically edits ewith a model trained on WikiAtomicEdits data.

training procedure for a generic edit-based model;

mains.

2 Methods

2.1 Generating Editing Data

We consider a general scenario where, given a pair

of sentences

and

˜e

, assumed to be related, but

not necessarily parallel, we aim to generate a target

sentence

that is parallel to

. We would also like

˜e

and

to be close, as

˜e

is often a valid transla-

tion of a sentence

that is close to

. Training

such models requires triplets (

˜e

). While large

amounts of parallel bilingual data are available for

many language pairs, they are hardly ever associ-

ated to related translations

˜e

(except for APE). We

therefore study ways to simulate synthetic

˜e

from

, while preserving large portions of

˜e

. String

edits can be decomposed into a sequence of three

basic edits (insertions, substitutions and deletions),

we design our artiﬁcial samples so that edits from

˜e

to eonly involve one type of operation (Figure 1).

Insertions

We mainly follow Xiao et al. (2022)

to generate initial translations

˜eins

for insertion by

randomly deleting segments from

. For each

, we

ﬁrst randomly sample an integer

k∈[1,5]

, then ran-

domly remove

non-overlapping segments from

. The length of each removed segment is also

randomly sampled with a maximum of

tokens.

We also impose that the overall ratio of removed

segments does not exceed

0.5

. Different from

Xiao et al. (2022),

˜eins

does not include any place-

holders to locate the positions of removed segments.

This makes

˜eins

a more realistic starting point as

the insertion positions are rarely known in prac-

tical settings. Our preliminary experiments also

show that identifying insertion positions makes the

inﬁlling task easier than when they are unknown.

Substitutions

To simulate substitutions, we ap-

ply round-trip translation with lexically constrained

decoding (LCD, Post and Vilar,2018) to generate

initial translations for substitution

˜esub

. Round-

trip translation is already used for the APE task

in Junczys-Dowmunt and Grundkiewicz (2016).

This requires two standard NMT models separately

trained on parallel data, one for each direction. For

each training example (

), we ﬁrst (a) translate

into an intermediate source sentence

f∗

using

top-

sampling (Edunov et al.,2018);

(b) gener-

ate an abbreviated version

˜e0

ins

using the method

described above for insertions. We then translate

f∗

using LCD, with

˜e0

ins

as constraints, to obtain

˜esub

. In this way, we ensure that at least half of

remains unchanged in

˜esub

, while the other parts

have been substituted. To increase diversity,

˜e0

ins

(used to create

˜esub

) is sampled with a different

random seed than

˜eins

(used for the insertion task).

Deletions

Simulating deletions requires the ini-

tial translation

˜edel

to be an extension of

. We pro-

pose two strategies to generate

˜edel

. The ﬁrst uses

a GAP insertion model as in Xiao et al. (2022), in

which word segments are randomly replaced with

a placeholder

[gap]

to generate

˜egap

. The task is

then to predict the missing segments based on the

concatenation of

and

˜egap

as input. This differs

from our own insertion task, as (a) insertion posi-

tions are identiﬁed as a

[gap]

symbol in

˜egap

and

(b) generation only computes the sequence of miss-

ing segments

eseg

, rather than a complete sentence.

We use GAP to generate extra segments for a

pair of parallel sentences as follows. We randomly

insert

k∈[1,5] [gap]

tokens into

, concatenate it

with

and use GAP to predict extra segments, yield-

ing the synthetic target sentence

˜edel1

. This method

always extends parallel sentences with additional

segments on the target side. However, these seg-

ments are arbitrary and may not contain any valid

semantic information, nor be syntactically correct.

We thus consider a second strategy, based on ac-

tual edit operations collected in the WikiAtomicEd-

its dataset

(Faruqui et al.,2018), which contains

edits of an original segment

and the resulting seg-

ment

, with exactly one insertion or deletion op-

eration for each example, collected from Wikipedia

edit history. This notably ensures that both versions

of each utterance are syntactically correct. We treat

the deletion data of WikiAtomicEdits as “reversed”

insertions, and use both of them to train a seq-to-

seq

wiki

model (

xshort→xlong

), generating longer

sentences from shorter ones. The

wiki

model is

then used to expand

into an

˜edel2

. Compared to

˜edel1

˜edel2

is syntactically more correct. However,

it is also by design very close (one edit away) to

Early experiments showed that using sampling instead of

beam search increases the diversity of the generated ˜esub.

2https://github.com/google-research-datasets/

wiki-atomic-edits

As both simulation methods have merits and

ﬂaws, we randomly select examples from

˜edel1

and

˜edel2

to build the ﬁnal synthetic initial translation

samples for the deletion operation ˜edel.

Copy and Translate Operations

To handle par-

allel sentences that do not require any changes, we

add a fourth copy operation, where the initial trans-

lation

˜ecp

is equal to the target sentence (

˜ecp=e

Hence, the data used to learn edit operations is

built with triplets (

˜e

) where

˜e

is uniformly

randomly selected from

˜eins

˜esub

˜edel

and

˜ecp

Finally, to maintain the capacity to perform stan-

dard MT from scratch, we also consider samples

where

˜e

is empty. The implementation of standard

MT varies slightly upon approaches, as we explain

below.

2.2 Model Architectures

We implement Bi-sync with Transformer-based

(Vaswani et al.,2017) autoregressive and non-

autoregressive models. The former (Edit-MT) is a

regular Transformer with a combined input made of

the concatenation of

and

˜e

; the latter (Edit-LevT)

is the LevT of Gu et al. (2019).

Edit-MT

In this model,

˜e

is simply concatenated

, with a special token to separate the two sen-

tences. This technique has been used e.g. in Dabre

et al. (2017) for multi-source MT or in Bulte and

Tezcan (2019) for translating with a similar exam-

ple. The input side of the editing training data is

thus f[sep] ˜e, as shown in Figure 1(top).

On the target side, we add a categorical preﬁx to

indicate the type of edit(s) associated with a given

training sample, as is commonly done for multi-

domain or multilingual MT. For each basic edit

(insertion, substitution and deletion), we use a bi-

nary tag to indicate if the operation is required. For

instance, an

˜eins

needing insertions would have tags

[ins] [!sub] [!del]

prepended to

. Copy cor-

responds to all three tags set to negative as

[!ins]

[!sub] [!del]

. The tagging scheme provides us

with various ways to perform edit-based MT: (a)

we can perform inference without knowing the re-

quired edit type of

˜e

by generating tags then trans-

lations; (b) when the edits are known, we can gen-

erate translations with desired edits by using the

corresponding tags as a forced preﬁx; (c) infer-

ence can also only output the edit tags and predict

the relation between

and

˜e

. The ability to per-

form standard MT is preserved by training with a

balanced mixture of editing data and parallel data.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BilingualSynchronization:RestoringTranslationalRelationshipswithEditingOperationsJitaoXuyJosepCregozFrançoisYvonyyUniversitéParis-Saclay,CNRS,LISN,91400,Orsay,FrancezSYSTRAN,5rueFeydeau,75002,Paris,France{jitao.xu,francois.yvon}@limsi.fr,josep.crego@systrangroup.comAbstractMachineTranslation(MT)isus...

展开>> 收起<<

Bilingual Synchronization Restoring Translational Relationships with Editing Operations Jitao XuyJosep CregozFrançois Yvony.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bilingual Synchronization Restoring Translational Relationships with Editing Operations Jitao XuyJosep CregozFrançois Yvony

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: