Focused Concatenation for Context-Aware Neural Machine Translation Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2 1Université Grenoble Alpes France

2025-05-06 0 0 589.85KB 14 页 10玖币

侵权投诉

Focused Concatenation for Context-Aware Neural Machine Translation

Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2

1Université Grenoble Alpes, France

2Naver Labs Europe, France

lorenzo.lupo@univ-grenoble-alpes.fr

marco.dinarelli@univ-grenoble-alpes.fr

laurent.besacier@naverlabs.com

Abstract

A straightforward approach to context-aware

neural machine translation consists in feed-

ing the standard encoder-decoder architec-

ture with a window of consecutive sentences,

formed by the current sentence and a num-

ber of sentences from its context concatenated

to it. In this work, we propose an improved

concatenation approach that encourages the

model to focus on the translation of the cur-

rent sentence, discounting the loss generated

by target context. We also propose an addi-

tional improvement that strengthen the notion

of sentence boundaries and of relative sentence

distance, facilitating model compliance to the

context-discounted objective. We evaluate our

approach with both average-translation quality

metrics and contrastive test sets for the trans-

lation of inter-sentential discourse phenomena,

proving its superiority to the vanilla concatena-

tion approach and other sophisticated context-

aware systems.

1 Introduction

While current neural machine translation (NMT)

systems have reached close-to-human quality in

the translation of decontextualized sentences (Wu

et al.,2016), they still have a wide margin of im-

provement ahead when it comes to translating full

documents (Läubli et al.,2018). Many works

tried to reduce this margin, proposing various ap-

proaches to context-aware NMT (CANMT)

. A

common taxonomy (Kim et al.,2019;Li et al.,

2020) divides them in two broad categories: multi-

encoding approaches and concatenation (single-

encoding) approaches. Despite its simplicity, the

concatenation approaches have been shown to

achieve competitive or superior performance to

more sophisticated, multi-encoding systems (Lopes

et al.,2020;Ma et al.,2021). Nonetheless, it

Unless otherwise speciﬁed, we refer to context as the

sentences that precede or follow a current sentence to be

translated, within the same document.

Figure 1: Example of the proposed approach applied

over a window of 2 sentences, with context discount

CD and segment-shifted positions by a factor of 10.

has been shown that Transformer-based NMT sys-

tems (Vaswani et al.,2017) struggle to learn locality

properties (Hardmeier,2012;Rizzi,2013) of both

the language itself and the source-target alignment

when the input sequence grows in length, as in the

case of concatenation (Bao et al.,2021). Unsur-

prisingly, the presence of context makes learning

harder for concatenation models by distracting at-

tention. Moreover, we know from recent litera-

ture that NMT systems require context for a sparse

set of inter-sentential discourse phenomena only

(Voita et al.,2019;Lupo et al.,2022). Therefore, it

is desirable to make concatenation models more fo-

cused on local linguistic phenomena, belonging to

the current sentence, while also processing its con-

text for enabling inter-sentential contextualization

whenever it is needed. We propose an improved

concatenation approach to CANMT that is more

focused on the translation of the current sentence

by means of two simple, parameter-free solutions:

•

Context-discounting: a simple modiﬁcation

of the NMT loss that improves context-aware

translation of a sentence by making the model

less distracted by its concatenated context;

•

Segment-shifted positions: a simple,

parameter-free modiﬁcation of position

embeddings, that facilitates the achievement

of the context-discounted objective by

supporting the learning of locality properties

in the document translation task.

We support our solutions with extensive experi-

arXiv:2210.13388v1 [cs.CL] 24 Oct 2022

ments, analysis and benchmarking.

2 Background

2.1 Multi-encoding approaches

Multi-encoding models couple a self-standing

sentence-level NMT system, with parameters

θS

with additional parameters

θC

that encode and inte-

grate the context of the current sentence, either on

source side, target side, or both. The full context-

aware architecture has parameters

Θ = [θS;θC]

Multi-encoding models differ from each other in

the way they encode the context or integrate its rep-

resentations with those of the current sentence. For

instance, the representations coming from the con-

text encoder can be integrated with the encoding

of the current sentence outside the decoder (Maruf

et al.,2018;Voita et al.,2018;Zhang et al.,2018;

Miculicich et al.,2018;Maruf et al.,2019;Zheng

et al.,2020) or inside the decoder (Tu et al.,2018;

Kuang et al.,2018;Bawden et al.,2018;Voita et al.,

2019;Tan et al.,2019), by making it attending to

the context representations directly, using its inter-

nal representation of the decoded history as query.

2.2 Single-encoder approaches

The concatenation approaches are the simplest in

terms of architecture, as they mainly consist in con-

catenating each (current) source sentence with its

context before feeding it to the standard encoder-

decoder architecture (Tiedemann and Scherrer,

2017;Junczys-Dowmunt,2019;Agrawal et al.,

2018;Ma et al.,2020), without the addition of

extra learnable parameters. The decoding can then

be limited to the current sentence, although de-

coding the full target concatenation is more effec-

tive thanks to the availability of target context. A

typical strategy to train a concatenation approach

and generate translations is by sliding windows

(Tiedemann and Scherrer,2017). An sKtoK model

decodes the translation

of a source window

, formed by

consecutive sentences belonging

to the same document: the current (

th) sentence

and

K−1

sentences concatenated as source-side

context. Besides the end-of-sequence token

<E>

another special token

<S>

is introduced to mark

sentence boundaries in the concatenation:

K=xj−K+1<S>xj−K+2<S>...<S>xj−1<S>xj<E>

K=yj−K+1<S>yj−K+2<S>...<S>yj−1<S>yj<E>

Both past and future contexts can be concatenated

to the current pair

xj,yj

, although in this work we

consider only the past context, for simplicity. At

training time, the loss is calculated over the whole

output

, but only the translation

of the cur-

rent sentence is kept at inference time, while the

translation of the context is discarded. Then, the

window is slid by one position forward to repeat the

process for the

(j+ 1)

th sentence and its context.

Concatenation approaches are trained by optimiz-

ing the same objective function as standard NMT

over a window of sentences:

L(xj

K,yj

K) =

|yj

t=1

log P(yj

K,t|yj

K,<t,xj

K),(1)

so that the likelihood of the current target sen-

tence is conditioned on source and target context.

2.3 Closing the gap

Concatenation approaches have the advantage of

treating the task of CANMT in the same way

as context-agnostic NMT, which eases learning

because the learnable parameters responsible for

inter-sentential contextualization are the same that

undertake intra-sentential contextualization. In-

deed, learning the parameters responsible for inter-

sentential contextualization in multi-encoding ap-

proaches (

θC

) has been shown to be challenging

because the training signal is sparse and the task of

retrieving useful context elements difﬁcult (Lupo

et al.,2022). Nonetheless, encoding current and

context sentences together comes at a cost. In fact,

when sequences are long the risk of paying atten-

tion to irrelevant elements increases. Paying at-

tention to the "wrong tokens" can harm their intra

and inter-sentential contextualization, associating

them to the wrong latent features. Indeed, Liu et al.

(2020) and Sun et al. (2022) showed that learning

to translate long sequences, comprised of many

sentences, fails without the use of large-scale pre-

training or data-augmentation (e.g., like Junczys-

Dowmunt (2019) and Ma et al. (2021) did). Bao

et al. (2021) provided some evidence about this

leaning difﬁculty, showing that failed models, i.e.,

models stuck in local minima with a high validation

loss, present a distribution of attention weights that

is ﬂatter (with higher entropy), both in the encoder

and the decoder, than the distribution occurring in

models that converge to lower validation loss. In

other words, attention struggles to learn the local-

ity properties of both the language itself and the

source-target alignment (Hardmeier,2012;Rizzi,

2013). As a solution, Zhang et al. (2020) and Bao

et al. (2021) propose two slightly different mask-

ing methods that allow both the encoding of the

current sentence concatenated with context, and

the separate encoding of each sentence in window.

The representations generated by the two encoding

schemes are then integrated together, at the cost of

adding extra learnable parameters to the standard

Transformer architecture.

3 Proposed approach

3.1 Context discounting

Evidently, Equation 1deﬁnes an objective function

that does not factor in the fact that we only care

about the translation of the current sentence

because the context translation will be discarded

during inference. Moreover, as discussed above,

we need attention to stay focused locally, relying

on context only for the disambiguation of relatively

sparse inter-sentential discourse phenomena that

are ambiguous at sentence level. Hence, we pro-

pose to encourage the model to focus on the trans-

lation of the current sentence

by applying a

discount

0≤CD <1

to the loss generated by

context tokens:

LCD(xj

K,yj

K) = CD·Lcontext +Lcurrent (2)

=CD·L(xj−1

K−1,yj−1

K−1) + L(xj,yj).

This is equivalent to consider an sKtoK con-

catenation approach as the result of a multi-task

sequence-to-sequence setting (Luong et al.,2016),

where an sKto1 model performs the reference task

of translating the current sentence given a concate-

nation of its source with K-1 context sentences,

while the translation of the context sentences is

added as a secondary, complementary task. The

reference task is assigned a bigger weight than the

secondary task in the multi-task composite loss. As

we will see in Section 4.5, this simple modiﬁcation

of the loss allows the model to learn a self-attentive

mechanism that is less distracted by noisy context

information, thus achieving net improvements in

the translation of inter-sentential discourse phenom-

ena occurring in the current sentence (Section 4.3),

and helping concatenation systems to generalize to

wider context after training (Section 4.5.3).

3.2 Segment-shifted positions

Context discounting pushes the model to discrim-

inate between the current sentence and the con-

text. Such discrimination can be undertaken by

cross-referencing the information provided by two

elements: sentence separation tokens

<S>

, and sinu-

soidal position encodings, as deﬁned in (Vaswani

et al.,2017). In order to facilitate this task, we

propose to provide the model with extra informa-

tion about sentence boundaries and their relative

distance. (Devlin et al.,2019) achieve this goal by

adding segment embeddings to every token repre-

sentation in input to the model, on top of token and

position embeddings, such that every segment em-

bedding represents the sentence position in the win-

dow of sentences. However, we propose an alterna-

tive solution that does not require any extra learn-

able parameter nor memory allocation: segment-

shifted positions. As shown in Figure 1, we apply a

constant shift after every separation token

<S>

, so

that the resulting token position is equal to its origi-

nal position plus a total shift depending on the cho-

sen constant shift and the index

k= 1,2, ..., K

the sentence the token belongs to:

t0=t+k∗shift

As a result, the position distance between tokens

belonging to different sentences is increased. For

example, the distance between the ﬁrst token of the

current sentence and the last token of the preceding

context sentence increases from

1 + shift

. By

increasing the distance between sinusoidal position

embeddings

of tokens belonging to different sen-

tences, their dot product, which is at the core of the

attention mechanism, becomes smaller, possibly re-

sulting in smaller attention weights. In other words,

the resulting attention becomes more localized, as

conﬁrmed by the empirical analysis reported in

Section 4.6.1. In Section 4.3, we present results

of segment-shifted positions, and then compare

them with both sinusoidal segment embeddings

and learned segment embeddings in Section 4.6.2.

4 Experiments

4.1 Setup3

We conduct experiments with two language pairs

and domains. For En

→

Ru, we adopt a document-

level corpus released by Voita et al. (2019), based

on OpenSubtitles2018 (with dev and test sets), com-

prised of 1.5M parallel sentences. For En

→

De, we

train models on TED talks subtitles released by

IWSLT17 (Cettolo et al.,2012). Models are tested

Positions can be shifted by segment also in the case of

learned position embeddings, both absolute and relative. We

leave such experiments for future works.

3See Appendix Afor more details.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FocusedConcatenationforContext-AwareNeuralMachineTranslationLorenzoLupo1MarcoDinarelli1LaurentBesacier21UniversitéGrenobleAlpes,France2NaverLabsEurope,Francelorenzo.lupo@univ-grenoble-alpes.frmarco.dinarelli@univ-grenoble-alpes.frlaurent.besacier@naverlabs.comAbstractAstraightforwardapproachtocontex...

展开>> 收起<<

Focused Concatenation for Context-Aware Neural Machine Translation Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2 1Université Grenoble Alpes France.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Focused Concatenation for Context-Aware Neural Machine Translation Lorenzo Lupo1Marco Dinarelli1Laurent Besacier2 1Université Grenoble Alpes France

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: