A baseline revisited Pushing the limits of multi-segment models for context-aware translation Suvodeep Majumder

2025-04-30 0 0 475.29KB 14 页 10玖币

侵权投诉

A baseline revisited: Pushing the limits of multi-segment models for

context-aware translation

Suvodeep Majumder∗

NCSU, Amazon

smajumd3@ncsu.edu

Stanislas Lauly∗

Amazon

laulysl@amazon.com

Maria Nadejde

Amazon

mnnadejd@amazon.com

Marcello Federico

Amazon

marcfede@amazon.com

Georgiana Dinu

Amazon

gddinu@amazon.com

Abstract

This paper addresses the task of contex-

tual translation using multi-segment models.

Speciﬁcally we show that increasing model

capacity further pushes the limits of this ap-

proach and that deeper models are more suited

to capture context dependencies. Furthermore,

improvements observed with larger models

can be transferred to smaller models using

knowledge distillation. Our experiments show

that this approach achieves competitive per-

formance across several languages and bench-

marks, without additional language-speciﬁc

tuning and task speciﬁc architectures.

1 Introduction

The quality of NMT (Neural Machine Translation)

models has been improving over the years and

is narrowing the gap to human translation perfor-

mance (Hassan et al.,2018). Until recently, most

of the MT research has focused on translating and

evaluating sentences in isolation, ignoring the con-

text in which these sentences occur. Simplifying

the translation task this way has its advantages:

data sets are easier to create, models are computa-

tionally more efﬁcient and human evaluations are

faster1.

While initial work failed to show signiﬁcant dif-

ferences in standard metrics (Tiedemann and Scher-

rer,2017), the impact of ignoring context has been

investigated more closely in recent years (Yin et al.,

2021b). Targeted testing has shown poor perfor-

mance on discourse-related phenomena (Müller

et al.,2018;Bawden et al.,2018;Voita et al.,2019a;

Jwalapuram et al.,2020b;Maruf et al.,2019b;Li

et al.,2020) (see Table 3for examples). Further-

more, without context, human evaluation fails to

expose all translation errors and leads to rush con-

clusions on achieving human parity (Läubli et al.,

*These authors contributed equally to this work

With full document context, annotation time per task

increases by 68% according to Grundkiewicz et al. (2021).

2018). It is thus important to start addressing the

MT task in a formulation that is closer to its true

complexity and bridges the gap to the real commu-

nication needs of the users.

This paper tackles the problem of context-aware

translation by re-visiting a straightforward multi-

sentence translation approach which is considered a

baseline in the literature. Our comprehensive exper-

iments show that by leveraging deeper transformer

models in combination with knowledge distillation

methods, this baseline leads to an effective and

robust alternative to specialized architectures pro-

posed in the literature. The paper’s contributions

are:

•

We show that multi-sentence translation can

beneﬁt from increased-capacity transformer

models and that deeper models are better at

learning contextual dependencies than wider

models.

•

We further show that distilled models can

learn contextual dependencies from larger

models, while reducing computational cost

and increasing robustness to input length vari-

ations.

•

Finally, results on four language pairs conﬁrm

that the approach achieves high performance

for both contextual and single-segment trans-

lation tasks.

2 Multi-segment translation models

Throughout this paper, we implement context-

aware translation models as multi-segment models,

as initially proposed in Tiedemann and Scherrer

(2017) and further used in Fernandes et al. (2021);

Lopes et al. (2020) among others.

Multi-segment data points

We use document-

level parallel data which is transformed to contain

concatenated, multi-segment input. Speciﬁcally,

arXiv:2210.10906v2 [cs.CL] 21 Oct 2022

Input Output

<start >Fire? <sep >Well, put it out, why don’t you? <end > <start >Ein Feuer? <sep >Na dann löscht er doch! <end >

<start >Well, put it out, why don’t you? <end > <start >Na dann löscht er doch! <end >

Table 1: Parallel training data contains both segments in isolation as well as concatenated segments. Example is

demonstrative, from the EN-DE anaphora test set (Müller et al.,2018). At inference time, only the translations of

target segments (in bold) are used.

we restrict this work to two consecutive sentences.

The source and target sides are concatenated using

a special delimiter token and added to the training.

While not strictly a requirement, the special token

allows the extraction of the context-aware transla-

tion for the second, target sentence. Prior context-

aware architectures can be categorized with respect

to the use of context as using: source-side, target-

side or both. As it generates both sentence trans-

lations jointly, the multi-segment approach takes

advantage of both source- and target-side context

at train-time. However, it does not use the context

reference translation during inference and multi-

segment input is simply translated as a continuous

output sequence.

Training data

We aim to create single transla-

tion models which can perform both translation

in-context and in isolation. For this reason, we

start from a training set including context for each

parallel sentence and create a duplicate of it by

removing the context information. All the contex-

tual models (Ctx) are trained on this joint single-

and multi-segment data, while the sentence-level

baselines (Bl) use only single sentences. Note that

although the data size varies between Bl and Ctx

models, the data is effectively identical and all the

models are trained using the same stopping criteria,

thus conferring no special advantage to any of the

models. Table 1exempliﬁes the training data.

3 Experimental setup

We perform experiments in four language arcs, En-

glish to German (EN-DE), English to French (EN-

FR), English to Russian (EN-RU) and Chinese to

English (ZH-EN).

3.1 Training

We use the WMT2019 data set for EN-DE, Open

Subtitles 2018 for EN-FR and EN-RU and UN Par-

allel Corpus V1.0 for ZH-EN, all four containing

document-level data. The data sets vary in size

from 4M segments for EN-DE to 17.4M for ZH-

EN (see Appendix A, Table 13 for details). De-

velopment data consists of the News task 2019

development set for DE, IWSLT 2019 for FR and

newstest2019 for RU and ZH respectively. In all

conditions the development data mirrors the train-

ing data, meaning that it is duplicated to contain

both multi- and single segments data for contextual

models, and original and distilled data for distilla-

tion experiments. In preliminary experiments we

found this to play an important role.

Models use the Transformer architec-

ture (Vaswani et al.,2017a). We start with

a baseline architecture of 6:2 encoder:decoder

layers and 2048 feed-forward width, which

subsequent experiments increase in decoder depth

and feed-forward width respectively. Training

is done with Sockeye (Domhan et al.,2020).

See Appendix Afor a complete list of training

parameters.

3.2 Testing

We measure performance of contextual models us-

ing both targeted and non-targeted testing.

Non-targeted tests

consists of contextual,

document-level data which is not selected to fo-

cus on discourse phenomena. For EN-DE we use

the test set splits made available in (Maruf et al.,

2019a): TED (2.3k segments), News-Commentary

(3k) and Europarl (5.1k). We use IWSLT15

(1k) (Cettolo et al.,2012) for EN-FR, WMT new-

stest2020 (4k) (Barrault et al.,2020) for EN-RU

and ﬁnally WMT newstest2020 (2k) (Barrault

et al.,2020) for ZH-EN. While contextual mod-

els may improve performance on these data sets,

previous work suggests that the effects are minimal

in high-resources scenarios with strong sentence-

level baselines (Lopes et al.,2020).

Targeted tests

have been developed in order to

evaluate performance on discourse phenomena. Ta-

ble 2lists the test sets used in this paper.

These

test sets contain contrastive translation pairs, con-

sisting of a correct human-generated translation,

and a variant of it where a pronoun, or another lin-

guistic unit of interest, is swapped with an incorrect

While highly relevant, data created by (Yin et al.,2021a)

has not been released at the time of writing this paper.

LP Type Size Source

EN-DE Anaphora 12,000 (Müller et al.,2018)

EN-FR Anaphora 12,000 (Lopes et al.,2020)

EN-RU Deixis 3,000 (Voita et al.,2019b)

Lex-coh 2,000

Ellipsis-vp 500

Ellipsis-inﬂ 500

ZH-EN Anaphora 500 (Jwalapuram et al.,2019)

Table 2: Targeted test sets used for evaluating discourse

phenomena.

DE Src I forgot to conﬁde it to you.

Ctx What’s your plan?

Ctx-tgt Was hast du vor?

Ref Ich vergaß, es euch zu vertraun.

Contr Ich vergaß, sie euch zu vertraun.

FR Src And where’s it coming from?

Ctx A sort of mist.

Ctx-tgt Une sorte de brume.

Ref Et elle vient d’où ?

Contr Et il vient d’où ?

RU Src Identity theft.

Ctx And I solved another crime.

Ctx-tgt

Ref

Contr

ZH Src 情况就是这样

Ctx 斐济人就好像生来就是打7人

制橄榄球的,而英国队仍是初出茅庐

Ctx-tgt It was as if Fiji had been born to play 7s,

while GB are still learning the trade .

Ref Which is pretty much how it is.

Contr That is pretty much how it is.

Table 3: Targeted test set examples. Models are as-

sessed as correct if they score the reference (Ref) higher

than a contrastive variant (Contr), given a source seg-

ment and its context.

one. Table 3shows examples from these data sets.

To complement accuracy of contrastive evalua-

tions, we also use targeted test sets and their refer-

ences to measure standard translation metrics.

4 Context-aware translation results

We begin our experiments by conﬁrming that con-

catenated models are indeed able to model context

dependencies (Section 4.1). We follow by testing

the hypothesis that larger models are better suited

for learning the more complex contextual training

data (Section 4.2). In order to avoid over-ﬁtting,

we use EN-DE as a development language and sub-

sequently test identical settings on FR, RU and ZH

in Section 4.3.

4.1 Multi-segment models

For all four language arcs,

Ctx

models use 6 en-

coder layers and 2 decoder layers (44M parameters)

and are trained using both segments in isolation as

well as concatenated context segments. In infer-

Arc Metric Test set Targeted Bl Ctx

DE BLEU TED 19.9 22.4

News 26.1 29.5

Europarl 29.3 31.5

BLEU ContraPro X20.1 21.1

Acc ContraPro X0.50 0.70

FR BLEU IWSLT 40.0 39.7

BLEU LCPT3X27.9 32.5

Acc LCPT X0.74 0.87

Acc Anaphora X0.50 0.72

RU BLEU WMT20 13.6 14.6

Acc Deixis X0.50 0.83

Acc Lex-coh X0.46 0.47

Acc Ellipsis-vp X0.20 0.60

Acc Ellipsis-inﬂ X0.52 0.68

ZH BLEU WMT20 21.2 21.4

Acc Eval-anaphora X0.58 0.61

Table 4: Concatenated models (Ctx) vs baseline models

(Bl) of the same capacity. While all test sets have con-

text, some are targeted towards discourse phenomena,

marked as Targeted (see Section 3for details).

ence, DE, FR and ZH models use one preceding

context sentence, matching the training. However,

over 60% of the targeted RU data exhibits longer

dependencies, of up to 3 previous segments. For

this reason, targeted EN-RU testing concatenates

all three context sentences. Baseline (

) models

use the same original train data and model archi-

tecture, this time trained and used to translate one

segment at a time.

Results are shown in Table 4. As observed by

previous work, concatenated models are consider-

ably better than their context-ignoring counterparts,

particularly on targeted test sets. In contrastive

testing, accuracy increases by 20-30% in absolute

values in all languages, with the exception of the

lexical cohesion data set in RU and anaphora data

set in ZH.

For non-targeted testing, Ctx models signif-

icantly out-perform the Bl models in 4 out of

the 6 test sets. This differs from previous work,

where contextual models using the concatenation

approach are reported to degrade BLEU scores:

Tiedemann and Scherrer (2017) measure 0.6 BLEU

drop, Voita et al. (2019b) show a 0.84 drop for RU,

Lopes et al. (2020), 1.2, and Junczys-Dowmunt

(2019a) shows a BLEU degradation of 1.5. These

results indicate that our approach to train the

contextual model with both contextual and non-

contextual data alleviates the issue of quality degra-

dation.

3Large-contrastive-pronoun-testset-EN-FR (LCPT)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Abaselinerevisited:Pushingthelimitsofmulti-segmentmodelsforcontext-awaretranslationSuvodeepMajumderNCSU,Amazonsmajumd3@ncsu.eduStanislasLaulyAmazonlaulysl@amazon.comMariaNadejdeAmazonmnnadejd@amazon.comMarcelloFedericoAmazonmarcfede@amazon.comGeorgianaDinuAmazongddinu@amazon.comAbstractThispaperad...

展开>> 收起<<

A baseline revisited Pushing the limits of multi-segment models for context-aware translation Suvodeep Majumder.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A baseline revisited Pushing the limits of multi-segment models for context-aware translation Suvodeep Majumder

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: