A baseline revisited Pushing the limits of multi-segment models for context-aware translation Suvodeep Majumder

2025-04-30 0 0 475.29KB 14 页 10玖币
侵权投诉
A baseline revisited: Pushing the limits of multi-segment models for
context-aware translation
Suvodeep Majumder
NCSU, Amazon
smajumd3@ncsu.edu
Stanislas Lauly
Amazon
laulysl@amazon.com
Maria Nadejde
Amazon
mnnadejd@amazon.com
Marcello Federico
Amazon
marcfede@amazon.com
Georgiana Dinu
Amazon
gddinu@amazon.com
Abstract
This paper addresses the task of contex-
tual translation using multi-segment models.
Specifically we show that increasing model
capacity further pushes the limits of this ap-
proach and that deeper models are more suited
to capture context dependencies. Furthermore,
improvements observed with larger models
can be transferred to smaller models using
knowledge distillation. Our experiments show
that this approach achieves competitive per-
formance across several languages and bench-
marks, without additional language-specific
tuning and task specific architectures.
1 Introduction
The quality of NMT (Neural Machine Translation)
models has been improving over the years and
is narrowing the gap to human translation perfor-
mance (Hassan et al.,2018). Until recently, most
of the MT research has focused on translating and
evaluating sentences in isolation, ignoring the con-
text in which these sentences occur. Simplifying
the translation task this way has its advantages:
data sets are easier to create, models are computa-
tionally more efficient and human evaluations are
faster1.
While initial work failed to show significant dif-
ferences in standard metrics (Tiedemann and Scher-
rer,2017), the impact of ignoring context has been
investigated more closely in recent years (Yin et al.,
2021b). Targeted testing has shown poor perfor-
mance on discourse-related phenomena (Müller
et al.,2018;Bawden et al.,2018;Voita et al.,2019a;
Jwalapuram et al.,2020b;Maruf et al.,2019b;Li
et al.,2020) (see Table 3for examples). Further-
more, without context, human evaluation fails to
expose all translation errors and leads to rush con-
clusions on achieving human parity (Läubli et al.,
*These authors contributed equally to this work
1
With full document context, annotation time per task
increases by 68% according to Grundkiewicz et al. (2021).
2018). It is thus important to start addressing the
MT task in a formulation that is closer to its true
complexity and bridges the gap to the real commu-
nication needs of the users.
This paper tackles the problem of context-aware
translation by re-visiting a straightforward multi-
sentence translation approach which is considered a
baseline in the literature. Our comprehensive exper-
iments show that by leveraging deeper transformer
models in combination with knowledge distillation
methods, this baseline leads to an effective and
robust alternative to specialized architectures pro-
posed in the literature. The paper’s contributions
are:
We show that multi-sentence translation can
benefit from increased-capacity transformer
models and that deeper models are better at
learning contextual dependencies than wider
models.
We further show that distilled models can
learn contextual dependencies from larger
models, while reducing computational cost
and increasing robustness to input length vari-
ations.
Finally, results on four language pairs confirm
that the approach achieves high performance
for both contextual and single-segment trans-
lation tasks.
2 Multi-segment translation models
Throughout this paper, we implement context-
aware translation models as multi-segment models,
as initially proposed in Tiedemann and Scherrer
(2017) and further used in Fernandes et al. (2021);
Lopes et al. (2020) among others.
Multi-segment data points
We use document-
level parallel data which is transformed to contain
concatenated, multi-segment input. Specifically,
arXiv:2210.10906v2 [cs.CL] 21 Oct 2022
Input Output
<start >Fire? <sep >Well, put it out, why don’t you? <end > <start >Ein Feuer? <sep >Na dann löscht er doch! <end >
<start >Well, put it out, why don’t you? <end > <start >Na dann löscht er doch! <end >
Table 1: Parallel training data contains both segments in isolation as well as concatenated segments. Example is
demonstrative, from the EN-DE anaphora test set (Müller et al.,2018). At inference time, only the translations of
target segments (in bold) are used.
we restrict this work to two consecutive sentences.
The source and target sides are concatenated using
a special delimiter token and added to the training.
While not strictly a requirement, the special token
allows the extraction of the context-aware transla-
tion for the second, target sentence. Prior context-
aware architectures can be categorized with respect
to the use of context as using: source-side, target-
side or both. As it generates both sentence trans-
lations jointly, the multi-segment approach takes
advantage of both source- and target-side context
at train-time. However, it does not use the context
reference translation during inference and multi-
segment input is simply translated as a continuous
output sequence.
Training data
We aim to create single transla-
tion models which can perform both translation
in-context and in isolation. For this reason, we
start from a training set including context for each
parallel sentence and create a duplicate of it by
removing the context information. All the contex-
tual models (Ctx) are trained on this joint single-
and multi-segment data, while the sentence-level
baselines (Bl) use only single sentences. Note that
although the data size varies between Bl and Ctx
models, the data is effectively identical and all the
models are trained using the same stopping criteria,
thus conferring no special advantage to any of the
models. Table 1exemplifies the training data.
3 Experimental setup
We perform experiments in four language arcs, En-
glish to German (EN-DE), English to French (EN-
FR), English to Russian (EN-RU) and Chinese to
English (ZH-EN).
3.1 Training
We use the WMT2019 data set for EN-DE, Open
Subtitles 2018 for EN-FR and EN-RU and UN Par-
allel Corpus V1.0 for ZH-EN, all four containing
document-level data. The data sets vary in size
from 4M segments for EN-DE to 17.4M for ZH-
EN (see Appendix A, Table 13 for details). De-
velopment data consists of the News task 2019
development set for DE, IWSLT 2019 for FR and
newstest2019 for RU and ZH respectively. In all
conditions the development data mirrors the train-
ing data, meaning that it is duplicated to contain
both multi- and single segments data for contextual
models, and original and distilled data for distilla-
tion experiments. In preliminary experiments we
found this to play an important role.
Models use the Transformer architec-
ture (Vaswani et al.,2017a). We start with
a baseline architecture of 6:2 encoder:decoder
layers and 2048 feed-forward width, which
subsequent experiments increase in decoder depth
and feed-forward width respectively. Training
is done with Sockeye (Domhan et al.,2020).
See Appendix Afor a complete list of training
parameters.
3.2 Testing
We measure performance of contextual models us-
ing both targeted and non-targeted testing.
Non-targeted tests
consists of contextual,
document-level data which is not selected to fo-
cus on discourse phenomena. For EN-DE we use
the test set splits made available in (Maruf et al.,
2019a): TED (2.3k segments), News-Commentary
(3k) and Europarl (5.1k). We use IWSLT15
(1k) (Cettolo et al.,2012) for EN-FR, WMT new-
stest2020 (4k) (Barrault et al.,2020) for EN-RU
and finally WMT newstest2020 (2k) (Barrault
et al.,2020) for ZH-EN. While contextual mod-
els may improve performance on these data sets,
previous work suggests that the effects are minimal
in high-resources scenarios with strong sentence-
level baselines (Lopes et al.,2020).
Targeted tests
have been developed in order to
evaluate performance on discourse phenomena. Ta-
ble 2lists the test sets used in this paper.
2
These
test sets contain contrastive translation pairs, con-
sisting of a correct human-generated translation,
and a variant of it where a pronoun, or another lin-
guistic unit of interest, is swapped with an incorrect
2
While highly relevant, data created by (Yin et al.,2021a)
has not been released at the time of writing this paper.
LP Type Size Source
EN-DE Anaphora 12,000 (Müller et al.,2018)
EN-FR Anaphora 12,000 (Lopes et al.,2020)
EN-RU Deixis 3,000 (Voita et al.,2019b)
Lex-coh 2,000
Ellipsis-vp 500
Ellipsis-infl 500
ZH-EN Anaphora 500 (Jwalapuram et al.,2019)
Table 2: Targeted test sets used for evaluating discourse
phenomena.
DE Src I forgot to confide it to you.
Ctx What’s your plan?
Ctx-tgt Was hast du vor?
Ref Ich vergaß, es euch zu vertraun.
Contr Ich vergaß, sie euch zu vertraun.
FR Src And where’s it coming from?
Ctx A sort of mist.
Ctx-tgt Une sorte de brume.
Ref Et elle vient d’où ?
Contr Et il vient d’où ?
RU Src Identity theft.
Ctx And I solved another crime.
Ctx-tgt
Ref
Contr
ZH Src
Ctx 7
,
Ctx-tgt It was as if Fiji had been born to play 7s,
while GB are still learning the trade .
Ref Which is pretty much how it is.
Contr That is pretty much how it is.
Table 3: Targeted test set examples. Models are as-
sessed as correct if they score the reference (Ref) higher
than a contrastive variant (Contr), given a source seg-
ment and its context.
one. Table 3shows examples from these data sets.
To complement accuracy of contrastive evalua-
tions, we also use targeted test sets and their refer-
ences to measure standard translation metrics.
4 Context-aware translation results
We begin our experiments by confirming that con-
catenated models are indeed able to model context
dependencies (Section 4.1). We follow by testing
the hypothesis that larger models are better suited
for learning the more complex contextual training
data (Section 4.2). In order to avoid over-fitting,
we use EN-DE as a development language and sub-
sequently test identical settings on FR, RU and ZH
in Section 4.3.
4.1 Multi-segment models
For all four language arcs,
Ctx
models use 6 en-
coder layers and 2 decoder layers (44M parameters)
and are trained using both segments in isolation as
well as concatenated context segments. In infer-
Arc Metric Test set Targeted Bl Ctx
DE BLEU TED 19.9 22.4
News 26.1 29.5
Europarl 29.3 31.5
BLEU ContraPro X20.1 21.1
Acc ContraPro X0.50 0.70
FR BLEU IWSLT 40.0 39.7
BLEU LCPT3X27.9 32.5
Acc LCPT X0.74 0.87
Acc Anaphora X0.50 0.72
RU BLEU WMT20 13.6 14.6
Acc Deixis X0.50 0.83
Acc Lex-coh X0.46 0.47
Acc Ellipsis-vp X0.20 0.60
Acc Ellipsis-infl X0.52 0.68
ZH BLEU WMT20 21.2 21.4
Acc Eval-anaphora X0.58 0.61
Table 4: Concatenated models (Ctx) vs baseline models
(Bl) of the same capacity. While all test sets have con-
text, some are targeted towards discourse phenomena,
marked as Targeted (see Section 3for details).
ence, DE, FR and ZH models use one preceding
context sentence, matching the training. However,
over 60% of the targeted RU data exhibits longer
dependencies, of up to 3 previous segments. For
this reason, targeted EN-RU testing concatenates
all three context sentences. Baseline (
Bl
) models
use the same original train data and model archi-
tecture, this time trained and used to translate one
segment at a time.
Results are shown in Table 4. As observed by
previous work, concatenated models are consider-
ably better than their context-ignoring counterparts,
particularly on targeted test sets. In contrastive
testing, accuracy increases by 20-30% in absolute
values in all languages, with the exception of the
lexical cohesion data set in RU and anaphora data
set in ZH.
For non-targeted testing, Ctx models signif-
icantly out-perform the Bl models in 4 out of
the 6 test sets. This differs from previous work,
where contextual models using the concatenation
approach are reported to degrade BLEU scores:
Tiedemann and Scherrer (2017) measure 0.6 BLEU
drop, Voita et al. (2019b) show a 0.84 drop for RU,
Lopes et al. (2020), 1.2, and Junczys-Dowmunt
(2019a) shows a BLEU degradation of 1.5. These
results indicate that our approach to train the
contextual model with both contextual and non-
contextual data alleviates the issue of quality degra-
dation.
3Large-contrastive-pronoun-testset-EN-FR (LCPT)
摘要:

Abaselinerevisited:Pushingthelimitsofmulti-segmentmodelsforcontext-awaretranslationSuvodeepMajumderNCSU,Amazonsmajumd3@ncsu.eduStanislasLaulyAmazonlaulysl@amazon.comMariaNadejdeAmazonmnnadejd@amazon.comMarcelloFedericoAmazonmarcfede@amazon.comGeorgianaDinuAmazongddinu@amazon.comAbstractThispaperad...

展开>> 收起<<
A baseline revisited Pushing the limits of multi-segment models for context-aware translation Suvodeep Majumder.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:475.29KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注