
Input Output
<start >Fire? <sep >Well, put it out, why don’t you? <end > <start >Ein Feuer? <sep >Na dann löscht er doch! <end >
<start >Well, put it out, why don’t you? <end > <start >Na dann löscht er doch! <end >
Table 1: Parallel training data contains both segments in isolation as well as concatenated segments. Example is
demonstrative, from the EN-DE anaphora test set (Müller et al.,2018). At inference time, only the translations of
target segments (in bold) are used.
we restrict this work to two consecutive sentences.
The source and target sides are concatenated using
a special delimiter token and added to the training.
While not strictly a requirement, the special token
allows the extraction of the context-aware transla-
tion for the second, target sentence. Prior context-
aware architectures can be categorized with respect
to the use of context as using: source-side, target-
side or both. As it generates both sentence trans-
lations jointly, the multi-segment approach takes
advantage of both source- and target-side context
at train-time. However, it does not use the context
reference translation during inference and multi-
segment input is simply translated as a continuous
output sequence.
Training data
We aim to create single transla-
tion models which can perform both translation
in-context and in isolation. For this reason, we
start from a training set including context for each
parallel sentence and create a duplicate of it by
removing the context information. All the contex-
tual models (Ctx) are trained on this joint single-
and multi-segment data, while the sentence-level
baselines (Bl) use only single sentences. Note that
although the data size varies between Bl and Ctx
models, the data is effectively identical and all the
models are trained using the same stopping criteria,
thus conferring no special advantage to any of the
models. Table 1exemplifies the training data.
3 Experimental setup
We perform experiments in four language arcs, En-
glish to German (EN-DE), English to French (EN-
FR), English to Russian (EN-RU) and Chinese to
English (ZH-EN).
3.1 Training
We use the WMT2019 data set for EN-DE, Open
Subtitles 2018 for EN-FR and EN-RU and UN Par-
allel Corpus V1.0 for ZH-EN, all four containing
document-level data. The data sets vary in size
from 4M segments for EN-DE to 17.4M for ZH-
EN (see Appendix A, Table 13 for details). De-
velopment data consists of the News task 2019
development set for DE, IWSLT 2019 for FR and
newstest2019 for RU and ZH respectively. In all
conditions the development data mirrors the train-
ing data, meaning that it is duplicated to contain
both multi- and single segments data for contextual
models, and original and distilled data for distilla-
tion experiments. In preliminary experiments we
found this to play an important role.
Models use the Transformer architec-
ture (Vaswani et al.,2017a). We start with
a baseline architecture of 6:2 encoder:decoder
layers and 2048 feed-forward width, which
subsequent experiments increase in decoder depth
and feed-forward width respectively. Training
is done with Sockeye (Domhan et al.,2020).
See Appendix Afor a complete list of training
parameters.
3.2 Testing
We measure performance of contextual models us-
ing both targeted and non-targeted testing.
Non-targeted tests
consists of contextual,
document-level data which is not selected to fo-
cus on discourse phenomena. For EN-DE we use
the test set splits made available in (Maruf et al.,
2019a): TED (2.3k segments), News-Commentary
(3k) and Europarl (5.1k). We use IWSLT15
(1k) (Cettolo et al.,2012) for EN-FR, WMT new-
stest2020 (4k) (Barrault et al.,2020) for EN-RU
and finally WMT newstest2020 (2k) (Barrault
et al.,2020) for ZH-EN. While contextual mod-
els may improve performance on these data sets,
previous work suggests that the effects are minimal
in high-resources scenarios with strong sentence-
level baselines (Lopes et al.,2020).
Targeted tests
have been developed in order to
evaluate performance on discourse phenomena. Ta-
ble 2lists the test sets used in this paper.
2
These
test sets contain contrastive translation pairs, con-
sisting of a correct human-generated translation,
and a variant of it where a pronoun, or another lin-
guistic unit of interest, is swapped with an incorrect
2
While highly relevant, data created by (Yin et al.,2021a)
has not been released at the time of writing this paper.