
Improved Data Augmentation for Translation Suggestion
Hongxiao Zhang1, Siyu Lai1, Songming Zhang1, Hui Huang2, Yufeng Chen1∗
Jinan Xu1and Jian Liu1
1Beijing Jiaotong University, Beijing, China
2Harbin Institute of Technology, Harbin, China
{hongxiaozhang,siyulai,smzhang22,chenyf,jaxu,jianliu}@bjtu.edu.cn,
huanghui_hit@126.com
Abstract
Translation suggestion (TS) models are used
to automatically provide alternative sugges-
tions for incorrect spans in sentences gener-
ated by machine translation. This paper intro-
duces the system used in our submission to the
WMT’22 Translation Suggestion shared task.
Our system is based on the ensemble of differ-
ent translation architectures, including Trans-
former, SA-Transformer, and DynamicConv.
We use three strategies to construct synthetic
data from parallel corpora to compensate for
the lack of supervised data. In addition, we
introduce a multi-phase pre-training strategy,
adding an additional pre-training phase with
in-domain data. We rank second and third on
the English-German and English-Chinese bidi-
rectional tasks, respectively.
1 Introduction
Translation suggestion (TS) is a scheme to simplify
Post-editing (PE) by automatically providing alter-
native suggestions for incorrect spans in machine
translation outputs. Yang et al. (2021) formally
define TS and build a high-quality dataset with hu-
man annotation, establishing a benchmark for TS.
Based on the machine translation framework, the
TS system takes the spliced source sentence
x
and
the translation sentence
˜m
as the input, where the
incorrect span of
˜m
is masked, and its output is
the correct alternative
y
of the incorrect span. The
TS task is still in the primary research stage, to
spur the research on this task, WMT released the
translation suggestion shared task.
This WMT’22 shared task consists of two sub-
tasks: Naive Translation Suggestion and Trans-
lation Suggestion with Hints. We participate
in the former, which publishes the bidirectional
translation suggestion task for two language pairs,
English-Chinese and English-German, and we par-
ticipate in all language pairs.
∗Yufeng Chen is the corresponding author.
Our TS systems are built based on several ma-
chine translation models, including Transformer
(Vaswani et al.,2017), SA-Transformer (Yang et al.,
2021), and DynamicConv (Wu et al.,2018). To
make up for the lack of training data, we use par-
allel corpora to construct synthetic data, based on
three strategies. Firstly, we randomly sample a
sub-segment in each target sentence of the golden
parallel data, mask the sampled sub-segment to sim-
ulate an incorrect span, and use the sub-segment
as an alternative suggestion. Secondly, the same
strategy as above is used for pseudo-parallel data
with the target side substituted by machine trans-
lation results. Finally, we use a quality estimation
(QE) model (Zheng et al.,2021) to estimate the
translation quality of words in translation output
sentence and select the span with low confidence
for masking, and then, we utilize an alignment tool
to find the sub-segment corresponding to the span
in the reference sentence and use it as the alterna-
tive suggestion for the span.
Considering that there is a domain difference
between the synthetic corpus and the human-
annotated corpus, we add an additional pre-training
phase. Specifically, we train a discriminator and
use it to filter sentences from the synthetic cor-
pus that are close to the golden corpus, which we
deem as in-domain data. After pre-training with
large-scale synthetic data, we perform an additional
pre-training with in-domain data, thereby reducing
the domain gap. We will describe our system in
detail in Section 3.
2 Related Work
The translation suggestion (TS) task is an important
part of post-editing (PE), which combines machine
translation (MT) and human translation (HT), and
improves the quality of translation by correcting
incorrect spans in machine translation outputs by
human translators. To simplify PE, some early
scholars have studied translation prediction (Green
arXiv:2210.06138v1 [cs.CL] 12 Oct 2022