
SOURCE REFERENCE MT
1) 乔恋攥紧了拳头,垂下了头。Qiao Lian clenched her fists and lowered her head. Joe clenched his fist and bowed his head.
2) 其实他说得对。Actually, he was right. In fact, he’s right.
3) JK自己就是一个蠢货,竟然会[...]。JSheKwas indeed an idiot, as only an idiot would [...] JIKam a fool, even will [...]
5) 她点进去,发现是凉粉群,所有人都在@她[...] She logged into JherKaccount and saw that a large num-
ber of fans in the Liang fan group had tagged her.[...]
She nodded in and found it was a cold powder group,
and everyone was on her.[...]
7) 【川流不息:乔恋,快看微博头条!微博头
条!】
[Chuan Forever:Qiao Lian, look at the headlines on
Weibo, quickly!]
Chuan-flowing:Joe love, quickly look at the
micro-blogging headlines! Weibo headlines?
8) 她微微一愣,拿起手机,登陆微博,在看到头条
的时候,整个人一下子愣住了!
She froze momentarily, then picked up JherKcell phone
and logged into Weibo. When JsheKsaw the headlines,
Jher entire bodyKimmediately froze over again!
She took a slight look, picked up the phone, landed on
the micro-blog, when JsheKsaw the headlines, Jthe
whole personKsuddenly choked!
Figure 2:
Part of a chapter in
BWB
. The same entities are marked with the same color. Pronoun omissions are marked with
JK
. The mistranslated verbs are marked with
teal
, and the mistranslated named entities are
underlined
. The full chapter is
in Fig. 5. MT is the output of a Transformer-based sentence-level machine translation system.
Corpus Genre Size Averaged Length
#word #sent #doc #w/s #s/d #w/d
IWSLT TED talk 4.2M 0.2M 2K 19.5 100 2,100
NewsCom news 6.4M 0.2M 5K 30.7 40 1,288
Europarl Parliament 7.3M 0.2M 15K 35.1 13 485
LDC News 81.8M 2.8M 61K 23.7 46 1,340
OpenSub Subtitle 16.9M 2.2M 3K 5.6 733 5,647
BW B novel (chapter) 460.8M 9.6M 196K 48.1 49 2,356
novel (book) 460.8M 9.6M 384 48.1 25.0K 1.2M
Table 1:
Statistics of various document-level parallel corpora.
For corpora that contain multiple language pairs (IWSLT and
OpenSub), we report the statistics for ZH-EN. For corpora that
do not contain ZH-EN parallel documents (NewsComand and
Europarl), we report the statistics of their (largest) available
language pairs (DE-EN and ET-EN). w,sand dstand for word,
sentence and document, respectively. The full list is in Tab. 5.
evaluations of several baseline models as well as
human post-editing performance on the
BWB
cor-
pus and observe large gaps between NMT models
and human performance. We hope that this corpus
will help us understand the deficiencies of exist-
ing systems and build better systems for document
level machine translation.
2 Dataset Creation
In this section, we describe three stages of the
dataset creation process: collecting bilingual paral-
lel documents, quality control and dataset split.
2.1 Bilingual Document Collection
We first select 385 Chinese web novels across mul-
tiple genres, including action, fantasy, romance,
comedy, science fictions, martial arts, etc. The
genre distribution is shown in Fig. 3. We then
crawl their corresponding English translations from
the Internet.
2
The English versions are translated
by professional translators who are native speak-
ers of English, and then corrected and aligned by
professional editors at the chapter level. The text
is converted to UTF-8 and certain data cleansing
(e.g. deduplication) is performed in the process.
2https://readnovelfull.com
Error Type # Description An
ENTITY 43.3%
error(s) due to the mistrans-
lation of named entities.
"
TENSE 38.7%
error(s) due to incorrect
tense.
ZEROPRO 17.3%
error(s) caused by the omis-
sion of pronoun(s).
"
AMBIGUITY 7.3%
there are some ambiguous
span(s) that is(are) correct
in the stand-alone sentence
but wrong in context.
"
ELLIPSIS 4.0%
error(s) caused by the omis-
sion of other span(s).
"
SENTENCE 51.3% sentence-level error(s).
NO ERROR 17.1% no errors.
Table 2:
The types of NMT errors and their description. #
represents the proportion of the error in the
BWB
test set.
"indicates “with annotation”.
Chapters that contain poetry or couplets in clas-
sical Chinese are excluded as they are difficult to
translate directly into English. Further, we exclude
chapters with less than 5 sentences and chapters
where the sequence ratio is greater than 3.0. The
titles of each chapter are also removed, since most
of them are neither translated properly nor at the
document level. The sentence alignment is auto-
matically performed by Bleualign
3
(Sennrich and
Volk, 2011). The final corpus has 384 books with
9,581,816 sentence pairs (a total of 461.8 million
words).4
2.2 Quality Control
We hired four bilingual graduate students to per-
form the quality control of the aforementioned pro-
cess. These annotators were native Chinese speak-
ers and proficient in English. We randomly se-
lected 163 chapters and asked the annotators to
3https://github.com/rsennrich/Bleualign
4
We will release a crawling and cleansing script pointing
to a past web arxiv that will enable others to reproduce our
dataset faithfully.