A Bilingual Parallel Corpus with Discourse Annotations Yuchen Eleanor JiangTianyu LiuShuming Ma Dongdong ZhangMrinmaya SachanRyan Cotterell

2025-04-30 0 0 1.45MB 12 页 10玖币
侵权投诉
A Bilingual Parallel Corpus with Discourse Annotations
Yuchen Eleanor JiangζTianyu LiuζShuming Maγ
Dongdong ZhangγMrinmaya SachanζRyan Cotterellζ
ζETH Zürich γMicrosoft Research Asia
{yuchen.jiang,tianyu.liu,ryan.cotterell,mrinmaya.sachan}@inf.ethz.ch
{shuming.ma,dongdong.zhang}@microsoft.com
Abstract
Machine translation (MT) has almost achieved
human parity at sentence-level translation. In
response, the MT community has, in part,
shifted its focus to document level translation.
However, the development of document-level
MT systems is hampered by the lack of par-
allel document corpora. This paper describes
BWB
, a large parallel corpus first introduced
in Jiang et al. (2022), along with an annotated
test set. The
BWB
corpus consists of Chinese
novels translated by experts into English, and
the annotated test set is designed to probe the
ability of machine translation systems to model
various discourse phenomena. Our resource is
freely available, and we hope that it will serve
as a guide and inspiration for more work in the
area of document-level machine translation.
https://github.com/EleanorJiang/
BlonDe/tree/main/BWB
1 Introduction
Machine translation (MT) has made significant
progress in the past few decades. Neural machine
translation (NMT) models, which are able to lever-
age abundant quantities of parallel training data,
have been one of the main contributors to this
progress (Luong et al., 2015; Vaswani et al., 2017;
Zhang et al., 2018, inter alia). Unfortunately, the
majority of available parallel corpora contain sen-
tence level translations. As a result, models trained
on these corpora translate text quite well at the
sentence level, but perform poorly when the en-
tire document translation is seen in context (Voita
et al., 2019b; Werlen and Sadiht, 2021). Partic-
ularly, sentence level translation models tend to
omit relevant contextual information, resulting in a
lack of coherence in the produced translation. For
example, in Fig. 2, the sentence-level MT system
fails to capture discourse dependencies across sen-
tences and the same concepts are not consistently
referred to with the same translations (i.e.
Weibo
Figure 1:
Comparing sizes of various document-level parallel
corpora. BWB is the to-date largest parallel corpus.
vs micro-blog,Qiao Lian vs Joe vs Joe love).1
Over the past few years, there have been ef-
forts to tackle this problem by building context-
aware NMT models (Wang et al., 2017; Miculicich
et al., 2018; Maruf and Haffari, 2018; Voita et al.,
2019a, inter alia). Although such approaches have
achieved some improvements, they nonetheless suf-
fer from a dearth of document level training data.
Take the WMT news translation task as an exam-
ple. The document level news commentary cor-
pus (Tiedemann, 2012) only contains 6.4M tokens
while the available sentence-level training data has
around 825M tokens. To alleviate this problem, we
collect a large document level parallel corpus that
consists of 196K paragraphs from Chinese novels
translated into English. As shown in Fig. 1, it is
the largest document-level corpus to the best of
our knowledge. Additionally, an in-depth human
analysis shows, it is very challenging for current
NMT systems due to its rich discourse phenomena
(see Fig. 2).
To better evaluate context-aware MT models,
we further annotate the test set with characteristic
discourse-level phenomena, namely ambiguity and
ellipsis. The test set is designed to specifically mea-
sure models’ capacity to exploit such long range
linguistic context. We then conduct systematic
1
This discourse phenomenon is referred as entity consis-
tency. There are other discourse dependencies that MT fails to
capture, such as tense cohesion, ellipsis and coreference. We
have left explanations of discourse phenomena to §3.1.
arXiv:2210.14667v1 [cs.CL] 26 Oct 2022
SOURCE REFERENCE MT
1) 下了Qiao Lian clenched her fists and lowered her head. Joe clenched his fist and bowed his head.
2) Actually, he was right. In fact, hes right.
3) JK一个[...]JSheKwas indeed an idiot, as only an idiot would [...] JIKam a fool, even will [...]
5) @[...] She logged into JherKaccount and saw that a large num-
ber of fans in the Liang fan group had tagged her.[...]
She nodded in and found it was a cold powder group,
and everyone was on her.[...]
7)
[Chuan Forever:Qiao Lian, look at the headlines on
Weibo, quickly!]
Chuan-flowing:Joe love, quickly look at the
micro-blogging headlines! Weibo headlines?
8) 微微
个人一下
She froze momentarily, then picked up JherKcell phone
and logged into Weibo. When JsheKsaw the headlines,
Jher entire bodyKimmediately froze over again!
She took a slight look, picked up the phone, landed on
the micro-blog, when JsheKsaw the headlines, Jthe
whole personKsuddenly choked!
Figure 2:
Part of a chapter in
BWB
. The same entities are marked with the same color. Pronoun omissions are marked with
JK
. The mistranslated verbs are marked with
teal
, and the mistranslated named entities are
underlined
. The full chapter is
in Fig. 5. MT is the output of a Transformer-based sentence-level machine translation system.
Corpus Genre Size Averaged Length
#word #sent #doc #w/s #s/d #w/d
IWSLT TED talk 4.2M 0.2M 2K 19.5 100 2,100
NewsCom news 6.4M 0.2M 5K 30.7 40 1,288
Europarl Parliament 7.3M 0.2M 15K 35.1 13 485
LDC News 81.8M 2.8M 61K 23.7 46 1,340
OpenSub Subtitle 16.9M 2.2M 3K 5.6 733 5,647
BW B novel (chapter) 460.8M 9.6M 196K 48.1 49 2,356
novel (book) 460.8M 9.6M 384 48.1 25.0K 1.2M
Table 1:
Statistics of various document-level parallel corpora.
For corpora that contain multiple language pairs (IWSLT and
OpenSub), we report the statistics for ZH-EN. For corpora that
do not contain ZH-EN parallel documents (NewsComand and
Europarl), we report the statistics of their (largest) available
language pairs (DE-EN and ET-EN). w,sand dstand for word,
sentence and document, respectively. The full list is in Tab. 5.
evaluations of several baseline models as well as
human post-editing performance on the
BWB
cor-
pus and observe large gaps between NMT models
and human performance. We hope that this corpus
will help us understand the deficiencies of exist-
ing systems and build better systems for document
level machine translation.
2 Dataset Creation
In this section, we describe three stages of the
dataset creation process: collecting bilingual paral-
lel documents, quality control and dataset split.
2.1 Bilingual Document Collection
We first select 385 Chinese web novels across mul-
tiple genres, including action, fantasy, romance,
comedy, science fictions, martial arts, etc. The
genre distribution is shown in Fig. 3. We then
crawl their corresponding English translations from
the Internet.
2
The English versions are translated
by professional translators who are native speak-
ers of English, and then corrected and aligned by
professional editors at the chapter level. The text
is converted to UTF-8 and certain data cleansing
(e.g. deduplication) is performed in the process.
2https://readnovelfull.com
Error Type # Description An
ENTITY 43.3%
error(s) due to the mistrans-
lation of named entities.
"
TENSE 38.7%
error(s) due to incorrect
tense.
ZEROPRO 17.3%
error(s) caused by the omis-
sion of pronoun(s).
"
AMBIGUITY 7.3%
there are some ambiguous
span(s) that is(are) correct
in the stand-alone sentence
but wrong in context.
"
ELLIPSIS 4.0%
error(s) caused by the omis-
sion of other span(s).
"
SENTENCE 51.3% sentence-level error(s).
NO ERROR 17.1% no errors.
Table 2:
The types of NMT errors and their description. #
represents the proportion of the error in the
BWB
test set.
"indicates “with annotation”.
Chapters that contain poetry or couplets in clas-
sical Chinese are excluded as they are difficult to
translate directly into English. Further, we exclude
chapters with less than 5 sentences and chapters
where the sequence ratio is greater than 3.0. The
titles of each chapter are also removed, since most
of them are neither translated properly nor at the
document level. The sentence alignment is auto-
matically performed by Bleualign
3
(Sennrich and
Volk, 2011). The final corpus has 384 books with
9,581,816 sentence pairs (a total of 461.8 million
words).4
2.2 Quality Control
We hired four bilingual graduate students to per-
form the quality control of the aforementioned pro-
cess. These annotators were native Chinese speak-
ers and proficient in English. We randomly se-
lected 163 chapters and asked the annotators to
3https://github.com/rsennrich/Bleualign
4
We will release a crawling and cleansing script pointing
to a past web arxiv that will enable others to reproduce our
dataset faithfully.
distinguish whether a document was well aligned
at the sentence level by counting the number of
misalignment. It is identified as a misalignment if,
for example, line 39 in English corresponds to line
39 and line 40 in Chinese, but the tool made a mis-
take in combining the two sentences. We observed
an alignment accuracy rate of 93.1%.
2.3 Dataset Split
We construct the development set and the test set
by randomly selecting 80 and 79 chapters from 6
novels, which contain 3,018 chapters in total. To
prevent any train-test leakage, these 6 novels are
removed from the training set. Tab. 6 provides the
detailed statistics of the
BWB
dataset split. In addi-
tion, we asked the same annotators who performed
the quality control to manually correct misalign-
ments in the development and test sets, and 7.3%
of the lines were corrected in total.
3 Dataset Analysis and Annotation
As part of this section, we analyze the types of
translation errors that can occur in sentence-level
NMT outputs, as well as annotate the
BWB
test
set. We also provide analysis on coherence-related
properties: numbers of named entities, numbers
of pronouns in both English and Chinese, and the
relationships of those factors. The annotation was
conducted by eight professional translators.
3.1 Translation Errors
The annotators were asked to identify and catego-
rize discourse-level translation errors made by a
state-of-the-art commercial NMT system, i.e. er-
rors that are only visible in context larger than in-
dividual sentences. The annotators followed the
following guideline for this error annotation:
1.
Identify cases that have translation errors: la-
bel examples as NO ERROR only if they meet
both the criteria of adequacy and fluency as
well as the global criterion of coherence.
2.
Identify whether the translation error is at the
sentence level or document level (or both):
SENTENCE are examples that are already not
adequate or fluent as stand-alone sentences.
3.
Categorize the DOCUMENT examples in ac-
cordance with the discourse phenomena, mark
the corresponding spans in the reference (En-
glish) that cause the MT output to be incorrect,
and provide the correct versions.
The types of errors are summarized in Tab. 2.
Lang MASCULINE FEMININE NEUTER EPICENE
EN 1633 2521 608 391
ZH 654 967 14 118
Table 3:
The distributions of different types of pronouns in
both English and Chinese in the BWB test set.
3.2 Named Entities
Named entities (NEs) are an essential part of sen-
tences in terms of human understanding and read-
ability. The mistranslation of NEs can signifi-
cantly impact translation output, although evalu-
ation scores (e.g. BLEU) may not be adversely
affected. Therefore, we also annotate named enti-
ties in the reference documents, following a similar
procedure to OntoNotes (Hovy et al., 2006). In
total, 2,234 entities are annotated in the
BWB
test.
3.3 Pronouns
Pronoun translation has been the focus of discourse-
level MT evaluation (Hardmeier, 2012; Miculi-
cich Werlen and Popescu-Belis, 2017). As
showned in Tab. 3, there are significantly fewer
pronouns in Chinese due to its pronoun-dropping
property. This poses extra challenges for NMT
since the skill of anaphoric resolution is required.
4 Experiments
We carry out evaluation of both baseline and state-
of-the-art MT models on
BWB
and also provide
human post-editing performance PE for compari-
son. The following 6 baselines are adapted:5
SMT
: phrase-based baseline (Chiang, 2007).
BING
,
G
oo
GL
e,
B
ai
D
u: commercial systems.
MT-S
: the Transformer baseline that translates
sentence by sentence (Vaswani et al., 2017).
MT-D
: the document-level NMT model that
adopts two-stage training (Zhang et al., 2018).
Evaluation Metrics
Systems are evaluated with
automatic standard sentence-level MT metrics
(BLEU (Papineni et al., 2002), METEOR (Baner-
jee and Lavie, 2005), BERT (Zhang* et al., 2020))
and a document-level metric (BLONDE,
?
). We also
performed evaluation targeted at specific discourse-
phenomena.
5
Professional translators were hired to conduct post-editing
on the BING outputs. They were instructed to correct only
discourse-level errors with minimal modification. MT-Sand
MT-Dare trained on
BWB
by fairseq (Ott et al., 2019), and
the training details are in App.C.
摘要:

ABilingualParallelCorpuswithDiscourseAnnotationsYuchenEleanorJiangTianyuLiuShumingMaDongdongZhangMrinmayaSachanRyanCotterellETHZürichMicrosoftResearchAsia{yuchen.jiang,tianyu.liu,ryan.cotterell,mrinmaya.sachan}@inf.ethz.ch{shuming.ma,dongdong.zhang}@microsoft.comAbstractMachinetranslation(MT)ha...

展开>> 收起<<
A Bilingual Parallel Corpus with Discourse Annotations Yuchen Eleanor JiangTianyu LiuShuming Ma Dongdong ZhangMrinmaya SachanRyan Cotterell.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.45MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注