A Bilingual Parallel Corpus with Discourse Annotations Yuchen Eleanor JiangTianyu LiuShuming Ma Dongdong ZhangMrinmaya SachanRyan Cotterell

2025-04-30 0 0 1.45MB 12 页 10玖币

A Bilingual Parallel Corpus with Discourse Annotations

Yuchen Eleanor JiangζTianyu LiuζShuming Maγ

Dongdong ZhangγMrinmaya SachanζRyan Cotterellζ

ζETH Zürich γMicrosoft Research Asia

{yuchen.jiang,tianyu.liu,ryan.cotterell,mrinmaya.sachan}@inf.ethz.ch

{shuming.ma,dongdong.zhang}@microsoft.com

Abstract

Machine translation (MT) has almost achieved

human parity at sentence-level translation. In

response, the MT community has, in part,

shifted its focus to document level translation.

However, the development of document-level

MT systems is hampered by the lack of par-

allel document corpora. This paper describes

BWB

, a large parallel corpus ﬁrst introduced

in Jiang et al. (2022), along with an annotated

test set. The

BWB

corpus consists of Chinese

novels translated by experts into English, and

the annotated test set is designed to probe the

ability of machine translation systems to model

various discourse phenomena. Our resource is

freely available, and we hope that it will serve

as a guide and inspiration for more work in the

area of document-level machine translation.

https://github.com/EleanorJiang/

BlonDe/tree/main/BWB

1 Introduction

Machine translation (MT) has made signiﬁcant

progress in the past few decades. Neural machine

translation (NMT) models, which are able to lever-

age abundant quantities of parallel training data,

have been one of the main contributors to this

progress (Luong et al., 2015; Vaswani et al., 2017;

Zhang et al., 2018, inter alia). Unfortunately, the

majority of available parallel corpora contain sen-

tence level translations. As a result, models trained

on these corpora translate text quite well at the

sentence level, but perform poorly when the en-

tire document translation is seen in context (Voita

et al., 2019b; Werlen and Sadiht, 2021). Partic-

ularly, sentence level translation models tend to

omit relevant contextual information, resulting in a

lack of coherence in the produced translation. For

example, in Fig. 2, the sentence-level MT system

fails to capture discourse dependencies across sen-

tences and the same concepts are not consistently

referred to with the same translations (i.e.

Weibo

Figure 1:

Comparing sizes of various document-level parallel

corpora. BWB is the to-date largest parallel corpus.

vs micro-blog,Qiao Lian vs Joe vs Joe love).1

Over the past few years, there have been ef-

forts to tackle this problem by building context-

aware NMT models (Wang et al., 2017; Miculicich

et al., 2018; Maruf and Haffari, 2018; Voita et al.,

2019a, inter alia). Although such approaches have

achieved some improvements, they nonetheless suf-

fer from a dearth of document level training data.

Take the WMT news translation task as an exam-

ple. The document level news commentary cor-

pus (Tiedemann, 2012) only contains 6.4M tokens

while the available sentence-level training data has

around 825M tokens. To alleviate this problem, we

collect a large document level parallel corpus that

consists of 196K paragraphs from Chinese novels

translated into English. As shown in Fig. 1, it is

the largest document-level corpus to the best of

our knowledge. Additionally, an in-depth human

analysis shows, it is very challenging for current

NMT systems due to its rich discourse phenomena

(see Fig. 2).

To better evaluate context-aware MT models,

we further annotate the test set with characteristic

discourse-level phenomena, namely ambiguity and

ellipsis. The test set is designed to speciﬁcally mea-

sure models’ capacity to exploit such long range

linguistic context. We then conduct systematic

1

This discourse phenomenon is referred as entity consis-

tency. There are other discourse dependencies that MT fails to

capture, such as tense cohesion, ellipsis and coreference. We

have left explanations of discourse phenomena to §3.1.

arXiv:2210.14667v1 [cs.CL] 26 Oct 2022

SOURCE REFERENCE MT

1) 乔恋攥紧了拳头，垂下了头。Qiao Lian clenched her ﬁsts and lowered her head. Joe clenched his ﬁst and bowed his head.

2) 其实他说得对。Actually, he was right. In fact, he’s right.

3) JK自己就是一个蠢货，竟然会[...]。JSheKwas indeed an idiot, as only an idiot would [...] JIKam a fool, even will [...]

5) 她点进去，发现是凉粉群，所有人都在@她[...] She logged into JherKaccount and saw that a large num-

ber of fans in the Liang fan group had tagged her.[...]

She nodded in and found it was a cold powder group,

and everyone was on her.[...]

7) 【川流不息：乔恋，快看微博头条！微博头

条！】

[Chuan Forever:Qiao Lian, look at the headlines on

Weibo, quickly!]

Chuan-ﬂowing:Joe love, quickly look at the

micro-blogging headlines! Weibo headlines?

8) 她微微一愣，拿起手机，登陆微博，在看到头条

的时候，整个人一下子愣住了！

She froze momentarily, then picked up JherKcell phone

and logged into Weibo. When JsheKsaw the headlines,

Jher entire bodyKimmediately froze over again!

She took a slight look, picked up the phone, landed on

the micro-blog, when JsheKsaw the headlines, Jthe

whole personKsuddenly choked!

Figure 2:

Part of a chapter in

BWB

. The same entities are marked with the same color. Pronoun omissions are marked with

JK

. The mistranslated verbs are marked with

teal

, and the mistranslated named entities are

underlined

. The full chapter is

in Fig. 5. MT is the output of a Transformer-based sentence-level machine translation system.

Corpus Genre Size Averaged Length

#word #sent #doc #w/s #s/d #w/d

IWSLT TED talk 4.2M 0.2M 2K 19.5 100 2,100

NewsCom news 6.4M 0.2M 5K 30.7 40 1,288

Europarl Parliament 7.3M 0.2M 15K 35.1 13 485

LDC News 81.8M 2.8M 61K 23.7 46 1,340

OpenSub Subtitle 16.9M 2.2M 3K 5.6 733 5,647

BW B novel (chapter) 460.8M 9.6M 196K 48.1 49 2,356

novel (book) 460.8M 9.6M 384 48.1 25.0K 1.2M

Table 1:

Statistics of various document-level parallel corpora.

For corpora that contain multiple language pairs (IWSLT and

OpenSub), we report the statistics for ZH-EN. For corpora that

do not contain ZH-EN parallel documents (NewsComand and

Europarl), we report the statistics of their (largest) available

language pairs (DE-EN and ET-EN). w,sand dstand for word,

sentence and document, respectively. The full list is in Tab. 5.

evaluations of several baseline models as well as

human post-editing performance on the

BWB

cor-

pus and observe large gaps between NMT models

and human performance. We hope that this corpus

will help us understand the deﬁciencies of exist-

ing systems and build better systems for document

level machine translation.

2 Dataset Creation

In this section, we describe three stages of the

dataset creation process: collecting bilingual paral-

lel documents, quality control and dataset split.

2.1 Bilingual Document Collection

We ﬁrst select 385 Chinese web novels across mul-

tiple genres, including action, fantasy, romance,

comedy, science ﬁctions, martial arts, etc. The

genre distribution is shown in Fig. 3. We then

crawl their corresponding English translations from

the Internet.

2

The English versions are translated

by professional translators who are native speak-

ers of English, and then corrected and aligned by

professional editors at the chapter level. The text

is converted to UTF-8 and certain data cleansing

(e.g. deduplication) is performed in the process.

2https://readnovelfull.com

Error Type # Description An

ENTITY 43.3%

error(s) due to the mistrans-

lation of named entities.

"

TENSE 38.7%

error(s) due to incorrect

tense.

ZEROPRO 17.3%

error(s) caused by the omis-

sion of pronoun(s).

"

AMBIGUITY 7.3%

there are some ambiguous

span(s) that is(are) correct

in the stand-alone sentence

but wrong in context.

"

ELLIPSIS 4.0%

error(s) caused by the omis-

sion of other span(s).

"

SENTENCE 51.3% sentence-level error(s).

NO ERROR 17.1% no errors.

Table 2:

The types of NMT errors and their description. #

represents the proportion of the error in the

BWB

test set.

"indicates “with annotation”.

Chapters that contain poetry or couplets in clas-

sical Chinese are excluded as they are difﬁcult to

translate directly into English. Further, we exclude

chapters with less than 5 sentences and chapters

where the sequence ratio is greater than 3.0. The

titles of each chapter are also removed, since most

of them are neither translated properly nor at the

document level. The sentence alignment is auto-

matically performed by Bleualign

3

(Sennrich and

Volk, 2011). The ﬁnal corpus has 384 books with

9,581,816 sentence pairs (a total of 461.8 million

words).4

2.2 Quality Control

We hired four bilingual graduate students to per-

form the quality control of the aforementioned pro-

cess. These annotators were native Chinese speak-

ers and proﬁcient in English. We randomly se-

lected 163 chapters and asked the annotators to

3https://github.com/rsennrich/Bleualign

4

We will release a crawling and cleansing script pointing

to a past web arxiv that will enable others to reproduce our

dataset faithfully.

distinguish whether a document was well aligned

at the sentence level by counting the number of

misalignment. It is identiﬁed as a misalignment if,

for example, line 39 in English corresponds to line

39 and line 40 in Chinese, but the tool made a mis-

take in combining the two sentences. We observed

an alignment accuracy rate of 93.1%.

2.3 Dataset Split

We construct the development set and the test set

by randomly selecting 80 and 79 chapters from 6

novels, which contain 3,018 chapters in total. To

prevent any train-test leakage, these 6 novels are

removed from the training set. Tab. 6 provides the

detailed statistics of the

BWB

dataset split. In addi-

tion, we asked the same annotators who performed

the quality control to manually correct misalign-

ments in the development and test sets, and 7.3%

of the lines were corrected in total.

3 Dataset Analysis and Annotation

As part of this section, we analyze the types of

translation errors that can occur in sentence-level

NMT outputs, as well as annotate the

BWB

test

set. We also provide analysis on coherence-related

properties: numbers of named entities, numbers

of pronouns in both English and Chinese, and the

relationships of those factors. The annotation was

conducted by eight professional translators.

3.1 Translation Errors

The annotators were asked to identify and catego-

rize discourse-level translation errors made by a

state-of-the-art commercial NMT system, i.e. er-

rors that are only visible in context larger than in-

dividual sentences. The annotators followed the

following guideline for this error annotation:

1.

Identify cases that have translation errors: la-

bel examples as NO ERROR only if they meet

both the criteria of adequacy and ﬂuency as

well as the global criterion of coherence.

2.

Identify whether the translation error is at the

sentence level or document level (or both):

SENTENCE are examples that are already not

adequate or ﬂuent as stand-alone sentences.

3.

Categorize the DOCUMENT examples in ac-

cordance with the discourse phenomena, mark

the corresponding spans in the reference (En-

glish) that cause the MT output to be incorrect,

and provide the correct versions.

The types of errors are summarized in Tab. 2.

Lang MASCULINE FEMININE NEUTER EPICENE

EN 1633 2521 608 391

ZH 654 967 14 118

Table 3:

The distributions of different types of pronouns in

both English and Chinese in the BWB test set.

3.2 Named Entities

Named entities (NEs) are an essential part of sen-

tences in terms of human understanding and read-

ability. The mistranslation of NEs can signiﬁ-

cantly impact translation output, although evalu-

ation scores (e.g. BLEU) may not be adversely

affected. Therefore, we also annotate named enti-

ties in the reference documents, following a similar

procedure to OntoNotes (Hovy et al., 2006). In

total, 2,234 entities are annotated in the

BWB

test.

3.3 Pronouns

Pronoun translation has been the focus of discourse-

level MT evaluation (Hardmeier, 2012; Miculi-

cich Werlen and Popescu-Belis, 2017). As

showned in Tab. 3, there are signiﬁcantly fewer

pronouns in Chinese due to its pronoun-dropping

property. This poses extra challenges for NMT

since the skill of anaphoric resolution is required.

4 Experiments

We carry out evaluation of both baseline and state-

of-the-art MT models on

BWB

and also provide

human post-editing performance PE for compari-

son. The following 6 baselines are adapted:5

•SMT

: phrase-based baseline (Chiang, 2007).

•BING

,

G

oo

GL

e,

B

ai

D

u: commercial systems.

•MT-S

: the Transformer baseline that translates

sentence by sentence (Vaswani et al., 2017).

•MT-D

: the document-level NMT model that

adopts two-stage training (Zhang et al., 2018).

Evaluation Metrics

Systems are evaluated with

automatic standard sentence-level MT metrics

(BLEU (Papineni et al., 2002), METEOR (Baner-

jee and Lavie, 2005), BERT (Zhang* et al., 2020))

and a document-level metric (BLONDE,

?

). We also

performed evaluation targeted at speciﬁc discourse-

phenomena.

5

Professional translators were hired to conduct post-editing

on the BING outputs. They were instructed to correct only

discourse-level errors with minimal modiﬁcation. MT-Sand

MT-Dare trained on

BWB

by fairseq (Ott et al., 2019), and

the training details are in App.C.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ABilingualParallelCorpuswithDiscourseAnnotationsYuchenEleanorJiangTianyuLiuShumingMaDongdongZhangMrinmayaSachanRyanCotterellETHZürichMicrosoftResearchAsia{yuchen.jiang,tianyu.liu,ryan.cotterell,mrinmaya.sachan}@inf.ethz.ch{shuming.ma,dongdong.zhang}@microsoft.comAbstractMachinetranslation(MT)ha...

展开>> 收起<<

A Bilingual Parallel Corpus with Discourse Annotations Yuchen Eleanor JiangTianyu LiuShuming Ma Dongdong ZhangMrinmaya SachanRyan Cotterell.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

相关推荐

更多

立即下载

分类：图书资源 价格：10玖币 属性：12 页 大小：1.45MB 格式：PDF 时间：2025-04-30

开通VIP享超值会员特权

多端同步记录
高速下载文档
免费文档工具
分享文档赚钱
每日登录抽奖
优质衍生服务

作者详情

MAOOA..
高级编辑

文档 14218 粉丝 0

相关内容

更多

热门标签

人际关系配电装置动力学连接体力的合成高考理综全宋诗作者索引公务员考试

/ 12

评分收藏

立即下载

关于我们联系我们隐私政策用户协议免责申明会员服务协议
本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！ Copyright ©Jiubeiyunall rights reserved SITEMAP| 备案号：渝ICP备2024044455号| 渝公网安备50010702506394 | 违法与不良信息举报方式：微信:jiubeiyun2024,QQ:264159069,电话:15523442343,邮箱:jiubeiyun@126.com

客服

关注

二维码已失效
刷新

打开微信，点击“扫一扫”

安全高效便捷

免密登录