LongtoNotes OntoNotes with Longer Coreference Chains Kumar ShridharyNicholas MonathzRaghuveer Thirukovalluru Alessandro StolfoyManzil ZaheerAndrew McCallumzMrinmaya Sachany

2025-05-02 0 0 947.69KB 15 页 10玖币

侵权投诉

LongtoNotes: OntoNotes with Longer Coreference Chains

Kumar Shridhar †Nicholas Monath ∗‡Raghuveer Thirukovalluru♦

Alessandro Stolfo †Manzil Zaheer ¶Andrew McCallum ‡Mrinmaya Sachan †

†ETH Zürich ‡UMass Amherst ♦Duke University ¶Google

shkumar@ethz.ch

Abstract

Ontonotes has served as the most important

benchmark for coreference resolution. How-

ever, for ease of annotation, several long doc-

uments in Ontonotes were split into smaller

parts. In this work, we build a corpus of

coreference-annotated documents of signiﬁ-

cantly longer length than what is currently

available. We do so by providing an ac-

curate, manually-curated, merging of annota-

tions from documents that were split into mul-

tiple parts in the original Ontonotes annota-

tion process (Pradhan et al.,2013). The re-

sulting corpus, which we call LongtoNotes

contains documents in multiple genres of the

English language with varying lengths, the

longest of which are up to 8x the length

of documents in Ontonotes, and 2x those in

Litbank. We evaluate state-of-the-art neural

coreference systems on this new corpus, ana-

lyze the relationships between model architec-

tures/hyperparameters and document length

on performance and efﬁciency of the mod-

els, and demonstrate areas of improvement

in long-document coreference modeling re-

vealed by our new corpus. Our data and code

is available at: https://github.com/

kumar-shridhar/LongtoNotes.

1 Introduction

Coreference resolution is an important problem in

discourse with applications in knowledge-base con-

struction (Luan et al.,2018), question-answering

(Reddy et al.,2019) and reading assistants (Azab

et al.,2013;Head et al.,2021). In many such set-

tings, the documents of interest, are signiﬁcantly

longer and/or on wider varieties of domains than

the currently available corpora with coreference

annotation (Pradhan et al.,2013;Bamman et al.,

2019;Mohan and Li,2019;Cohen et al.,2017).

The Ontonotes corpus (Pradhan et al.,2013) is

perhaps the most widely used benchmark for coref-

erence (Lee et al.,2013;Durrett and Klein,2013;

∗Now at Google.

Figure 1: Comparing Average Document Length.

Long documents in genres such as broadcast conver-

sations (bc) were split into smaller parts in Ontonotes.

Our proposed dataset, LongtoNotes, restore doc-

uments to their original form, revealing dramatic in-

creases in length in certain genres.

Wiseman et al.,2016;Lee et al.,2017;Joshi et al.,

2020;Toshniwal et al.,2020b;Thirukovalluru et al.,

2021;Kirstain et al.,2021). The construction pro-

cess for Ontonotes, however, resulted in documents

with an artiﬁcially reduced length. For ease of an-

notation, longer documents were split into smaller

parts and each part was annotated separately and

treated as an independent document (Pradhan et al.,

2013). The result is a corpus in which certain

genres, such as broadcast conversation (bc), have

greatly reduced length compared to their original

form (Figure 1). As a result, the long, bursty spread

of coreference chains in these documents is missing

from the evaluation benchmark.

In this work, we present an extension to

the Ontonotes corpus, called

LongtoNotes

combines coreference annota-

tions in various parts of the same document, lead-

ing to a full document coreference annotation. This

was done by our annotation team, which was care-

fully trained to follow the annotation guidelines

laid out in the original Ontonotes corpus (§3). This

led to a dataset where the average document length

is over 40% longer than the standard OntoNotes

benchmark and the average size of coreference

arXiv:2210.03650v1 [cs.CL] 7 Oct 2022

chains increased by 25%. While other datasets

such as Litbank (Bamman et al.,2019) and CRAFT

(Cohen et al.,2017) focus on long documents in

specialized domains,

LongtoNotes

comprises

of documents in multiple genres (Table 1).

To illustrate the usefulness of

LongtoNotes

we evaluate state-of-the-art coreference resolution

models (Kirstain et al.,2021;Toshniwal et al.,

2020b;Joshi et al.,2020) on the corpus and analyze

the performance in terms of document length (§4.2).

We illustrate how model architecture decisions and

hyperparameters that support long-range dependen-

cies have the greatest impact on coreference perfor-

mance and importantly, these differences are only

illustrated using

LongtoNotes

and are not seen

in Ontonotes (§4.3).

LongtoNotes

also presents

a challenge in scaling coreference models as pre-

diction time and memory requirement increase sub-

stantially on the long documents (§4.4).

2 Our Contribution: LongtoNotes

We present

LongtoNotes

, a corpus that ex-

tends the English coreference annotation in the

OntoNotes Release 5.0 corpus

(Pradhan et al.,

2013) to provide annotations for longer documents.

In the original English OntoNotes corpus, the gen-

res such as broadcast conversations (bc) and tele-

phone conversation (tc) contain long documents

that were divided into smaller parts to facilitate

easier annotation.

LongtoNotes

is constructed

by collecting annotations to combine within-part

coreference chains into coreference chains over the

entire long document. The annotation procedure,

in which annotators merge coreference chains, is

described and analyzed in Section 3.

The divided parts of a long document in

Ontonotes are all assigned to the same partition

(train/dev/test). This allows

LongtoNotes

maintain the same train/dev/test partition, at the

document level, as Ontonotes (Appendix, Table 10).

The size of these partitions however does change

as the divided parts are combined into a single an-

notated text in

LongtoNotes

. We will release

scripts to convert OntoNotes to

LongtoNotes

in both CoNLL and CorefUD (Universal Depen-

dencies)

formats under the Creative Commons 4.0

license.

We refer to

LongtoNotess

as the subset of

LongtoNotes

comprising only of long docu-

The Arabic and Chinese parts of the Ontonotes dataset

are not considered in our study. See Appendix A.3

2https://ufal.mff.cuni.cz/corefud

Figure 2: Document and Coref Chain Length. The

number of coreference chains increases with the in-

crease in token length in LongtoNotes.

ments (i.e. documents merged by the annotators).

2.1 Length of Documents in LongtoNotes

The average number of tokens per document

(rounded to the nearest integer) in

LongtoNotes

is 674, 44% higher than in Ontonotes (466). Ta-

ble 1breaks down the changes in document length

by genre. We observe that the genre with the

longest documents is broadcast conversation with

4071 tokens per document, which is a dramatic

increase from the length of the divided parts in

Ontonotes which had 511 tokens per document in

the same. The number of coreference chains and

the number of mentions per chain grows as well.

The long documents that were split into multiple

parts during the original OntoNotes annotation are

not evenly distributed among the genres of text

present in the corpus. In particular, text categories

broadcast news (bn) and newswire (nw) consist ex-

clusively of short non-split documents, which were

not affected by the

LongtoNotes

merging pro-

cess. A detailed distribution of what documents are

merged in

LongtoNotes

is provided in Table 9

in the Appendix.

2.2 Number of Coreference Chains

As a consequence of the increase in document

length,

LongtoNotes

presents a higher number

of coreference chains per document (16), compared

to OntoNotes (12). Figure 2shows the length and

number of coreference chains for each document in

the two corpora. As expected, the number of chains

in a document tends to get larger as the document

size increases.

For genres with longer average document lengths

like broadcast conversation (bc), the increase in

the number of chains is as high as

85%

, while this

increase is only

25%

for pivot (pt) genre when

Figure 3: Number of Chains per Document. A his-

togram log plot reveals the long-tailed distribution of

the number of coreference chains present per docu-

ment in LongtoNotes. Ontonotes contain more doc-

uments with fewer chains.

the document length is comparatively shorter. It

is worth noting that the majority of documents

had a number of chains in the range of

and only about

documents out of

3493

the OntoNotes dataset had

50 chains per docu-

ment. For

LongtoNotes

the number increases

documents. A comparison of the number

of chains per document between OntoNotes and

LongtoNotes is shown in Figure 3.

2.3 Number of Mentions per Chain

The number of mentions per coreference chain in

LongtoNotes

is over

30%

more than OntoNotes.

This is primarily because of longer documents and

an increase in the number of coreference chains

per document. Mentions per chain increase with

the increase in document length. For the broadcast

conversation (bc) genre, the increase in the men-

tions per chain is highest with

87%

, while for the

pivot (pt) (Old Testament and New Testament text)

genre it is only 30% as it has shorter documents.

2.4 Distances to the Antecedents

For each coreference chain, we analyzed the dis-

tance between the mentions and their antecedents.

The largest distance for a mention to its antecedent

grew

x for

LongtoNotes

dataset when com-

pared to OntoNotes from 4,885 to 11,473 tokens.

Figure 4shows a detailed breakdown of the men-

tion to antecedent distance. There are no mentions

that are more than

K tokens distant from its an-

tecedent in OntoNotes. There are

178

such men-

tions in LongtoNotes.

Figure 4: Distance to Antecedent. Histogram (log-

scale) shows that the largest distance of mention to their

antecedents per chain increases in LongtoNotes

compared to OntoNotes.

2.5 Comparison with other Datasets

The literature contains multiple works proposing

datasets for coreference resolution: Wiki coref

(Ghaddar and Langlais,2016), LitBank (Bamman

et al.,2019), PreCo (Chen et al.,2018), Quiz Bowl

Questions (Rodriguez et al.,2019;Guha et al.,

2015), ACE corpus (Walker et al.,2006), MUC

(Chinchor and Sundheim,1995), MedMentions

(Mohan and Li,2019), inter alia. We compare

LongtoNotes

to these datasets in terms of the

number of documents, the total number of tokens,

and document length (Table 2).

Litbank is a popular long document coreference

dataset, presenting a high tokens/document ratio.

However, the dataset consists of only 100 docu-

ments, rendering model development challenges.

Moreover, it focuses only on the literary domain.

Other datasets containing long documents (e.g.,

WikiCoref) are also very small in size. On the

other hand, datasets consisting of a larger number

of texts tend to contain shorter documents (e.g.,

PreCo). Thus, by building

LongtoNotes

, we

address the scarcity of a multi-genre corpus with

a collection of long documents containing long-

range coreference dependencies.

3 Annotation Procedure & Quality

In this section, we describe and assess the annota-

tion procedure used to build LongtoNotes.

3.1 Annotation Task

To build

LongtoNotes

, it sufﬁces to succes-

sively merge chains in the current part

i+ 1

of the

document with one of the chains in the previous

parts 1, . . . , i.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LongtoNotes:OntoNoteswithLongerCoreferenceChainsKumarShridharyNicholasMonathzRaghuveerThirukovalluru}AlessandroStolfoyManzilZaheer{AndrewMcCallumzMrinmayaSachanyyETHZürichzUMassAmherst}DukeUniversity{Googleshkumar@ethz.chAbstractOntonoteshasservedasthemostimportantbenchmarkforcoreferenceresolution....

收起<<

LongtoNotes OntoNotes with Longer Coreference Chains Kumar ShridharyNicholas MonathzRaghuveer Thirukovalluru Alessandro StolfoyManzil ZaheerAndrew McCallumzMrinmaya Sachany.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LongtoNotes OntoNotes with Longer Coreference Chains Kumar ShridharyNicholas MonathzRaghuveer Thirukovalluru Alessandro StolfoyManzil ZaheerAndrew McCallumzMrinmaya Sachany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: