LongtoNotes OntoNotes with Longer Coreference Chains Kumar ShridharyNicholas MonathzRaghuveer Thirukovalluru Alessandro StolfoyManzil ZaheerAndrew McCallumzMrinmaya Sachany

2025-05-02 0 0 947.69KB 15 页 10玖币
侵权投诉
LongtoNotes: OntoNotes with Longer Coreference Chains
Kumar Shridhar Nicholas Monath Raghuveer Thirukovalluru
Alessandro Stolfo Manzil Zaheer Andrew McCallum Mrinmaya Sachan
ETH Zürich UMass Amherst Duke University Google
shkumar@ethz.ch
Abstract
Ontonotes has served as the most important
benchmark for coreference resolution. How-
ever, for ease of annotation, several long doc-
uments in Ontonotes were split into smaller
parts. In this work, we build a corpus of
coreference-annotated documents of signifi-
cantly longer length than what is currently
available. We do so by providing an ac-
curate, manually-curated, merging of annota-
tions from documents that were split into mul-
tiple parts in the original Ontonotes annota-
tion process (Pradhan et al.,2013). The re-
sulting corpus, which we call LongtoNotes
contains documents in multiple genres of the
English language with varying lengths, the
longest of which are up to 8x the length
of documents in Ontonotes, and 2x those in
Litbank. We evaluate state-of-the-art neural
coreference systems on this new corpus, ana-
lyze the relationships between model architec-
tures/hyperparameters and document length
on performance and efficiency of the mod-
els, and demonstrate areas of improvement
in long-document coreference modeling re-
vealed by our new corpus. Our data and code
is available at: https://github.com/
kumar-shridhar/LongtoNotes.
1 Introduction
Coreference resolution is an important problem in
discourse with applications in knowledge-base con-
struction (Luan et al.,2018), question-answering
(Reddy et al.,2019) and reading assistants (Azab
et al.,2013;Head et al.,2021). In many such set-
tings, the documents of interest, are significantly
longer and/or on wider varieties of domains than
the currently available corpora with coreference
annotation (Pradhan et al.,2013;Bamman et al.,
2019;Mohan and Li,2019;Cohen et al.,2017).
The Ontonotes corpus (Pradhan et al.,2013) is
perhaps the most widely used benchmark for coref-
erence (Lee et al.,2013;Durrett and Klein,2013;
Now at Google.
Figure 1: Comparing Average Document Length.
Long documents in genres such as broadcast conver-
sations (bc) were split into smaller parts in Ontonotes.
Our proposed dataset, LongtoNotes, restore doc-
uments to their original form, revealing dramatic in-
creases in length in certain genres.
Wiseman et al.,2016;Lee et al.,2017;Joshi et al.,
2020;Toshniwal et al.,2020b;Thirukovalluru et al.,
2021;Kirstain et al.,2021). The construction pro-
cess for Ontonotes, however, resulted in documents
with an artificially reduced length. For ease of an-
notation, longer documents were split into smaller
parts and each part was annotated separately and
treated as an independent document (Pradhan et al.,
2013). The result is a corpus in which certain
genres, such as broadcast conversation (bc), have
greatly reduced length compared to their original
form (Figure 1). As a result, the long, bursty spread
of coreference chains in these documents is missing
from the evaluation benchmark.
In this work, we present an extension to
the Ontonotes corpus, called
LongtoNotes
.
LongtoNotes
combines coreference annota-
tions in various parts of the same document, lead-
ing to a full document coreference annotation. This
was done by our annotation team, which was care-
fully trained to follow the annotation guidelines
laid out in the original Ontonotes corpus (§3). This
led to a dataset where the average document length
is over 40% longer than the standard OntoNotes
benchmark and the average size of coreference
arXiv:2210.03650v1 [cs.CL] 7 Oct 2022
chains increased by 25%. While other datasets
such as Litbank (Bamman et al.,2019) and CRAFT
(Cohen et al.,2017) focus on long documents in
specialized domains,
LongtoNotes
comprises
of documents in multiple genres (Table 1).
To illustrate the usefulness of
LongtoNotes
,
we evaluate state-of-the-art coreference resolution
models (Kirstain et al.,2021;Toshniwal et al.,
2020b;Joshi et al.,2020) on the corpus and analyze
the performance in terms of document length (§4.2).
We illustrate how model architecture decisions and
hyperparameters that support long-range dependen-
cies have the greatest impact on coreference perfor-
mance and importantly, these differences are only
illustrated using
LongtoNotes
and are not seen
in Ontonotes (§4.3).
LongtoNotes
also presents
a challenge in scaling coreference models as pre-
diction time and memory requirement increase sub-
stantially on the long documents 4.4).
2 Our Contribution: LongtoNotes
We present
LongtoNotes
, a corpus that ex-
tends the English coreference annotation in the
OntoNotes Release 5.0 corpus
1
(Pradhan et al.,
2013) to provide annotations for longer documents.
In the original English OntoNotes corpus, the gen-
res such as broadcast conversations (bc) and tele-
phone conversation (tc) contain long documents
that were divided into smaller parts to facilitate
easier annotation.
LongtoNotes
is constructed
by collecting annotations to combine within-part
coreference chains into coreference chains over the
entire long document. The annotation procedure,
in which annotators merge coreference chains, is
described and analyzed in Section 3.
The divided parts of a long document in
Ontonotes are all assigned to the same partition
(train/dev/test). This allows
LongtoNotes
to
maintain the same train/dev/test partition, at the
document level, as Ontonotes (Appendix, Table 10).
The size of these partitions however does change
as the divided parts are combined into a single an-
notated text in
LongtoNotes
. We will release
scripts to convert OntoNotes to
LongtoNotes
in both CoNLL and CorefUD (Universal Depen-
dencies)
2
formats under the Creative Commons 4.0
license.
We refer to
LongtoNotess
as the subset of
LongtoNotes
comprising only of long docu-
1
The Arabic and Chinese parts of the Ontonotes dataset
are not considered in our study. See Appendix A.3
2https://ufal.mff.cuni.cz/corefud
Figure 2: Document and Coref Chain Length. The
number of coreference chains increases with the in-
crease in token length in LongtoNotes.
ments (i.e. documents merged by the annotators).
2.1 Length of Documents in LongtoNotes
The average number of tokens per document
(rounded to the nearest integer) in
LongtoNotes
is 674, 44% higher than in Ontonotes (466). Ta-
ble 1breaks down the changes in document length
by genre. We observe that the genre with the
longest documents is broadcast conversation with
4071 tokens per document, which is a dramatic
increase from the length of the divided parts in
Ontonotes which had 511 tokens per document in
the same. The number of coreference chains and
the number of mentions per chain grows as well.
The long documents that were split into multiple
parts during the original OntoNotes annotation are
not evenly distributed among the genres of text
present in the corpus. In particular, text categories
broadcast news (bn) and newswire (nw) consist ex-
clusively of short non-split documents, which were
not affected by the
LongtoNotes
merging pro-
cess. A detailed distribution of what documents are
merged in
LongtoNotes
is provided in Table 9
in the Appendix.
2.2 Number of Coreference Chains
As a consequence of the increase in document
length,
LongtoNotes
presents a higher number
of coreference chains per document (16), compared
to OntoNotes (12). Figure 2shows the length and
number of coreference chains for each document in
the two corpora. As expected, the number of chains
in a document tends to get larger as the document
size increases.
For genres with longer average document lengths
like broadcast conversation (bc), the increase in
the number of chains is as high as
85%
, while this
increase is only
25%
for pivot (pt) genre when
Figure 3: Number of Chains per Document. A his-
togram log plot reveals the long-tailed distribution of
the number of coreference chains present per docu-
ment in LongtoNotes. Ontonotes contain more doc-
uments with fewer chains.
the document length is comparatively shorter. It
is worth noting that the majority of documents
had a number of chains in the range of
20
to
50
and only about
20
documents out of
3493
in
the OntoNotes dataset had
>
50 chains per docu-
ment. For
LongtoNotes
the number increases
to
96
documents. A comparison of the number
of chains per document between OntoNotes and
LongtoNotes is shown in Figure 3.
2.3 Number of Mentions per Chain
The number of mentions per coreference chain in
LongtoNotes
is over
30%
more than OntoNotes.
This is primarily because of longer documents and
an increase in the number of coreference chains
per document. Mentions per chain increase with
the increase in document length. For the broadcast
conversation (bc) genre, the increase in the men-
tions per chain is highest with
87%
, while for the
pivot (pt) (Old Testament and New Testament text)
genre it is only 30% as it has shorter documents.
2.4 Distances to the Antecedents
For each coreference chain, we analyzed the dis-
tance between the mentions and their antecedents.
The largest distance for a mention to its antecedent
grew
3
x for
LongtoNotes
dataset when com-
pared to OntoNotes from 4,885 to 11,473 tokens.
Figure 4shows a detailed breakdown of the men-
tion to antecedent distance. There are no mentions
that are more than
5
K tokens distant from its an-
tecedent in OntoNotes. There are
178
such men-
tions in LongtoNotes.
Figure 4: Distance to Antecedent. Histogram (log-
scale) shows that the largest distance of mention to their
antecedents per chain increases in LongtoNotes
compared to OntoNotes.
2.5 Comparison with other Datasets
The literature contains multiple works proposing
datasets for coreference resolution: Wiki coref
(Ghaddar and Langlais,2016), LitBank (Bamman
et al.,2019), PreCo (Chen et al.,2018), Quiz Bowl
Questions (Rodriguez et al.,2019;Guha et al.,
2015), ACE corpus (Walker et al.,2006), MUC
(Chinchor and Sundheim,1995), MedMentions
(Mohan and Li,2019), inter alia. We compare
LongtoNotes
to these datasets in terms of the
number of documents, the total number of tokens,
and document length (Table 2).
Litbank is a popular long document coreference
dataset, presenting a high tokens/document ratio.
However, the dataset consists of only 100 docu-
ments, rendering model development challenges.
Moreover, it focuses only on the literary domain.
Other datasets containing long documents (e.g.,
WikiCoref) are also very small in size. On the
other hand, datasets consisting of a larger number
of texts tend to contain shorter documents (e.g.,
PreCo). Thus, by building
LongtoNotes
, we
address the scarcity of a multi-genre corpus with
a collection of long documents containing long-
range coreference dependencies.
3 Annotation Procedure & Quality
In this section, we describe and assess the annota-
tion procedure used to build LongtoNotes.
3.1 Annotation Task
To build
LongtoNotes
, it suffices to succes-
sively merge chains in the current part
i+ 1
of the
document with one of the chains in the previous
parts 1, . . . , i.
摘要:

LongtoNotes:OntoNoteswithLongerCoreferenceChainsKumarShridharyNicholasMonathzRaghuveerThirukovalluru}AlessandroStolfoyManzilZaheer{AndrewMcCallumzMrinmayaSachanyyETHZürichzUMassAmherst}DukeUniversity{Googleshkumar@ethz.chAbstractOntonoteshasservedasthemostimportantbenchmarkforcoreferenceresolution....

收起<<
LongtoNotes OntoNotes with Longer Coreference Chains Kumar ShridharyNicholas MonathzRaghuveer Thirukovalluru Alessandro StolfoyManzil ZaheerAndrew McCallumzMrinmaya Sachany.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:947.69KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注