
LongtoNotes: OntoNotes with Longer Coreference Chains
Kumar Shridhar †Nicholas Monath ∗‡Raghuveer Thirukovalluru♦
Alessandro Stolfo †Manzil Zaheer ¶Andrew McCallum ‡Mrinmaya Sachan †
†ETH Zürich ‡UMass Amherst ♦Duke University ¶Google
shkumar@ethz.ch
Abstract
Ontonotes has served as the most important
benchmark for coreference resolution. How-
ever, for ease of annotation, several long doc-
uments in Ontonotes were split into smaller
parts. In this work, we build a corpus of
coreference-annotated documents of signifi-
cantly longer length than what is currently
available. We do so by providing an ac-
curate, manually-curated, merging of annota-
tions from documents that were split into mul-
tiple parts in the original Ontonotes annota-
tion process (Pradhan et al.,2013). The re-
sulting corpus, which we call LongtoNotes
contains documents in multiple genres of the
English language with varying lengths, the
longest of which are up to 8x the length
of documents in Ontonotes, and 2x those in
Litbank. We evaluate state-of-the-art neural
coreference systems on this new corpus, ana-
lyze the relationships between model architec-
tures/hyperparameters and document length
on performance and efficiency of the mod-
els, and demonstrate areas of improvement
in long-document coreference modeling re-
vealed by our new corpus. Our data and code
is available at: https://github.com/
kumar-shridhar/LongtoNotes.
1 Introduction
Coreference resolution is an important problem in
discourse with applications in knowledge-base con-
struction (Luan et al.,2018), question-answering
(Reddy et al.,2019) and reading assistants (Azab
et al.,2013;Head et al.,2021). In many such set-
tings, the documents of interest, are significantly
longer and/or on wider varieties of domains than
the currently available corpora with coreference
annotation (Pradhan et al.,2013;Bamman et al.,
2019;Mohan and Li,2019;Cohen et al.,2017).
The Ontonotes corpus (Pradhan et al.,2013) is
perhaps the most widely used benchmark for coref-
erence (Lee et al.,2013;Durrett and Klein,2013;
∗Now at Google.
Figure 1: Comparing Average Document Length.
Long documents in genres such as broadcast conver-
sations (bc) were split into smaller parts in Ontonotes.
Our proposed dataset, LongtoNotes, restore doc-
uments to their original form, revealing dramatic in-
creases in length in certain genres.
Wiseman et al.,2016;Lee et al.,2017;Joshi et al.,
2020;Toshniwal et al.,2020b;Thirukovalluru et al.,
2021;Kirstain et al.,2021). The construction pro-
cess for Ontonotes, however, resulted in documents
with an artificially reduced length. For ease of an-
notation, longer documents were split into smaller
parts and each part was annotated separately and
treated as an independent document (Pradhan et al.,
2013). The result is a corpus in which certain
genres, such as broadcast conversation (bc), have
greatly reduced length compared to their original
form (Figure 1). As a result, the long, bursty spread
of coreference chains in these documents is missing
from the evaluation benchmark.
In this work, we present an extension to
the Ontonotes corpus, called
LongtoNotes
.
LongtoNotes
combines coreference annota-
tions in various parts of the same document, lead-
ing to a full document coreference annotation. This
was done by our annotation team, which was care-
fully trained to follow the annotation guidelines
laid out in the original Ontonotes corpus (§3). This
led to a dataset where the average document length
is over 40% longer than the standard OntoNotes
benchmark and the average size of coreference
arXiv:2210.03650v1 [cs.CL] 7 Oct 2022