to ‘complete’ the corpus by combining crowdsourc-
ing with automatic annotation. Only about 70% of
documents in the corpus were completely anno-
tated by the players. The proposed method (i) uses
an anaphoric resolver to automatically annotate all
mentions, including the few still unannotated; (ii)
aggregates the resulting judgments using a proba-
bilistic aggregation method for anaphora, and (iii)
uses the resulting expanded dataset to retrain the
anaphoric resolver. We show that the resolve-and-
aggregate method results in models with higher
accuracy than models trained using only the com-
pletely annotated data, or the full corpus not com-
pleted using the method.
2 Background
Anaphorically annotated corpora
A number
of anaphorically annotated datasets have now been
released, covering a number of languages (Hin-
richs et al.,2005;Hendrickx et al.,2008;Recasens
and Martí,2010;Pradhan et al.,2012;Landragin,
2016;Nedoluzhko et al.,2016;Cohen et al.,2017;
Chen et al.,2018;Bamman et al.,2020;Uryupina
et al.,2020;Zeldes,2020) and turning anaphora /
coreference in a very active area of research (Prad-
han et al.,2012;Fernandes et al.,2014;Wiseman
et al.,2015;Lee et al.,2017,2018;Yu et al.,2020;
Joshi et al.,2020). However, only a few of these
resources are genuinely large in terms of markables
(Pradhan et al.,2012;Cohen et al.,2017), and most
are focused on news, with only a few corpora cov-
ering other genres such as scientific articles (e.g.,
CRAFT (Cohen et al.,2017)), fiction (e.g., LitBank
(Bamman et al.,2020) and Phrase Detectives (Poe-
sio et al.,2019)), and Wikipedia (e.g., WikiCoref
(Ghaddar and Langlais,2016) or again Phrase De-
tectives (Poesio et al.,2019)). Important genres
such as dialogue are barely covered (Muzerelle
et al.,2014;Khosla et al.,2021). There is evidence
that this concentration on a single genre, and on
ONTONOTES in particular, does not result in mod-
els that generalize well (Xia and Durme,2021).
Existing resources are also limited in terms of
coverage. Most recent datasets are based on general
purpose annotation schemes with a clear linguistic
foundation, but especially the largest ones focus on
the simplest cases of anaphora / coreference (e.g.,
singletons and non-referring expressions are not
annotated in ONTONOTES). And the documents
found in existing corpora tend to be short, with
the exception of CRAFT: e.g., average document
length is 329 in PRECO, 467 in ONTONOTES, 630
in ARRAU, and 753 in Phrase Detectives.
Scaling up anaphoric annotation
One ap-
proach to scale up anaphoric reference annotation
is using fully automatic methods to either anno-
tate a dataset, such as AMALGUM (Gessler et al.,
2020), or create a benchmark from scratch, such
as KNOWREF (Emami et al.,2019). While entirely
automatic annotation may result in datasets of arbi-
trarily large size, such annotations cannot expand
current models’ coverage to aspects of anaphoric
reference do not already handle well. And creating
from scratch large-scale benchmarks for specific
issues hasn’t so far been shown to result in datasets
reflecting the variety and richness of real texts.
Crowdsourcing has emerged as the dominant
paradigm for annotation in NLP (Snow et al.,2008;
Poesio et al.,2017) because of its reduced costs
and increased speed in comparison with traditional
annotation. But the costs for really large-scale an-
notation are still prohibitive even for crowdsourc-
ing (Poesio et al.,2013,2017). To address this is-
sue, a number of approaches have been developed
to optimize the use of crowdsourcing for corefer-
ence annotation. In particular, active learning has
been used to reduce the amount of annotation work
needed (Laws et al.,2012;Li et al.,2020;Yuan
et al.,2022). Another issue is that anaphoric refer-
ence is a complex type of annotation whose most
complex aspects require special quality control typ-
ically not available with microtask crowdsourcing.
Games-With-A-Purpose
A form of crowd-
sourcing which has been widely used to address
the issues of cost and quality is Games-With-A-
Purpose (GWAP) (von Ahn,2006;Cooper et al.,
2010;Lafourcade et al.,2015). Games-With-A-
Purpose is the version of crowdsourcing where
labelling is created through a game, so that the
reward for the workers is in terms of enjoyment
rather than financial–were proposed as a solution
for large-scale data labelling. A number of GWAPs
were therefore developed for NLP, including Jeux
de Mots (Lafourcade,2007;Joubert et al.,2018),
Phrase Detectives (Chamberlain et al.,2008;Poe-
sio et al.,2013), OntoGalaxy (Krause et al.,2010),
the Wordrobe platform (Basile et al.,2012), Dr
Detective (Dumitrache et al.,2013), Zombilingo
(Fort et al.,2014), TileAttack! (Madge et al.,2017),
Wormingo (Kicikoglu et al.,2019), Name That Lan-
guage! (Cieri et al.,2021) or High School Super-