Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts Juntao Yu1 Silviu Paun2 Maris Camilleri1 Paloma Carretero Garcia1

2025-04-24 0 0 297.04KB 14 页 10玖币
侵权投诉
Aggregating Crowdsourced and Automatic Judgments to Scale Up a
Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
Juntao Yu1, Silviu Paun2
, Maris Camilleri1, Paloma Carretero Garcia1,
Jon Chamberlain1,Udo Kruschwitz 3and Massimo Poesio4
1University of Essex, UK; 2Amazon Research, Romania;
3University of Regensburg, Germany; 4Queen Mary University of London, UK;
j.yu@essex.ac.uk; silviupn@amazon.com; mcamil@essex.ac.uk;pcarre@essex.ac.uk;
jchamb@essex.ac.uk; udo.kruschwitz@ur.de; m.poesio@qmul.ac.uk;
Abstract
Although several datasets annotated for
anaphoric reference / coreference exist, even
the largest such datasets have limitations in
term of size, range of domains, coverage of
anaphoric phenomena, and size of documents
included. Yet, the approaches proposed to
scale up anaphoric annotation haven’t so
far resulted in datasets overcoming these
limitations. In this paper, we introduce
a new release of a corpus for anaphoric
reference labelled via a game-with-a-purpose.
This new release is comparable in size to
the largest existing corpora for anaphoric
reference due in part to substantial activity
by the players, in part thanks to the use of
a new resolve-and-aggregate paradigm to
‘complete’ markable annotations through the
combination of an anaphoric resolver and an
aggregation method for anaphoric reference.
The proposed method could be adopted to
greatly speed up annotation time in other
projects involving games-with-a-purpose. In
addition, the corpus covers genres for which
no comparable size datasets exist (Fiction
and Wikipedia); it covers singletons and
non-referring expressions; and it includes a
substantial number of long documents (>2K
in length).
1 Introduction
Many resources annotated for anaphoric reference /
coreference exist; but even the largest such datasets,
such as ONTONOTES (Pradhan et al.,2012), have
a number of limitations. The largest resources are
still medium scale (e.g., ONTONOTES (Pradhan
et al.,2012) is 1.5M tokens, as is CRAFT (Cohen
et al.,2017)). They only cover a limited range of
domains, primarily news (as in ONTONOTES) and
scientific articles (as in CRAFT), and models trained
on these datasets have been shown not to general-
Work was done prior to joining Amazon research.
ize well to other domains (Xia and Durme,2021).
1
The range of anaphoric phenomena covered is also
narrow (Poesio et al.,2016).
Several proposals have been made to scale up
anaphoric annotation in terms of size, range of do-
mains, and phenomena covered proposed, includ-
ing automatic data augmentation (Emami et al.,
2019;Gessler et al.,2020;Aloraini and Poesio,
2021), and crowdsourcing combined with active
learning (Laws et al.,2012;Li et al.,2020;Yuan
et al.,2022) or through Games-With-A-Purpose
(Chamberlain et al.,2008;Hladká et al.,2009;Bos
et al.,2017;Kicikoglu et al.,2019). However, the
largest existing anaphoric corpora created using
Games-With-A-Purpose (e.g., (Poesio et al.,2019))
are still smaller than the largest resources created
with traditional methods, and the corpora created
using data augmentation techniques are focused on
specific aspects of anaphoric reference. In order
to use such approaches to create resources of the
required scale in terms of size, variety and range of
phenomena covered novel methods are required.
The first contribution of this paper is the Phrase
Detectives 3.0 corpus of anaphoric reference anno-
tated using a Game-With-A-Purpose. This corpus
has a comparable size in tokens (1.37M) to the
ONTONOTES corpus (Pradhan et al.,2012), but
twice the number of markables. Its annotation
scheme also covers singletons and non-referring ex-
pressions; it is focused on two genres - fiction and
Wikipedia articles - not covered in ONTONOTES,
and for which only much smaller datasets exist;
and it includes a range of documents ranging from
short to fairly long (
14K
tokens) enabling research
on NLP in long documents (Beltagy et al.,2020).
The second contribution of the paper is a new it-
erative resolve-and-aggregate approach developed
1
The largest existing corpus for English, the 10M words
PRECO (Chen et al.,2018), consists of language learning texts,
but the models trained on this genre have proven to have even
worse performance on other domains.
arXiv:2210.05581v1 [cs.CL] 11 Oct 2022
to ‘complete’ the corpus by combining crowdsourc-
ing with automatic annotation. Only about 70% of
documents in the corpus were completely anno-
tated by the players. The proposed method (i) uses
an anaphoric resolver to automatically annotate all
mentions, including the few still unannotated; (ii)
aggregates the resulting judgments using a proba-
bilistic aggregation method for anaphora, and (iii)
uses the resulting expanded dataset to retrain the
anaphoric resolver. We show that the resolve-and-
aggregate method results in models with higher
accuracy than models trained using only the com-
pletely annotated data, or the full corpus not com-
pleted using the method.
2 Background
Anaphorically annotated corpora
A number
of anaphorically annotated datasets have now been
released, covering a number of languages (Hin-
richs et al.,2005;Hendrickx et al.,2008;Recasens
and Martí,2010;Pradhan et al.,2012;Landragin,
2016;Nedoluzhko et al.,2016;Cohen et al.,2017;
Chen et al.,2018;Bamman et al.,2020;Uryupina
et al.,2020;Zeldes,2020) and turning anaphora /
coreference in a very active area of research (Prad-
han et al.,2012;Fernandes et al.,2014;Wiseman
et al.,2015;Lee et al.,2017,2018;Yu et al.,2020;
Joshi et al.,2020). However, only a few of these
resources are genuinely large in terms of markables
(Pradhan et al.,2012;Cohen et al.,2017), and most
are focused on news, with only a few corpora cov-
ering other genres such as scientific articles (e.g.,
CRAFT (Cohen et al.,2017)), fiction (e.g., LitBank
(Bamman et al.,2020) and Phrase Detectives (Poe-
sio et al.,2019)), and Wikipedia (e.g., WikiCoref
(Ghaddar and Langlais,2016) or again Phrase De-
tectives (Poesio et al.,2019)). Important genres
such as dialogue are barely covered (Muzerelle
et al.,2014;Khosla et al.,2021). There is evidence
that this concentration on a single genre, and on
ONTONOTES in particular, does not result in mod-
els that generalize well (Xia and Durme,2021).
Existing resources are also limited in terms of
coverage. Most recent datasets are based on general
purpose annotation schemes with a clear linguistic
foundation, but especially the largest ones focus on
the simplest cases of anaphora / coreference (e.g.,
singletons and non-referring expressions are not
annotated in ONTONOTES). And the documents
found in existing corpora tend to be short, with
the exception of CRAFT: e.g., average document
length is 329 in PRECO, 467 in ONTONOTES, 630
in ARRAU, and 753 in Phrase Detectives.
Scaling up anaphoric annotation
One ap-
proach to scale up anaphoric reference annotation
is using fully automatic methods to either anno-
tate a dataset, such as AMALGUM (Gessler et al.,
2020), or create a benchmark from scratch, such
as KNOWREF (Emami et al.,2019). While entirely
automatic annotation may result in datasets of arbi-
trarily large size, such annotations cannot expand
current models’ coverage to aspects of anaphoric
reference do not already handle well. And creating
from scratch large-scale benchmarks for specific
issues hasn’t so far been shown to result in datasets
reflecting the variety and richness of real texts.
Crowdsourcing has emerged as the dominant
paradigm for annotation in NLP (Snow et al.,2008;
Poesio et al.,2017) because of its reduced costs
and increased speed in comparison with traditional
annotation. But the costs for really large-scale an-
notation are still prohibitive even for crowdsourc-
ing (Poesio et al.,2013,2017). To address this is-
sue, a number of approaches have been developed
to optimize the use of crowdsourcing for corefer-
ence annotation. In particular, active learning has
been used to reduce the amount of annotation work
needed (Laws et al.,2012;Li et al.,2020;Yuan
et al.,2022). Another issue is that anaphoric refer-
ence is a complex type of annotation whose most
complex aspects require special quality control typ-
ically not available with microtask crowdsourcing.
Games-With-A-Purpose
A form of crowd-
sourcing which has been widely used to address
the issues of cost and quality is Games-With-A-
Purpose (GWAP) (von Ahn,2006;Cooper et al.,
2010;Lafourcade et al.,2015). Games-With-A-
Purpose is the version of crowdsourcing where
labelling is created through a game, so that the
reward for the workers is in terms of enjoyment
rather than financial–were proposed as a solution
for large-scale data labelling. A number of GWAPs
were therefore developed for NLP, including Jeux
de Mots (Lafourcade,2007;Joubert et al.,2018),
Phrase Detectives (Chamberlain et al.,2008;Poe-
sio et al.,2013), OntoGalaxy (Krause et al.,2010),
the Wordrobe platform (Basile et al.,2012), Dr
Detective (Dumitrache et al.,2013), Zombilingo
(Fort et al.,2014), TileAttack! (Madge et al.,2017),
Wormingo (Kicikoglu et al.,2019), Name That Lan-
guage! (Cieri et al.,2021) or High School Super-
hero (Bonetti and Tonelli,2021). GWAPs for coref-
erence include Phrase Detectives (Chamberlain
et al.,2008;Poesio et al.,2013), the Pointers game
in WordRobe (Bos et al.,2017) and Wormingo (Ki-
cikoglu et al.,2019), all deployed, and PlayCoref
(Hladká et al.,2009), proposed but not tested.
However, whereas truly successful GWAPs such
as FOLDIT have been developed in other areas of
science (Cooper et al.,2010), even the most suc-
cessful GWAPs for NLP only collected moderate
amounts of data (Poesio et al.,2019;Joubert et al.,
2018). In part, this is because the games used to
actually collect NLP labels aren’t very entertain-
ing, leading to efforts to develop engaging designs
such as (Jurgens and Navigli,2014;Dziedzic and
Włodarczyk,2017;Madge et al.,2019).
An interesting solution to this issue was pro-
posed although not fully developed for Wordrobe
(Bos et al.,2017). This solution is a hybrid be-
tween automatic annotation and crowdsourcing: a
combination of crowd and automatically computed
judgments is aggregated to ensure that every item
has at least one label. This solution wasn’t prop-
erly tested in Wordrobe, which only collected very
few judgments and for a small corpus; and any-
way the approach followed could not be applied to
anaphora/coreference, due to the lack of a suitable
aggregation mechanism for anaphora/coreference.
In this paper we propose a method for aggregating
crowd and automatic judgments inspired by this
idea, but using an aggregation method for anaphora,
and truly tested on a dataset containing a very large
number of anaphoric judgments.
3 Phrase Detectives
The Phrase Detectives Game-With-A-Purpose
(Chamberlain et al.,2008;Poesio et al.,2013;
Chamberlain,2016;Poesio et al.,2019) was
designed to collect multiple judgments about
anaphoric reference.
Game design
Phrase Detectives doesn’t follow
the design of some of the original von Ahn games
(von Ahn and Dabbish,2008), in that it is a one-
person game, and not timed; both competition and
timing were found to have orthogonal effects on
the quality of the annotation (Chamberlain,2016).
Points are used as the main incentive, with weekly
and monthly boards being displayed.
Players play two different games: one aiming
at labelling new data, the other at validating judg-
ments expressed by the other players. In the an-
notation game, Name the Culprit, the player pro-
vides an anaphoric judgment about a highlighted
markable (the possible judgments according to the
annotation scheme are discussed next). If differ-
ent participants enter different interpretations for a
markable then each interpretation is presented to
other participants in the validation game, Detec-
tives Conference, in which the participants have to
agree or disagree with the interpretation.
Every item is annotated by at least 8 players (20
on average), and each distinct interpretation is vali-
dated by at least four players. Players get points for
each label they produce, but especially when their
interpretation is agreed upon by other players, thus
rewarding accuracy. Initially, players play against
gold data, and are periodically evaluated against
the gold; when they achieve a sufficient level of
accuracy, they start seeing incompletely annotated
data. Extensive analyses of the data suggest that al-
though there is a great number of noisy judgments,
this simple training and validation method delivers
extremely accurate aggregated labels (Poesio et al.,
2013;Chamberlain,2016;Poesio et al.,2019)
Annotation Scheme
The annotation scheme
used in Phrase Detectives is a simplified version
of the ARRAU annotation scheme (Uryupina et al.,
2020), covering all the main aspects of anaphoric
annotation, including the distinction between re-
ferring and non-referring expressions (all noun
phrases are annotated as either referring or non-
referring, with two types of non-referring expres-
sions being annotated: expletives and predicative
NPs); the distinction between discourse-new and
discourse-old referring expressions (Prince,1992);
and the annotation of all types of identity refer-
ence (including split antecedent plural anaphora).
Only the most complex types of anaphoric refer-
ence (bridging references and discourse deixis) are
not annotated. The main differences between the
annotation scheme used in Phrase Detectives and
those used in ARRAU,ONTONOTES, and PRECO
are summarized in Table 1, modelled on a similar
table in (Chen et al.,2018). In the Phrase Detec-
tives corpus predication and coreference are clearly
distinguished, as in ONTONOTES and ARRAU but
unlike in PRECO. Singletons are considered mark-
ables. Expletives and split antecedent plurals are
marked, unlike in either ONTONOTES or PRECO.
Possibly the most distinctive feature of the an-
notation scheme is that disagreements among an-
notators are preserved, encoding a form of implicit
摘要:

AggregatingCrowdsourcedandAutomaticJudgmentstoScaleUpaCorpusofAnaphoricReferenceforFictionandWikipediaTextsJuntaoYu1,SilviuPaun2,MarisCamilleri1,PalomaCarreteroGarcia1,JonChamberlain1,UdoKruschwitz3andMassimoPoesio41UniversityofEssex,UK;2AmazonResearch,Romania;3UniversityofRegensburg,Germany;4Queen...

展开>> 收起<<
Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts Juntao Yu1 Silviu Paun2 Maris Camilleri1 Paloma Carretero Garcia1.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:297.04KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注