Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts Juntao Yu1 Silviu Paun2 Maris Camilleri1 Paloma Carretero Garcia1

2025-04-24 0 0 297.04KB 14 页 10玖币

侵权投诉

Aggregating Crowdsourced and Automatic Judgments to Scale Up a

Corpus of Anaphoric Reference for Fiction and Wikipedia Texts

Juntao Yu1, Silviu Paun2∗

, Maris Camilleri1, Paloma Carretero Garcia1,

Jon Chamberlain1,Udo Kruschwitz 3and Massimo Poesio4

1University of Essex, UK; 2Amazon Research, Romania;

3University of Regensburg, Germany; 4Queen Mary University of London, UK;

j.yu@essex.ac.uk; silviupn@amazon.com; mcamil@essex.ac.uk;pcarre@essex.ac.uk;

jchamb@essex.ac.uk; udo.kruschwitz@ur.de; m.poesio@qmul.ac.uk;

Abstract

Although several datasets annotated for

anaphoric reference / coreference exist, even

the largest such datasets have limitations in

term of size, range of domains, coverage of

anaphoric phenomena, and size of documents

included. Yet, the approaches proposed to

scale up anaphoric annotation haven’t so

far resulted in datasets overcoming these

limitations. In this paper, we introduce

a new release of a corpus for anaphoric

reference labelled via a game-with-a-purpose.

This new release is comparable in size to

the largest existing corpora for anaphoric

reference due in part to substantial activity

by the players, in part thanks to the use of

a new resolve-and-aggregate paradigm to

‘complete’ markable annotations through the

combination of an anaphoric resolver and an

aggregation method for anaphoric reference.

The proposed method could be adopted to

greatly speed up annotation time in other

projects involving games-with-a-purpose. In

addition, the corpus covers genres for which

no comparable size datasets exist (Fiction

and Wikipedia); it covers singletons and

non-referring expressions; and it includes a

substantial number of long documents (>2K

in length).

1 Introduction

Many resources annotated for anaphoric reference /

coreference exist; but even the largest such datasets,

such as ONTONOTES (Pradhan et al.,2012), have

a number of limitations. The largest resources are

still medium scale (e.g., ONTONOTES (Pradhan

et al.,2012) is 1.5M tokens, as is CRAFT (Cohen

et al.,2017)). They only cover a limited range of

domains, primarily news (as in ONTONOTES) and

scientiﬁc articles (as in CRAFT), and models trained

on these datasets have been shown not to general-

∗

Work was done prior to joining Amazon research.

ize well to other domains (Xia and Durme,2021).

The range of anaphoric phenomena covered is also

narrow (Poesio et al.,2016).

Several proposals have been made to scale up

anaphoric annotation in terms of size, range of do-

mains, and phenomena covered proposed, includ-

ing automatic data augmentation (Emami et al.,

2019;Gessler et al.,2020;Aloraini and Poesio,

2021), and crowdsourcing combined with active

learning (Laws et al.,2012;Li et al.,2020;Yuan

et al.,2022) or through Games-With-A-Purpose

(Chamberlain et al.,2008;Hladká et al.,2009;Bos

et al.,2017;Kicikoglu et al.,2019). However, the

largest existing anaphoric corpora created using

Games-With-A-Purpose (e.g., (Poesio et al.,2019))

are still smaller than the largest resources created

with traditional methods, and the corpora created

using data augmentation techniques are focused on

speciﬁc aspects of anaphoric reference. In order

to use such approaches to create resources of the

required scale in terms of size, variety and range of

phenomena covered novel methods are required.

The ﬁrst contribution of this paper is the Phrase

Detectives 3.0 corpus of anaphoric reference anno-

tated using a Game-With-A-Purpose. This corpus

has a comparable size in tokens (1.37M) to the

ONTONOTES corpus (Pradhan et al.,2012), but

twice the number of markables. Its annotation

scheme also covers singletons and non-referring ex-

pressions; it is focused on two genres - ﬁction and

Wikipedia articles - not covered in ONTONOTES,

and for which only much smaller datasets exist;

and it includes a range of documents ranging from

short to fairly long (

14K

tokens) enabling research

on NLP in long documents (Beltagy et al.,2020).

The second contribution of the paper is a new it-

erative resolve-and-aggregate approach developed

The largest existing corpus for English, the 10M words

PRECO (Chen et al.,2018), consists of language learning texts,

but the models trained on this genre have proven to have even

worse performance on other domains.

arXiv:2210.05581v1 [cs.CL] 11 Oct 2022

to ‘complete’ the corpus by combining crowdsourc-

ing with automatic annotation. Only about 70% of

documents in the corpus were completely anno-

tated by the players. The proposed method (i) uses

an anaphoric resolver to automatically annotate all

mentions, including the few still unannotated; (ii)

aggregates the resulting judgments using a proba-

bilistic aggregation method for anaphora, and (iii)

uses the resulting expanded dataset to retrain the

anaphoric resolver. We show that the resolve-and-

aggregate method results in models with higher

accuracy than models trained using only the com-

pletely annotated data, or the full corpus not com-

pleted using the method.

2 Background

Anaphorically annotated corpora

A number

of anaphorically annotated datasets have now been

released, covering a number of languages (Hin-

richs et al.,2005;Hendrickx et al.,2008;Recasens

and Martí,2010;Pradhan et al.,2012;Landragin,

2016;Nedoluzhko et al.,2016;Cohen et al.,2017;

Chen et al.,2018;Bamman et al.,2020;Uryupina

et al.,2020;Zeldes,2020) and turning anaphora /

coreference in a very active area of research (Prad-

han et al.,2012;Fernandes et al.,2014;Wiseman

et al.,2015;Lee et al.,2017,2018;Yu et al.,2020;

Joshi et al.,2020). However, only a few of these

resources are genuinely large in terms of markables

(Pradhan et al.,2012;Cohen et al.,2017), and most

are focused on news, with only a few corpora cov-

ering other genres such as scientiﬁc articles (e.g.,

CRAFT (Cohen et al.,2017)), ﬁction (e.g., LitBank

(Bamman et al.,2020) and Phrase Detectives (Poe-

sio et al.,2019)), and Wikipedia (e.g., WikiCoref

(Ghaddar and Langlais,2016) or again Phrase De-

tectives (Poesio et al.,2019)). Important genres

such as dialogue are barely covered (Muzerelle

et al.,2014;Khosla et al.,2021). There is evidence

that this concentration on a single genre, and on

ONTONOTES in particular, does not result in mod-

els that generalize well (Xia and Durme,2021).

Existing resources are also limited in terms of

coverage. Most recent datasets are based on general

purpose annotation schemes with a clear linguistic

foundation, but especially the largest ones focus on

the simplest cases of anaphora / coreference (e.g.,

singletons and non-referring expressions are not

annotated in ONTONOTES). And the documents

found in existing corpora tend to be short, with

the exception of CRAFT: e.g., average document

length is 329 in PRECO, 467 in ONTONOTES, 630

in ARRAU, and 753 in Phrase Detectives.

Scaling up anaphoric annotation

One ap-

proach to scale up anaphoric reference annotation

is using fully automatic methods to either anno-

tate a dataset, such as AMALGUM (Gessler et al.,

2020), or create a benchmark from scratch, such

as KNOWREF (Emami et al.,2019). While entirely

automatic annotation may result in datasets of arbi-

trarily large size, such annotations cannot expand

current models’ coverage to aspects of anaphoric

reference do not already handle well. And creating

from scratch large-scale benchmarks for speciﬁc

issues hasn’t so far been shown to result in datasets

reﬂecting the variety and richness of real texts.

Crowdsourcing has emerged as the dominant

paradigm for annotation in NLP (Snow et al.,2008;

Poesio et al.,2017) because of its reduced costs

and increased speed in comparison with traditional

annotation. But the costs for really large-scale an-

notation are still prohibitive even for crowdsourc-

ing (Poesio et al.,2013,2017). To address this is-

sue, a number of approaches have been developed

to optimize the use of crowdsourcing for corefer-

ence annotation. In particular, active learning has

been used to reduce the amount of annotation work

needed (Laws et al.,2012;Li et al.,2020;Yuan

et al.,2022). Another issue is that anaphoric refer-

ence is a complex type of annotation whose most

complex aspects require special quality control typ-

ically not available with microtask crowdsourcing.

Games-With-A-Purpose

A form of crowd-

sourcing which has been widely used to address

the issues of cost and quality is Games-With-A-

Purpose (GWAP) (von Ahn,2006;Cooper et al.,

2010;Lafourcade et al.,2015). Games-With-A-

Purpose is the version of crowdsourcing where

labelling is created through a game, so that the

reward for the workers is in terms of enjoyment

rather than ﬁnancial–were proposed as a solution

for large-scale data labelling. A number of GWAPs

were therefore developed for NLP, including Jeux

de Mots (Lafourcade,2007;Joubert et al.,2018),

Phrase Detectives (Chamberlain et al.,2008;Poe-

sio et al.,2013), OntoGalaxy (Krause et al.,2010),

the Wordrobe platform (Basile et al.,2012), Dr

Detective (Dumitrache et al.,2013), Zombilingo

(Fort et al.,2014), TileAttack! (Madge et al.,2017),

Wormingo (Kicikoglu et al.,2019), Name That Lan-

guage! (Cieri et al.,2021) or High School Super-

hero (Bonetti and Tonelli,2021). GWAPs for coref-

erence include Phrase Detectives (Chamberlain

et al.,2008;Poesio et al.,2013), the Pointers game

in WordRobe (Bos et al.,2017) and Wormingo (Ki-

cikoglu et al.,2019), all deployed, and PlayCoref

(Hladká et al.,2009), proposed but not tested.

However, whereas truly successful GWAPs such

as FOLDIT have been developed in other areas of

science (Cooper et al.,2010), even the most suc-

cessful GWAPs for NLP only collected moderate

amounts of data (Poesio et al.,2019;Joubert et al.,

2018). In part, this is because the games used to

actually collect NLP labels aren’t very entertain-

ing, leading to efforts to develop engaging designs

such as (Jurgens and Navigli,2014;Dziedzic and

Włodarczyk,2017;Madge et al.,2019).

An interesting solution to this issue was pro-

posed although not fully developed for Wordrobe

(Bos et al.,2017). This solution is a hybrid be-

tween automatic annotation and crowdsourcing: a

combination of crowd and automatically computed

judgments is aggregated to ensure that every item

has at least one label. This solution wasn’t prop-

erly tested in Wordrobe, which only collected very

few judgments and for a small corpus; and any-

way the approach followed could not be applied to

anaphora/coreference, due to the lack of a suitable

aggregation mechanism for anaphora/coreference.

In this paper we propose a method for aggregating

crowd and automatic judgments inspired by this

idea, but using an aggregation method for anaphora,

and truly tested on a dataset containing a very large

number of anaphoric judgments.

3 Phrase Detectives

The Phrase Detectives Game-With-A-Purpose

(Chamberlain et al.,2008;Poesio et al.,2013;

Chamberlain,2016;Poesio et al.,2019) was

designed to collect multiple judgments about

anaphoric reference.

Game design

Phrase Detectives doesn’t follow

the design of some of the original von Ahn games

(von Ahn and Dabbish,2008), in that it is a one-

person game, and not timed; both competition and

timing were found to have orthogonal effects on

the quality of the annotation (Chamberlain,2016).

Points are used as the main incentive, with weekly

and monthly boards being displayed.

Players play two different games: one aiming

at labelling new data, the other at validating judg-

ments expressed by the other players. In the an-

notation game, Name the Culprit, the player pro-

vides an anaphoric judgment about a highlighted

markable (the possible judgments according to the

annotation scheme are discussed next). If differ-

ent participants enter different interpretations for a

markable then each interpretation is presented to

other participants in the validation game, Detec-

tives Conference, in which the participants have to

agree or disagree with the interpretation.

Every item is annotated by at least 8 players (20

on average), and each distinct interpretation is vali-

dated by at least four players. Players get points for

each label they produce, but especially when their

interpretation is agreed upon by other players, thus

rewarding accuracy. Initially, players play against

gold data, and are periodically evaluated against

the gold; when they achieve a sufﬁcient level of

accuracy, they start seeing incompletely annotated

data. Extensive analyses of the data suggest that al-

though there is a great number of noisy judgments,

this simple training and validation method delivers

extremely accurate aggregated labels (Poesio et al.,

2013;Chamberlain,2016;Poesio et al.,2019)

Annotation Scheme

The annotation scheme

used in Phrase Detectives is a simpliﬁed version

of the ARRAU annotation scheme (Uryupina et al.,

2020), covering all the main aspects of anaphoric

annotation, including the distinction between re-

ferring and non-referring expressions (all noun

phrases are annotated as either referring or non-

referring, with two types of non-referring expres-

sions being annotated: expletives and predicative

NPs); the distinction between discourse-new and

discourse-old referring expressions (Prince,1992);

and the annotation of all types of identity refer-

ence (including split antecedent plural anaphora).

Only the most complex types of anaphoric refer-

ence (bridging references and discourse deixis) are

not annotated. The main differences between the

annotation scheme used in Phrase Detectives and

those used in ARRAU,ONTONOTES, and PRECO

are summarized in Table 1, modelled on a similar

table in (Chen et al.,2018). In the Phrase Detec-

tives corpus predication and coreference are clearly

distinguished, as in ONTONOTES and ARRAU but

unlike in PRECO. Singletons are considered mark-

ables. Expletives and split antecedent plurals are

marked, unlike in either ONTONOTES or PRECO.

Possibly the most distinctive feature of the an-

notation scheme is that disagreements among an-

notators are preserved, encoding a form of implicit

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AggregatingCrowdsourcedandAutomaticJudgmentstoScaleUpaCorpusofAnaphoricReferenceforFictionandWikipediaTextsJuntaoYu1,SilviuPaun2,MarisCamilleri1,PalomaCarreteroGarcia1,JonChamberlain1,UdoKruschwitz3andMassimoPoesio41UniversityofEssex,UK;2AmazonResearch,Romania;3UniversityofRegensburg,Germany;4Queen...

展开>> 收起<<

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts Juntao Yu1 Silviu Paun2 Maris Camilleri1 Paloma Carretero Garcia1.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts Juntao Yu1 Silviu Paun2 Maris Camilleri1 Paloma Carretero Garcia1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: