Cross-document Event Coreference Search Task Dataset and Modeling Alon Eirew12Avi Caciularu1Ido Dagan1 1Bar Ilan University Ramat-Gan Israel2Intel Labs Israel

2025-05-06 0 0 674.01KB 14 页 10玖币

侵权投诉

Cross-document Event Coreference Search: Task, Dataset and Modeling

Alon Eirew1,2 Avi Caciularu1Ido Dagan1

1Bar Ilan University, Ramat-Gan, Israel 2Intel Labs, Israel

alon.eirew@intel.com

avi.c33@gmail.com

dagan@cs.biu.ac.il

Abstract

The task of Cross-document Coreference Res-

olution has been traditionally formulated as re-

quiring to identify all coreference links across

a given set of documents. We propose an ap-

pealing, and often more applicable, comple-

mentary set up for the task – Cross-document

Coreference Search, focusing in this paper on

event coreference. Concretely, given a men-

tion in context of an event of interest, con-

sidered as a query, the task is to ﬁnd all

coreferring mentions for the query event in

a large document collection. To support re-

search on this task, we create a correspond-

ing dataset, which is derived from Wikipedia

while leveraging annotations in the available

Wikipedia Event Coreference dataset (WEC-

Eng). Observing that the coreference search

setup is largely analogous to the setting of

Open Domain Question Answering, we adapt

the prominent Deep Passage Retrieval (DPR)

model to our setting, as an appealing baseline.

Finally, we present a novel model that inte-

grates a powerful coreference scoring scheme

into the DPR architecture, yielding improved

performance.

1 Introduction

Cross-Document Event Coreference (CDEC) res-

olution is the task of identifying clusters of text

mentions, across multiple texts, that refer to the

same event. For example, consider the following

two underlined event mentions from the WEC-Eng

CDEC dataset (Eirew et al.,2021):

...On

14 April 2010

, an

earthquake

struck

the prefecture, registering a magnitude of 6.9

(USGS, EMSC) or 7.1 (Xinhua). It originated in

the Yushu Tibetan Autonomous Prefecture...

...a school mostly for

Tibetan

orphans in

Chindu County, Qinghai, after the

2010 Yushu

earthquake destroyed the old school...

Figure 1: Example of Coreference Search. Provided

with a query passage containing a mention of interest, a

coreference search system retrieves from a large corpus

the best candidate passages containing mentions core-

ferring with the query.

Both event mentions refer to the same earth-

quake, as can be determined by the shared event

arguments (2010, Yushu, Tibetan). In event coref-

erence resolution, the goal is to cluster event men-

tions that refer to the same event, whether within a

single document or across a document collection.

Currently, with the growing number of doc-

uments describing real-world events and event-

oriented information, the need for efﬁcient meth-

ods for accessing such information is apparent.

Successful and efﬁcient identiﬁcation, clustering,

and access to event-related information, may be

beneﬁcial for a broad range of applications at

the multi-text level, that need to match and inte-

grate information across documents, such as multi-

document summarization (Falke et al.,2017;Liao

et al.,2018), multi-hop question answering (Dhin-

gra et al.,2018;Wang et al.,2019) and Knowledge

Base Population (KBP) (Lin et al.,2020).

Currently, the CDEC task, as formed in corre-

arXiv:2210.12654v1 [cs.CL] 23 Oct 2022

sponding datasets, is intended at creating models

that exhaustively resolve all coreference links in

a given dataset. However, an applicable realistic

scenario may require to efﬁciently search and ex-

tract coreferring events of only speciﬁc events of

interest. A typical such use-case can be of a user

reading a text and encountering an event of interest

(for example, the plane crash event in Figure 1),

and then wishing to further explore and learn about

the event from a large document collection.

To address such needs, we propose an appealing,

and often more applicable, complementary set up

for the task – Cross-document Coreference Search

(Figure 1), focusing in this paper on event corefer-

ence. Concretely, given a mention in context of an

event of interest, considered as a query, the task is

to ﬁnd all coreferring mentions for the query event

in a large corpus.

Such coreference resolution search use-case can-

not be addressed currently, for two main reasons:

(1) Existing CDEC datasets are relatively small for

the realistic representation of a search task; (2) Cur-

rent CDEC models, which are designed at linking

all coreference links in a given dataset, are inappli-

cable in terms of computation at the much larger

search space required by realistic coreference reso-

lution search scenarios.

To facilitate research on this setup, we present

a large dataset, derived from Wikipedia, by lever-

aging existing annotations in the Wikipedia Event

Coreference dataset (WEC) (Eirew et al.,2021).

Our curated dataset resembles in structure to an

Open-domain QA (ODQA) dataset (Berant et al.,

2013;Baudiš and Šedivý,2015;Joshi et al.,2017;

Kwiatkowski et al.,2019;Rajpurkar et al.,2016),

containing a set of coreference queries and a large

passage collection for retrieval.

Observing that the coreference search setup is

largely analogous to the setting of Open Domain

Question Answering, we adapt the prominent Deep

Passage Retrieval (DPR) model to our setting, as an

appealing baseline. Further, motivated to integrate

coreference modeling into DPR, we adapted com-

ponents inspired by a prominent within-document

end-to-end coreference resolution model (Lee et al.,

2017), which was previously applied also to the

CDEC task (Cattan et al.,2020). Thus, we devel-

oped an integrated model that leverages compo-

nents from both DPR and the coreference model of

Lee et al. (2017). Our novel model yields substan-

tially improved performance on several important

evaluation metrics.

Our dataset

and code

are released for open

access.

2 Background

In this section, we ﬁrst describe the Cross Doc-

ument Event Coreference (CDEC) task, datasets

and models (§2.1) and then review the common

open-domain QA model architecture (§2.2).

2.1 Cross-Document Event Coreference

Resolution

ECB+ (Cybulska and Vossen,2014) is the most

commonly used dataset for training and testing

models for cross-document event coreference res-

olution. This corpus consists of documents par-

titioned into 43 clusters, each corresponding to

a certain news topic. ECB+ is relatively small,

where on average only 1.9 sentences per document

were selected for annotation, yielding only 722

non-singleton coreference clusters in total (that is,

clusters containing more than a single event men-

tion, while singleton clusters correspond to men-

tions that do not hold a coreference relation with

any other mention in the data).

Since annotating a CDEC dataset is a very chal-

lenging task, several annotation methods try to

semi-automatically create a CDEC dataset by tak-

ing advantage of available resources. The Gun Vio-

lence Corpus (GVC) (Vossen et al.,2018) leveraged

a structured database recording gun violence events

for creating an annotation scheme for gun violence

related events. In total GVC annotated 7,298 men-

tions distributed into 1,046 non-singleton clusters.

More recently, WEC-Eng (Eirew et al.,2021)

and HyperCoref (Bugert and Gurevych,2021)

leveraged article hyperlinks pointing to the same

concept in order to create an automatic annota-

tion process. This annotation scheme helped Hy-

perCoref curate 2.7M event mentions distributed

among 0.8M event clusters, extracted from news ar-

ticles. The smaller WEC-Eng curates 43,672 event

mentions distributed among 7,597 non-singleton

clusters. Differently then HyperCoref, the WEC-

Eng development set (containing 1,250 mentions

and 233 clusters) and test set (contains 1,893 men-

tions and 322 clusters) have gone through a manual

validation process (see Table 1), ensuring their high

1https://huggingface.co/datasets/Intel/

CoreSearch

2https://github.com/AlonEirew/CoreSearch

quality.

All the above mentioned datasets are targeted

for models which exhaustively resolve all corefer-

ence links within a given dataset (Barhom et al.,

2019;Meged et al.,2020;Cattan et al.,2020;Caci-

ularu et al.,2021;Yu et al.,2020;Held et al.,2021;

Allaway et al.,2021;Hsu and Horwood,2022).

This setting resembles the within-document corefer-

ence resolution setting, where similarly all links are

exhaustively resolved in a given single-document.

However, while within-document coreference res-

olution is contained to a single document, CDCR

might relate to an unbounded multi-text search

space (e.g., news articles, Wikipedia articles, court

and police records and so on). To that end, we

aim at a task and dataset for modeling CDEC as a

search problem. To facilitate a large corpus for a

realistic representation of such a task, while ensur-

ing reliable development and test sets, we adopted

the WEC-Eng

as the basis for our dataset creation

(§3).

Within Document Coreference Resolution

Re-

cent within-document coreference resolution mod-

els (Lee et al.,2018;Joshi et al.,2019;Kantor and

Globerson,2019;Wu et al.,2020), were inspired

by the end-to-end model architecture introduced

by Lee et al. (2017). In particular, two distinct

components were adopted in those works, which

were shown to be effective in detecting mentions

and their coreference relations, both in the within-

document and cross-document (Cattan et al.,2020)

settings. In our proposed model, we similarly adopt

those two components to better represent corefer-

ence relations, in the coreference search settings.

2.2 Open-Domain Question Answering

Open-domain question answering (ODQA)

(Voorhees,1999), is concerned with answering

factoid questions based on a large collection of

documents. Modern open-domain QA systems

have been restructured and simpliﬁed by com-

bining information retrieval (IR) techniques and

neural reading comprehension models (Chen et al.,

2017). In those approaches, a retriever component

ﬁnds documents that might contain an answer

from a large collection of documents, followed by

a reader component that ﬁnds a candidate answer

The larger magnitude of HyperCoref makes it a suitable

candidate for our CoreSearch. However, since HyperCoref is

not publicly released, we could not evaluate on it. We leave

this part to future work.

Mentions None-Singleton

Clusters

WEC-Eng (train) 40,529 7,042

WEC-Eng (dev) 1,250 216

WEC-Eng (test) 1,893 306

Table 1: WEC-Eng Dataset Statistics. Mentions: The

total number of event mentions within the correspond-

ing section. Non-Singleton Clusters: Number of event

clusters containing more than a single event mention.

Train Dev Test Total

WEC-Eng Validated Data

# Clusters 237 49 236 522

# Passages (with Mentions) 1,503 341 1,266 3,110

# Added Destructor Passages 922,736 923,376 923,746 2,769,858

# Total Passages 924,239 923,717 925,012 2,772,968

Table 2: CoreSearch dataset statistics.

in a given document (Lee et al.,2019;Yang et al.,

2019;Karpukhin et al.,2020).

We observe that the Cross-Document Event

Coreference Search (CDES) setting resembles the

ODQA task. Speciﬁcally, given a passage contain-

ing a mention of interest, considered as a query,

CDES is concerned with ﬁnding mentions core-

ferring with the query event in a large document

collection. To facilitate research in this task, we

created a dataset similar in structure to ODQA

datasets (Berant et al.,2013;Baudiš and Šedivý,

2015;Joshi et al.,2017;Kwiatkowski et al.,2019;

Rajpurkar et al.,2016), and established a suitable

model resembling in architecture to the recent two-

step (retriever/reader) systems, as described in the

following sections.

3 The CoreSearch Dataset

We formulated the Cross-Document Event Coref-

erence Search task following a similar approach

to open-domain question answering (illustrated in

Figure 1). Speciﬁcally, given a query containing a

marked target event mention, along with a passage

collection, the goal is to retrieve all the passages

from the passage collection that contain an event

mention coreferring with the query event, and ex-

tract the coreferring mention span of each retrieved

passage.

To facilitate research on this task, we present a

large dataset, derived from Wikipedia, termed Core-

Search. In this section we describe the CoreSearch

dataset structure (§3.1), following by describing

the structure of a single query instance (§3.2).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Cross-documentEventCoreferenceSearch:Task,DatasetandModelingAlonEirew1,2AviCaciularu1IdoDagan11BarIlanUniversity,Ramat-Gan,Israel2IntelLabs,Israelalon.eirew@intel.comavi.c33@gmail.comdagan@cs.biu.ac.ilAbstractThetaskofCross-documentCoreferenceRes-olutionhasbeentraditionallyformulatedasre-quiringtoid...

展开>> 收起<<

Cross-document Event Coreference Search Task Dataset and Modeling Alon Eirew12Avi Caciularu1Ido Dagan1 1Bar Ilan University Ramat-Gan Israel2Intel Labs Israel.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Cross-document Event Coreference Search Task Dataset and Modeling Alon Eirew12Avi Caciularu1Ido Dagan1 1Bar Ilan University Ramat-Gan Israel2Intel Labs Israel

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: