Cross-document Event Coreference Search Task Dataset and Modeling Alon Eirew12Avi Caciularu1Ido Dagan1 1Bar Ilan University Ramat-Gan Israel2Intel Labs Israel

2025-05-06
0
0
674.01KB
14 页
10玖币
侵权投诉
Cross-document Event Coreference Search: Task, Dataset and Modeling
Alon Eirew1,2 Avi Caciularu1Ido Dagan1
1Bar Ilan University, Ramat-Gan, Israel 2Intel Labs, Israel
alon.eirew@intel.com
avi.c33@gmail.com
dagan@cs.biu.ac.il
Abstract
The task of Cross-document Coreference Res-
olution has been traditionally formulated as re-
quiring to identify all coreference links across
a given set of documents. We propose an ap-
pealing, and often more applicable, comple-
mentary set up for the task – Cross-document
Coreference Search, focusing in this paper on
event coreference. Concretely, given a men-
tion in context of an event of interest, con-
sidered as a query, the task is to find all
coreferring mentions for the query event in
a large document collection. To support re-
search on this task, we create a correspond-
ing dataset, which is derived from Wikipedia
while leveraging annotations in the available
Wikipedia Event Coreference dataset (WEC-
Eng). Observing that the coreference search
setup is largely analogous to the setting of
Open Domain Question Answering, we adapt
the prominent Deep Passage Retrieval (DPR)
model to our setting, as an appealing baseline.
Finally, we present a novel model that inte-
grates a powerful coreference scoring scheme
into the DPR architecture, yielding improved
performance.
1 Introduction
Cross-Document Event Coreference (CDEC) res-
olution is the task of identifying clusters of text
mentions, across multiple texts, that refer to the
same event. For example, consider the following
two underlined event mentions from the WEC-Eng
CDEC dataset (Eirew et al.,2021):
1.
...On
14 April 2010
, an
earthquake
struck
the prefecture, registering a magnitude of 6.9
(USGS, EMSC) or 7.1 (Xinhua). It originated in
the Yushu Tibetan Autonomous Prefecture...
2.
...a school mostly for
Tibetan
orphans in
Chindu County, Qinghai, after the
2010 Yushu
earthquake destroyed the old school...
Figure 1: Example of Coreference Search. Provided
with a query passage containing a mention of interest, a
coreference search system retrieves from a large corpus
the best candidate passages containing mentions core-
ferring with the query.
Both event mentions refer to the same earth-
quake, as can be determined by the shared event
arguments (2010, Yushu, Tibetan). In event coref-
erence resolution, the goal is to cluster event men-
tions that refer to the same event, whether within a
single document or across a document collection.
Currently, with the growing number of doc-
uments describing real-world events and event-
oriented information, the need for efficient meth-
ods for accessing such information is apparent.
Successful and efficient identification, clustering,
and access to event-related information, may be
beneficial for a broad range of applications at
the multi-text level, that need to match and inte-
grate information across documents, such as multi-
document summarization (Falke et al.,2017;Liao
et al.,2018), multi-hop question answering (Dhin-
gra et al.,2018;Wang et al.,2019) and Knowledge
Base Population (KBP) (Lin et al.,2020).
Currently, the CDEC task, as formed in corre-
arXiv:2210.12654v1 [cs.CL] 23 Oct 2022
sponding datasets, is intended at creating models
that exhaustively resolve all coreference links in
a given dataset. However, an applicable realistic
scenario may require to efficiently search and ex-
tract coreferring events of only specific events of
interest. A typical such use-case can be of a user
reading a text and encountering an event of interest
(for example, the plane crash event in Figure 1),
and then wishing to further explore and learn about
the event from a large document collection.
To address such needs, we propose an appealing,
and often more applicable, complementary set up
for the task – Cross-document Coreference Search
(Figure 1), focusing in this paper on event corefer-
ence. Concretely, given a mention in context of an
event of interest, considered as a query, the task is
to find all coreferring mentions for the query event
in a large corpus.
Such coreference resolution search use-case can-
not be addressed currently, for two main reasons:
(1) Existing CDEC datasets are relatively small for
the realistic representation of a search task; (2) Cur-
rent CDEC models, which are designed at linking
all coreference links in a given dataset, are inappli-
cable in terms of computation at the much larger
search space required by realistic coreference reso-
lution search scenarios.
To facilitate research on this setup, we present
a large dataset, derived from Wikipedia, by lever-
aging existing annotations in the Wikipedia Event
Coreference dataset (WEC) (Eirew et al.,2021).
Our curated dataset resembles in structure to an
Open-domain QA (ODQA) dataset (Berant et al.,
2013;Baudiš and Šedivý,2015;Joshi et al.,2017;
Kwiatkowski et al.,2019;Rajpurkar et al.,2016),
containing a set of coreference queries and a large
passage collection for retrieval.
Observing that the coreference search setup is
largely analogous to the setting of Open Domain
Question Answering, we adapt the prominent Deep
Passage Retrieval (DPR) model to our setting, as an
appealing baseline. Further, motivated to integrate
coreference modeling into DPR, we adapted com-
ponents inspired by a prominent within-document
end-to-end coreference resolution model (Lee et al.,
2017), which was previously applied also to the
CDEC task (Cattan et al.,2020). Thus, we devel-
oped an integrated model that leverages compo-
nents from both DPR and the coreference model of
Lee et al. (2017). Our novel model yields substan-
tially improved performance on several important
evaluation metrics.
Our dataset
1
and code
2
are released for open
access.
2 Background
In this section, we first describe the Cross Doc-
ument Event Coreference (CDEC) task, datasets
and models (§2.1) and then review the common
open-domain QA model architecture (§2.2).
2.1 Cross-Document Event Coreference
Resolution
ECB+ (Cybulska and Vossen,2014) is the most
commonly used dataset for training and testing
models for cross-document event coreference res-
olution. This corpus consists of documents par-
titioned into 43 clusters, each corresponding to
a certain news topic. ECB+ is relatively small,
where on average only 1.9 sentences per document
were selected for annotation, yielding only 722
non-singleton coreference clusters in total (that is,
clusters containing more than a single event men-
tion, while singleton clusters correspond to men-
tions that do not hold a coreference relation with
any other mention in the data).
Since annotating a CDEC dataset is a very chal-
lenging task, several annotation methods try to
semi-automatically create a CDEC dataset by tak-
ing advantage of available resources. The Gun Vio-
lence Corpus (GVC) (Vossen et al.,2018) leveraged
a structured database recording gun violence events
for creating an annotation scheme for gun violence
related events. In total GVC annotated 7,298 men-
tions distributed into 1,046 non-singleton clusters.
More recently, WEC-Eng (Eirew et al.,2021)
and HyperCoref (Bugert and Gurevych,2021)
leveraged article hyperlinks pointing to the same
concept in order to create an automatic annota-
tion process. This annotation scheme helped Hy-
perCoref curate 2.7M event mentions distributed
among 0.8M event clusters, extracted from news ar-
ticles. The smaller WEC-Eng curates 43,672 event
mentions distributed among 7,597 non-singleton
clusters. Differently then HyperCoref, the WEC-
Eng development set (containing 1,250 mentions
and 233 clusters) and test set (contains 1,893 men-
tions and 322 clusters) have gone through a manual
validation process (see Table 1), ensuring their high
1https://huggingface.co/datasets/Intel/
CoreSearch
2https://github.com/AlonEirew/CoreSearch
quality.
All the above mentioned datasets are targeted
for models which exhaustively resolve all corefer-
ence links within a given dataset (Barhom et al.,
2019;Meged et al.,2020;Cattan et al.,2020;Caci-
ularu et al.,2021;Yu et al.,2020;Held et al.,2021;
Allaway et al.,2021;Hsu and Horwood,2022).
This setting resembles the within-document corefer-
ence resolution setting, where similarly all links are
exhaustively resolved in a given single-document.
However, while within-document coreference res-
olution is contained to a single document, CDCR
might relate to an unbounded multi-text search
space (e.g., news articles, Wikipedia articles, court
and police records and so on). To that end, we
aim at a task and dataset for modeling CDEC as a
search problem. To facilitate a large corpus for a
realistic representation of such a task, while ensur-
ing reliable development and test sets, we adopted
the WEC-Eng
3
as the basis for our dataset creation
(§3).
Within Document Coreference Resolution
Re-
cent within-document coreference resolution mod-
els (Lee et al.,2018;Joshi et al.,2019;Kantor and
Globerson,2019;Wu et al.,2020), were inspired
by the end-to-end model architecture introduced
by Lee et al. (2017). In particular, two distinct
components were adopted in those works, which
were shown to be effective in detecting mentions
and their coreference relations, both in the within-
document and cross-document (Cattan et al.,2020)
settings. In our proposed model, we similarly adopt
those two components to better represent corefer-
ence relations, in the coreference search settings.
2.2 Open-Domain Question Answering
Open-domain question answering (ODQA)
(Voorhees,1999), is concerned with answering
factoid questions based on a large collection of
documents. Modern open-domain QA systems
have been restructured and simplified by com-
bining information retrieval (IR) techniques and
neural reading comprehension models (Chen et al.,
2017). In those approaches, a retriever component
finds documents that might contain an answer
from a large collection of documents, followed by
a reader component that finds a candidate answer
3
The larger magnitude of HyperCoref makes it a suitable
candidate for our CoreSearch. However, since HyperCoref is
not publicly released, we could not evaluate on it. We leave
this part to future work.
Mentions None-Singleton
Clusters
WEC-Eng (train) 40,529 7,042
WEC-Eng (dev) 1,250 216
WEC-Eng (test) 1,893 306
Table 1: WEC-Eng Dataset Statistics. Mentions: The
total number of event mentions within the correspond-
ing section. Non-Singleton Clusters: Number of event
clusters containing more than a single event mention.
Train Dev Test Total
WEC-Eng Validated Data
# Clusters 237 49 236 522
# Passages (with Mentions) 1,503 341 1,266 3,110
# Added Destructor Passages 922,736 923,376 923,746 2,769,858
# Total Passages 924,239 923,717 925,012 2,772,968
Table 2: CoreSearch dataset statistics.
in a given document (Lee et al.,2019;Yang et al.,
2019;Karpukhin et al.,2020).
We observe that the Cross-Document Event
Coreference Search (CDES) setting resembles the
ODQA task. Specifically, given a passage contain-
ing a mention of interest, considered as a query,
CDES is concerned with finding mentions core-
ferring with the query event in a large document
collection. To facilitate research in this task, we
created a dataset similar in structure to ODQA
datasets (Berant et al.,2013;Baudiš and Šedivý,
2015;Joshi et al.,2017;Kwiatkowski et al.,2019;
Rajpurkar et al.,2016), and established a suitable
model resembling in architecture to the recent two-
step (retriever/reader) systems, as described in the
following sections.
3 The CoreSearch Dataset
We formulated the Cross-Document Event Coref-
erence Search task following a similar approach
to open-domain question answering (illustrated in
Figure 1). Specifically, given a query containing a
marked target event mention, along with a passage
collection, the goal is to retrieve all the passages
from the passage collection that contain an event
mention coreferring with the query event, and ex-
tract the coreferring mention span of each retrieved
passage.
To facilitate research on this task, we present a
large dataset, derived from Wikipedia, termed Core-
Search. In this section we describe the CoreSearch
dataset structure (§3.1), following by describing
the structure of a single query instance (§3.2).
摘要:
展开>>
收起<<
Cross-documentEventCoreferenceSearch:Task,DatasetandModelingAlonEirew1,2AviCaciularu1IdoDagan11BarIlanUniversity,Ramat-Gan,Israel2IntelLabs,Israelalon.eirew@intel.comavi.c33@gmail.comdagan@cs.biu.ac.ilAbstractThetaskofCross-documentCoreferenceRes-olutionhasbeentraditionallyformulatedasre-quiringtoid...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
VIP免费2024-11-14 22
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 3
-
VIP免费2024-11-23 4
-
VIP免费2024-11-23 28
-
VIP免费2024-11-23 11
-
VIP免费2024-11-23 21
-
VIP免费2024-11-23 12
-
VIP免费2024-11-23 5
分类:图书资源
价格:10玖币
属性:14 页
大小:674.01KB
格式:PDF
时间:2025-05-06
作者详情
-
IMU2CLIP MULTIMODAL CONTRASTIVE LEARNING FOR IMU MOTION SENSORS FROM EGOCENTRIC VIDEOS AND TEXT NARRATIONS Seungwhan Moon Andrea Madotto Zhaojiang Lin Alireza Dirafzoon Aparajita Saraf10 玖币0人下载
-
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective Zijian Zhang1 Chang Shu23 Ya Xiao1 Yuan Shen1 Di Zhu1 Jing Xiao210 玖币0人下载