CrossRE A Cross-Domain Dataset for Relation Extraction Elisa Bassignanaand Barbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark

2025-05-06 0 0 436.51KB 13 页 10玖币
侵权投诉
CrossRE: A Cross-Domain Dataset for Relation Extraction
Elisa Bassignanaand Barbara PlankSÄ
Department of Computer Science, IT University of Copenhagen, Denmark
SCenter for Information and Language Processing (CIS), LMU Munich, Germany
ÄMunich Center for Machine Learning (MCML), Munich, Germany
elba@itu.dk bplank@cis.lmu.de
Abstract
Relation Extraction (RE) has attracted increas-
ing attention, but current RE evaluation is lim-
ited to in-domain evaluation setups. Little is
known on how well a RE system fares in chal-
lenging, but realistic out-of-distribution eval-
uation setups. To address this gap, we pro-
pose CROSSRE, a new, freely-available cross-
domain benchmark for RE, which comprises
six distinct text domains and includes multi-
label annotations. An additional innovation is
that we release meta-data collected during an-
notation, to include explanations and flags of
difficult instances. We provide an empirical
evaluation with a state-of-the-art model for re-
lation classification. As the meta-data enables
us to shed new light on the state-of-the-art
model, we provide a comprehensive analysis
on the impact of difficult cases and find corre-
lations between model and human annotations.
Overall, our empirical investigation highlights
the difficulty of cross-domain RE. We release
our dataset, to spur more research in this direc-
tion.1
1 Introduction
Relation Extraction (RE) is the task of extracting
structured knowledge from unstructured text. Al-
though the fact that the task has attracted increas-
ing attention in recent years, there is still a large
gap in comprehensive evaluation of such systems
which include out-of-domain setups (Bassignana
and Plank,2022). Despite the drought of research
on cross-domain evaluation of RE, its practical im-
portance remains. Given the wide range of ap-
plications for RE to downstream tasks which can
vary from question answering, to knowledge-base
population, to summarization, and to all kind of
other tasks which require extracting structured in-
formation from unstructured text, out-of-domain
generalization capabilities are extremely beneficial.
It is essential to build RE models that transfer well
1https://github.com/mainlp/CrossRE
He then went to live at Chalcedon, whence in 367 he was
banished to Mauretania for harbouring the rebel Procopius.
e1: location
e3: writer
e2: country
For many years starting from 1986, Miller directed the
development of WordNet, a large computer-readable electronic
reference usable in applications such as search engines.
e3: product
e2: product
e1: researcher
Ent A
Ent B
Exp
SA
UN
e1
e2
-
-
-
e3
e2
-
-
X
Ent A
Ent B
Exp
SA
UN
e3
e1
harboured_in
-
-
Figure 1: CROSSRE Samples from Literature and
Artificial Intelligence Domains. At the top, the re-
lation is enriched with the EXPLANATION (Exp) "har-
boured_in". At the bottom, instead, the second relation
is marked with UNCERTAINTY (UN) by the annotator.
to new unseen domains, which can be learned from
limited data, and work well even on data for which
new relations or entity types have to be recognized.
One direction which is gaining attention is to
study RE systems under the assumption that new
relation types have to be learned from few exam-
ples (few-shot learning;Han et al.,2018;Gao et al.,
2019). One other direction is to study how sensi-
tive a RE system is under the assumption that the
input text features change (domain shift;Plank and
Moschitti,2013). There exists a limited amount
of studies that focus on the latter aspect, and—to
the best of our knowledge—there exists only one
paper that proposes to study both, few-shot rela-
tion classification under domain shift (Gao et al.,
2019). However, this last work considers only two
domains—Wikipedia text for training and biomedi-
cal literature for testing—and has been criticized
for its unrealistic setup (Sabo et al.,2021).
In this paper, we propose CROSSRE, a new chal-
lenging cross-domain evaluation benchmark for
RE for English (samples in Figure 1). CROSSRE
arXiv:2210.09345v1 [cs.CL] 17 Oct 2022
is manually curated with hand-annotated relations
covering up to 17 types, and includes multi-label
annotations. It contains six diverse text domains,
namely: news, literature, natural sciences, music,
politics and artificial intelligence. One of the chal-
lenges of CROSSRE is that both entities and re-
lation type distributions vary considerably across
domains. CROSSRE is heavily inspired by Cross-
NER (Liu et al.,2021), a recently proposed chal-
lenging benchmark for Named Entity Recognition
(NER). We extend CrossNER to RE and collect ad-
ditional meta-data including explanations and flags
of difcult instances. To the best of our knowledge,
CROSSRE is the most diverse RE datasets available
to date, enabling research on domain adaptation
and few-shot learning. In this paper we contribute:
A new, comprehensive, manually-curated and
freely-available RE dataset covering six di-
verse text domains and over 5k sentences.
We release meta-data collected during annota-
tion, and the annotation guidelines.
An empirical evaluation of a state-of-the-art
relation classification model and an experi-
mental analysis of the meta-data provided.
2 Related Work
Despite the popularity of the RE task (e.g. Nguyen
and Grishman,2015b;Miwa and Bansal,2016;
Baldini Soares et al.,2019;Wang and Lu,2020;
Zhong and Chen,2021), the cross-domain direc-
tion has not been widely explored. There are
only two datasets which can be considered an
initial step towards cross-domain RE. The ACE
dataset (Doddington et al.,2004) has been ana-
lyzed considering its five domains: news (broad-
cast news, newswire), weblogs, telephone conver-
sations, usenet and broadcast conversations (Plank
and Moschitti,2013;Nguyen and Grishman,2014,
2015a). In contrast to ACE, the domains in
CROSSRE are more distinctive, with specific and
more diverse entity types in each of them.
More recently, the FewRel 2.0 dataset (Gao et al.,
2019), has been published. It builds upon the orig-
inal FewRel dataset (Han et al.,2018)—collected
from Wikipedia—and adds a new test set in the
biomedical domain, collected from PubMed.
3 CrossRE
3.1 Motivation
RE aims to extract semantically informative triples
from unstructured text. The triples comprehend
an ordered pair of text spans which represent
named entities or mentions, and the semantic re-
lation which holds between them. The latter is
usually taken from a pre-defined set of relation
types, which typically changes across datasets,
even within the same domain. The absence of stan-
dards in RE leads to models which are designed to
extract specific relations from specific datasets. As
a consequence, the ability to generalize over out-
of-domain distributions and unseen data is usually
lacking. While such specialized models could be
useful in applications where particular knowledge
is required (e.g. the bioNLP field), in most of the
cases a more generic level is enough to supply the
information required for the downstream task. In
conclusion, RE models that are able to generalize
over domain-specific data would be beneficial in
terms of both costs of developing and training RE
systems designed to work in pre-defined scenarios.
To fill this gap, and in order to encourage the com-
munity to explore more the cross-domain RE angle,
we publish CROSSRE, a new dataset for RE which
includes six different domains, with a unified label
set of 17 relation types.2
3.2 Dataset Overview
CROSSRE includes the following domains: news
(
Z
), politics (
ý
), natural science (
), music
(W), literature (]) and artificial intelligence (Ä;
AI). Our semantic relations are annotated on top
of CrossNER (Liu et al.,2021), a cross-domain
dataset for NER which contains domain-specific
entity types.
3
The news domain (collected from
Reuters News) corresponds to the data released for
the CoNLL-2003 shared task (Tjong Kim Sang and
De Meulder,2003), while the other five domains
have been collected from Wikipedia. The six do-
mains have been proposed and defined by previous
work, and shown to contain diverse vocabularies.
We refer to Liu et al. (2021) for details on e.g. vo-
cabulary overlap across domains.
During our relation annotation process, we ad-
ditionally correct some mistakes in named enti-
2
Our data statement (Bender and Friedman,2018) can be
found in Appendix A.
3https://github.com/zliucr/CrossNER/
tree/main/ner_data
Figure 2: CROSSRE Label Distribution. Percentage label distribution over the 17 relation types divided by
CROSSREs six domains. Detailed counts and percentages in Appendix D.
ties previously annotated in CrossNER (entity type,
entity boundaries), but only revise existing entity
mentions involved in a semantic relation, as well
as add new entities involved in semantic relations
(see samples in Appendix C).
The final dataset statistics are reported in Table 1.
We keep the train/dev/test data split by Liu et al.
(2021) and because of resource constraints, we fix
as lower bound the sentence amount of the smallest
domain (AI). We pursue their design choice of mak-
ing training sets relatively small as cross-domain
models are expected to do fast adaptation with a
small-scale of target domain data samples. Our an-
notations are at the sentence-level, and the number
of relations indicates the amount of directed entity
pairs which are annotated with at least one of the
17 relation labels.
The final dataset contains 17 relation labels for
the six domains: PART-OF,PHYSICAL,USAGE,
ROLE,SOCIAL,GENERAL-AFFILIATION,COM-
PARE,TEMPORAL,ARTIFACT,ORIGIN,TOPIC,
OPPOSITE,CAUSE-EFFECT,WIN-DEFEAT,TYPE-
OF,NAMED, and RELATED-TO. The latter, very
generic, encapsulates all the semantic relations oc-
curring with an extremely low frequency. With
this label we make a step forward in respect
to Sabo et al. (2021) which merge the ‘other’ and
‘no-relation’ cases into the ‘None-of-the-above’
(NOTA) label. We provide the description of each
relation type in Appendix B, and the full annota-
tion guidelines in our repository. The resulting
label distribution is illustrated in Figure 2, show-
ing that relations vary substantially across domains.
We will return to this point in the experimental sec-
tion and provide further details in the next Section.
After that, we describe the process that resulted in
the final annotation guidelines and relation types.
This includes the details on annotation agreement.
SENTENCES RELATIONS
train dev test tot. train dev test tot.
Z164 350 400 914 175 300 396 871
ý101 350 400 851 502 1,616 1,831 3,949
103 351 400 854 355 1,340 1,393 3,088
W100 350 399 849 496 1,861 2,333 4,690
]100 400 416 916 397 1,539 1,591 3,527
Ä100 350 431 881 350 1,006 1,127 2,483
tot. 668 2,151 2,446 5,265 2,275 7,662 8,671 18,608
Table 1: CROSSRE Statistics. Number of sentences
and number of relations annotated for each domain.
As mentioned, our guidelines allow for multi-
label annotations (Jiang et al.,2016). This means
that each entity pair can be assigned to multiple
relation types—except for the RELATED-TO label
which is exclusive and has to be used when none
of the other 16 labels fit the data (see example in
Appendix E). The combination of labels enables
more precise annotations which better represent
the meaning expressed in the text (e.g. domain-
specific scenarios), by keeping the relation label
set relatively small and generic, as motivated in Sec-
tion 3.1. Overall, 6% of the relations in CROSSRE
are annotated with multiple labels, specifically:
Z
2%,
ý
15%,
5%,
W
4%,
]
2%, and
Ä
4%.
Note that because of the directionality of the rela-
tions, entity pairs containing the same entities, but
reverse order, do not count as multi-labeled.
3.3 Label Distributions
CROSSRE includes the same label set over its six
domains. This implementation choice is motivated
by the aim of studying cross-domain RE models
which are able to generalize over domain-specific
data, and abstract to non-domain-specific relations.
The result is a dataset with divergent label distribu-
tions across the different domains. Figure 2shows
摘要:

CrossRE:ACross-DomainDatasetforRelationExtractionElisaBassignanaÞandBarbaraPlankÞSÄÞDepartmentofComputerScience,ITUniversityofCopenhagen,DenmarkSCenterforInformationandLanguageProcessing(CIS),LMUMunich,GermanyÄMunichCenterforMachineLearning(MCML),Munich,Germanyelba@itu.dkbplank@cis.lmu.deAbstractRel...

展开>> 收起<<
CrossRE A Cross-Domain Dataset for Relation Extraction Elisa Bassignanaand Barbara Plankrobot Department of Computer Science IT University of Copenhagen Denmark.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:436.51KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注