Transformer-based Subject Entity Detection in Wikipedia Listings

2025-05-06 0 0 1.02MB 16 页 10玖币
侵权投诉
Transformer-based Subject Entity Detection in
Wikipedia Listings
Nicolas Heist1,*,Heiko Paulheim1
1Data and Web Science Group, University of Mannheim, Germany
Abstract
In tasks like question answering or text summarisation, it is essential to have background knowledge
about the relevant entities. The information about entities - and in particular, about long-tail or emerging
entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this
paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and
tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which
we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses
a transformer network to identify subject entities on token-level and surpasses an existing approach
in terms of performance while being bound by fewer limitations. Due to a exible input format, it
is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as
input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extract 40
million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are
incorporated in the most recent version of CaLiGraph.
Keywords
Subject Entity Detection, Named Entity Recognition, Wikipedia Listings, CaLiGraph
1. Introduction
1.1. Motivation
Background knowledge provides an essential advantage in tasks like text summarisation or
question answering. With ready-to-use entity linking tools like Falcon [
1
], entities in text can
be identied and additional information can be drawn from background knowledge graphs (e.g.
DBpedia [
2
] or CaLiGraph
1
[
3
]). Of course, this is only possible if the necessary information
about the entity is included in the knowledge graph [4].
Hence, it is important to equip knowledge graphs with as much entity knowledge as possible.
While this is easily possible for prominent entities that are mentioned frequently, the retrieval
of information about long-tail and emerging entities that are mentioned only very infrequently
is tedious [
5
]. Still, approaches for automatic information extraction can be applied to increase
ISWC 2022: Deep Learning for Knowledge Graphs, October 23–27, 2022, Virtual Conference
*Corresponding author.
"nico@informatik.uni-mannheim.de (N. Heist); heiko@informatik.uni-mannheim.de (H. Paulheim)
~http://www.uni-mannheim.de/dws/people/researchers/phd-students/nicolas-heist/ (N. Heist);
http://www.heikopaulheim.com/ (H. Paulheim)
0000-0002-4354-9138 (N. Heist); 0000-0002-4354-9138 (H. Paulheim)
©2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
1http://caligraph.org
arXiv:2210.01482v1 [cs.IR] 4 Oct 2022
Gilby Clarke
--- -- ---- -- - ----
Discography
 -- ---- - --- - ---
Albums with Guns N' Roses
 -The Spaghetti Incident?(1993)
 - Greatest Hits(1999)
Albums with Nancy Sinatra
 - California Girl
Solo albums
  Name     Year  --
  Rubber    1998   ---
  Swag     2001   -
...
Listing 1
Listing 2
Listing 3
Page Title
Section
Figure 1: Simplified view on the listings of the Wikipedia page of Gilby Clarke.
the coverage of knowledge graphs to a certain extent. One strand of research is concerned with
open information extraction systems that try to extract facts from web text (e.g. [6,7]). While
they perform strongly on well-known entities, the extraction quality for long-tail entities is
considerably worse [6].
The extraction of information from semi-structured data is in general less error-prone and
has already proven to yield high-quality results as, for example, DBpedia itself is extracted
primarily from Wikipedia infoboxes; other approaches use the category system of Wikipedia
[
8
,
9
,
10
]; many more approaches focus on tables (in Wikipedia or the web) as a semi-structured
data source to extract entities and relations (see [11] for a comprehensive survey).
In this work, we generalize over structures like enumerations (Listings 1 and 2) and tables
(Listing 3 in Figure 1) by simply considering them as listings with listing items (i.e., enumeration
entries or table rows). Further, we call the main entity, that a listing item is about, a subject
entity (SE). In previous work, we dened SEs as all entities in a listing appearing as instances to a
common concept [
12
]. In case of Figure 1, the SEs are the mentioned albums (e.g. The Spaghetti
Incident? or California Girl). Here, the common concept is made explicit through the section
labels above the listings (Albums with..), but it may as well be the case that it is only implicitly
dened through the respective SEs. As a listing item typically mentions only one SE together
with some context (in this case, the publication year of the album), we assume that at most one
SE per listing item exists.
In the English Wikipedia chapter alone, we nd almost ve million listings in roughly two
million articles. From our estimation, about 80% of the listings are suitable for the extraction of
SEs, bearing an immense potential for knowledge graph completion (for details, see Section 3.1).
Upon extraction, they can easily be digested by downstream applications: Due to the semi-
structured nature of listings, the quality of extraction is higher than extraction from plain text,
and SEs are typically extracted in groups of instances sharing a common concept (as given
by the denition above). Especially the latter point makes subsequent disambiguation step
much easier, as the group of extracted instances provides context for every individual instance.
Another example of the downstream use of SEs is a work of ours where we used groups of SEs
to learn lexical patterns that entail axioms [
12
]. For example, if a listing is in a section that
starts with Albums with, we learn that the SEs are of the type Album.
The combination of these two ideas, i.e. of extracting novel SEs and learning dening axioms
for them, can bring a big benet. In Figure 1, instead of simply discovering California Girl as a
new entity, we additionally assign the type Album. Thinking further, we can learn an axiom
that all albums mentioned in the discography of Gilby Clarke are albums that are authored by
him. The additional information can be used to rene the description of the extracted entity in
the knowledge graph.
1.2. Problem Statement
Given an arbitrary listing, we want to identify the SEs among all entities mentioned in the listing.
In the literature, there are only very few approaches that deal with this problem. The most
related approach is a previous work of the authors that is concerned with the detection of SEs
in Wikipedia list pages [
3
].
2
The approach uses a hand-crafted set of features to classify entities
in tables or enumerations of list pages as SEs. However, the approach has several limitations:
It is only applicable to list pages and not to listings in any other context as the features
are primarily designed for the list page context.
Dependencies between individual SEs of listing items are not taken into account as the
classication is done separately for every item.
The approach needs mention boundaries of entities as input for the classication. Con-
sequently, it cannot identify any new entities but only categorize existing entities into
subject and non-subject entities.
1.3. Contributions
To harness the information expressed through SEs in more general settings, we aim to over-
come the previously mentioned limitations in this work. In particular, we make the following
contributions:
We present a Transformer-based approach for SE detection with a exible input format
that allows us to apply it to any kind of listing. Further, the model takes dependencies
between listing items into account (Section 4.1).
During prediction, the approach detects SEs end-to-end without relying on mention
boundaries of the entities in the input sequence (Section 4.2).
We introduce a novel mechanism for generating negative samples of listings (Section 4.3)
and a ne-tuning mechanism on noisy listing labels (Section 4.4) leading to more accurate
prediction results.
2List pages are special Wikipedia pages that contain only listings describing entities of a certain topic.
In our evaluation, we show that the performance of our approach is superior to previous
work (Section 5.3); further, we analyse its performance in a more general scenario - that
is, arbitrary listings of Wikipedia pages (Section 5.4).
We run the extraction of SEs on the complete Wikipedia corpus and incorporate the
results in a new version of CaLiGraph (Section 5.6).
The produced code is publicly available and part of the CaLiGraph extraction framework.3
2. Related Work
With the presented approach we detect SEs end-to-end, directly from listing text. For a given
listing, we identify mentions of named entities and decide at the same time whether they are
SEs of a listing or not. In the following, we rst review Named Entity Recognition (NER) and
subsequently discuss approaches that detect SEs.
2.1. Named Entity Recognition
NER is a subproblem of Entity Linking (EL) which only tries to identify mentions of named
entities in the text without actually disambiguating them [
13
]. As opposed to general Entity
Recognition, NER only deals with the identication of named entities and ignores the linking of
concepts (also called Wikication) [14].
Early NER systems were based on hand-crafted rules and lexicons, followed by systems using
feature-engineering and machine learning [
15
]. One of the rst competitive NER systems that
used neural networks has been presented by Collobert et al. in 2011 [
16
]. This eventually lead
to more sophisticated architectures based on word embeddings and LSTMs (e.g. from Lample et
al. [17]).
With the rise of transformer networks [
18
] like BERT [
19
] in 2018, they also found their
direct application in NER (e.g. by Liang et al. [
20
]), or as part of an end-to-end EL system like
the one from Broscheit [
21
]. The latter uses a simple but eective prediction scheme, where
entities are predicted at token-level and multiple subsequent tokens with the same predicted
entity are collapsed into the actual entity prediction. In our work, we use a similar token-level
prediction scheme to detect SEs.
2.2. Subject Entity Detection
Although SE detection has not explicitly been addressed in the literature very frequently, there
are some approaches that deal with related problems or subproblems of it. In table interpretation,
an important task is the identication of the subject column, i.e. the column containing the
entity with outgoing relations to all other columns. TAIPAN [
22
] is an approach that aims to
recover the semantics of tables and names subject column identication as the rst major task
towards relation extraction in tables. To identify subject columns, they choose the columns
having entities with the most outgoing edges to entities in other columns w.r.t. a background
knowledge graph. While this is a viable approach for tables that are already annotated with
3https://github.com/nheist/CaLiGraph
摘要:

Transformer-basedSubjectEntityDetectioninWikipediaListingsNicolasHeist1,*,HeikoPaulheim11DataandWebScienceGroup,UniversityofMannheim,GermanyAbstractIntaskslikequestionansweringortextsummarisation,itisessentialtohavebackgroundknowledgeabouttherelevantentities.Theinformationaboutentities-andinparticul...

展开>> 收起<<
Transformer-based Subject Entity Detection in Wikipedia Listings.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.02MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注