Transformer-based Subject Entity Detection in Wikipedia Listings

2025-05-06 0 0 1.02MB 16 页 10玖币

侵权投诉

Transformer-based Subject Entity Detection in

Wikipedia Listings

Nicolas Heist1,*,Heiko Paulheim1

1Data and Web Science Group, University of Mannheim, Germany

Abstract

In tasks like question answering or text summarisation, it is essential to have background knowledge

about the relevant entities. The information about entities - and in particular, about long-tail or emerging

entities - in publicly available knowledge graphs like DBpedia or CaLiGraph is far from complete. In this

paper, we present an approach that exploits the semi-structured nature of listings (like enumerations and

tables) to identify the main entities of the listing items (i.e., of entries and rows). These entities, which

we call subject entities, can be used to increase the coverage of knowledge graphs. Our approach uses

a transformer network to identify subject entities on token-level and surpasses an existing approach

in terms of performance while being bound by fewer limitations. Due to a exible input format, it

is applicable to any kind of listing and is, unlike prior work, not dependent on entity boundaries as

input. We demonstrate our approach by applying it to the complete Wikipedia corpus and extract 40

million mentions of subject entities with an estimated precision of 71% and recall of 77%. The results are

incorporated in the most recent version of CaLiGraph.

Keywords

Subject Entity Detection, Named Entity Recognition, Wikipedia Listings, CaLiGraph

1. Introduction

1.1. Motivation

Background knowledge provides an essential advantage in tasks like text summarisation or

question answering. With ready-to-use entity linking tools like Falcon [

], entities in text can

be identied and additional information can be drawn from background knowledge graphs (e.g.

DBpedia [

] or CaLiGraph

[

]). Of course, this is only possible if the necessary information

about the entity is included in the knowledge graph [4].

Hence, it is important to equip knowledge graphs with as much entity knowledge as possible.

While this is easily possible for prominent entities that are mentioned frequently, the retrieval

of information about long-tail and emerging entities that are mentioned only very infrequently

is tedious [

]. Still, approaches for automatic information extraction can be applied to increase

ISWC 2022: Deep Learning for Knowledge Graphs, October 23–27, 2022, Virtual Conference

*Corresponding author.

"nico@informatik.uni-mannheim.de (N. Heist); heiko@informatik.uni-mannheim.de (H. Paulheim)

~http://www.uni-mannheim.de/dws/people/researchers/phd-students/nicolas-heist/ (N. Heist);

http://www.heikopaulheim.com/ (H. Paulheim)

0000-0002-4354-9138 (N. Heist); 0000-0002-4354-9138 (H. Paulheim)

CEUR

Workshop

Proceedings

http://ceur-ws.org

ISSN 1613-0073

CEUR Workshop Proceedings (CEUR-WS.org)

1http://caligraph.org

arXiv:2210.01482v1 [cs.IR] 4 Oct 2022

Gilby Clarke

--- -- ---- -- - ----

Discography

 -- ---- - --- - ---

 Albums with Guns N' Roses

 -The Spaghetti Incident?(1993)

 - Greatest Hits(1999)

 Albums with Nancy Sinatra

 - California Girl

 Solo albums

  Name     Year  --

  Rubber    1998   ---

  Swag     2001   -

...

Listing 1

Listing 2

Listing 3

Page Title

Section

Figure 1: Simplified view on the listings of the Wikipedia page of Gilby Clarke.

the coverage of knowledge graphs to a certain extent. One strand of research is concerned with

open information extraction systems that try to extract facts from web text (e.g. [6,7]). While

they perform strongly on well-known entities, the extraction quality for long-tail entities is

considerably worse [6].

The extraction of information from semi-structured data is in general less error-prone and

has already proven to yield high-quality results as, for example, DBpedia itself is extracted

primarily from Wikipedia infoboxes; other approaches use the category system of Wikipedia

[

]; many more approaches focus on tables (in Wikipedia or the web) as a semi-structured

data source to extract entities and relations (see [11] for a comprehensive survey).

In this work, we generalize over structures like enumerations (Listings 1 and 2) and tables

(Listing 3 in Figure 1) by simply considering them as listings with listing items (i.e., enumeration

entries or table rows). Further, we call the main entity, that a listing item is about, a subject

entity (SE). In previous work, we dened SEs as all entities in a listing appearing as instances to a

common concept [

]. In case of Figure 1, the SEs are the mentioned albums (e.g. The Spaghetti

Incident? or California Girl). Here, the common concept is made explicit through the section

labels above the listings (Albums with..), but it may as well be the case that it is only implicitly

dened through the respective SEs. As a listing item typically mentions only one SE together

with some context (in this case, the publication year of the album), we assume that at most one

SE per listing item exists.

In the English Wikipedia chapter alone, we nd almost ve million listings in roughly two

million articles. From our estimation, about 80% of the listings are suitable for the extraction of

SEs, bearing an immense potential for knowledge graph completion (for details, see Section 3.1).

Upon extraction, they can easily be digested by downstream applications: Due to the semi-

structured nature of listings, the quality of extraction is higher than extraction from plain text,

and SEs are typically extracted in groups of instances sharing a common concept (as given

by the denition above). Especially the latter point makes subsequent disambiguation step

much easier, as the group of extracted instances provides context for every individual instance.

Another example of the downstream use of SEs is a work of ours where we used groups of SEs

to learn lexical patterns that entail axioms [

]. For example, if a listing is in a section that

starts with Albums with, we learn that the SEs are of the type Album.

The combination of these two ideas, i.e. of extracting novel SEs and learning dening axioms

for them, can bring a big benet. In Figure 1, instead of simply discovering California Girl as a

new entity, we additionally assign the type Album. Thinking further, we can learn an axiom

that all albums mentioned in the discography of Gilby Clarke are albums that are authored by

him. The additional information can be used to rene the description of the extracted entity in

the knowledge graph.

1.2. Problem Statement

Given an arbitrary listing, we want to identify the SEs among all entities mentioned in the listing.

In the literature, there are only very few approaches that deal with this problem. The most

related approach is a previous work of the authors that is concerned with the detection of SEs

in Wikipedia list pages [

The approach uses a hand-crafted set of features to classify entities

in tables or enumerations of list pages as SEs. However, the approach has several limitations:

•

It is only applicable to list pages and not to listings in any other context as the features

are primarily designed for the list page context.

•

Dependencies between individual SEs of listing items are not taken into account as the

classication is done separately for every item.

•

The approach needs mention boundaries of entities as input for the classication. Con-

sequently, it cannot identify any new entities but only categorize existing entities into

subject and non-subject entities.

1.3. Contributions

To harness the information expressed through SEs in more general settings, we aim to over-

come the previously mentioned limitations in this work. In particular, we make the following

contributions:

•

We present a Transformer-based approach for SE detection with a exible input format

that allows us to apply it to any kind of listing. Further, the model takes dependencies

between listing items into account (Section 4.1).

•

During prediction, the approach detects SEs end-to-end without relying on mention

boundaries of the entities in the input sequence (Section 4.2).

•

We introduce a novel mechanism for generating negative samples of listings (Section 4.3)

and a ne-tuning mechanism on noisy listing labels (Section 4.4) leading to more accurate

prediction results.

2List pages are special Wikipedia pages that contain only listings describing entities of a certain topic.

•

In our evaluation, we show that the performance of our approach is superior to previous

work (Section 5.3); further, we analyse its performance in a more general scenario - that

is, arbitrary listings of Wikipedia pages (Section 5.4).

•

We run the extraction of SEs on the complete Wikipedia corpus and incorporate the

results in a new version of CaLiGraph (Section 5.6).

The produced code is publicly available and part of the CaLiGraph extraction framework.3

2. Related Work

With the presented approach we detect SEs end-to-end, directly from listing text. For a given

listing, we identify mentions of named entities and decide at the same time whether they are

SEs of a listing or not. In the following, we rst review Named Entity Recognition (NER) and

subsequently discuss approaches that detect SEs.

2.1. Named Entity Recognition

NER is a subproblem of Entity Linking (EL) which only tries to identify mentions of named

entities in the text without actually disambiguating them [

]. As opposed to general Entity

Recognition, NER only deals with the identication of named entities and ignores the linking of

concepts (also called Wikication) [14].

Early NER systems were based on hand-crafted rules and lexicons, followed by systems using

feature-engineering and machine learning [

]. One of the rst competitive NER systems that

used neural networks has been presented by Collobert et al. in 2011 [

]. This eventually lead

to more sophisticated architectures based on word embeddings and LSTMs (e.g. from Lample et

al. [17]).

With the rise of transformer networks [

] like BERT [

] in 2018, they also found their

direct application in NER (e.g. by Liang et al. [

]), or as part of an end-to-end EL system like

the one from Broscheit [

]. The latter uses a simple but eective prediction scheme, where

entities are predicted at token-level and multiple subsequent tokens with the same predicted

entity are collapsed into the actual entity prediction. In our work, we use a similar token-level

prediction scheme to detect SEs.

2.2. Subject Entity Detection

Although SE detection has not explicitly been addressed in the literature very frequently, there

are some approaches that deal with related problems or subproblems of it. In table interpretation,

an important task is the identication of the subject column, i.e. the column containing the

entity with outgoing relations to all other columns. TAIPAN [

] is an approach that aims to

recover the semantics of tables and names subject column identication as the rst major task

towards relation extraction in tables. To identify subject columns, they choose the columns

having entities with the most outgoing edges to entities in other columns w.r.t. a background

knowledge graph. While this is a viable approach for tables that are already annotated with

3https://github.com/nheist/CaLiGraph

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Transformer-basedSubjectEntityDetectioninWikipediaListingsNicolasHeist1,*,HeikoPaulheim11DataandWebScienceGroup,UniversityofMannheim,GermanyAbstractIntaskslikequestionansweringortextsummarisation,itisessentialtohavebackgroundknowledgeabouttherelevantentities.Theinformationaboutentities-andinparticul...

展开>> 收起<<

Transformer-based Subject Entity Detection in Wikipedia Listings.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transformer-based Subject Entity Detection in Wikipedia Listings

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: