DBkWik - Multi Source Matching of Knowledge Graphs Sven Hertling0000 000303335888and Heiko Paulheim0000 000343868195

2025-05-06 0 0 1.02MB 15 页 10玖币

侵权投诉

DBkWik++- Multi Source Matching of

Knowledge Graphs

Sven Hertling[0000−0003−0333−5888] and Heiko Paulheim[0000−0003−4386−8195]

Data and Web Science Group, University of Mannheim, Germany

{sven,heiko}@informatik.uni-mannheim.de

Abstract. Large knowledge graphs like DBpedia and YAGO are always

based on the same source, i.e., Wikipedia. But there are more wikis that

contain information about long-tail entities such as wiki hosting plat-

forms like Fandom. In this paper, we present the approach and analy-

sis of DBkWik++, a fused Knowledge Graph from thousands of wikis.

A modiﬁed version of the DBpedia framework is applied to each wiki

which results in many isolated Knowledge Graphs. With an incremental

merge based approach, we reuse one-to-one matching systems to solve

the multi source KG matching task. Based on this alignment we create

a consolidated knowledge graph with more than 15 million instances.

Keywords: Knowledge Graph Matching ·Incremental Merge ·Fusion

1 Introduction

There are many knowledge graphs (KGs) available in the Linked Open Data

Cloud1, some have a special focus like life science or governmental data. In

the course of time, a few hubs evolved which have a high link degree to other

datasets – two of them are DBpedia and YAGO. Both cover general knowledge

which is extracted from Wikipedia. Thus many applications use these datasets.

One drawback is that they all originate from the same source and have thus

nearly the same concepts. For many applications like recommender systems [1],

information about not so well known entities (also called long tail entities) is

required to ﬁnd similar concepts. Additional sources for such entities can be

found in other wikis than Wikipedia.

One example is the wiki farm Fandom2. Everyone can create wikis about any

topic. Due to the restricted scope in each of these wikis, also pages about not

so well known entities are created. As an example William Riker (the ﬁctional

character in the Star Trek universe) appears in Wikipedia because this character

is famous enough to be added (see also the notability criterion for people3). For

other characters, like his mother Betty Riker, this notability is not given, so it

only appears in special wikis like memory alpha4(a Star Trek wiki in Fandom).

1https://lod-cloud.net

2https://www.fandom.com

3https://en.wikipedia.org/wiki/Wikipedia:Notability_(people)

4https://memory-alpha.fandom.com/wiki/Betty_Riker

arXiv:2210.02864v1 [cs.IR] 6 Oct 2022

2 Sven Hertling and Heiko Paulheim

Table 1. Comparison of public Knowledge Graphs based on [10] sorted by the number

of instances.

Knowledge Graph # Instances # Assertions # Classes # Relations Source

Voldemort 55,861 693,428 621 294 Web

Cyc 122,441 2,229,266 116,821 148 Experts

DBpedia 5,044,223 854,294,312 760 1,355 Wikipedia

NELL 5,120,688 60,594,443 1,187 440 Web

YAGO 6,349,359 479,392,870 819,292 77 Wikipedia

CaLiGraph 7,315,918 517,099,124 755,963 271 Wikipedia

BabelNet 7,735,436 178,982,397 6,044,564 22 multiple

DBkWik 11,163,719 91,526,001 12,029 128,566 Fandom

DBkWik++ 15,346,033 106,347,347 15,642 215,273 Fandom

Wikidata 52,252,549 732,420,508 2,356,259 6,236 Community

The idea in this work is to use these wikis and apply a modiﬁed version

of the DBpedia extraction framework to create knowledge graphs out of them

containing information about long tail entities. Each resulting KG is isolated and

can share same instances, properties, and classes that need to be matched. For

this multi source knowledge graph matching task we reuse a one-to-one matcher

and apply it multiple times in an incremental merge based setup to create an

alignment over all KGs.

After fusing all KGs together based on the generated alignment, we end

up with DBkWik++, a fused knowledge graph of more than 40,000 wikis from

Fandom. The contributions of this paper are threefold:

–presentation of the overall approach to generate the KGs

–matching of 40,000 KGs on schema and instance level by reusing a one-to-one

matcher

–analysis of the resulting knowledge graph

The rest of this paper is structured as follows. Section 2 describes related work

such as diﬀerent general purpose knowledge graphs and matching techniques.

Afterward, section 3 shows details about the retrieval of wikis, application of the

DBpedia extraction framework, and the incremental merge of the KGs. After the

fusion and provenance information of the KG, the resulting alignment is proﬁled

in section 4 and the KG in section 5. We conclude with an outlook and future

work.

2 Related Work

This section is divided into two parts: 1) description of other cross-domain knowl-

edge graphs and 2) multi source matching approaches to combine isolated KGs

into one large KG.

Table 1 shows the public cross-domain KGs together with the number of

instances, assertions, classes, and relations. In addition, the main source of the

DBkWik++- Multi Source Matching of Knowledge Graphs 3

content is provided in the last column. The KGs are sorted by the number of

instances starting with the smallest.

VoldemortKG [34] uses data extracted from webpages (common crawl) via

structured annotations using approaches like RDFa, Microdata, and Microfor-

mats. The resulting set of KGs (one for each webpage) is then merged together

and linked to DBpedia by using Wikipedia links occurring on those webpages.

The overall graph is relatively small and contains only 55,861 instances.

Cyc [18] was generated by a small number of experts. They focus on common

sense knowledge and create more assertions than instances. The scalability is

quite limited because of the manual generation. The total cost was estimated as

120 Million USD. The numbers in the table refer to the openly available subset

OpenCyc.

DBpedia [2] instead uses another approach that scales much better. The main

source is Wikipedia where a lot of entries contain information in so-called in-

foboxes (MediaWiki templates). These templates contain attribute value pairs

where the values are shown on the webpage. When processing these pages with-

out resolving the templates, those key-value pairs can be extracted and trans-

formed to triples where the wiki page is the subject, the template key is the

property, and the template value is the corresponding literal or resource (in case

it is a URL). This scales much better and opens the door for other data-driven

approaches.

NELL [22] (Never-Ending Language Learning) is an approach to extract

information from free text appearing on web pages. Based on some initial facts,

textual patterns for these relations are extracted and applied to unseen text to

extract more subjects and objects for that relation. The resulting facts are again

used to derive new patterns. With a human-in-the-loop approach, the authors

try to increase the quality by removing incorrect triples.

YAGO [7] uses the same source as DBpedia (namely Wikipedia) but cre-

ates the class hierarchy based on the categories deﬁned in Wikipedia instead of

manually creating the hierarchy like DBpedia.

CaLiGraph [11] also uses the category tree but converts the information in

category names into formal axioms e.g. “List of people from New York City“

where each instance in this category should have the triple <instance, bornIn,

New York City>.

Babelnet [24] integrates a lot of sources like Wikipedia and WordNet [21] to

collect synonyms and translations in many languages.

DBkWik [12] is generated from Fandom wikis with the DBpedia extraction

framework. Thus it has the same structure as DBpedia but includes more long-

tail entities, especially from the entertainment domain. It uses information from

12,840 wikis.

Wikidata [36] is a community-driven approach like Wikipedia but allows to

add factual information in form of triples instead of free text. Furthermore, it

includes and fuses other large-scale datasets such as national libraries’ bibliogra-

phies.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DBkWik++-MultiSourceMatchingofKnowledgeGraphsSvenHertling[0000000303335888]andHeikoPaulheim[0000000343868195]DataandWebScienceGroup,UniversityofMannheim,Germanyfsven,heikog@informatik.uni-mannheim.deAbstract.LargeknowledgegraphslikeDBpediaandYAGOarealwaysbasedonthesamesource,i.e.,Wikipedia.Buttherea...

展开>> 收起<<

DBkWik - Multi Source Matching of Knowledge Graphs Sven Hertling0000 000303335888and Heiko Paulheim0000 000343868195.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DBkWik - Multi Source Matching of Knowledge Graphs Sven Hertling0000 000303335888and Heiko Paulheim0000 000343868195

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: