DBkWik - Multi Source Matching of Knowledge Graphs Sven Hertling0000 000303335888and Heiko Paulheim0000 000343868195

2025-05-06 0 0 1.02MB 15 页 10玖币
侵权投诉
DBkWik++- Multi Source Matching of
Knowledge Graphs
Sven Hertling[0000000303335888] and Heiko Paulheim[0000000343868195]
Data and Web Science Group, University of Mannheim, Germany
{sven,heiko}@informatik.uni-mannheim.de
Abstract. Large knowledge graphs like DBpedia and YAGO are always
based on the same source, i.e., Wikipedia. But there are more wikis that
contain information about long-tail entities such as wiki hosting plat-
forms like Fandom. In this paper, we present the approach and analy-
sis of DBkWik++, a fused Knowledge Graph from thousands of wikis.
A modified version of the DBpedia framework is applied to each wiki
which results in many isolated Knowledge Graphs. With an incremental
merge based approach, we reuse one-to-one matching systems to solve
the multi source KG matching task. Based on this alignment we create
a consolidated knowledge graph with more than 15 million instances.
Keywords: Knowledge Graph Matching ·Incremental Merge ·Fusion
1 Introduction
There are many knowledge graphs (KGs) available in the Linked Open Data
Cloud1, some have a special focus like life science or governmental data. In
the course of time, a few hubs evolved which have a high link degree to other
datasets – two of them are DBpedia and YAGO. Both cover general knowledge
which is extracted from Wikipedia. Thus many applications use these datasets.
One drawback is that they all originate from the same source and have thus
nearly the same concepts. For many applications like recommender systems [1],
information about not so well known entities (also called long tail entities) is
required to find similar concepts. Additional sources for such entities can be
found in other wikis than Wikipedia.
One example is the wiki farm Fandom2. Everyone can create wikis about any
topic. Due to the restricted scope in each of these wikis, also pages about not
so well known entities are created. As an example William Riker (the fictional
character in the Star Trek universe) appears in Wikipedia because this character
is famous enough to be added (see also the notability criterion for people3). For
other characters, like his mother Betty Riker, this notability is not given, so it
only appears in special wikis like memory alpha4(a Star Trek wiki in Fandom).
1https://lod-cloud.net
2https://www.fandom.com
3https://en.wikipedia.org/wiki/Wikipedia:Notability_(people)
4https://memory-alpha.fandom.com/wiki/Betty_Riker
arXiv:2210.02864v1 [cs.IR] 6 Oct 2022
2 Sven Hertling and Heiko Paulheim
Table 1. Comparison of public Knowledge Graphs based on [10] sorted by the number
of instances.
Knowledge Graph # Instances # Assertions # Classes # Relations Source
Voldemort 55,861 693,428 621 294 Web
Cyc 122,441 2,229,266 116,821 148 Experts
DBpedia 5,044,223 854,294,312 760 1,355 Wikipedia
NELL 5,120,688 60,594,443 1,187 440 Web
YAGO 6,349,359 479,392,870 819,292 77 Wikipedia
CaLiGraph 7,315,918 517,099,124 755,963 271 Wikipedia
BabelNet 7,735,436 178,982,397 6,044,564 22 multiple
DBkWik 11,163,719 91,526,001 12,029 128,566 Fandom
DBkWik++ 15,346,033 106,347,347 15,642 215,273 Fandom
Wikidata 52,252,549 732,420,508 2,356,259 6,236 Community
The idea in this work is to use these wikis and apply a modified version
of the DBpedia extraction framework to create knowledge graphs out of them
containing information about long tail entities. Each resulting KG is isolated and
can share same instances, properties, and classes that need to be matched. For
this multi source knowledge graph matching task we reuse a one-to-one matcher
and apply it multiple times in an incremental merge based setup to create an
alignment over all KGs.
After fusing all KGs together based on the generated alignment, we end
up with DBkWik++, a fused knowledge graph of more than 40,000 wikis from
Fandom. The contributions of this paper are threefold:
presentation of the overall approach to generate the KGs
matching of 40,000 KGs on schema and instance level by reusing a one-to-one
matcher
analysis of the resulting knowledge graph
The rest of this paper is structured as follows. Section 2 describes related work
such as different general purpose knowledge graphs and matching techniques.
Afterward, section 3 shows details about the retrieval of wikis, application of the
DBpedia extraction framework, and the incremental merge of the KGs. After the
fusion and provenance information of the KG, the resulting alignment is profiled
in section 4 and the KG in section 5. We conclude with an outlook and future
work.
2 Related Work
This section is divided into two parts: 1) description of other cross-domain knowl-
edge graphs and 2) multi source matching approaches to combine isolated KGs
into one large KG.
Table 1 shows the public cross-domain KGs together with the number of
instances, assertions, classes, and relations. In addition, the main source of the
DBkWik++- Multi Source Matching of Knowledge Graphs 3
content is provided in the last column. The KGs are sorted by the number of
instances starting with the smallest.
VoldemortKG [34] uses data extracted from webpages (common crawl) via
structured annotations using approaches like RDFa, Microdata, and Microfor-
mats. The resulting set of KGs (one for each webpage) is then merged together
and linked to DBpedia by using Wikipedia links occurring on those webpages.
The overall graph is relatively small and contains only 55,861 instances.
Cyc [18] was generated by a small number of experts. They focus on common
sense knowledge and create more assertions than instances. The scalability is
quite limited because of the manual generation. The total cost was estimated as
120 Million USD. The numbers in the table refer to the openly available subset
OpenCyc.
DBpedia [2] instead uses another approach that scales much better. The main
source is Wikipedia where a lot of entries contain information in so-called in-
foboxes (MediaWiki templates). These templates contain attribute value pairs
where the values are shown on the webpage. When processing these pages with-
out resolving the templates, those key-value pairs can be extracted and trans-
formed to triples where the wiki page is the subject, the template key is the
property, and the template value is the corresponding literal or resource (in case
it is a URL). This scales much better and opens the door for other data-driven
approaches.
NELL [22] (Never-Ending Language Learning) is an approach to extract
information from free text appearing on web pages. Based on some initial facts,
textual patterns for these relations are extracted and applied to unseen text to
extract more subjects and objects for that relation. The resulting facts are again
used to derive new patterns. With a human-in-the-loop approach, the authors
try to increase the quality by removing incorrect triples.
YAGO [7] uses the same source as DBpedia (namely Wikipedia) but cre-
ates the class hierarchy based on the categories defined in Wikipedia instead of
manually creating the hierarchy like DBpedia.
CaLiGraph [11] also uses the category tree but converts the information in
category names into formal axioms e.g. “List of people from New York City“
where each instance in this category should have the triple <instance, bornIn,
New York City>.
Babelnet [24] integrates a lot of sources like Wikipedia and WordNet [21] to
collect synonyms and translations in many languages.
DBkWik [12] is generated from Fandom wikis with the DBpedia extraction
framework. Thus it has the same structure as DBpedia but includes more long-
tail entities, especially from the entertainment domain. It uses information from
12,840 wikis.
Wikidata [36] is a community-driven approach like Wikipedia but allows to
add factual information in form of triples instead of free text. Furthermore, it
includes and fuses other large-scale datasets such as national libraries’ bibliogra-
phies.
摘要:

DBkWik++-MultiSourceMatchingofKnowledgeGraphsSvenHertling[0000000303335888]andHeikoPaulheim[0000000343868195]DataandWebScienceGroup,UniversityofMannheim,Germanyfsven,heikog@informatik.uni-mannheim.deAbstract.LargeknowledgegraphslikeDBpediaandYAGOarealwaysbasedonthesamesource,i.e.,Wikipedia.Buttherea...

展开>> 收起<<
DBkWik - Multi Source Matching of Knowledge Graphs Sven Hertling0000 000303335888and Heiko Paulheim0000 000343868195.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.02MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注