Leveraging Wikidatas edit history in knowledge graph refinement tasks

2025-05-02 0 0 736.36KB 18 页 10玖币
侵权投诉
Leveraging Wikidata’s edit history in knowledge graph refinement tasks
Alejandro Gonzalez-Heviaa,
, Daniel Gayo-Avelloa
aDepartment of Computer Science, University of Oviedo, Spain
Abstract
Knowledge graphs have been adopted in many diverse fields for a variety of purposes. Most of those applications rely
on valid and complete data to deliver their results, pressing the need to improve the quality of knowledge graphs. A
number of solutions have been proposed to that end, ranging from rule-based approaches to the use of probabilistic
methods, but there is an element that has not been considered yet: the edit history of the graph. In the case of
collaborative knowledge graphs (e.g., Wikidata), those edits represent the process in which the community reaches some
kind of fuzzy and distributed consensus over the information that best represents each entity, and can hold potentially
interesting information to be used by knowledge graph refinement methods. In this paper, we explore the use of edit
history information from Wikidata to improve the performance of type prediction methods. To do that, we have first
built a JSON dataset containing the edit history of every instance from the 100 most important classes in Wikidata.
This edit history information is then explored and analyzed, with a focus on its potential applicability in knowledge
graph refinement tasks. Finally, we propose and evaluate two new methods to leverage this edit history information in
knowledge graph embedding models for type prediction tasks. Our results show an improvement in one of the proposed
methods against current approaches, showing the potential of using edit information in knowledge graph refinement
tasks and opening new promising research lines within the field.
Keywords: Semantic Web, Wikidata, Edit History, Knowledge Graph Refinement, Type Prediction, Knowledge Graph
Embeddings
1. Introduction
Different fields have incorporated the use of domain-
specific knowledge graphs during recent years to solve their
tasks. Some concrete examples of such domain-specific
tasks include performing investment analysis [1], manag-
ing diseases and symptoms from medical records [2], or
automatically generating test cases for software projects
[3], among many others. Furthermore, the emergence of
several open and general-purpose knowledge graphs, such
as DBPedia [4] and Wikidata [5], has also attracted new
communities closer to the Semantic Web by allowing them
to exploit this structured information for many different
applications. It goes without saying that most of those
applications rely on the correctness and completeness of
the data in the knowledge graph to deliver their results.
It is therefore crucial to ensure a high level of quality
for those knowledge graphs. This has led to works that de-
fine quality metrics and dimensions to better analyze and
understand data quality [6]. Those works reveal the exis-
tence of constraint violations and missing information in
Corresponding author
Email addresses: uo251513@uniovi.es (Alejandro
Gonzalez-Hevia), dani@uniovi.es (Daniel Gayo-Avello)
URL: www.alejgh.com (Alejandro Gonzalez-Hevia),
www.danigayo.info (Daniel Gayo-Avello)
modern knowledge graphs, among other quality problems
[7, 8].
Therefore, a number of different approaches have been
proposed to improve the quality of knowledge graphs. Some
of them follow a deductive approach, where a set of rules
or constraints that each triple must follow are defined to
enforce data quality [9, 10]. Other proposals follow an in-
ductive approach, using predictive models or alternative
probabilistic methods to try to fill incomplete information
or fix errors in the knowledge graph [11].
However, in the specific case of collaborative knowledge
graphs like Wikidata, there is an element that has not been
fully explored yet: its edit history information. One of the
main features of Wikidata, telling it apart from other open
general-purpose knowledge graphs, is its collaborative ap-
proach: anyone can start from scratch editing entities in
Wikidata. At the time of this writing, there have been
1,640,933,943 edits made to Wikidata. In those edits, the
community has progressively built a consensus –fuzzy and
somewhat distributed among the editors– over the infor-
mation that best represents each entity within the knowl-
edge graph, while also capturing the natural evolution of
those entities across time.
In this paper we explore the possibilities of leverag-
ing edit information to refine the contents of a knowledge
graph, laying the foundations for future work in this area.
Our main contributions are:
Preprint submitted to Journal of Web Semantics October 28, 2022
arXiv:2210.15495v1 [cs.LG] 27 Oct 2022
1. The creation of a JSON dataset containing the com-
plete edit history of every entity of the 100 most
important classes in Wikidata, following Wikidata’s
data model (Section 3).
2. An analysis of the main editing patterns from con-
tributors, edits made by class, and divisiveness in
Wikidata based on the edit information (Section 4).
This information is analyzed with a focus on its pos-
sible applications to knowledge graph refinement tasks.
3. The proposal of two approaches to leverage edit his-
tory data in type prediction tasks: the use of edits
in the negative sampling process of knowledge graph
embeddings models, and using the edit information
as labeled data fed to a classifier (Section 5.1). We
perform an evaluation of these approaches against a
set of baselines and analyze the impact of using edit
history information in both approaches.
4. An RDF dataset containing edit history informa-
tion about Wikidata, following a custom data format
where each operation and revision is serialized to the
graph. This RDF dataset is also available without
edit history information, and can serve as a baseline
to measure the impact of using edit history data in
knowledge graph refinement models (Section 5.2).
The rest of this paper is structured as follows. In the
next section, we go over Wikidata’s data model and pro-
vide a formal definition of a knowledge graph and its ed-
its. These concepts are needed to better understand the
successive aspects of our work. Section 3 goes over the
process of acquiring edit information from Wikidata. This
edit information is explored in section 4. In Section 5,
we propose two new methods to leverage edit information
to improve existing knowledge graph embedding models,
and we evaluate their performance with respect to current
approaches. Related work is reviewed in Section 6. And,
finally, in Section 7 we present the conclusions of this work
and future research directions.
2. Background
2.1. Wikidata data model
Entities are the basic building block of Wikidata’s data
model. There are two different types of entities: items and
properties. Each entity is given a unique incremental nu-
meric id, with items being prefixed by a ‘Q’ and properties
by a ‘P’.
Astatement is composed of a property and a value as-
signed to that property, optionally having 1 to nqualifiers
and 1 to nreferences. In the rest of this paper we will
use the term simple statement to refer to statements that
are just composed of a property and a value. A statement
group is the set of statements that an item has of a given
property. Each entity in Wikidata is composed of 0 to n
statement groups.
Qualifiers are used to give further information about
a given statement (e.g., the point in time when the state-
ment holds true). Each qualifier is also composed of a
property and a value assigned to that property. The com-
bination of a property, value, and qualifier is called a claim.
References are also property-value pairs, and they provide
the source that validates a statement.
Aliases, descriptions, and labels constitute the finger-
print of an entity. These elements are mapped internally to
skos:altLabel,schema:description, and rdfs:label
URIs in the RDF representation of an entity. The com-
bination of description and label of an entity in a given
language must be unique. An entity can have multiple
aliases but only a single description and label for a given
language.
Snaks are the most basic information structure in Wiki-
data, and provide information about the value of a prop-
erty. There are three types of snaks: value,somevalue,
and novalue. Value snaks indicate that the property has a
known value, which is then represented using Wikidata’s
available datatypes1. Somevalue snaks indicate that the
property has a value but its value is unknown2. Finally,
novalue snaks indicate that the property does not have a
value.
Wikidata also introduces three ranks which can be as-
signed to each statement: preferred,deprecated, and
normal. These ranks are generally used to decide which
statements must be returned when querying Wikidata, and
also to clean up its user interface when exploring an entity.
Statements can also have an order within each statement
group. Although the order of statements within each rank
is not relevant, it can be changed by users.
All these elements are internally serialized in Wikidata
to JSON and different RDF serialization formats. It must
be noted that in this section we have covered all the ele-
ments that are mentioned in the rest of this paper, but the
list is not exhaustive. Additional information about Wiki-
data’s data model and its serialization is available online3.
2.2. Revisions
Wikidata allows any user to edit entities. Therefore,
each entity is composed of a revision history that holds
all the changes made to the entity by Wikidata contribu-
tors. Any of the elements described in the previous section
can be changed, and a single revision may hold changes
to any number and combination of elements of an entity.
Throughout this paper we will use the terms edition and
revision interchangeably, following Wikidata’s terminol-
ogy.
1More information available at https://www.wikidata.org/
wiki/Special:ListDatatypes
2Somevalue snaks are represented with blank nodes in the RDF
serialization of the data model.
3https://www.mediawiki.org/wiki/Wikibase/DataModel?
tableofcontents=0
2
A revision also contains additional metadata, including
its timestamp, author of the revision, tags, and a descrip-
tion. Tags are usually used to indicate the device from
which the revision was made (or the tool that made the
edit, if it was an automated process) and also to indicate
the cause of the edit4(e.g., to revert vandalism).
In the context of this paper we will use the term oper-
ation to refer to a single modification made to an entity’s
element in a revision. One revision may be composed of
1 to noperations. An operation may represent the addi-
tion or removal of a single element from the entity. For
the sake of simplicity, we will also consider replacements
as operations, which are a combination of an addition and
removal operation to an entity.
Wikidata allows restoring and undoing revisions made
to an entity. Restoring allows users to undo all the ed-
its made to an entity up to the selected restoration state.
Undoing is more versatile, since it undoes from 1 to n
edits selected by the user, which do not need to be con-
secutive edits like in the restoration process. In none of
those cases the revisions being undone are removed from
the revision history of the entity. A new revision is made
instead, which includes the necessary operations to undo
the selected revisions.
Edits can be manually removed by Wikidata adminis-
trators under specific circumstances. These include revi-
sions that contain private information, a violation of copy-
right, or personal attacks of a serious nature.
2.3. Formal definitions
We will now formally introduce the main elements used
throughout this paper. Let I,Land Bbe disjoint count-
ably infinite sets of IRIs, literals and blank nodes respec-
tively. A knowledge graph can be formally defined from a
static point of view as a set of triples (s, p, o)(I ∪ B)×
I × (I ∪ L ∪ B).
From a dynamic point of view, a knowledge graph is
built from a sequence of operations Op ={opj: 1
j≤ ∞}. Each operation opjis composed of a triple
t= (s, p, o) : s(I ∪ B)× I × (I ∪ L ∪ B). Op+=
{t1, t2, ..., tn}represents the set of addition operations of
the graph, while Op={t0
1, t0
2, ..., t0
m}represents the set of
removal operations. The set of all operations is therefore
defined as Op =Op+Op. A knowledge graph is built
out of noperations, with Kirepresenting the state of the
graph after applying all operations up to operation i. Ap-
plying an addition operation op+
i+1 = (s, p, o) to a graph
Kiresults in graph Ki+1 =Ki(s, p, o). On the other
hand, applying a removal operation op
i+1 = (s, p, o) to a
graph Kiresults in graph Ki+1 =Ki\(s, p, o). The final
state of a knowledge graph can be obtained by a successive
application of all its operations.
4A list of the most common tags can be accessed at https://www.
wikidata.org/wiki/Special:Tags
3. Extracting edit history data from Wikidata
We now present the approach followed to extract the
edit history information from Wikidata to conduct our ex-
periments.
3.1. Subset selection
Wikidata was composed of 97,795,169 entities at the
time of this writing, with more than 1,640,933,943 revi-
sions in total made by users5. Given that working with
the entire Wikidata revision history could be too compu-
tationally expensive to validate our proposal, we extracted
a subset to conduct our experiments.
This subset is composed of instances from the most
important Wikidata classes. To that end, we have com-
puted the ClassRank [12] score of every class in Wikidata,
choosing the top 100 classes with the highest score. Then,
we extracted the edit history information of every entity
that is an instance of any of those classes. We preferred
choosing the most important classes for our experiments
over producing a random sample since –in general– en-
tities belonging to central classes receive more attention
from the community, and are therefore more promising for
exploiting their edit history information.
To run ClassRank we defined the P31 property (in-
stance of ) of Wikidata as a class-pointer 6. The ClassRank
score of a class is computed by aggregating the PageRank
[13] scores of all its instances. However, since computing
the PageRank score of every entity in Wikidata was too
computationally expensive, we used a set of pre-computed
PageRank scores. These scores were obtained using the
Danker7tool, which periodically computes the PageRank
score of every existing entity in Wikipedia [14]. The scores
are then mapped from Wikipedia pages to their respec-
tive Wikidata entities, getting an approximation of their
PageRank value8.
The 20 most important classes based on their Class-
Rank score can be seen in table 1. These results were then
filtered manually, since some of those entities could be con-
sidered Wikidata classes at the ontological level, but not at
a conceptual one. To do so, we have removed those classes
that contained the term “Wikimedia” in their labels, since
they are used to organize Wikimedia content but do not
represent classes at a conceptual level.
The final subset is composed of 89 classes and 9.3 mil-
lion instances –around 10% of the total number of entities
in Wikidata. Although this subset is composed of just a
10% of the entities in Wikidata, its size is around a 35% of
the total size of Wikidata. This can be explained due to
5Source: https://www.wikidata.org/wiki/Wikidata:
Statistics
6The class-pointer is used by ClassRank to fetch those entities
from Wikidata that are classes.
7https://github.com/athalhammer/danker
8These dumps can be accessed at https://danker.s3.amazonaws.
com/index.html
3
Table 1: Top 20 most important classes based on their ClassRank score
Name ClassRank score Number of instances
human 2,167,439 3,873,812
Wikimedia category 1,057,559 2,207,283
sovereign state 837,883 203
taxon 755,681 1,962,491
country 746,208 193
point in time with respect to recurrent timeframe 635,401 3,273
calendar year 499,551 666
big city 321,893 3,238
human settlement 321,637 512,417
Wikimedia disambiguation page 317,631 1,294,218
Wikimedia administration category 302,196 12,146
Wikimedia list article 287,605 301,370
language 284,433 8,894
modern language 257,324 6,875
city 247,022 8,650
academic discipline 204,426 1,603
time zone named for a UTC offset 170,254 72
metacategory in Wikimedia projects 167,871 2,545
republic 165,881 78
capital 161,384 388
taxonomic rank 160,822 67
the fact that the most important entities have, in general,
more content introduced by the community with respect
to lesser important classes.
3.2. Data extraction
Wikidata periodically releases public dumps of its con-
tents9. We have selected the pages-meta-history dumps
to extract the edit history of every entity from our sub-
set, since these dumps are the only ones containing every
revision of each entity and not just their final content.
This dataset is composed of several XML files containing
metadata of every revision made to each entity, and also
a JSON blob with the complete content of the entity after
each revision. Our final dataset is built from the pages-
meta-history dumps from 2021-11-01.
Since working with the complete content of every en-
tity after each revision leads to a lot of redundant entity
information, we made some preprocessing steps to reduce
the dataset size. Instead of storing the complete JSON
content of each entity after every revision r, we computed
the diff between the JSON content in the previous revi-
sion (rt1) and the current one (rt). These diffs are stored
in the JSON Patch format10, therefore allowing the re-
construction of the entity contents after any revision. To
obtain the complete JSON content of an entity at revision
rtwe just need to apply the patches of every previous re-
vision up to rt. A simplified example of the decomposition
of an entity in diffs is illustrated in figure 1.
9Available at https://dumps.wikimedia.org/wikidatawiki/
10https://datatracker.ietf.org/doc/html/rfc6902
{
"name": "Alice",
"friends": ["Bob",
"Carol"],
"height": 1.78
}
{
"name": "Alice",
"friends": ["Bob"],
"height": 1.80
}
{
"name": "Alice",
"friends":["Bob",
"Carol"],
"height": 1.80,
"weight": 80
}
Wikidata revision data
r1 r2 r3
Diff-based revision data
[{
"op": "add",
"path": "/name",
"value": "Alice"
},
{
"op": "add",
"path": "/height",
"value": 1.78
},
{
"op": "add",
"path": "/friends",
"value": ["Bob",
"Carol"]
}]
[{
"op": "add",
"path": "/weight",
"value": 80
},
{
"op": "add",
"path": "/friends/1",
"value": "Carol"
}]
r1 r2 r3
[{
"op": "replace",
"path": "/height",
"value": 1.8
},
{
"op": "remove",
"path": "/friends/1"
}]
Figure 1: Example of revision decompositions into JSON Patch. At
the top of the figure we show the simplified content of a Wikidata
entity at each revision r. At the bottom of the figure we show the
decomposition of this entity into diffs in JSON format.
4
摘要:

LeveragingWikidata'sedithistoryinknowledgegraphre nementtasksAlejandroGonzalez-Heviaa,,DanielGayo-AvelloaaDepartmentofComputerScience,UniversityofOviedo,SpainAbstractKnowledgegraphshavebeenadoptedinmanydiverse eldsforavarietyofpurposes.Mostofthoseapplicationsrelyonvalidandcompletedatatodelivertheir...

展开>> 收起<<
Leveraging Wikidatas edit history in knowledge graph refinement tasks.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:736.36KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注