Leveraging Wikidatas edit history in knowledge graph refinement tasks

2025-05-02 0 0 736.36KB 18 页 10玖币

侵权投诉

Leveraging Wikidata’s edit history in knowledge graph reﬁnement tasks

Alejandro Gonzalez-Heviaa,∗

, Daniel Gayo-Avelloa

aDepartment of Computer Science, University of Oviedo, Spain

Abstract

Knowledge graphs have been adopted in many diverse ﬁelds for a variety of purposes. Most of those applications rely

on valid and complete data to deliver their results, pressing the need to improve the quality of knowledge graphs. A

number of solutions have been proposed to that end, ranging from rule-based approaches to the use of probabilistic

methods, but there is an element that has not been considered yet: the edit history of the graph. In the case of

collaborative knowledge graphs (e.g., Wikidata), those edits represent the process in which the community reaches some

kind of fuzzy and distributed consensus over the information that best represents each entity, and can hold potentially

interesting information to be used by knowledge graph reﬁnement methods. In this paper, we explore the use of edit

history information from Wikidata to improve the performance of type prediction methods. To do that, we have ﬁrst

built a JSON dataset containing the edit history of every instance from the 100 most important classes in Wikidata.

This edit history information is then explored and analyzed, with a focus on its potential applicability in knowledge

graph reﬁnement tasks. Finally, we propose and evaluate two new methods to leverage this edit history information in

knowledge graph embedding models for type prediction tasks. Our results show an improvement in one of the proposed

methods against current approaches, showing the potential of using edit information in knowledge graph reﬁnement

tasks and opening new promising research lines within the ﬁeld.

Keywords: Semantic Web, Wikidata, Edit History, Knowledge Graph Reﬁnement, Type Prediction, Knowledge Graph

Embeddings

1. Introduction

Diﬀerent ﬁelds have incorporated the use of domain-

speciﬁc knowledge graphs during recent years to solve their

tasks. Some concrete examples of such domain-speciﬁc

tasks include performing investment analysis [1], manag-

ing diseases and symptoms from medical records [2], or

automatically generating test cases for software projects

[3], among many others. Furthermore, the emergence of

several open and general-purpose knowledge graphs, such

as DBPedia [4] and Wikidata [5], has also attracted new

communities closer to the Semantic Web by allowing them

to exploit this structured information for many diﬀerent

applications. It goes without saying that most of those

applications rely on the correctness and completeness of

the data in the knowledge graph to deliver their results.

It is therefore crucial to ensure a high level of quality

for those knowledge graphs. This has led to works that de-

ﬁne quality metrics and dimensions to better analyze and

understand data quality [6]. Those works reveal the exis-

tence of constraint violations and missing information in

∗Corresponding author

Email addresses: uo251513@uniovi.es (Alejandro

Gonzalez-Hevia), dani@uniovi.es (Daniel Gayo-Avello)

URL: www.alejgh.com (Alejandro Gonzalez-Hevia),

www.danigayo.info (Daniel Gayo-Avello)

modern knowledge graphs, among other quality problems

[7, 8].

Therefore, a number of diﬀerent approaches have been

proposed to improve the quality of knowledge graphs. Some

of them follow a deductive approach, where a set of rules

or constraints that each triple must follow are deﬁned to

enforce data quality [9, 10]. Other proposals follow an in-

ductive approach, using predictive models or alternative

probabilistic methods to try to ﬁll incomplete information

or ﬁx errors in the knowledge graph [11].

However, in the speciﬁc case of collaborative knowledge

graphs like Wikidata, there is an element that has not been

fully explored yet: its edit history information. One of the

main features of Wikidata, telling it apart from other open

general-purpose knowledge graphs, is its collaborative ap-

proach: anyone can start from scratch editing entities in

Wikidata. At the time of this writing, there have been

1,640,933,943 edits made to Wikidata. In those edits, the

community has progressively built a consensus –fuzzy and

somewhat distributed among the editors– over the infor-

mation that best represents each entity within the knowl-

edge graph, while also capturing the natural evolution of

those entities across time.

In this paper we explore the possibilities of leverag-

ing edit information to reﬁne the contents of a knowledge

graph, laying the foundations for future work in this area.

Our main contributions are:

Preprint submitted to Journal of Web Semantics October 28, 2022

arXiv:2210.15495v1 [cs.LG] 27 Oct 2022

1. The creation of a JSON dataset containing the com-

plete edit history of every entity of the 100 most

important classes in Wikidata, following Wikidata’s

data model (Section 3).

2. An analysis of the main editing patterns from con-

tributors, edits made by class, and divisiveness in

Wikidata based on the edit information (Section 4).

This information is analyzed with a focus on its pos-

sible applications to knowledge graph reﬁnement tasks.

3. The proposal of two approaches to leverage edit his-

tory data in type prediction tasks: the use of edits

in the negative sampling process of knowledge graph

embeddings models, and using the edit information

as labeled data fed to a classiﬁer (Section 5.1). We

perform an evaluation of these approaches against a

set of baselines and analyze the impact of using edit

history information in both approaches.

4. An RDF dataset containing edit history informa-

tion about Wikidata, following a custom data format

where each operation and revision is serialized to the

graph. This RDF dataset is also available without

edit history information, and can serve as a baseline

to measure the impact of using edit history data in

knowledge graph reﬁnement models (Section 5.2).

The rest of this paper is structured as follows. In the

next section, we go over Wikidata’s data model and pro-

vide a formal deﬁnition of a knowledge graph and its ed-

its. These concepts are needed to better understand the

successive aspects of our work. Section 3 goes over the

process of acquiring edit information from Wikidata. This

edit information is explored in section 4. In Section 5,

we propose two new methods to leverage edit information

to improve existing knowledge graph embedding models,

and we evaluate their performance with respect to current

approaches. Related work is reviewed in Section 6. And,

ﬁnally, in Section 7 we present the conclusions of this work

and future research directions.

2. Background

2.1. Wikidata data model

Entities are the basic building block of Wikidata’s data

model. There are two diﬀerent types of entities: items and

properties. Each entity is given a unique incremental nu-

meric id, with items being preﬁxed by a ‘Q’ and properties

by a ‘P’.

Astatement is composed of a property and a value as-

signed to that property, optionally having 1 to nqualiﬁers

and 1 to nreferences. In the rest of this paper we will

use the term simple statement to refer to statements that

are just composed of a property and a value. A statement

group is the set of statements that an item has of a given

property. Each entity in Wikidata is composed of 0 to n

statement groups.

Qualiﬁers are used to give further information about

a given statement (e.g., the point in time when the state-

ment holds true). Each qualiﬁer is also composed of a

property and a value assigned to that property. The com-

bination of a property, value, and qualiﬁer is called a claim.

References are also property-value pairs, and they provide

the source that validates a statement.

Aliases, descriptions, and labels constitute the ﬁnger-

print of an entity. These elements are mapped internally to

skos:altLabel,schema:description, and rdfs:label

URIs in the RDF representation of an entity. The com-

bination of description and label of an entity in a given

language must be unique. An entity can have multiple

aliases but only a single description and label for a given

language.

Snaks are the most basic information structure in Wiki-

data, and provide information about the value of a prop-

erty. There are three types of snaks: value,somevalue,

and novalue. Value snaks indicate that the property has a

known value, which is then represented using Wikidata’s

available datatypes1. Somevalue snaks indicate that the

property has a value but its value is unknown2. Finally,

novalue snaks indicate that the property does not have a

value.

Wikidata also introduces three ranks which can be as-

signed to each statement: preferred,deprecated, and

normal. These ranks are generally used to decide which

statements must be returned when querying Wikidata, and

also to clean up its user interface when exploring an entity.

Statements can also have an order within each statement

group. Although the order of statements within each rank

is not relevant, it can be changed by users.

All these elements are internally serialized in Wikidata

to JSON and diﬀerent RDF serialization formats. It must

be noted that in this section we have covered all the ele-

ments that are mentioned in the rest of this paper, but the

list is not exhaustive. Additional information about Wiki-

data’s data model and its serialization is available online3.

2.2. Revisions

Wikidata allows any user to edit entities. Therefore,

each entity is composed of a revision history that holds

all the changes made to the entity by Wikidata contribu-

tors. Any of the elements described in the previous section

can be changed, and a single revision may hold changes

to any number and combination of elements of an entity.

Throughout this paper we will use the terms edition and

revision interchangeably, following Wikidata’s terminol-

ogy.

1More information available at https://www.wikidata.org/

wiki/Special:ListDatatypes

2Somevalue snaks are represented with blank nodes in the RDF

serialization of the data model.

3https://www.mediawiki.org/wiki/Wikibase/DataModel?

tableofcontents=0

A revision also contains additional metadata, including

its timestamp, author of the revision, tags, and a descrip-

tion. Tags are usually used to indicate the device from

which the revision was made (or the tool that made the

edit, if it was an automated process) and also to indicate

the cause of the edit4(e.g., to revert vandalism).

In the context of this paper we will use the term oper-

ation to refer to a single modiﬁcation made to an entity’s

element in a revision. One revision may be composed of

1 to noperations. An operation may represent the addi-

tion or removal of a single element from the entity. For

the sake of simplicity, we will also consider replacements

as operations, which are a combination of an addition and

removal operation to an entity.

Wikidata allows restoring and undoing revisions made

to an entity. Restoring allows users to undo all the ed-

its made to an entity up to the selected restoration state.

Undoing is more versatile, since it undoes from 1 to n

edits selected by the user, which do not need to be con-

secutive edits like in the restoration process. In none of

those cases the revisions being undone are removed from

the revision history of the entity. A new revision is made

instead, which includes the necessary operations to undo

the selected revisions.

Edits can be manually removed by Wikidata adminis-

trators under speciﬁc circumstances. These include revi-

sions that contain private information, a violation of copy-

right, or personal attacks of a serious nature.

2.3. Formal deﬁnitions

We will now formally introduce the main elements used

throughout this paper. Let I,Land Bbe disjoint count-

ably inﬁnite sets of IRIs, literals and blank nodes respec-

tively. A knowledge graph can be formally deﬁned from a

static point of view as a set of triples (s, p, o)∈(I ∪ B)×

I × (I ∪ L ∪ B).

From a dynamic point of view, a knowledge graph is

built from a sequence of operations Op ={opj: 1 ≤

j≤ ∞}. Each operation opjis composed of a triple

t= (s, p, o) : s∈(I ∪ B)× I × (I ∪ L ∪ B). Op+=

{t1, t2, ..., tn}represents the set of addition operations of

the graph, while Op−={t0

1, t0

2, ..., t0

m}represents the set of

removal operations. The set of all operations is therefore

deﬁned as Op =Op+∪Op−. A knowledge graph is built

out of noperations, with Kirepresenting the state of the

graph after applying all operations up to operation i. Ap-

plying an addition operation op+

i+1 = (s, p, o) to a graph

Kiresults in graph Ki+1 =Ki∪(s, p, o). On the other

hand, applying a removal operation op−

i+1 = (s, p, o) to a

graph Kiresults in graph Ki+1 =Ki\(s, p, o). The ﬁnal

state of a knowledge graph can be obtained by a successive

application of all its operations.

4A list of the most common tags can be accessed at https://www.

wikidata.org/wiki/Special:Tags

3. Extracting edit history data from Wikidata

We now present the approach followed to extract the

edit history information from Wikidata to conduct our ex-

periments.

3.1. Subset selection

Wikidata was composed of 97,795,169 entities at the

time of this writing, with more than 1,640,933,943 revi-

sions in total made by users5. Given that working with

the entire Wikidata revision history could be too compu-

tationally expensive to validate our proposal, we extracted

a subset to conduct our experiments.

This subset is composed of instances from the most

important Wikidata classes. To that end, we have com-

puted the ClassRank [12] score of every class in Wikidata,

choosing the top 100 classes with the highest score. Then,

we extracted the edit history information of every entity

that is an instance of any of those classes. We preferred

choosing the most important classes for our experiments

over producing a random sample since –in general– en-

tities belonging to central classes receive more attention

from the community, and are therefore more promising for

exploiting their edit history information.

To run ClassRank we deﬁned the P31 property (in-

stance of ) of Wikidata as a class-pointer 6. The ClassRank

score of a class is computed by aggregating the PageRank

[13] scores of all its instances. However, since computing

the PageRank score of every entity in Wikidata was too

computationally expensive, we used a set of pre-computed

PageRank scores. These scores were obtained using the

Danker7tool, which periodically computes the PageRank

score of every existing entity in Wikipedia [14]. The scores

are then mapped from Wikipedia pages to their respec-

tive Wikidata entities, getting an approximation of their

PageRank value8.

The 20 most important classes based on their Class-

Rank score can be seen in table 1. These results were then

ﬁltered manually, since some of those entities could be con-

sidered Wikidata classes at the ontological level, but not at

a conceptual one. To do so, we have removed those classes

that contained the term “Wikimedia” in their labels, since

they are used to organize Wikimedia content but do not

represent classes at a conceptual level.

The ﬁnal subset is composed of 89 classes and 9.3 mil-

lion instances –around 10% of the total number of entities

in Wikidata. Although this subset is composed of just a

10% of the entities in Wikidata, its size is around a 35% of

the total size of Wikidata. This can be explained due to

5Source: https://www.wikidata.org/wiki/Wikidata:

Statistics

6The class-pointer is used by ClassRank to fetch those entities

from Wikidata that are classes.

7https://github.com/athalhammer/danker

8These dumps can be accessed at https://danker.s3.amazonaws.

com/index.html

Table 1: Top 20 most important classes based on their ClassRank score

Name ClassRank score Number of instances

human 2,167,439 3,873,812

Wikimedia category 1,057,559 2,207,283

sovereign state 837,883 203

taxon 755,681 1,962,491

country 746,208 193

point in time with respect to recurrent timeframe 635,401 3,273

calendar year 499,551 666

big city 321,893 3,238

human settlement 321,637 512,417

Wikimedia disambiguation page 317,631 1,294,218

Wikimedia administration category 302,196 12,146

Wikimedia list article 287,605 301,370

language 284,433 8,894

modern language 257,324 6,875

city 247,022 8,650

academic discipline 204,426 1,603

time zone named for a UTC oﬀset 170,254 72

metacategory in Wikimedia projects 167,871 2,545

republic 165,881 78

capital 161,384 388

taxonomic rank 160,822 67

the fact that the most important entities have, in general,

more content introduced by the community with respect

to lesser important classes.

3.2. Data extraction

Wikidata periodically releases public dumps of its con-

tents9. We have selected the pages-meta-history dumps

to extract the edit history of every entity from our sub-

set, since these dumps are the only ones containing every

revision of each entity and not just their ﬁnal content.

This dataset is composed of several XML ﬁles containing

metadata of every revision made to each entity, and also

a JSON blob with the complete content of the entity after

each revision. Our ﬁnal dataset is built from the pages-

meta-history dumps from 2021-11-01.

Since working with the complete content of every en-

tity after each revision leads to a lot of redundant entity

information, we made some preprocessing steps to reduce

the dataset size. Instead of storing the complete JSON

content of each entity after every revision r, we computed

the diﬀ between the JSON content in the previous revi-

sion (rt−1) and the current one (rt). These diﬀs are stored

in the JSON Patch format10, therefore allowing the re-

construction of the entity contents after any revision. To

obtain the complete JSON content of an entity at revision

rtwe just need to apply the patches of every previous re-

vision up to rt. A simpliﬁed example of the decomposition

of an entity in diﬀs is illustrated in ﬁgure 1.

9Available at https://dumps.wikimedia.org/wikidatawiki/

10https://datatracker.ietf.org/doc/html/rfc6902

{

"name": "Alice",

"friends": ["Bob",

"Carol"],

"height": 1.78

}

{

"name": "Alice",

"friends": ["Bob"],

"height": 1.80

}

{

"name": "Alice",

"friends":["Bob",

"Carol"],

"height": 1.80,

"weight": 80

}

Wikidata revision data

r1 r2 r3

Diff-based revision data

[{

"op": "add",

"path": "/name",

"value": "Alice"

{

"op": "add",

"path": "/height",

"value": 1.78

{

"op": "add",

"path": "/friends",

"value": ["Bob",

"Carol"]

}]

[{

"op": "add",

"path": "/weight",

"value": 80

{

"op": "add",

"path": "/friends/1",

"value": "Carol"

}]

r1 r2 r3

[{

"op": "replace",

"path": "/height",

"value": 1.8

{

"op": "remove",

"path": "/friends/1"

}]

Figure 1: Example of revision decompositions into JSON Patch. At

the top of the ﬁgure we show the simpliﬁed content of a Wikidata

entity at each revision r. At the bottom of the ﬁgure we show the

decomposition of this entity into diﬀs in JSON format.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LeveragingWikidata'sedithistoryinknowledgegraphrenementtasksAlejandroGonzalez-Heviaa,,DanielGayo-AvelloaaDepartmentofComputerScience,UniversityofOviedo,SpainAbstractKnowledgegraphshavebeenadoptedinmanydiverseeldsforavarietyofpurposes.Mostofthoseapplicationsrelyonvalidandcompletedatatodelivertheir...

展开>> 收起<<

Leveraging Wikidatas edit history in knowledge graph refinement tasks.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Leveraging Wikidatas edit history in knowledge graph refinement tasks

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: