
A revision also contains additional metadata, including
its timestamp, author of the revision, tags, and a descrip-
tion. Tags are usually used to indicate the device from
which the revision was made (or the tool that made the
edit, if it was an automated process) and also to indicate
the cause of the edit4(e.g., to revert vandalism).
In the context of this paper we will use the term oper-
ation to refer to a single modification made to an entity’s
element in a revision. One revision may be composed of
1 to noperations. An operation may represent the addi-
tion or removal of a single element from the entity. For
the sake of simplicity, we will also consider replacements
as operations, which are a combination of an addition and
removal operation to an entity.
Wikidata allows restoring and undoing revisions made
to an entity. Restoring allows users to undo all the ed-
its made to an entity up to the selected restoration state.
Undoing is more versatile, since it undoes from 1 to n
edits selected by the user, which do not need to be con-
secutive edits like in the restoration process. In none of
those cases the revisions being undone are removed from
the revision history of the entity. A new revision is made
instead, which includes the necessary operations to undo
the selected revisions.
Edits can be manually removed by Wikidata adminis-
trators under specific circumstances. These include revi-
sions that contain private information, a violation of copy-
right, or personal attacks of a serious nature.
2.3. Formal definitions
We will now formally introduce the main elements used
throughout this paper. Let I,Land Bbe disjoint count-
ably infinite sets of IRIs, literals and blank nodes respec-
tively. A knowledge graph can be formally defined from a
static point of view as a set of triples (s, p, o)∈(I ∪ B)×
I × (I ∪ L ∪ B).
From a dynamic point of view, a knowledge graph is
built from a sequence of operations Op ={opj: 1 ≤
j≤ ∞}. Each operation opjis composed of a triple
t= (s, p, o) : s∈(I ∪ B)× I × (I ∪ L ∪ B). Op+=
{t1, t2, ..., tn}represents the set of addition operations of
the graph, while Op−={t0
1, t0
2, ..., t0
m}represents the set of
removal operations. The set of all operations is therefore
defined as Op =Op+∪Op−. A knowledge graph is built
out of noperations, with Kirepresenting the state of the
graph after applying all operations up to operation i. Ap-
plying an addition operation op+
i+1 = (s, p, o) to a graph
Kiresults in graph Ki+1 =Ki∪(s, p, o). On the other
hand, applying a removal operation op−
i+1 = (s, p, o) to a
graph Kiresults in graph Ki+1 =Ki\(s, p, o). The final
state of a knowledge graph can be obtained by a successive
application of all its operations.
4A list of the most common tags can be accessed at https://www.
wikidata.org/wiki/Special:Tags
3. Extracting edit history data from Wikidata
We now present the approach followed to extract the
edit history information from Wikidata to conduct our ex-
periments.
3.1. Subset selection
Wikidata was composed of 97,795,169 entities at the
time of this writing, with more than 1,640,933,943 revi-
sions in total made by users5. Given that working with
the entire Wikidata revision history could be too compu-
tationally expensive to validate our proposal, we extracted
a subset to conduct our experiments.
This subset is composed of instances from the most
important Wikidata classes. To that end, we have com-
puted the ClassRank [12] score of every class in Wikidata,
choosing the top 100 classes with the highest score. Then,
we extracted the edit history information of every entity
that is an instance of any of those classes. We preferred
choosing the most important classes for our experiments
over producing a random sample since –in general– en-
tities belonging to central classes receive more attention
from the community, and are therefore more promising for
exploiting their edit history information.
To run ClassRank we defined the P31 property (in-
stance of ) of Wikidata as a class-pointer 6. The ClassRank
score of a class is computed by aggregating the PageRank
[13] scores of all its instances. However, since computing
the PageRank score of every entity in Wikidata was too
computationally expensive, we used a set of pre-computed
PageRank scores. These scores were obtained using the
Danker7tool, which periodically computes the PageRank
score of every existing entity in Wikipedia [14]. The scores
are then mapped from Wikipedia pages to their respec-
tive Wikidata entities, getting an approximation of their
PageRank value8.
The 20 most important classes based on their Class-
Rank score can be seen in table 1. These results were then
filtered manually, since some of those entities could be con-
sidered Wikidata classes at the ontological level, but not at
a conceptual one. To do so, we have removed those classes
that contained the term “Wikimedia” in their labels, since
they are used to organize Wikimedia content but do not
represent classes at a conceptual level.
The final subset is composed of 89 classes and 9.3 mil-
lion instances –around 10% of the total number of entities
in Wikidata. Although this subset is composed of just a
10% of the entities in Wikidata, its size is around a 35% of
the total size of Wikidata. This can be explained due to
5Source: https://www.wikidata.org/wiki/Wikidata:
Statistics
6The class-pointer is used by ClassRank to fetch those entities
from Wikidata that are classes.
7https://github.com/athalhammer/danker
8These dumps can be accessed at https://danker.s3.amazonaws.
com/index.html
3