Performing live time-traversal queries via SPARQL on RDF datasets Arcangelo Massari1and Silvio Peroni12

2025-05-02 0 0 5.33MB 26 页 10玖币
侵权投诉
Performing live time-traversal queries via
SPARQL on RDF datasets
Arcangelo Massari1and Silvio Peroni1,2
1Research Centre for Open Scholarly Metadata, Department of
Classical Philology and Italian Studies, University of Bologna,
Bologna, Italy
2Digital Humanities Advanced Research Centre (/DH.arc),
Department of Classical Philology and Italian Studies, University
of Bologna, Bologna, Italy
Abstract
This article introduces a methodology to perform live time-traversal
SPARQL queries on RDF datasets and software based on this methodol-
ogy that oers a solution to manage the provenance and change-tracking
of entities described using RDF. These are crucial factors in ensuring ver-
iability and trust. Nevertheless, some of the most prominent knowledge
bases – including DBpedia,
W
ikidata, Yago, and the Dynamic Linked
Data Observatory – do not support time-agnostic queries, i.e., queries
across dierent snapshots together with provenance information. The
OpenCitations Data Model (OCDM) describes one possible way to track
provenance and entities’ changes in RDF datasets, and it allows restor-
ing an entity to a specic status in time (i.e., a snapshot) by applying
SPARQL update queries. The methodology and library presented in this
article are based on the rationale introduced in the OCDM.
W
e also de-
veloped benchmarks proving that such a procedure is ecient for specic
queries and less ecient for others. To the best of our knowledge, our
library is the only one to support all the time-related retrieval function-
alities live, i.e., enabling real-time searches and updates. Moreover, since
OCDM complies with standard RDF, queries are expressed via standard
SPARQL.
Keywords
semantic web, query processing, trust
1 Introduction
Data reliability is based on provenance: who produced information, when, and the
primary source. Such provenance information is essential because the truth value of
an assertion on the
W
eb is never absolute, as claimed by
W
ikipedia, which on its pol-
icy on the subject states: “the threshold for inclusion in
W
ikipedia is veriability, not
arcangelo.massari@unibo.it
silvio.peroni@unibo.it
1
truth” (Garnkel, 2008). The Semantic
W
eb reinforces this aspect since each applica-
tion processing information must evaluate trustworthiness by probing the statements’
context (i.e., the provenance) (Koivunen & Miller, 2001).
Moreover, data changes over time, for either the natural evolution of concepts or
the correction of mistakes. Indeed, the latest version of knowledge may not be the most
accurate. Such phenomena are particularly tangible in the
W
eb of Data, as highlighted
in a study by the Dynamic Linked Data Observatory, which noted the modication
of about 38% of the nearly 90,000 RDF documents monitored for 29 weeks and the
permanent disappearance of 5% (K¨afer et al., 2013).
Notwithstanding these premises, the most extensive RDF datasets to date – DB-
Pedia, Wikidata, Yago, and the Dynamic Linked Data Observatory – either do not use
RDF to track changes or do not provide provenance information at the entity level.
(Dooley & Boˇzi´c, 2019; Orlandi & Passant, 2011; Project, 2021; Umbrich et al., 2010).
Therefore, they don’t allow SPARQL time-traversal queries on previous statuses of
their entities together with provenance information. For instance,
W
ikidata allows
SPARQL queries on entities temporally annotated via its proprietary RDF extension
but does not allow queries on change-tracking data.
The main reason behind this phenomenon is that the founding technologies of the
Semantic
W
eb – namely SPARQL, O
W
L, and RDF – did not initially provide an ef-
fective mechanism to annotate statements with metadata information. This lacking
led to the introduction of numerous metadata representation models, none of which
succeeded in establishing itself over the others and becoming a widely accepted stan-
dard to track both provenance and changes to RDF entities (Berners-Lee, 2005; Board,
2020; Caplan, 2017; Carroll et al., 2005; Ciccarese et al., 2008; Damiani et al., 2019;
da Silva et al., 2006; Dividino et al., 2009; Flouris et al., 2009; Hartig & Thompson,
2019; Hoart et al., 2013; Lebo et al., 2013; Moreau et al., 2011; Nguyen et al., 2014;
Pediaditis et al., 2009; Sahoo et al., 2010; Sahoo & Sheth, 2009; Suchanek et al., 2019;
Zimmermann et al., 2012).
In the past, some software was developed to perform time-traversal queries on
RDF datasets, enabling the reconstruction of the status of a particular entity at a
given time. However, as far as we know, all existing solutions need to preprocess
and index RDF data to work eciently (Cerdeira-Pena et al., 2016; Im et al., 2012;
Neumann &
W
eikum, 2010; Pellissier Tanon & Suchanek, 2019; Taelman et al., 2019).
This requirement is impractical for linked open datasets that constantly receive many
updates, such as
W
ikidata. For example, “Ostrich requires 22 hours to ingest
revision 9 of DBpedia (2.43M added and 2.46M deleted triples)” (Pelgrin et al., 2021).
Conversely, software operating on the y either does not support all query types (Noy
& Musen, 2002), or supports them non-generically by imposing a custom database
(Graube et al., 2016) or a specic triplestore (Arndt et al., 2019; Sande et al., 2013).
This work introduces a methodology and a Python library enabling all the time-
related retrieval functionalities identied by Fern´andez et al. (2016) live, i.e., allowing
real-time queries and updates without preprocessing the data. Moreover, data can
be stored on any RDF-compliant storage system (e.g., RDF-serialized textual les
and triplestores) when the provenance and data changes are tracked according to the
OpenCitations Data Model (Daquino et al., 2020).
The rest of the paper is organized as follows. Section 2 reviews the literature on
metadata representation models, retrieval functionalities, and archiving policies for dy-
namic linked data. Section 3 showcases the methodology underlying the time-agnostic-
library implementation, and Section 4 discusses the nal product from a quantitative
point of view, reporting the benchmarks results on execution times and memory.
2
2 Related works
This section reviews related metadata representation models (Section 2.1) before delv-
ing into query typologies, query languages (Section 2.2), and existing methodologies
to performing such queries (Section 2.3).
2.1 Representing dynamic linked data
The landscape of strategies to formally represent provenance in RDF is vast and frag-
mented (Sikos & Philp, 2020).
To date, the only
W
3C standard syntax for annotating triples’ provenance is RDF
reication (Manola & Miller, 2004) and it is the only one to be back-compatible with
all RDF-based systems. However, there are several deprecation proposals for this
syntax (Beckett, 2010), due to its poor scalability.
Dierent approaches have been proposed since 2005, and four categories of solu-
tions can be identied:
Encapsulating provenance in RDF triples: n-ary relations (
W
3C, 2006), PaCE
(Sahoo et al., 2010) and singleton properties (Nguyen et al., 2014)
Associating provenance to the triple through RDF quadruples: named graphs
(Carroll et al., 2005), RDF/S graphsets (Pediaditis et al., 2009), RDF triple
coloring (Flouris et al., 2009), and nanopublications (Groth et al., 2010).
Extending the RDF data model: Notation 3 Logic (Berners-Lee, 2005), RDF+
(Dividino et al., 2009),
SPOTL(X) (Hoart et al., 2013), annotated RDF (aRDF) (Udrea et al., 2010;
Zimmermann et al., 2012), and RDF* (Hartig & Thompson, 2019).
Using ontologies: Proof Markup Language (da Silva et al., 2006), S
W
AN Ontol-
ogy (Ciccarese et al., 2008), Provenir Ontology (Sahoo & Sheth, 2009), Prove-
nance Ontology (Gil et al., 2010), Open Provenance Model (Moreau et al., 2011),
PREMIS (Caplan, 2017), Dublin Core Metadata Terms (Board, 2020), and the
OpenCitations Data Model (Daquino et al., 2020).
For a complete analysis and comparison, refer to Sikos & Philp (2020). In this
context it is important to stress that most of these solutions do not comply with
RDF 1.1 (i.e., RDF/S graphsets, N3Logic, aRDF, RDF+, SPOTL(X), and RDF*),
are domain-specic (i.e., Provenir, SWAN, and PREMIS ontologies), rely on blank
nodes (n-ary relations), or suer from scalability issues (singleton properties, PaCE).
Despite being incompatible with RDF 1.1, it is worth mentioning that a W3C work-
ing group has recently published the rst draft to make RDF* a standard (Gschwend
& Lassila, 2022).
To date, named graphs (Carroll et al., 2005) and the Provenance Ontology (Moreau
& Missier, 2013) are the most adopted approaches to attach provenance metadata to
RDF triples. On the one hand, Named Graphs are widespread because they are
compliant with RDF 1.1 and can be queried with SPARQL 1.1; they are scalable,
and have several serialization formats (i.e., TriX, TriG, and N-Quads). On the other,
the Provenance Ontology was published by the Provenance Working Group as a
W
3C
Recommendation in 2013, meeting all the requirements for provenance on the
W
eb
and collecting existing ontologies into a single general model.
The OpenCitations Data Model (Daquino et al., 2020) represent provenance and
track changes in a way that complies with RDF 1.1 and relies on well-known and widely
adopted standards, PROV-O, named graphs, and Dublin Core, as will be detailed in
Section 3.
3
2.2 Querying dynamic linked data
Fern´andez, Polleres, and Umbrich (2016) provided two classications on time agnostic
queries, a low-level one relating to “query atoms” and a high-level one about “retrieval
needs”. In this article, we use the high-level classication, which is more explicit about
the queries to reconstruct a full version of an entity, an entire delta, and the query on
multiples/all deltas, without the need to derive them by composition between multiple
queries atoms. Before detailing such queries, it is required to dene what an entity, a
time-aware dataset, and a version are.
Denition 1 (Entity).An entity Eis the set of RDF triples (s,p,o) having the same
subject s.
Denition 2 (Time-aware dataset).A version annotated entity is an entity E
annotated with a label irepresenting the version in which this entity holds, denoted by
the notation Ei, where iN. A time-aware dataset Ais a set of version-annotated
entities.
Denition 3 (Version).A version of a time-aware dataset Aat snapshot iis the
RDF graph Ai={E|EiA}.
In the query denitions, the evaluation of a SPARQL query Qon a graph G
produces a bag of solution mappings [[Q]]G.
Version materialization (V
M
)retrieves the full version of a specic entity. For-
mally: V
M
(E, i) = Ei. For example, “Get the 2014 snapshot of the entity representing
David Shotton”.
Single-version structured query (SV) retrieves the results of a SPARQL query
targeted at a specic version. Formally: SV (Q, Vi) = [[Q]]Vi. For example, “
W
hich
David Shotton’s papers were featured in the dataset in 2014?”.
Cross-version structured query (CV) — also called time-traversal query
retrieves the results of a SPARQL query targeted at multiple versions. Formally:
CV (Q, Vi, Vj) = SV (Q, Vi)on SV (Q, Vj). For example, “Which David Shotton’s pa-
pers were featured in the dataset in 2013 and in 2014?”.
Delta materialization (D
M
)retrieves the dierences of a specic entity between
two consecutive versions. Formally: D
M
(E, Vi)=(+,). With +=Ei\Ej,
=Ej\Eiand i, j N, i > j, @kN:j < k < i. For example, “
W
hat data
changed about the entity representing David Shotton in 2014?”.
Single-delta structured query (SD) retrieves the change-sets of a SPARQL
query’s results between one consecutive couple of versions. Formally: SD(Q, Vi, Vj) =
(+,). With + = [[Q]]Vi\[[Q]]Vj,= [[Q]]Vj\[[Q]]Viand i, j N, i > j, @k
N:j < k < i. For example, “
W
hich David Shotton’s papers were featured in the
dataset in 2014 but not in 2013?”.
Cross-delta structured query (CD) retrieves the change-sets of a SPARQL
query’s results between more than one consecutive couple of versions. Formally:
CD(Q, Vi, Vj, Vm) = SD(Q, Vi, Vj)on SD(Q, Vj, Vm). For example, “
W
hen were arti-
cles by David Shotton added to or removed from the collection?”.
Extensions of SPARQL exist to support queries on time-aware RDF datasets,
that either require using non-standard languages to map data — such as τ-SP ARQL
(Tappolet & Bernstein, 2009), T-SPARQL (Grandi, 2010), and AnQL (Zimmermann
et al., 2012)
or only works on a purpose-built database, i.e. SP ARQLTon the
RDF-TX system (Zaniolo et al., 2018). This article proposes a methodology to support
all query types on any triplestore in standard SPARQL.
In this direction, SPARQ-LTL (Fionda et al., 2016) proposes a relevant approach
by extending SPARQL but describing an algorithm for rewriting queries in standard
SPARQL, provided that all triples are annotated with revision numbers and the re-
visions are accessible as named graphs. However, to the best of our knowledge, this
strategy has no implementations.
4
2.3 Storing dynamic linked open data
This section will review existing storage and querying methodologies, focusing on
supported queries, real-time operation, and generality. We consider generic a model
that complies with standard RDF and can be queried via standard SPARQL on any
RDF-compatible storage system.
Various archiving policies have been elaborated to store and query the evolution
of RDF datasets, namely independent copies, change-based, timestamp-based, and
fragment-based policies (Pelgrin et al., 2021).
Independent copies consist of storing each version separately. It is the most
straightforward model to implement and allows performing VM, SV, and CV easily.
However, this approach needs a massive amount of space for storing and time for pro-
cessing. Furthermore, given the dierent statements’ versions, further dimechanisms
are required to identify what changed. Nevertheless, to date, this is the archiving pol-
icy adopted by most systems and knowledge bases, such as DBPedia (Lehmann et al.,
2015), Wikidata (Dooley & Boˇzi´c, 2019; Erxleben et al., 2014;
W
ikidata:Database
download”, 2021), and YAGO (Project, 2021).
The rst version control systems for RDF was SemVersion (V¨olkel et al., 2005),
specially tailored for ontologies. It saves each version of an ontology in a separate
snapshot and dierences are calculated on the y. SemVersion supports VM, SV, DM,
and SD but not via SPARQL, because SPARQL became a W3C Recommendation in
2008 and SemVersion has not been updated since 2005.
The change-based policy was introduced to solve scalability problems caused by the
independent copies approach. It consists of saving only the deltas between one version
and the other. For this reason, DM is costless. The drawback is that additional
computational costs for delta propagation are required to support version-focused
queries.
The rst proposal of this approach relied on a RDBMS to store the original dataset
and the deltas between two consecutive versions (Im et al., 2012). To improve per-
formance, deltas are pre-processed and duplicated, or unnecessary modications are
deleted. There is no support for SPARQL and queries must be formulated in SQL.
A concrete implementation of a change-based policy is R&
W
base, a version control
system inspired by Git but designed for RDF (Sande et al., 2013). Additions and
deletions are stored in separate named graphs, and all queries are supported. However,
this model is not fully semantic, since it requires hash tables to map revisions with
change-sets. In addition, it is not triplestore-agnostic, as it supports only Fuseki and
Virtuoso.
R43ples is inspired by R&WBase and perfects it by adopting a totally semantic
model (Graube et al., 2016). It is called Revision Management Ontology and records
change-sets and the related provenance metadata in separate graphs using PROV-
O and some new properties (e.g., rmo:deltaAdded and rmo:deltaRemoved). R43ples
acts as a proxy between the data triplestore and the provenance triplestore. How-
ever, R43ples cannot be considered a generic solution, as it extends SPARQL with
some keywords to simplify the queries (e.g., REVISION,TAG,MERGE), and the current
implementation mandates using Jena TDB as the provenance triplestore.
The timestamp-based policy annotates each triple with its transaction time, that
is, the timestamp of the version in which that statement was in the dataset.
x-RDF-3X is a database for RDF designed to manage high-frequency online up-
dates, versioning, time-traversal queries, and transactions (Neumann & Weikum, 2010).
The triples are never deleted but are annotated with two elds: the insertion and
deletion timestamp, where the last one has zero value for currently living versions. Af-
terward, updates are saved in a separate workspace and merged into various indexes
at occasional savepoints. x-RDF-3X supports VM and SV queries.
v-RDFCSA uses a similar strategy but excels in reducing space requirements, com-
5
Performing live time-traversal queries via SPARQL on RDF datasets Arcangelo Massari1and Silvio Peroni12.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:5.33MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注