Performing live time-traversal queries via SPARQL on RDF datasets Arcangelo Massari1and Silvio Peroni12

2025-05-02 0 0 5.33MB 26 页 10玖币

侵权投诉

Performing live time-traversal queries via

SPARQL on RDF datasets

Arcangelo Massari∗1and Silvio Peroni†1,2

1Research Centre for Open Scholarly Metadata, Department of

Classical Philology and Italian Studies, University of Bologna,

Bologna, Italy

2Digital Humanities Advanced Research Centre (/DH.arc),

Department of Classical Philology and Italian Studies, University

of Bologna, Bologna, Italy

Abstract

This article introduces a methodology to perform live time-traversal

SPARQL queries on RDF datasets and software based on this methodol-

ogy that oers a solution to manage the provenance and change-tracking

of entities described using RDF. These are crucial factors in ensuring ver-

iability and trust. Nevertheless, some of the most prominent knowledge

bases – including DBpedia,

ikidata, Yago, and the Dynamic Linked

Data Observatory – do not support time-agnostic queries, i.e., queries

across dierent snapshots together with provenance information. The

OpenCitations Data Model (OCDM) describes one possible way to track

provenance and entities’ changes in RDF datasets, and it allows restor-

ing an entity to a specic status in time (i.e., a snapshot) by applying

SPARQL update queries. The methodology and library presented in this

article are based on the rationale introduced in the OCDM.

e also de-

veloped benchmarks proving that such a procedure is ecient for specic

queries and less ecient for others. To the best of our knowledge, our

library is the only one to support all the time-related retrieval function-

alities live, i.e., enabling real-time searches and updates. Moreover, since

OCDM complies with standard RDF, queries are expressed via standard

SPARQL.

Keywords

—

semantic web, query processing, trust

1 Introduction

Data reliability is based on provenance: who produced information, when, and the

primary source. Such provenance information is essential because the truth value of

an assertion on the

eb is never absolute, as claimed by

ikipedia, which on its pol-

icy on the subject states: “the threshold for inclusion in

ikipedia is veriability, not

∗arcangelo.massari@unibo.it

†silvio.peroni@unibo.it

truth” (Garnkel, 2008). The Semantic

eb reinforces this aspect since each applica-

tion processing information must evaluate trustworthiness by probing the statements’

context (i.e., the provenance) (Koivunen & Miller, 2001).

Moreover, data changes over time, for either the natural evolution of concepts or

the correction of mistakes. Indeed, the latest version of knowledge may not be the most

accurate. Such phenomena are particularly tangible in the

eb of Data, as highlighted

in a study by the Dynamic Linked Data Observatory, which noted the modication

of about 38% of the nearly 90,000 RDF documents monitored for 29 weeks and the

permanent disappearance of 5% (K¨afer et al., 2013).

Notwithstanding these premises, the most extensive RDF datasets to date – DB-

Pedia, Wikidata, Yago, and the Dynamic Linked Data Observatory – either do not use

RDF to track changes or do not provide provenance information at the entity level.

(Dooley & Boˇzi´c, 2019; Orlandi & Passant, 2011; Project, 2021; Umbrich et al., 2010).

Therefore, they don’t allow SPARQL time-traversal queries on previous statuses of

their entities together with provenance information. For instance,

ikidata allows

SPARQL queries on entities temporally annotated via its proprietary RDF extension

but does not allow queries on change-tracking data.

The main reason behind this phenomenon is that the founding technologies of the

Semantic

eb – namely SPARQL, O

L, and RDF – did not initially provide an ef-

fective mechanism to annotate statements with metadata information. This lacking

led to the introduction of numerous metadata representation models, none of which

succeeded in establishing itself over the others and becoming a widely accepted stan-

dard to track both provenance and changes to RDF entities (Berners-Lee, 2005; Board,

2020; Caplan, 2017; Carroll et al., 2005; Ciccarese et al., 2008; Damiani et al., 2019;

da Silva et al., 2006; Dividino et al., 2009; Flouris et al., 2009; Hartig & Thompson,

2019; Hoart et al., 2013; Lebo et al., 2013; Moreau et al., 2011; Nguyen et al., 2014;

Pediaditis et al., 2009; Sahoo et al., 2010; Sahoo & Sheth, 2009; Suchanek et al., 2019;

Zimmermann et al., 2012).

In the past, some software was developed to perform time-traversal queries on

RDF datasets, enabling the reconstruction of the status of a particular entity at a

given time. However, as far as we know, all existing solutions need to preprocess

and index RDF data to work eciently (Cerdeira-Pena et al., 2016; Im et al., 2012;

Neumann &

eikum, 2010; Pellissier Tanon & Suchanek, 2019; Taelman et al., 2019).

This requirement is impractical for linked open datasets that constantly receive many

updates, such as

ikidata. For example, “Ostrich requires ∼22 hours to ingest

revision 9 of DBpedia (2.43M added and 2.46M deleted triples)” (Pelgrin et al., 2021).

Conversely, software operating on the y either does not support all query types (Noy

& Musen, 2002), or supports them non-generically by imposing a custom database

(Graube et al., 2016) or a specic triplestore (Arndt et al., 2019; Sande et al., 2013).

This work introduces a methodology and a Python library enabling all the time-

related retrieval functionalities identied by Fern´andez et al. (2016) live, i.e., allowing

real-time queries and updates without preprocessing the data. Moreover, data can

be stored on any RDF-compliant storage system (e.g., RDF-serialized textual les

and triplestores) when the provenance and data changes are tracked according to the

OpenCitations Data Model (Daquino et al., 2020).

The rest of the paper is organized as follows. Section 2 reviews the literature on

metadata representation models, retrieval functionalities, and archiving policies for dy-

namic linked data. Section 3 showcases the methodology underlying the time-agnostic-

library implementation, and Section 4 discusses the nal product from a quantitative

point of view, reporting the benchmarks results on execution times and memory.

2 Related works

This section reviews related metadata representation models (Section 2.1) before delv-

ing into query typologies, query languages (Section 2.2), and existing methodologies

to performing such queries (Section 2.3).

2.1 Representing dynamic linked data

The landscape of strategies to formally represent provenance in RDF is vast and frag-

mented (Sikos & Philp, 2020).

To date, the only

3C standard syntax for annotating triples’ provenance is RDF

reication (Manola & Miller, 2004) and it is the only one to be back-compatible with

all RDF-based systems. However, there are several deprecation proposals for this

syntax (Beckett, 2010), due to its poor scalability.

Dierent approaches have been proposed since 2005, and four categories of solu-

tions can be identied:

•Encapsulating provenance in RDF triples: n-ary relations (

3C, 2006), PaCE

(Sahoo et al., 2010) and singleton properties (Nguyen et al., 2014)

•Associating provenance to the triple through RDF quadruples: named graphs

(Carroll et al., 2005), RDF/S graphsets (Pediaditis et al., 2009), RDF triple

coloring (Flouris et al., 2009), and nanopublications (Groth et al., 2010).

•Extending the RDF data model: Notation 3 Logic (Berners-Lee, 2005), RDF+

(Dividino et al., 2009),

SPOTL(X) (Hoart et al., 2013), annotated RDF (aRDF) (Udrea et al., 2010;

Zimmermann et al., 2012), and RDF* (Hartig & Thompson, 2019).

•Using ontologies: Proof Markup Language (da Silva et al., 2006), S

AN Ontol-

ogy (Ciccarese et al., 2008), Provenir Ontology (Sahoo & Sheth, 2009), Prove-

nance Ontology (Gil et al., 2010), Open Provenance Model (Moreau et al., 2011),

PREMIS (Caplan, 2017), Dublin Core Metadata Terms (Board, 2020), and the

OpenCitations Data Model (Daquino et al., 2020).

For a complete analysis and comparison, refer to Sikos & Philp (2020). In this

context it is important to stress that most of these solutions do not comply with

RDF 1.1 (i.e., RDF/S graphsets, N3Logic, aRDF, RDF+, SPOTL(X), and RDF*),

are domain-specic (i.e., Provenir, SWAN, and PREMIS ontologies), rely on blank

nodes (n-ary relations), or suer from scalability issues (singleton properties, PaCE).

Despite being incompatible with RDF 1.1, it is worth mentioning that a W3C work-

ing group has recently published the rst draft to make RDF* a standard (Gschwend

& Lassila, 2022).

To date, named graphs (Carroll et al., 2005) and the Provenance Ontology (Moreau

& Missier, 2013) are the most adopted approaches to attach provenance metadata to

RDF triples. On the one hand, Named Graphs are widespread because they are

compliant with RDF 1.1 and can be queried with SPARQL 1.1; they are scalable,

and have several serialization formats (i.e., TriX, TriG, and N-Quads). On the other,

the Provenance Ontology was published by the Provenance Working Group as a

Recommendation in 2013, meeting all the requirements for provenance on the

and collecting existing ontologies into a single general model.

The OpenCitations Data Model (Daquino et al., 2020) represent provenance and

track changes in a way that complies with RDF 1.1 and relies on well-known and widely

adopted standards, PROV-O, named graphs, and Dublin Core, as will be detailed in

Section 3.

2.2 Querying dynamic linked data

Fern´andez, Polleres, and Umbrich (2016) provided two classications on time agnostic

queries, a low-level one relating to “query atoms” and a high-level one about “retrieval

needs”. In this article, we use the high-level classication, which is more explicit about

the queries to reconstruct a full version of an entity, an entire delta, and the query on

multiples/all deltas, without the need to derive them by composition between multiple

queries atoms. Before detailing such queries, it is required to dene what an entity, a

time-aware dataset, and a version are.

Denition 1 (Entity).An entity Eis the set of RDF triples (s,p,o) having the same

subject s.

Denition 2 (Time-aware dataset).A version annotated entity is an entity E

annotated with a label irepresenting the version in which this entity holds, denoted by

the notation Ei, where i∈N. A time-aware dataset Ais a set of version-annotated

entities.

Denition 3 (Version).A version of a time-aware dataset Aat snapshot iis the

RDF graph Ai={E|Ei∈A}.

In the query denitions, the evaluation of a SPARQL query Qon a graph G

produces a bag of solution mappings [[Q]]G.

Version materialization (V

)retrieves the full version of a specic entity. For-

mally: V

(E, i) = Ei. For example, “Get the 2014 snapshot of the entity representing

David Shotton”.

Single-version structured query (SV) retrieves the results of a SPARQL query

targeted at a specic version. Formally: SV (Q, Vi) = [[Q]]Vi. For example, “

hich

David Shotton’s papers were featured in the dataset in 2014?”.

Cross-version structured query (CV) — also called time-traversal query

—

retrieves the results of a SPARQL query targeted at multiple versions. Formally:

CV (Q, Vi, Vj) = SV (Q, Vi)on SV (Q, Vj). For example, “Which David Shotton’s pa-

pers were featured in the dataset in 2013 and in 2014?”.

Delta materialization (D

)retrieves the dierences of a specic entity between

two consecutive versions. Formally: D

(E, Vi)=(∆+,∆−). With ∆+=Ei\Ej,

∆−=Ej\Eiand i, j ∈N, i > j, @k∈N:j < k < i. For example, “

hat data

changed about the entity representing David Shotton in 2014?”.

Single-delta structured query (SD) retrieves the change-sets of a SPARQL

query’s results between one consecutive couple of versions. Formally: SD(Q, Vi, Vj) =

(∆+,∆−). With ∆+ = [[Q]]Vi\[[Q]]Vj,∆−= [[Q]]Vj\[[Q]]Viand i, j ∈N, i > j, @k∈

N:j < k < i. For example, “

hich David Shotton’s papers were featured in the

dataset in 2014 but not in 2013?”.

Cross-delta structured query (CD) retrieves the change-sets of a SPARQL

query’s results between more than one consecutive couple of versions. Formally:

CD(Q, Vi, Vj, Vm) = SD(Q, Vi, Vj)on SD(Q, Vj, Vm). For example, “

hen were arti-

cles by David Shotton added to or removed from the collection?”.

Extensions of SPARQL exist to support queries on time-aware RDF datasets,

that either require using non-standard languages to map data — such as τ-SP ARQL

(Tappolet & Bernstein, 2009), T-SPARQL (Grandi, 2010), and AnQL (Zimmermann

et al., 2012)

—

or only works on a purpose-built database, i.e. SP ARQLTon the

RDF-TX system (Zaniolo et al., 2018). This article proposes a methodology to support

all query types on any triplestore in standard SPARQL.

In this direction, SPARQ-LTL (Fionda et al., 2016) proposes a relevant approach

by extending SPARQL but describing an algorithm for rewriting queries in standard

SPARQL, provided that all triples are annotated with revision numbers and the re-

visions are accessible as named graphs. However, to the best of our knowledge, this

strategy has no implementations.

2.3 Storing dynamic linked open data

This section will review existing storage and querying methodologies, focusing on

supported queries, real-time operation, and generality. We consider generic a model

that complies with standard RDF and can be queried via standard SPARQL on any

RDF-compatible storage system.

Various archiving policies have been elaborated to store and query the evolution

of RDF datasets, namely independent copies, change-based, timestamp-based, and

fragment-based policies (Pelgrin et al., 2021).

Independent copies consist of storing each version separately. It is the most

straightforward model to implement and allows performing VM, SV, and CV easily.

However, this approach needs a massive amount of space for storing and time for pro-

cessing. Furthermore, given the dierent statements’ versions, further dimechanisms

are required to identify what changed. Nevertheless, to date, this is the archiving pol-

icy adopted by most systems and knowledge bases, such as DBPedia (Lehmann et al.,

2015), Wikidata (Dooley & Boˇzi´c, 2019; Erxleben et al., 2014; “

ikidata:Database

download”, 2021), and YAGO (Project, 2021).

The rst version control systems for RDF was SemVersion (V¨olkel et al., 2005),

specially tailored for ontologies. It saves each version of an ontology in a separate

snapshot and dierences are calculated on the y. SemVersion supports VM, SV, DM,

and SD but not via SPARQL, because SPARQL became a W3C Recommendation in

2008 and SemVersion has not been updated since 2005.

The change-based policy was introduced to solve scalability problems caused by the

independent copies approach. It consists of saving only the deltas between one version

and the other. For this reason, DM is costless. The drawback is that additional

computational costs for delta propagation are required to support version-focused

queries.

The rst proposal of this approach relied on a RDBMS to store the original dataset

and the deltas between two consecutive versions (Im et al., 2012). To improve per-

formance, deltas are pre-processed and duplicated, or unnecessary modications are

deleted. There is no support for SPARQL and queries must be formulated in SQL.

A concrete implementation of a change-based policy is R&

base, a version control

system inspired by Git but designed for RDF (Sande et al., 2013). Additions and

deletions are stored in separate named graphs, and all queries are supported. However,

this model is not fully semantic, since it requires hash tables to map revisions with

change-sets. In addition, it is not triplestore-agnostic, as it supports only Fuseki and

Virtuoso.

R43ples is inspired by R&WBase and perfects it by adopting a totally semantic

model (Graube et al., 2016). It is called Revision Management Ontology and records

change-sets and the related provenance metadata in separate graphs using PROV-

O and some new properties (e.g., rmo:deltaAdded and rmo:deltaRemoved). R43ples

acts as a proxy between the data triplestore and the provenance triplestore. How-

ever, R43ples cannot be considered a generic solution, as it extends SPARQL with

some keywords to simplify the queries (e.g., REVISION,TAG,MERGE), and the current

implementation mandates using Jena TDB as the provenance triplestore.

The timestamp-based policy annotates each triple with its transaction time, that

is, the timestamp of the version in which that statement was in the dataset.

x-RDF-3X is a database for RDF designed to manage high-frequency online up-

dates, versioning, time-traversal queries, and transactions (Neumann & Weikum, 2010).

The triples are never deleted but are annotated with two elds: the insertion and

deletion timestamp, where the last one has zero value for currently living versions. Af-

terward, updates are saved in a separate workspace and merged into various indexes

at occasional savepoints. x-RDF-3X supports VM and SV queries.

v-RDFCSA uses a similar strategy but excels in reducing space requirements, com-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

Performing live time-traversal queries via SPARQL on RDF datasets Arcangelo Massari1and Silvio Peroni12.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Performing live time-traversal queries via SPARQL on RDF datasets Arcangelo Massari1and Silvio Peroni12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: