Our focus in this article is not on decentralizing
data, but on finding data after it has been decentral-
ized, which can be done via query processing. The is-
sue of query processing over data has been primarily
tackled from a Big Data standpoint so far. However, if
decentralization efforts such as Solid will become a
reality, we need to be prepared for the need to query
over a huge number of data sources. For example, de-
centralized social networking applications will need to
be able to query over networks of friends containing
hundreds or thousands of data documents. As such, we
need new query techniques that are specifically de-
signed for such levels of decentralization. One of the
most promising techniques that could achieve this is
called Link-Traversal-based Query Processing
(LTQP) [2, 3]. LTQP is able to query over a set of doc-
uments that are connected to each other via links. An
LTQP query engine typically starts from one or more
documents, and traverses links between them in a
crawling-manner in order to resolve the given query.
Since LTQP is still a relative young area of re-
search, in which there are still a number of open prob-
lems that need to be tackled, notably result complete-
ness and query termination [2]. Aside from these
known issues, we also state the importance of security.
Security is a highly important and well-investigated
topic in the context of Web applications [4, 5], but it
has not yet been investigated in the context of LTQP.
As such, we investigate in this article security issues
related to LTQP engines, which may threaten the in-
tegrity of the user’s data, machine, and user experi-
ence, but also lead to privacy issues if personal data is
unintentionally leaked. Specifically, we focus on data-
driven security issues that are inherent to LTQP due to
the fact that it requires a query engine to follow links
on the Web, which is an uncontrolled, unpredictable
and potentially unsafe environment. Instead of analyz-
ing a single security threat in-depth, we perform a
broader high-level analysis of multiple security
threats.
Since LTQP is still a relatively new area of research,
its real-world applications are currently limited. As
such, we can not learn from security issues that arose
in existing systems. Instead of waiting for –potentially
unsafe– widespread applications of LTQP, we draw
inspiration from related domains that are already well-
established. Specifically, we draw inspiration from the
domains of crawling and Web browsers in Section 2,
and draw links to what impact these known security
issues will have on LTQP query engines. In Section 3,
we introduce a guiding use case that will be used to
illustrate different threats with. After that, we discuss
our method of categorizing vulnerabilities in
Section 4. Next, we list 10 data-driven security vulner-
abilities related to LTQP in Section 5, which are de-
rived from known vulnerabilities in similar domains.
For each vulnerability, we provide examples, and
sketch possible high-level mitigations. Finally, we dis-
cuss the future of LTQP security and conclude in
Section 6.
2. Related Work
This section lists relevant related work in the topics
of LTQP and security.
2.1. Link-Traversal-Based Query Processing
More than a decade ago, Link-Traversal-based
Query Processing (LTQP) [3, 2] has been introduced
as an alternative query paradigm for enabling query
execution over document-oriented interfaces. These
documents are usually Linked Data [6] serialized us-
ing any RDF [7] serialization. RDF is suitable to
LTQP and decentralization because of its global se-
mantics, which allows queries to be written indepen-
dently of the schemas of specific documents. In order
to execute these queries, LTQP processing occurs over
live data, and discover links to other documents via
the follow-your-nose principle during query execution.
This is in contrast to the typical query execution over
centralized database-oriented interfaces such as
SPARQL endpoints [8], where data is assumed to be
loaded into the endpoint beforehand, and no additional
data is discovered during query execution.
Concretely, LTQP typically starts off with an input
query and a set of seed documents. The query engine
then dereferences all seed documents via an HTTP GET
request, discovers links to other documents inside
those documents, and recursively dereferences those
discovered documents. Since document discovery can
be a very long (or infinite) process, query execution
happens during the discovery process based on all the
RDF triples that are extracted from the discovered
documents. This is typically done by implementing
these processes in an iterative pipeline [9]. Further-
more, since this discovery approach can lead to a large
number of discovered documents, different reachabili-