1 18A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing Extended Version

2025-04-28 0 0 832.94KB 18 页 10玖币
侵权投诉
1 / 18
A Prospective Analysis of Security
Vulnerabilities within Link Traversal-Based
Query Processing (Extended Version)
Ruben Taelman and Ruben Verborgh
IDLab, Department of Electronics and Information Systems, Ghent University – imec,
{firstname.lastname}@ugent.be
This is an extended version of an article with the same title published in the proceedings of the QuWeDa workshop
at ISWC 2022. Next to more details in the related work and conclusions sections, this extension introduces con-
crete mitigations of each vulnerability.
Abstract.
The societal and economical consequences surrounding Big Data-driven platforms have increased the call for de-
centralized solutions. However, retrieving and querying data in more decentralized environments requires funda-
mentally different approaches, whose properties are not yet well understood. Link-Traversal-based Query Process-
ing (LTQP) is a technique for querying over decentralized data networks, in which a client-side query engine dis-
covers data by traversing links between documents. Since decentralized environments are potentially unsafe due
to their non-centrally controlled nature, there is a need for client-side LTQP query engines to be resistant against
security threats aimed at the query engine’s host machine or the query initiator’s personal data. As such, we have
performed an analysis of potential security vulnerabilities of LTQP. This article provides an overview of security
threats in related domains, which are used as inspiration for the identification of 10 LTQP security threats. Each
threat is explained, together with an example, and one or more avenues for mitigations are proposed. We conclude
with several concrete recommendations for LTQP query engine developers and data publishers as a first step to
mitigate some of these issues. With this work, we start filling the unknowns for enabling querying over decentral-
ized environments. Aside from future work on security, wider research is needed to uncover missing building
blocks for enabling true decentralization.
1. Introduction
Contrary to the Web’s initial design as a decentral-
ized ecosystem, the Web has grown to be a very cen-
tralized place, as large parts of the Web are currently
made up of a few large Bid Data-driven centralized
platforms [1]. This large-scale centralization has lead
to a number of problems related to personal informa-
tion abuse, and other economic and societal problems.
In order to solve these problems, there are calls to go
back to the original vision of a decentralized Web. The
leading effort to achieve this decentralization is
Solid [1]. Solid proposes a radical decentralization of
data across personal data vaults, where everyone is in
full control of its own personal data vault. This vault
can contain any number of documents, where its own-
er can determine who or what can access what parts of
this data. In contrast to the current state of the Web
where data primarily resides in a small number of
huge data sources, Solid leads to a a Web where data
is spread over a huge number of data sources.
2 / 18
Our focus in this article is not on decentralizing
data, but on finding data after it has been decentral-
ized, which can be done via query processing. The is-
sue of query processing over data has been primarily
tackled from a Big Data standpoint so far. However, if
decentralization efforts such as Solid will become a
reality, we need to be prepared for the need to query
over a huge number of data sources. For example, de-
centralized social networking applications will need to
be able to query over networks of friends containing
hundreds or thousands of data documents. As such, we
need new query techniques that are specifically de-
signed for such levels of decentralization. One of the
most promising techniques that could achieve this is
called Link-Traversal-based Query Processing
(LTQP) [2, 3]. LTQP is able to query over a set of doc-
uments that are connected to each other via links. An
LTQP query engine typically starts from one or more
documents, and traverses links between them in a
crawling-manner in order to resolve the given query.
Since LTQP is still a relative young area of re-
search, in which there are still a number of open prob-
lems that need to be tackled, notably result complete-
ness and query termination [2]. Aside from these
known issues, we also state the importance of security.
Security is a highly important and well-investigated
topic in the context of Web applications [4, 5], but it
has not yet been investigated in the context of LTQP.
As such, we investigate in this article security issues
related to LTQP engines, which may threaten the in-
tegrity of the users data, machine, and user experi-
ence, but also lead to privacy issues if personal data is
unintentionally leaked. Specifically, we focus on data-
driven security issues that are inherent to LTQP due to
the fact that it requires a query engine to follow links
on the Web, which is an uncontrolled, unpredictable
and potentially unsafe environment. Instead of analyz-
ing a single security threat in-depth, we perform a
broader high-level analysis of multiple security
threats.
Since LTQP is still a relatively new area of research,
its real-world applications are currently limited. As
such, we can not learn from security issues that arose
in existing systems. Instead of waiting for –potentially
unsafe– widespread applications of LTQP, we draw
inspiration from related domains that are already well-
established. Specifically, we draw inspiration from the
domains of crawling and Web browsers in Section 2,
and draw links to what impact these known security
issues will have on LTQP query engines. In Section 3,
we introduce a guiding use case that will be used to
illustrate different threats with. After that, we discuss
our method of categorizing vulnerabilities in
Section 4. Next, we list 10 data-driven security vulner-
abilities related to LTQP in Section 5, which are de-
rived from known vulnerabilities in similar domains.
For each vulnerability, we provide examples, and
sketch possible high-level mitigations. Finally, we dis-
cuss the future of LTQP security and conclude in
Section 6.
2. Related Work
This section lists relevant related work in the topics
of LTQP and security.
2.1. Link-Traversal-Based Query Processing
More than a decade ago, Link-Traversal-based
Query Processing (LTQP) [3, 2] has been introduced
as an alternative query paradigm for enabling query
execution over document-oriented interfaces. These
documents are usually Linked Data [6] serialized us-
ing any RDF [7] serialization. RDF is suitable to
LTQP and decentralization because of its global se-
mantics, which allows queries to be written indepen-
dently of the schemas of specific documents. In order
to execute these queries, LTQP processing occurs over
live data, and discover links to other documents via
the follow-your-nose principle during query execution.
This is in contrast to the typical query execution over
centralized database-oriented interfaces such as
SPARQL endpoints [8], where data is assumed to be
loaded into the endpoint beforehand, and no additional
data is discovered during query execution.
Concretely, LTQP typically starts off with an input
query and a set of seed documents. The query engine
then dereferences all seed documents via an HTTP GET
request, discovers links to other documents inside
those documents, and recursively dereferences those
discovered documents. Since document discovery can
be a very long (or infinite) process, query execution
happens during the discovery process based on all the
RDF triples that are extracted from the discovered
documents. This is typically done by implementing
these processes in an iterative pipeline [9]. Further-
more, since this discovery approach can lead to a large
number of discovered documents, different reachabili-
3 / 18
ty criteria [10] have been introduced as a way to re-
strict what links are to be followed for a given query.
So far, most research into LTQP has happened in
the areas of formalization [10, 11], performance im-
provements [12, 13, 14], and query syntax [15]. One
work has indicated the importance of
trustworthiness [16] during link traversal, as people
may publish false or contradicting information, which
would need to be avoided or filtered out during query
execution. Another work mentioned the need for
LTQP engines to adhere to robots.txt files [17] in
order to not lead to unintentional denial of service at-
tacks of data publishers. Given the focus of our work
on data-driven security vulnerabilities related to LTQP
engines, we only consider this issue of trustworthiness
further in this work, and omit the security vulnerabili-
ties from a data publishers perspective.
2.2. Vulnerabilities Of RDF Query Processing
Research involving the security vulnerabilities of
RDF query processing has been primarily focused on
injection attacks within Web applications that internal-
ly send SPARQL queries to a SPARQL endpoint. So
far, no research has been done on vulnerabilities spe-
cific to RDF federated querying or link traversal. As
such, we list the relevant work on single-source
SPARQL querying hereafter.
The most significant type of security vulnerability
in Web applications in general is Injection through
User Input, of which SQL injection attacks [4] are a
primary example. Orduna et al. [5] investigate this
type of attack in the context of SPARQL queries, and
show that parameterized queries can help avoid this
type of attacks. A parameterized query is a query tem-
plate that can contain multiple parameters, which can
be instantiated with different values. To avoid injec-
tion attacks, parameterized query libraries will per-
form the necessary validation and escaping on the in-
serted values. The authors implemented parameterized
queries in the Jena framework [18] as a mitigation
example.
SemGuard [19] is a system that aims to detect injec-
tion attacks in both SPARQL and SQL queries for
query engines that support both. A motivation of this
work is that the use of parameterized queries is not al-
ways desirable, as systems may already have been im-
plemented without them, and updating them would be
too expensive. This approach is based on the automat-
ic analysis of the incoming query’s parse tree. It will
check if the parse tree only has a leaf node for the ex-
pected user input, compared to the original template
query’s parse tree. If it does not have a leaf node, this
means that the user is attempting to execute queries
that were not intended by the application developer.
Asdhar et al. [20] analyzed injection attacks to Web
applications via the SPARQL query language [21] and
the SPARQL update language [22]. Furthermore, they
provide SemWebGoat, a deliberately insecure RDF-
based Web application for educational purposes
around security. All of the discussed attacks involve
some form of injection, leading to retrieval or modifi-
cation of unwanted data, or denial-of-service by for
example injecting the ?s ?p ?o pattern. Such ?s ?p ?
o patterns cause all data to be fetched, which for large
datasets can require long execution times, which may
lead to denials of service for following SPARQL
queries, or even crash the server and lead to availabili-
ty issues [23].
2.3. Linked Data Access Control
Kirrane et al. [24] surveyed the existing approaches
for achieving access control in RDF, for both authenti-
cation and authorization. The authors mention that
only a minority of those works apply specifically to
the document-oriented nature of Linked Data. They do
however mention that non-Linked-Data-specific ap-
proaches could potentially be applied to Linked Data
in future work. Hereafter, we briefly discuss the rele-
vant aspects of access control research that applies to
Linked Data. To the best of our knowledge, no securi-
ty vulnerabilities have yet been identified for any of
these.
2.3.1. Authentication
Authentication involves verifying an agent’s identi-
ty through certain credentials. A WebID
(https://www.w3.org/wiki/WebID) (Web Identity and
Discovery) is a URL through which agents can be
identified on the Web. WebID-TLS [25] is a protocol
that allows authentication of WebID agents via TLS
certificates. However, due to the limited support of
such certificates in Web browsers, its usage is hin-
dered. WebID-OIDC [26] is a more recent protocol is
based on the OpenID Connect [27] protocol for au-
thenticating WebID agents. Due to its compatibility
with modern Web browsers, WebID-OIDC is frequent-
ly used inside the Solid ecosystem.
4 / 18
2.3.2. Authorization
Authorization involves determining who can read or
write what kind of data. Web Access Control [28] is an
RDF-based access control system that works in a de-
centralized fashion. It enables declarative access con-
trol policies for documents to be assigned to users and
groups. Due to its properties, it is being used as default
access control mechanism in the Solid ecosystem. Sac-
co et al. [29] extend Web Access Control to not only
declare document-level access, but also on resource,
statement and graph level. Costabello et al. [30] intro-
duce the Shi3ld framework that enables access control
for Linked Data Platform [31]. Two variants of this
framework exist; one based on a SPARQL query en-
gine, and one more limited variant that works without
SPARQL queries. Kirrane et al. [32] introduce a
framework for enabling query-based access control via
query rewriting of simple graph pattern queries. Fur-
ther, Steyskal et al. [33] provide an approach that is
based on the Open Digital Rights Language. Finally,
Taelman et al. [34] introduce a framework to optimize
federated querying over documents that require access
control, by incorporating authorizations into privacy-
preserving data summaries.
2.4. Web Crawlers
Web crawling [35] is a process that involves collect-
ing information on the Web by following links be-
tween pages. Web crawlers are typically used for Web
indexing to aid search engines. Focused crawling [36]
is a special form of Web crawling that prioritizes cer-
tain Web pages, such as Web pages about a certain
topic, or domains for a certain country. LTQP can
therefore be considered as an area of focused crawling
that where the priority lies in achieving query results.
Web crawlers are often used for discovering vulner-
able Web sites, for example through Google
Dorking [37], which involves using Google Search to
find Web sites that are misconfigured or use vulnera-
ble software. Furthermore, crawlers are often used to
find private information on Web sites. Such issues are
however not the focus of this work. Instead, we are in-
terested in the security of the crawling process itself,
for which little research has been done to the best of
our knowledge.
One related work in this area involves abusing
crawlers to initiate attacks on other Web sites [38].
This may cause performance degradation on the at-
tacked Web site, or could even cause the crawling
agent to be blocked by the server. These attacks in-
volve convincing the crawler to follow a link to a
third-party Web site that exploits a certain vulnerabili-
ty, such as an SQL injection. Additionally, this work
describes a type of attack that allows vulnerable Web
sites to be used for improving the PageRank [39] of an
attacker-owned Web site via forged backlinks.
Some other works focus on mitigation of so-called
crawler traps [40, 41] or spider traps. These are sets
of URLs that cause an infinite crawling process, which
can either be intentional or accidental. Such crawler
traps can have multiple causes:
Links between dynamic pages that are based on
URLs with query parameters;
Infinite redirection loops via using the HTTP 3xx
range;
Links to search APIs;
Infinitely paged resources, such as calendars;
Incorrect relative URLs that continuously increase
the URL length.
Crawler traps are mostly discovered through human
intervention when many documents in a single domain
are discovered. Recently, a new detection technique
was introduced [42] that attempts to measure the dis-
tance between documents, and rejects links to docu-
ments that are too similar.
2.5. Web Browsers
Web browsers enable users to visualize and interact
with Web pages. This interaction is closely related to
LTQP, with the main difference that LTQP works au-
tonomously, while Web browsers are user-driven.
Considering this close resemblance between these two
domains, we give an overview of the main security
vulnerabilities in Web browsers.
2.5.1. Modern Web Browser Architecture
Silic et al. [43] analyzed the architectures of modern
Web browsers, determined the main vulnerabilities,
and discuss how these issues are coped with.
Architecture-wise, browsers can be categorized into
monolithic and modular browser architectures. The
difference between the two is that the former does not
provide isolation between concurrently executed Web
programs, while the latter does. The authors argue that
a modular architecture is important for security, fault-
摘要:

1/18AProspectiveAnalysisofSecurityVulnerabilitieswithinLinkTraversal-BasedQueryProcessing(ExtendedVersion)RubenTaelmanandRubenVerborghIDLab,DepartmentofElectronicsandInformationSystems,GhentUniversity–imec,{firstname.lastname}@ugent.beThisisanextendedversionofanarticlewiththesametitlepublishedinthep...

展开>> 收起<<
1 18A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing Extended Version.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:832.94KB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注