1 18A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing Extended Version

2025-04-28 0 0 832.94KB 18 页 10玖币

侵权投诉

1 / 18

A Prospective Analysis of Security

Vulnerabilities within Link Traversal-Based

Query Processing (Extended Version)

Ruben Taelman and Ruben Verborgh

IDLab, Department of Electronics and Information Systems, Ghent University – imec,

{firstname.lastname}@ugent.be

This is an extended version of an article with the same title published in the proceedings of the QuWeDa workshop

at ISWC 2022. Next to more details in the related work and conclusions sections, this extension introduces con-

crete mitigations of each vulnerability.

Abstract.

The societal and economical consequences surrounding Big Data-driven platforms have increased the call for de-

centralized solutions. However, retrieving and querying data in more decentralized environments requires funda-

mentally different approaches, whose properties are not yet well understood. Link-Traversal-based Query Process-

ing (LTQP) is a technique for querying over decentralized data networks, in which a client-side query engine dis-

covers data by traversing links between documents. Since decentralized environments are potentially unsafe due

to their non-centrally controlled nature, there is a need for client-side LTQP query engines to be resistant against

security threats aimed at the query engine’s host machine or the query initiator’s personal data. As such, we have

performed an analysis of potential security vulnerabilities of LTQP. This article provides an overview of security

threats in related domains, which are used as inspiration for the identification of 10 LTQP security threats. Each

threat is explained, together with an example, and one or more avenues for mitigations are proposed. We conclude

with several concrete recommendations for LTQP query engine developers and data publishers as a first step to

mitigate some of these issues. With this work, we start filling the unknowns for enabling querying over decentral-

ized environments. Aside from future work on security, wider research is needed to uncover missing building

blocks for enabling true decentralization.

1. Introduction

Contrary to the Web’s initial design as a decentral-

ized ecosystem, the Web has grown to be a very cen-

tralized place, as large parts of the Web are currently

made up of a few large Bid Data-driven centralized

platforms [1]. This large-scale centralization has lead

to a number of problems related to personal informa-

tion abuse, and other economic and societal problems.

In order to solve these problems, there are calls to go

back to the original vision of a decentralized Web. The

leading effort to achieve this decentralization is

Solid [1]. Solid proposes a radical decentralization of

data across personal data vaults, where everyone is in

full control of its own personal data vault. This vault

can contain any number of documents, where its own-

er can determine who or what can access what parts of

this data. In contrast to the current state of the Web

where data primarily resides in a small number of

huge data sources, Solid leads to a a Web where data

is spread over a huge number of data sources.

2 / 18

Our focus in this article is not on decentralizing

data, but on finding data after it has been decentral-

ized, which can be done via query processing. The is-

sue of query processing over data has been primarily

tackled from a Big Data standpoint so far. However, if

decentralization efforts such as Solid will become a

reality, we need to be prepared for the need to query

over a huge number of data sources. For example, de-

centralized social networking applications will need to

be able to query over networks of friends containing

hundreds or thousands of data documents. As such, we

need new query techniques that are specifically de-

signed for such levels of decentralization. One of the

most promising techniques that could achieve this is

called Link-Traversal-based Query Processing

(LTQP) [2, 3]. LTQP is able to query over a set of doc-

uments that are connected to each other via links. An

LTQP query engine typically starts from one or more

documents, and traverses links between them in a

crawling-manner in order to resolve the given query.

Since LTQP is still a relative young area of re-

search, in which there are still a number of open prob-

lems that need to be tackled, notably result complete-

ness and query termination [2]. Aside from these

known issues, we also state the importance of security.

Security is a highly important and well-investigated

topic in the context of Web applications [4, 5], but it

has not yet been investigated in the context of LTQP.

As such, we investigate in this article security issues

related to LTQP engines, which may threaten the in-

tegrity of the user’s data, machine, and user experi-

ence, but also lead to privacy issues if personal data is

unintentionally leaked. Specifically, we focus on data-

driven security issues that are inherent to LTQP due to

the fact that it requires a query engine to follow links

on the Web, which is an uncontrolled, unpredictable

and potentially unsafe environment. Instead of analyz-

ing a single security threat in-depth, we perform a

broader high-level analysis of multiple security

threats.

Since LTQP is still a relatively new area of research,

its real-world applications are currently limited. As

such, we can not learn from security issues that arose

in existing systems. Instead of waiting for –potentially

unsafe– widespread applications of LTQP, we draw

inspiration from related domains that are already well-

established. Specifically, we draw inspiration from the

domains of crawling and Web browsers in Section 2,

and draw links to what impact these known security

issues will have on LTQP query engines. In Section 3,

we introduce a guiding use case that will be used to

illustrate different threats with. After that, we discuss

our method of categorizing vulnerabilities in

Section 4. Next, we list 10 data-driven security vulner-

abilities related to LTQP in Section 5, which are de-

rived from known vulnerabilities in similar domains.

For each vulnerability, we provide examples, and

sketch possible high-level mitigations. Finally, we dis-

cuss the future of LTQP security and conclude in

Section 6.

2. Related Work

This section lists relevant related work in the topics

of LTQP and security.

2.1. Link-Traversal-Based Query Processing

More than a decade ago, Link-Traversal-based

Query Processing (LTQP) [3, 2] has been introduced

as an alternative query paradigm for enabling query

execution over document-oriented interfaces. These

documents are usually Linked Data [6] serialized us-

ing any RDF [7] serialization. RDF is suitable to

LTQP and decentralization because of its global se-

mantics, which allows queries to be written indepen-

dently of the schemas of specific documents. In order

to execute these queries, LTQP processing occurs over

live data, and discover links to other documents via

the follow-your-nose principle during query execution.

This is in contrast to the typical query execution over

centralized database-oriented interfaces such as

SPARQL endpoints [8], where data is assumed to be

loaded into the endpoint beforehand, and no additional

data is discovered during query execution.

Concretely, LTQP typically starts off with an input

query and a set of seed documents. The query engine

then dereferences all seed documents via an HTTP GET

request, discovers links to other documents inside

those documents, and recursively dereferences those

discovered documents. Since document discovery can

be a very long (or infinite) process, query execution

happens during the discovery process based on all the

RDF triples that are extracted from the discovered

documents. This is typically done by implementing

these processes in an iterative pipeline [9]. Further-

more, since this discovery approach can lead to a large

number of discovered documents, different reachabili-

3 / 18

ty criteria [10] have been introduced as a way to re-

strict what links are to be followed for a given query.

So far, most research into LTQP has happened in

the areas of formalization [10, 11], performance im-

provements [12, 13, 14], and query syntax [15]. One

work has indicated the importance of

trustworthiness [16] during link traversal, as people

may publish false or contradicting information, which

would need to be avoided or filtered out during query

execution. Another work mentioned the need for

LTQP engines to adhere to robots.txt files [17] in

order to not lead to unintentional denial of service at-

tacks of data publishers. Given the focus of our work

on data-driven security vulnerabilities related to LTQP

engines, we only consider this issue of trustworthiness

further in this work, and omit the security vulnerabili-

ties from a data publisher’s perspective.

2.2. Vulnerabilities Of RDF Query Processing

Research involving the security vulnerabilities of

RDF query processing has been primarily focused on

injection attacks within Web applications that internal-

ly send SPARQL queries to a SPARQL endpoint. So

far, no research has been done on vulnerabilities spe-

cific to RDF federated querying or link traversal. As

such, we list the relevant work on single-source

SPARQL querying hereafter.

The most significant type of security vulnerability

in Web applications in general is Injection through

User Input, of which SQL injection attacks [4] are a

primary example. Orduna et al. [5] investigate this

type of attack in the context of SPARQL queries, and

show that parameterized queries can help avoid this

type of attacks. A parameterized query is a query tem-

plate that can contain multiple parameters, which can

be instantiated with different values. To avoid injec-

tion attacks, parameterized query libraries will per-

form the necessary validation and escaping on the in-

serted values. The authors implemented parameterized

queries in the Jena framework [18] as a mitigation

example.

SemGuard [19] is a system that aims to detect injec-

tion attacks in both SPARQL and SQL queries for

query engines that support both. A motivation of this

work is that the use of parameterized queries is not al-

ways desirable, as systems may already have been im-

plemented without them, and updating them would be

too expensive. This approach is based on the automat-

ic analysis of the incoming query’s parse tree. It will

check if the parse tree only has a leaf node for the ex-

pected user input, compared to the original template

query’s parse tree. If it does not have a leaf node, this

means that the user is attempting to execute queries

that were not intended by the application developer.

Asdhar et al. [20] analyzed injection attacks to Web

applications via the SPARQL query language [21] and

the SPARQL update language [22]. Furthermore, they

provide SemWebGoat, a deliberately insecure RDF-

based Web application for educational purposes

around security. All of the discussed attacks involve

some form of injection, leading to retrieval or modifi-

cation of unwanted data, or denial-of-service by for

example injecting the ?s ?p ?o pattern. Such ?s ?p ?

o patterns cause all data to be fetched, which for large

datasets can require long execution times, which may

lead to denials of service for following SPARQL

queries, or even crash the server and lead to availabili-

ty issues [23].

2.3. Linked Data Access Control

Kirrane et al. [24] surveyed the existing approaches

for achieving access control in RDF, for both authenti-

cation and authorization. The authors mention that

only a minority of those works apply specifically to

the document-oriented nature of Linked Data. They do

however mention that non-Linked-Data-specific ap-

proaches could potentially be applied to Linked Data

in future work. Hereafter, we briefly discuss the rele-

vant aspects of access control research that applies to

Linked Data. To the best of our knowledge, no securi-

ty vulnerabilities have yet been identified for any of

these.

2.3.1. Authentication

Authentication involves verifying an agent’s identi-

ty through certain credentials. A WebID

(https://www.w3.org/wiki/WebID) (Web Identity and

Discovery) is a URL through which agents can be

identified on the Web. WebID-TLS [25] is a protocol

that allows authentication of WebID agents via TLS

certificates. However, due to the limited support of

such certificates in Web browsers, its usage is hin-

dered. WebID-OIDC [26] is a more recent protocol is

based on the OpenID Connect [27] protocol for au-

thenticating WebID agents. Due to its compatibility

with modern Web browsers, WebID-OIDC is frequent-

ly used inside the Solid ecosystem.

4 / 18

2.3.2. Authorization

Authorization involves determining who can read or

write what kind of data. Web Access Control [28] is an

RDF-based access control system that works in a de-

centralized fashion. It enables declarative access con-

trol policies for documents to be assigned to users and

groups. Due to its properties, it is being used as default

access control mechanism in the Solid ecosystem. Sac-

co et al. [29] extend Web Access Control to not only

declare document-level access, but also on resource,

statement and graph level. Costabello et al. [30] intro-

duce the Shi3ld framework that enables access control

for Linked Data Platform [31]. Two variants of this

framework exist; one based on a SPARQL query en-

gine, and one more limited variant that works without

SPARQL queries. Kirrane et al. [32] introduce a

framework for enabling query-based access control via

query rewriting of simple graph pattern queries. Fur-

ther, Steyskal et al. [33] provide an approach that is

based on the Open Digital Rights Language. Finally,

Taelman et al. [34] introduce a framework to optimize

federated querying over documents that require access

control, by incorporating authorizations into privacy-

preserving data summaries.

2.4. Web Crawlers

Web crawling [35] is a process that involves collect-

ing information on the Web by following links be-

tween pages. Web crawlers are typically used for Web

indexing to aid search engines. Focused crawling [36]

is a special form of Web crawling that prioritizes cer-

tain Web pages, such as Web pages about a certain

topic, or domains for a certain country. LTQP can

therefore be considered as an area of focused crawling

that where the priority lies in achieving query results.

Web crawlers are often used for discovering vulner-

able Web sites, for example through Google

Dorking [37], which involves using Google Search to

find Web sites that are misconfigured or use vulnera-

ble software. Furthermore, crawlers are often used to

find private information on Web sites. Such issues are

however not the focus of this work. Instead, we are in-

terested in the security of the crawling process itself,

for which little research has been done to the best of

our knowledge.

One related work in this area involves abusing

crawlers to initiate attacks on other Web sites [38].

This may cause performance degradation on the at-

tacked Web site, or could even cause the crawling

agent to be blocked by the server. These attacks in-

volve convincing the crawler to follow a link to a

third-party Web site that exploits a certain vulnerabili-

ty, such as an SQL injection. Additionally, this work

describes a type of attack that allows vulnerable Web

sites to be used for improving the PageRank [39] of an

attacker-owned Web site via forged backlinks.

Some other works focus on mitigation of so-called

crawler traps [40, 41] or spider traps. These are sets

of URLs that cause an infinite crawling process, which

can either be intentional or accidental. Such crawler

traps can have multiple causes:

Links between dynamic pages that are based on

URLs with query parameters;

Infinite redirection loops via using the HTTP 3xx

range;

Links to search APIs;

Infinitely paged resources, such as calendars;

Incorrect relative URLs that continuously increase

the URL length.

Crawler traps are mostly discovered through human

intervention when many documents in a single domain

are discovered. Recently, a new detection technique

was introduced [42] that attempts to measure the dis-

tance between documents, and rejects links to docu-

ments that are too similar.

2.5. Web Browsers

Web browsers enable users to visualize and interact

with Web pages. This interaction is closely related to

LTQP, with the main difference that LTQP works au-

tonomously, while Web browsers are user-driven.

Considering this close resemblance between these two

domains, we give an overview of the main security

vulnerabilities in Web browsers.

2.5.1. Modern Web Browser Architecture

Silic et al. [43] analyzed the architectures of modern

Web browsers, determined the main vulnerabilities,

and discuss how these issues are coped with.

Architecture-wise, browsers can be categorized into

monolithic and modular browser architectures. The

difference between the two is that the former does not

provide isolation between concurrently executed Web

programs, while the latter does. The authors argue that

a modular architecture is important for security, fault-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1/18AProspectiveAnalysisofSecurityVulnerabilitieswithinLinkTraversal-BasedQueryProcessing(ExtendedVersion)RubenTaelmanandRubenVerborghIDLab,DepartmentofElectronicsandInformationSystems,GhentUniversity–imec,{firstname.lastname}@ugent.beThisisanextendedversionofanarticlewiththesametitlepublishedinthep...

展开>> 收起<<

1 18A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing Extended Version.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 18A Prospective Analysis of Security Vulnerabilities within Link Traversal-Based Query Processing Extended Version

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: