Efficient Evaluation of Arbitrary Relational Calculus Queries

2025-05-03 0 0 707.02KB 40 页 10玖币

侵权投诉

Logical Methods in Computer Science

Volume 19, Issue 4, 2023, pp. 38:1–38:40

https://lmcs.episciences.org/

Submitted Oct. 21, 2022

Published Dec. 22, 2023

EFFICIENT EVALUATION OF

ARBITRARY RELATIONAL CALCULUS QUERIES

MARTIN RASZYK a, DAVID BASIN b, SR¯

DAN KRSTI´

Cb, AND DMITRIY TRAYTEL c

aDFINITY, Zurich, Switzerland

e-mail address: martin.raszyk@dﬁnity.org

bDepartment of Computer Science, ETH Z¨urich, Zurich, Switzerland

e-mail address:{basin, srdan.krstic}@inf.ethz.ch

cDepartment of Computer Science, University of Copenhagen, Copenhagen, Denmark

e-mail address: traytel@di.ku.dk

Abstract.

The relational calculus (RC) is a concise, declarative query language. However,

existing RC query evaluation approaches are ineﬃcient and often deviate from established

algorithms based on ﬁnite tables used in database management systems. We devise a new

translation of an arbitrary RC query into two safe-range queries, for which the ﬁniteness of

the query’s evaluation result is guaranteed. Assuming an inﬁnite domain, the two queries

have the following meaning: The ﬁrst is closed and characterizes the original query’s relative

safety, i.e., whether given a ﬁxed database, the original query evaluates to a ﬁnite relation.

The second safe-range query is equivalent to the original query, if the latter is relatively safe.

We compose our translation with other, more standard ones to ultimately obtain two SQL

queries. This allows us to use standard database management systems to evaluate arbitrary

RC queries. We show that our translation improves the time complexity over existing

approaches, which we also empirically conﬁrm in both realistic and synthetic experiments.

1. Introduction

Codd’s theorem states that all domain-independent queries of the relational calculus (RC) can

be expressed in relational algebra (RA) [

Cod72

]. A popular interpretation of this result is that

RA suﬃces to express all interesting queries. This interpretation justiﬁes why SQL evolved as

the practical database query language with the RA as its mathematical foundation. SQL is

declarative and abstracts over the actual RA expression used to evaluate a query. Yet, SQL’s

syntax inherits RA’s deliberate syntactic limitations, such as union-compatibility, which

ensure domain independence. RC does not have such syntactic limitations, which arguably

makes it a more attractive declarative query language than both RA and SQL. The main

problem of RC is that it is not immediately clear how to evaluate even domain-independent

queries, much less how to handle the domain-dependent (i.e., not domain-independent) ones.

As a running example, consider a shop in which brands (unary ﬁnite relation Bof brands)

sell products (binary ﬁnite relation Prelating brands and products) and products are reviewed

by users with a score (ternary ﬁnite relation Srelating products, users, and scores). We

Key words and phrases: Relational calculus, relative safety, safe range, query translation.

LOGICAL METHODS

IN COMPUTER SCIENCE DOI:10.46298/LMCS-19(4:38)2023

c, and D. Traytel

⃝Creative Commons

38:2 M. Raszyk, D. Basin, S. Krsti´

c, and D. Traytel Vol. 19:4

consider a brand suspicious if there is a user and a score such that all the brand’s products

were reviewed by that user with that score. An RC query computing suspicious brands is

Qsusp BB(b)∧ ∃u, s. ∀p. P(b, p)−→ S(p, u, s).

This query is domain independent and follows closely our informal description. It is not,

however, clear how to evaluate it because its second conjunct is domain dependent as it is

satisﬁed for every brand that does not occur in P. Finding suspicious brands using RA or

SQL is a challenge, which only the best students from an undergraduate database course

will accomplish. We give away an RA answer next (where

−

is the set diﬀerence operator

and ▷is the anti-join, also known as the generalized diﬀerence operator [AHV95]):

πbrand ((πuser ,score (S)×B)−πbrand,user,score ((πuser,score (S)×P)▷S)) ∪(B−πbrand (P)).

The highlighted expressions

πuser,score

(S) are called generators. They ensure that the

left operands of the anti-join and set diﬀerence operators include or have the same columns

(i.e., are union-compatible) as the corresponding right operands. (Following Codd [

Cod72

one could also use the active domain to obtain canonical, but far less eﬃcient, generators.)

Van Gelder and Topor [

GT87

GT91

] present a translation from a decidable class of

domain-independent RC queries, called evaluable, to RA expressions. Their translation of

the evaluable

Qsusp

query would yield diﬀerent generators, replacing both highlighted parts

πuser

(S)

×πscore

(S). That one can avoid this Cartesian product as shown above is subtle:

Replacing only the ﬁrst highlighted generator with the product results in an inequivalent

RA expression.

Once we have identiﬁed suspicious brands, we may want to obtain the users whose scoring

made the brands suspicious. In RC, omitting u’s quantiﬁer from Qsusp achieves just that:

Qsusp

user BB(b)∧ ∃s. ∀p. P(b, p)−→ S(p, u, s).

In contrast, RA cannot express the same property as it is domain dependent (hence also not

evaluable and thus out of scope for Van Gelder and Topor’s translation):

Qsusp

user

is satisﬁed

for every user if a brand has no products, i.e., it does not occur in P. Yet,

Qsusp

user

is satisﬁed

for ﬁnitely many users on every database instance where Pcontains at least one row for every

brand from the relation B, in other words

Qsusp

user

is relatively safe on such database instances.

How does one evaluate queries that are not evaluable or even domain dependent? The

main approaches from the literature (Section 2) are either to use variants of the active

domain semantics [

BL00

HS94

AGSS86

] or to abandon ﬁnite relations entirely and evaluate

queries using ﬁnite representations of inﬁnite (but well-behaved) relations such as systems of

constraints [

Rev02

] or automatic structures [

BG04

]. These approaches favor expressiveness

over eﬃciency. But unlike query translations, they cannot beneﬁt from decades of practical

database research and engineering.

In this work, we translate arbitrary RC queries to RA expressions under the assumption

of an inﬁnite domain. To deal with queries that are domain dependent, our translation

produces two RA expressions, instead of a single equivalent one. The ﬁrst RA expression

characterizes the original RC query’s relative safety, the decidable question of whether the

query evaluates to a ﬁnite relation for a given database, which can be the case even for

a domain-dependent query, e.g.,

Qsusp

user

. If the original query is relatively safe on a given

database, i.e., produces some ﬁnite result, then the second RA expression evaluates to the

same ﬁnite result. Taken together, the two RA expressions solve the query capturability

problem [

AH91

]: they allow us to enumerate the original RC query’s ﬁnite evaluation result,

or to learn that it would be inﬁnite using RA operations on the unmodiﬁed database.

Vol. 19:4 EFFICIENT EVALUATION OF ARBITRARY RELATIONAL CALCULUS QUERIES 38:3

(Section 3.1)

Safe-range RC

(Section 3.2)

SRNF

(Section 3.3)

RANF

(Section 3.4) RA SQL

Section 4 Section 6.1 Section 6.2

Section 6.3

Section 6.4 Section 6.5

Figure 1: Overview of our translation.

Figure 1 summarizes our translation’s steps and the sections where they are presented.

Starting from an RC query, it produces two SQL queries via transformations to safe-range

queries, the safe-range normal form (SRNF), the relational algebra normal form (RANF), and

RA, respectively (Section 3). This article’s main contribution is the ﬁrst step: translating

an RC query into two safe-range RC queries (Section 4), which fundamentally diﬀers from

Van Gelder and Topor’s approach and produces better generators, like

πuser,score

(S) above.

Our generators strictly improve the time complexity of query evaluation (Section 5).

After the standard transformations from safe-range to RANF queries and from there

to RA expressions, we translate the RA expressions into SQL using the

radb

tool [

Yan19

]

(Section 6). We leverage various ideas from the literature to optimize the overall result. For

example, we generalize Claußen et al. [

CKMP97

]’s approach to avoid evaluating Cartesian

products like πuser,score (S)×Pin RANF queries by using count aggregations (Section 6.3).

The translation to SQL enables any standard database management system (DBMS) to

evaluate RC queries. We implement our translation and then use either

PostgreSQL

MySQL

for query evaluation. Using a real Amazon review dataset [

NLM19

] and our synthetic bench-

mark that generates hard database instances for random RC queries (Section 7), we evaluate

our translation’s performance (Section 8). The evaluation shows that our approach outper-

forms Van Gelder and Topor’s translation (which also uses a standard DBMS for evaluation)

and other RC evaluation approaches based on constraint databases and structure reduction.

In summary, our three main contributions are as follows:

•

We devise a translation of an arbitrary RC query into a pair of RA expressions as described

above. The time complexity of evaluating our translation’s results improves upon Van

Gelder and Topor’s approach [GT91].

•

We implement our translation and extend it to produce SQL queries. The resulting tool

RC2SQL

makes RC a viable input language for any standard DBMS. We evaluate our tool

on synthetic and real data and conﬁrm that our translation’s improved asymptotic time

complexity carries over into practice.

•

To challenge

RC2SQL

(and its competitors) in our evaluation, we devise the Data Golf

benchmark that generates hard database instances for randomly generated RC queries.

This article extends our ICDT 2022 conference paper [

RBKT22b

] with a more complete

description of the translation. In particular, it describes the steps that follow our main contri-

bution – the translation of RC queries into two safe-range queries. In addition, we formally ver-

ify the functional correctness (but not the complexity analysis) of the main contribution using

the Isabelle/HOL proof assistant [

RT22

]. The theorems and examples that have been veriﬁed

in Isabelle are marked with a special symbol ( ). The formalization helped us identify and

correct a technical oversight in the algorithm from the conference paper (even though the prob-

lem was compensated for by the subsequent steps of the translation in our implementation).

38:4 M. Raszyk, D. Basin, S. Krsti´

c, and D. Traytel Vol. 19:4

2. Related Work

We recall Trakhtenbrot’s theorem and the fundamental notions of capturability and data

complexity. Given an RC query over a ﬁnite domain, Trakhtenbrot [

Tra50

] showed that it is

undecidable whether there exists a (ﬁnite) structure and a variable assignment satisfying the

query. In contrast, the question of whether a ﬁxed structure and a ﬁxed variable assignment

satisﬁes the given RC query is decidable [AGSS86].

Kifer [

Kif88

] calls a query class capturable if there is an algorithm that, given a query

in the class and a database instance, enumerates the query’s evaluation result, i.e., all tuples

satisfying the query. Avron and Hirshfeld [

AH91

] observe that Kifer’s notion is restricted

because it requires every query in a capturable class to be domain independent. Hence, they

propose an alternative deﬁnition that we also use: A query class is capturable if there is

an algorithm that, given a query in the class, a (ﬁnite or inﬁnite) domain, and a database

instance, determines whether the query’s evaluation result on the database instance over

the domain is ﬁnite and enumerates the result in this case. Our work solves Avron and

Hirshfeld’s capturability problem additionally assuming an inﬁnite domain.

Data complexity [

Var82

] is the complexity of recognizing if a tuple satisﬁes a ﬁxed query

over a database, as a function of the database size. Our capturability algorithm provides an up-

per bound on the data complexity for RC queries over an inﬁnite domain that have a ﬁnite eval-

uation result (but it cannot decide if a tuple belongs to a query’s result if the result is inﬁnite).

Next, we group related approaches to evaluating RC queries into three categories.

Structure reduction. The classical approach to handling arbitrary RC queries is

to evaluate them under a ﬁnite structure [

Lib04

]. The core question here is whether the

evaluation produces the same result as deﬁned by the natural semantics, which typically

considers inﬁnite domains. Codd’s theorem [Cod72] aﬃrmatively answers this question for

domain-independent queries, restricting the structure to the active domain. Ailamazyan et

al. [

AGSS86

] show that RC is a capturable query class by extending the active domain with

a few additional elements, whose number depends only on the query, and evaluating the

query over this ﬁnite domain. Natural–active collapse results [

BL00

] generalize Ailamazyan

et al.’s [

AGSS86

] result to extensions of RC (e.g., with order relations) by combining the

structure reduction with a translation-based approach. Hull and Su [

HS94

] study several

semantics of RC that guarantee the ﬁniteness of the query’s evaluation result. In particular,

the “output-restricted unlimited interpretation” only restricts the query’s evaluation result

to tuples that only contain elements in the active domain, but the quantiﬁed variables still

range over the (ﬁnite or inﬁnite) underlying domain. Our work is inspired by all these

theoretical landmarks, in particular Hull and Su’s work (Section 4.1). Yet we avoid using

(extended) active domains, which make query evaluation impractical.

Query translation. Another strategy is to translate a given query into one that can

be evaluated eﬃciently, for example as a sequence of RA operations on ﬁnite tables. Van

Gelder and Topor pioneered this approach [

GT87

GT91

] for RC. A core component of their

translation is the choice of generators, which replace the active domain restrictions from

structure reduction approaches and thereby improve the time complexity. Extensions to

scalar and complex function symbols have also been studied [

EHJ93

LYL08

]. All these

approaches focus on syntactic classes of RC, for which domain independence is given, e.g.,

the evaluable queries of Van Gelder and Topor (Appendix A). Our approach is inspired by

Van Gelder and Topor’s work but generalizes it to handle arbitrary RC queries at the cost

Vol. 19:4 EFFICIENT EVALUATION OF ARBITRARY RELATIONAL CALCULUS QUERIES 38:5

of assuming an inﬁnite domain. Also, we further improve the time complexity of Van Gelder

and Topor’s approach by choosing better generators.

Evaluation with inﬁnite relations. Constraint databases [

Rev02

] obviate the need for

using RA operations on ﬁnite tables. This yields signiﬁcant expressiveness gains as domain

independence need not be assumed. Yet the eﬃciency of the quantiﬁer elimination procedures

employed cannot compare with the simple evaluation of the RA’s projection operation.

Similarly, automatic structures [

BG04

] can represent the results of arbitrary RC queries

ﬁnitely, but struggle with large quantities of data. We demonstrate this in our evaluation

where we compare our translation to several modern incarnations of the above approaches,

all based on binary decision diagrams [MLAH99, Møl02, CGS09, KM01, BKMZ15].

3. Preliminaries

We introduce the RC syntax and semantics and deﬁne relevant classes of RC queries.

3.1. Relational Calculus. A signature

is a triple (

C,R, ι

), where

and

are disjoint

ﬁnite sets of constant and predicate symbols, and the function

R → N

maps each predicate

symbol

r∈ R

to its arity

(

). Let

= (

C,R, ι

) be a signature and

a countably inﬁnite set

of variables disjoint from C ∪ R. The following grammar deﬁnes the syntax of RC queries:

Q::= ⊥ | ⊤ | x≈t|r(t1, . . . , tι(r))| ¬Q|Q∨Q|Q∧Q| ∃x. Q.

Here,

r∈ R

is a predicate symbol,

t, t1, . . . , tι(r)∈ V ∪ C

are terms, and

x∈ V

is a variable.

We write

∃⃗v. Q

for

∃v1. . . . ∃vk. Q

and

∀⃗v. Q

for

¬∃⃗v. ¬Q

, where

⃗v

is a variable sequence

v1, . . . , vk

. If

= 0, then both

∃⃗v. Q

and

∀⃗v. Q

denote just

. Quantiﬁers have lower

precedence than conjunctions and disjunctions, e.g.,

∃x. Q1∧Q2

means

∃x.

(

Q1∧Q2

). We

use

≈

to denote the equality of terms in RC to distinguish it from =, which denotes syntactic

object identity. We also write

Q1−→ Q2

for

¬Q1∨Q2

. However, writing

Q1∨Q2

for

¬(¬Q1∧ ¬Q2) would complicate later deﬁnitions, e.g., the safe-range queries (Section 3.2).

We deﬁne the subquery partial order

⊑

on queries as the (reﬂexive and transitive) sub-

term relation on the datatype of RC queries. For example,

is a subquery of the query

Q1∧

¬∃y. Q2

. We denote by

sub

(

) the set of subqueries of a query

, by

(

) the set of free vari-

ables in

, and by

(

) be the set of all (free and bound) variables in a query

. Furthermore,

we denote by

⃗

(

) the sequence of free variables in

based on some ﬁxed ordering of variables.

We lift this notation to sets of queries in the standard way. A query

with no free variables,

i.e.,

(

) =

∅

, is called closed. Queries of the form

(

t1, . . . , tι(r)

) and

x≈

care called atomic

predicates. We deﬁne the predicate

(

) characterizing atomic predicates, i.e.,

(

) is true

iﬀ

is an atomic predicate. Queries of the form

∃⃗v. r

(

t1, . . . , tι(r)

) and

∃⃗v. x ≈

care called

quantiﬁed predicates. We denote by

∃x. Q

the query obtained by existentially quantifying a

variable

from a query

is free in

, i.e.,

∃x. Q B∃x. Q

x∈fv

(

) and

∃x. Q BQ

oth-

erwise. We lift this notation to sets of queries in the standard way. We use

∃x. Q

(instead of

∃x. Q

) when constructing a query to avoid introducing bound variables that never occur in

A structure

over a signature (

C,R, ι

) consists of a non-empty domain

and interpre-

tations c

S∈ D

and

rS⊆ Dι(r)

, for each c

∈ C

and

r∈ R

. We assume that all the relations

are ﬁnite. Note that this assumption does not yield a ﬁnite structure (as deﬁned in ﬁnite

model theory [

Lib04

]) since the domain

can still be inﬁnite. A (variable)assignment is

a mapping

V → D

. We extend

to constant symbols c

∈ C

with

. We write

[

x7→ d

] for the assignment that maps

d∈ D

and is otherwise identical to

. We lift this

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LogicalMethodsinComputerScienceVolume19,Issue4,2023,pp.38:1–38:40https://lmcs.episciences.org/SubmittedOct.21,2022PublishedDec.22,2023EFFICIENTEVALUATIONOFARBITRARYRELATIONALCALCULUSQUERIESMARTINRASZYKa,DAVIDBASINb,SR¯DANKRSTI´Cb,ANDDMITRIYTRAYTELcaDFINITY,Zurich,Switzerlande-mailaddress:martin.rasz...

展开>> 收起<<

Efficient Evaluation of Arbitrary Relational Calculus Queries.pdf

共40页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efficient Evaluation of Arbitrary Relational Calculus Queries

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: