Efficient Evaluation of Arbitrary Relational Calculus Queries

2025-05-03 0 0 707.02KB 40 页 10玖币
侵权投诉
Logical Methods in Computer Science
Volume 19, Issue 4, 2023, pp. 38:1–38:40
https://lmcs.episciences.org/
Submitted Oct. 21, 2022
Published Dec. 22, 2023
EFFICIENT EVALUATION OF
ARBITRARY RELATIONAL CALCULUS QUERIES
MARTIN RASZYK a, DAVID BASIN b, SR¯
DAN KRSTI´
Cb, AND DMITRIY TRAYTEL c
aDFINITY, Zurich, Switzerland
e-mail address: martin.raszyk@dfinity.org
bDepartment of Computer Science, ETH Z¨urich, Zurich, Switzerland
e-mail address:{basin, srdan.krstic}@inf.ethz.ch
cDepartment of Computer Science, University of Copenhagen, Copenhagen, Denmark
e-mail address: traytel@di.ku.dk
Abstract.
The relational calculus (RC) is a concise, declarative query language. However,
existing RC query evaluation approaches are inefficient and often deviate from established
algorithms based on finite tables used in database management systems. We devise a new
translation of an arbitrary RC query into two safe-range queries, for which the finiteness of
the query’s evaluation result is guaranteed. Assuming an infinite domain, the two queries
have the following meaning: The first is closed and characterizes the original query’s relative
safety, i.e., whether given a fixed database, the original query evaluates to a finite relation.
The second safe-range query is equivalent to the original query, if the latter is relatively safe.
We compose our translation with other, more standard ones to ultimately obtain two SQL
queries. This allows us to use standard database management systems to evaluate arbitrary
RC queries. We show that our translation improves the time complexity over existing
approaches, which we also empirically confirm in both realistic and synthetic experiments.
1. Introduction
Codd’s theorem states that all domain-independent queries of the relational calculus (RC) can
be expressed in relational algebra (RA) [
Cod72
]. A popular interpretation of this result is that
RA suffices to express all interesting queries. This interpretation justifies why SQL evolved as
the practical database query language with the RA as its mathematical foundation. SQL is
declarative and abstracts over the actual RA expression used to evaluate a query. Yet, SQL’s
syntax inherits RA’s deliberate syntactic limitations, such as union-compatibility, which
ensure domain independence. RC does not have such syntactic limitations, which arguably
makes it a more attractive declarative query language than both RA and SQL. The main
problem of RC is that it is not immediately clear how to evaluate even domain-independent
queries, much less how to handle the domain-dependent (i.e., not domain-independent) ones.
As a running example, consider a shop in which brands (unary finite relation Bof brands)
sell products (binary finite relation Prelating brands and products) and products are reviewed
by users with a score (ternary finite relation Srelating products, users, and scores). We
Key words and phrases: Relational calculus, relative safety, safe range, query translation.
LOGICAL METHODS
IN COMPUTER SCIENCE DOI:10.46298/LMCS-19(4:38)2023
© M. Raszyk, D. Basin, S. Krsti´
c, and D. Traytel
CC
Creative Commons
38:2 M. Raszyk, D. Basin, S. Krsti´
c, and D. Traytel Vol. 19:4
consider a brand suspicious if there is a user and a score such that all the brand’s products
were reviewed by that user with that score. An RC query computing suspicious brands is
Qsusp BB(b)∧ ∃u, s. p. P(b, p)S(p, u, s).
This query is domain independent and follows closely our informal description. It is not,
however, clear how to evaluate it because its second conjunct is domain dependent as it is
satisfied for every brand that does not occur in P. Finding suspicious brands using RA or
SQL is a challenge, which only the best students from an undergraduate database course
will accomplish. We give away an RA answer next (where
is the set difference operator
and is the anti-join, also known as the generalized difference operator [AHV95]):
πbrand ((πuser ,score (S)×B)πbrand,user,score ((πuser,score (S)×P)S)) (Bπbrand (P)).
The highlighted expressions
πuser,score
(S) are called generators. They ensure that the
left operands of the anti-join and set difference operators include or have the same columns
(i.e., are union-compatible) as the corresponding right operands. (Following Codd [
Cod72
],
one could also use the active domain to obtain canonical, but far less efficient, generators.)
Van Gelder and Topor [
GT87
,
GT91
] present a translation from a decidable class of
domain-independent RC queries, called evaluable, to RA expressions. Their translation of
the evaluable
Qsusp
query would yield different generators, replacing both highlighted parts
by
πuser
(S)
×πscore
(S). That one can avoid this Cartesian product as shown above is subtle:
Replacing only the first highlighted generator with the product results in an inequivalent
RA expression.
Once we have identified suspicious brands, we may want to obtain the users whose scoring
made the brands suspicious. In RC, omitting u’s quantifier from Qsusp achieves just that:
Qsusp
user BB(b)∧ ∃s. p. P(b, p)S(p, u, s).
In contrast, RA cannot express the same property as it is domain dependent (hence also not
evaluable and thus out of scope for Van Gelder and Topor’s translation):
Qsusp
user
is satisfied
for every user if a brand has no products, i.e., it does not occur in P. Yet,
Qsusp
user
is satisfied
for finitely many users on every database instance where Pcontains at least one row for every
brand from the relation B, in other words
Qsusp
user
is relatively safe on such database instances.
How does one evaluate queries that are not evaluable or even domain dependent? The
main approaches from the literature (Section 2) are either to use variants of the active
domain semantics [
BL00
,
HS94
,
AGSS86
] or to abandon finite relations entirely and evaluate
queries using finite representations of infinite (but well-behaved) relations such as systems of
constraints [
Rev02
] or automatic structures [
BG04
]. These approaches favor expressiveness
over efficiency. But unlike query translations, they cannot benefit from decades of practical
database research and engineering.
In this work, we translate arbitrary RC queries to RA expressions under the assumption
of an infinite domain. To deal with queries that are domain dependent, our translation
produces two RA expressions, instead of a single equivalent one. The first RA expression
characterizes the original RC query’s relative safety, the decidable question of whether the
query evaluates to a finite relation for a given database, which can be the case even for
a domain-dependent query, e.g.,
Qsusp
user
. If the original query is relatively safe on a given
database, i.e., produces some finite result, then the second RA expression evaluates to the
same finite result. Taken together, the two RA expressions solve the query capturability
problem [
AH91
]: they allow us to enumerate the original RC query’s finite evaluation result,
or to learn that it would be infinite using RA operations on the unmodified database.
Vol. 19:4 EFFICIENT EVALUATION OF ARBITRARY RELATIONAL CALCULUS QUERIES 38:3
RC
(Section 3.1)
Safe-range RC
(Section 3.2)
SRNF
(Section 3.3)
RANF
(Section 3.4) RA SQL
Section 4 Section 6.1 Section 6.2
Section 6.3
Section 6.4 Section 6.5
Figure 1: Overview of our translation.
Figure 1 summarizes our translation’s steps and the sections where they are presented.
Starting from an RC query, it produces two SQL queries via transformations to safe-range
queries, the safe-range normal form (SRNF), the relational algebra normal form (RANF), and
RA, respectively (Section 3). This article’s main contribution is the first step: translating
an RC query into two safe-range RC queries (Section 4), which fundamentally differs from
Van Gelder and Topor’s approach and produces better generators, like
πuser,score
(S) above.
Our generators strictly improve the time complexity of query evaluation (Section 5).
After the standard transformations from safe-range to RANF queries and from there
to RA expressions, we translate the RA expressions into SQL using the
radb
tool [
Yan19
]
(Section 6). We leverage various ideas from the literature to optimize the overall result. For
example, we generalize Claußen et al. [
CKMP97
]’s approach to avoid evaluating Cartesian
products like πuser,score (S)×Pin RANF queries by using count aggregations (Section 6.3).
The translation to SQL enables any standard database management system (DBMS) to
evaluate RC queries. We implement our translation and then use either
PostgreSQL
or
MySQL
for query evaluation. Using a real Amazon review dataset [
NLM19
] and our synthetic bench-
mark that generates hard database instances for random RC queries (Section 7), we evaluate
our translation’s performance (Section 8). The evaluation shows that our approach outper-
forms Van Gelder and Topor’s translation (which also uses a standard DBMS for evaluation)
and other RC evaluation approaches based on constraint databases and structure reduction.
In summary, our three main contributions are as follows:
We devise a translation of an arbitrary RC query into a pair of RA expressions as described
above. The time complexity of evaluating our translation’s results improves upon Van
Gelder and Topor’s approach [GT91].
We implement our translation and extend it to produce SQL queries. The resulting tool
RC2SQL
makes RC a viable input language for any standard DBMS. We evaluate our tool
on synthetic and real data and confirm that our translation’s improved asymptotic time
complexity carries over into practice.
To challenge
RC2SQL
(and its competitors) in our evaluation, we devise the Data Golf
benchmark that generates hard database instances for randomly generated RC queries.
This article extends our ICDT 2022 conference paper [
RBKT22b
] with a more complete
description of the translation. In particular, it describes the steps that follow our main contri-
bution – the translation of RC queries into two safe-range queries. In addition, we formally ver-
ify the functional correctness (but not the complexity analysis) of the main contribution using
the Isabelle/HOL proof assistant [
RT22
]. The theorems and examples that have been verified
in Isabelle are marked with a special symbol ( ). The formalization helped us identify and
correct a technical oversight in the algorithm from the conference paper (even though the prob-
lem was compensated for by the subsequent steps of the translation in our implementation).
38:4 M. Raszyk, D. Basin, S. Krsti´
c, and D. Traytel Vol. 19:4
2. Related Work
We recall Trakhtenbrot’s theorem and the fundamental notions of capturability and data
complexity. Given an RC query over a finite domain, Trakhtenbrot [
Tra50
] showed that it is
undecidable whether there exists a (finite) structure and a variable assignment satisfying the
query. In contrast, the question of whether a fixed structure and a fixed variable assignment
satisfies the given RC query is decidable [AGSS86].
Kifer [
Kif88
] calls a query class capturable if there is an algorithm that, given a query
in the class and a database instance, enumerates the query’s evaluation result, i.e., all tuples
satisfying the query. Avron and Hirshfeld [
AH91
] observe that Kifer’s notion is restricted
because it requires every query in a capturable class to be domain independent. Hence, they
propose an alternative definition that we also use: A query class is capturable if there is
an algorithm that, given a query in the class, a (finite or infinite) domain, and a database
instance, determines whether the query’s evaluation result on the database instance over
the domain is finite and enumerates the result in this case. Our work solves Avron and
Hirshfeld’s capturability problem additionally assuming an infinite domain.
Data complexity [
Var82
] is the complexity of recognizing if a tuple satisfies a fixed query
over a database, as a function of the database size. Our capturability algorithm provides an up-
per bound on the data complexity for RC queries over an infinite domain that have a finite eval-
uation result (but it cannot decide if a tuple belongs to a query’s result if the result is infinite).
Next, we group related approaches to evaluating RC queries into three categories.
Structure reduction. The classical approach to handling arbitrary RC queries is
to evaluate them under a finite structure [
Lib04
]. The core question here is whether the
evaluation produces the same result as defined by the natural semantics, which typically
considers infinite domains. Codd’s theorem [Cod72] affirmatively answers this question for
domain-independent queries, restricting the structure to the active domain. Ailamazyan et
al. [
AGSS86
] show that RC is a capturable query class by extending the active domain with
a few additional elements, whose number depends only on the query, and evaluating the
query over this finite domain. Natural–active collapse results [
BL00
] generalize Ailamazyan
et al.’s [
AGSS86
] result to extensions of RC (e.g., with order relations) by combining the
structure reduction with a translation-based approach. Hull and Su [
HS94
] study several
semantics of RC that guarantee the finiteness of the query’s evaluation result. In particular,
the “output-restricted unlimited interpretation” only restricts the query’s evaluation result
to tuples that only contain elements in the active domain, but the quantified variables still
range over the (finite or infinite) underlying domain. Our work is inspired by all these
theoretical landmarks, in particular Hull and Su’s work (Section 4.1). Yet we avoid using
(extended) active domains, which make query evaluation impractical.
Query translation. Another strategy is to translate a given query into one that can
be evaluated efficiently, for example as a sequence of RA operations on finite tables. Van
Gelder and Topor pioneered this approach [
GT87
,
GT91
] for RC. A core component of their
translation is the choice of generators, which replace the active domain restrictions from
structure reduction approaches and thereby improve the time complexity. Extensions to
scalar and complex function symbols have also been studied [
EHJ93
,
LYL08
]. All these
approaches focus on syntactic classes of RC, for which domain independence is given, e.g.,
the evaluable queries of Van Gelder and Topor (Appendix A). Our approach is inspired by
Van Gelder and Topor’s work but generalizes it to handle arbitrary RC queries at the cost
Vol. 19:4 EFFICIENT EVALUATION OF ARBITRARY RELATIONAL CALCULUS QUERIES 38:5
of assuming an infinite domain. Also, we further improve the time complexity of Van Gelder
and Topor’s approach by choosing better generators.
Evaluation with infinite relations. Constraint databases [
Rev02
] obviate the need for
using RA operations on finite tables. This yields significant expressiveness gains as domain
independence need not be assumed. Yet the efficiency of the quantifier elimination procedures
employed cannot compare with the simple evaluation of the RA’s projection operation.
Similarly, automatic structures [
BG04
] can represent the results of arbitrary RC queries
finitely, but struggle with large quantities of data. We demonstrate this in our evaluation
where we compare our translation to several modern incarnations of the above approaches,
all based on binary decision diagrams [MLAH99, Møl02, CGS09, KM01, BKMZ15].
3. Preliminaries
We introduce the RC syntax and semantics and define relevant classes of RC queries.
3.1. Relational Calculus. A signature
σ
is a triple (
C,R, ι
), where
C
and
R
are disjoint
finite sets of constant and predicate symbols, and the function
ι
:
R → N
maps each predicate
symbol
r∈ R
to its arity
ι
(
r
). Let
σ
= (
C,R, ι
) be a signature and
V
a countably infinite set
of variables disjoint from C ∪ R. The following grammar defines the syntax of RC queries:
Q::= ⊥ | ⊤ | xt|r(t1, . . . , tι(r))| ¬Q|QQ|QQ| ∃x. Q.
Here,
r∈ R
is a predicate symbol,
t, t1, . . . , tι(r) V ∪ C
are terms, and
x∈ V
is a variable.
We write
v. Q
for
v1. . . . vk. Q
and
v. Q
for
¬∃v. ¬Q
, where
v
is a variable sequence
v1, . . . , vk
. If
k
= 0, then both
v. Q
and
v. Q
denote just
Q
. Quantifiers have lower
precedence than conjunctions and disjunctions, e.g.,
x. Q1Q2
means
x.
(
Q1Q2
). We
use
to denote the equality of terms in RC to distinguish it from =, which denotes syntactic
object identity. We also write
Q1Q2
for
¬Q1Q2
. However, writing
Q1Q2
for
¬(¬Q1∧ ¬Q2) would complicate later definitions, e.g., the safe-range queries (Section 3.2).
We define the subquery partial order
on queries as the (reflexive and transitive) sub-
term relation on the datatype of RC queries. For example,
Q1
is a subquery of the query
Q1
¬∃y. Q2
. We denote by
sub
(
Q
) the set of subqueries of a query
Q
, by
fv
(
Q
) the set of free vari-
ables in
Q
, and by
av
(
Q
) be the set of all (free and bound) variables in a query
Q
. Furthermore,
we denote by
fv
(
Q
) the sequence of free variables in
Q
based on some fixed ordering of variables.
We lift this notation to sets of queries in the standard way. A query
Q
with no free variables,
i.e.,
fv
(
Q
) =
, is called closed. Queries of the form
r
(
t1, . . . , tι(r)
) and
x
care called atomic
predicates. We define the predicate
ap
(
·
) characterizing atomic predicates, i.e.,
ap
(
Q
) is true
iff
Q
is an atomic predicate. Queries of the form
v. r
(
t1, . . . , tι(r)
) and
v. x
care called
quantified predicates. We denote by
˜
x. Q
the query obtained by existentially quantifying a
variable
x
from a query
Q
if
x
is free in
Q
, i.e.,
˜
x. Q Bx. Q
if
xfv
(
Q
) and
˜
x. Q BQ
oth-
erwise. We lift this notation to sets of queries in the standard way. We use
˜
x. Q
(instead of
x. Q
) when constructing a query to avoid introducing bound variables that never occur in
Q
.
A structure
S
over a signature (
C,R, ι
) consists of a non-empty domain
D
and interpre-
tations c
S∈ D
and
rS⊆ Dι(r)
, for each c
∈ C
and
r∈ R
. We assume that all the relations
rS
are finite. Note that this assumption does not yield a finite structure (as defined in finite
model theory [
Lib04
]) since the domain
D
can still be infinite. A (variable)assignment is
a mapping
α
:
V → D
. We extend
α
to constant symbols c
∈ C
with
α
(c) = c
S
. We write
α
[
x7→ d
] for the assignment that maps
x
to
d∈ D
and is otherwise identical to
α
. We lift this
摘要:

LogicalMethodsinComputerScienceVolume19,Issue4,2023,pp.38:1–38:40https://lmcs.episciences.org/SubmittedOct.21,2022PublishedDec.22,2023EFFICIENTEVALUATIONOFARBITRARYRELATIONALCALCULUSQUERIESMARTINRASZYKa,DAVIDBASINb,SR¯DANKRSTI´Cb,ANDDMITRIYTRAYTELcaDFINITY,Zurich,Switzerlande-mailaddress:martin.rasz...

展开>> 收起<<
Efficient Evaluation of Arbitrary Relational Calculus Queries.pdf

共40页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:40 页 大小:707.02KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 40
客服
关注