Quantifying the Loss of Acyclic Join Dependencies

2025-05-02 0 0 5.37MB 35 页 10玖币
侵权投诉
Quantifying the Loss of Acyclic Join Dependencies
Batya Kenig
Technion, Israel Institute of Technology
Haifa, Israel
batyak@technion.ac.il
Nir Weinberger
Technion, Israel Institute of Technology
Haifa, Israel
nirwein@technion.ac.il
ABSTRACT
Acyclic schemes posses known benets for database design, speed-
ing up queries, and reducing space requirements. An acyclic join
dependency (AJD) is lossless with respect to a universal relation if
joining the projections associated with the schema results in the
original universal relation. An intuitive and standard measure of
loss entailed by an AJD is the number of redundant tuples generated
by the acyclic join. Recent work has shown that the loss of an AJD
can also be characterized by an information-theoretic measure. Mo-
tivated by the problem of automatically tting an acyclic schema to
a universal relation, we investigate the connection between these
two characterizations of loss. We rst show that the loss of an AJD
is captured using the notion of KL-Divergence. We then show that
the KL-divergence can be used to bound the number of redundant
tuples. We prove a deterministic lower bound on the percentage
of redundant tuples. For an upper bound, we propose a random
database model, and establish a high probability bound on the per-
centage of redundant tuples, which coincides with the lower bound
for large databases.
ACM Reference Format:
Batya Kenig and Nir Weinberger. 2023. Quantifying the Loss of Acyclic
Join Dependencies. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI
Symposium on Principles of Database Systems (PODS ’23), June 18–23, 2023,
Seattle, WA, USA. ACM, New York, NY, USA, 35 pages. https://doi.org/10.
1145/3584372.3588658
1 INTRODUCTION
In the traditional approach to database design, the designer has a
clear conceptual model of the data, and of the dependencies between
the attributes. This information guides the database normalization
process, which leads to a database schema consisting of multiple
relation schemas that have the benet of reduced redundancies, and
more ecient querying and updating of the data. This approach
requires that the data precisely meet the constraints of the model;
various data repair techniques [
1
,
4
] have been developed to address
the case that the data does not meet the constraints of the schema
exactly. Current data management applications are required to han-
dle data that is noisy, erroneous, and inconsistent. The presumption
that such data meet a predened set of constraints is not likely to
hold. In many cases, such applications and are willing to tolerate
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specic permission
and/or a fee. Request permissions from permissions@acm.org.
PODS ’23, June 18–23, 2023, Seattle, WA, USA
©2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0127-6/23/06. . . $15.00
https://doi.org/10.1145/3584372.3588658
the “loss” entailed by an imperfect database schema, and will be
content with a database schema that only “approximately ts” the
data. Motivated by the task of schema-discovery for a given dataset,
in this work, we investigate dierent ways of measuring the “loss”
of an imperfect database schema, and the relationship between
these dierent measures.
Decomposing a relation schema
Ω
is the process of breaking
the relation scheme into two or more relation schemes
Ω1, . . . , Ω𝑘
whose union is
Ω
. The decomposition is lossless if, for any relation
instance
𝑅
over
Ω
, it holds that
𝑅=ΠΩ1(𝑅)··· ΠΩ𝑘(𝑅)
. The
loss of a database scheme
{Ω1, . . . , Ω𝑘}
with respect to a relation
instance
𝑅
over
Ω
, is the set of tuples
𝑘
𝑖=1ΠΩ𝑖(𝑅)\𝑅
that are
in the join, but not in
𝑅
(formal denitions in Section 2). We say
that such tuples are spurious. Normalizing a relation scheme is the
process of (losslessly) decomposing it into a database scheme where
each of its resulting relational schemes have certain properties. The
specic properties imposed on the resulting relational schemes
dene dierent normal forms such as 3NF [
7
], BCNF [
8
], 4NF [
11
],
and 5NF [10, 12].
A data dependency denes a relationship between sets of at-
tributes in a database. A Join Dependency (JD) denes a
𝑘
-way
decomposition (where
𝑘2
) of a relation schema
Ω
, and is said
to hold in a relation instance
𝑅
over
Ω
if the join is lossless with
respect to
𝑅
(formal denitions in Section 2). Join Dependencies
generalize Multivalued Dependencies (MVDs) that are eectively
Join Dependencies where
𝑘=2
, which in turn generalize Functional
Dependencies (FDs), which are perhaps the most widely studied data
dependencies due to their simple and intuitive semantics.
Acyclic Join Dependencies (AJDs) is a type of JD that is specied
by an Acyclic Schema [
2
]. Acyclic schemes have many applications
in databases and in machine learning; they enable ecient query
evaluation [
26
], play a key role in database normalization and de-
sign [
11
,
20
], and improve the performance of many well-known
machine learning algorithms over relational data [16, 23, 24].
Consider how we may measure the loss of an acyclic schema
S={Ω1, . . . , Ω𝑘}
with respect to a given relation instance
𝑅
over
Ω
. An intuitive approach, based on the denition of a lossless join,
is to simply count the number of spurious tuples generated by the
join (i.e., and are not included in the relation instance
𝑅
). In previ-
ous work, Kenig et al. [
14
] presented an algorithm that discovers
“approximate Acyclic Schemes” for a given dataset. Building on
earlier work by Lee [
18
,
19
], the authors proposed to measure the
loss of an acyclic schema, with respect to a given dataset, using
an information-theoretic metric called the
J
-measure, formally
dened in Section 2. Lee has shown that this information-theoretic
measure characterizes lossless AJDs. That is, the
J
-measure of an
acyclic schema with respect to a dataset is
0
if and only if the AJD
dened by this schema is lossless with respect to the dataset [
18
,
19
]
arXiv:2210.14572v2 [cs.DB] 10 Apr 2023
PODS ’23, June 18–23, 2023, Seattle, WA, USA Batya Kenig and Nir Weinberger
(i.e., no spurious tuples). Beyond this characterization, not much is
known about the relationship between these two measures of loss.
In fact, the relationship is not necessarily monotonic, and may vary
widely even between two acyclic schemas with similar
J
-measure
values [
14
]. Nevertheless, empirical studies have shown that low
values of the
J
-measure generally lead to acyclic schemas that
incur a small number of spurious tuples [14].
Our rst result is a characterization of the
J
-measure of an
acyclic schema
S={Ω1, . . . , Ω𝑘}
, with respect to a relation instance
𝑅
, as the KL-divergence between two empirical distributions; the
one associated with
𝑅
, and the one associated with
𝑅def
=ΠΩ1(𝑅)
··· ΠΩ𝑘(𝑅)
. The empirical distribution is a standard notion used
to associate a multivariate probability distribution with a multiset
of tuples, and a formal denition is deferred to Section 2. The KL-
divergence is a non-symmetric measure of the similarity between
two probability distributions
𝑃(Ω)
and
𝑄(Ω)
. It has numerous
information-theoretic applications, and can be loosely thought of
as a measure of the information lost when
𝑄(Ω)
is used to ap-
proximate
𝑃(Ω)
. Our result that the
J
-measure is, in fact, the
KL-divergence between the empirical distributions associated with
the original relation instance and the one implied by the acyclic
schema, explains the success of the
J
-measure for identifying
“approximate acyclic schemas” in [
14
], and how the
J
-measure
characterizes lossless AJDs [18, 19].
With this result at hand, we address the following problem. Given
a relation schema
Ω
, an acyclic schema
S={Ω1, . . . , Ω𝑘}
, and a
value
J 0
, compute the minimum and maximum number of spu-
rious tuples generated by
S
with respect to any relation instance
𝑅
over
Ω
, whose KL-divergence from
𝑅def
=ΠΩ1(𝑅)··· ΠΩ𝑘(𝑅)
is
J
. To this end, we prove a deterministic lower bound on the
number of spurious tuples that depends only on
J
. We then con-
sider the problem of determining an upper bound on the number of
spurious tuples. As it turns out, this problem is more challenging,
as a deterministic upper bound does not hold. We thus propose a
random relation model, in which a relation is drawn uniformly at
random from all possible empirical distributions of a given size
𝑁
.
We then show that a bound analogous to the deterministic lower
bound on the relative number of spurious tuples also holds as an
upper bound with two dierences: First, it holds with high proba-
bility over the random choice of relation (and not with probability
1
), and second, is holds with an additive term, though one which
vanishes for asymptotically large relation instances. The proof of
this result is fairly complicated, and as discussed in Section 5, re-
quires applications of multiple techniques from information theory
[9] and concentration of measure [6].
Beyond its theoretical interest, understanding the relationship
between the information-theoretic KL-divergence, and the tangible
property of loss, as measured by the number of spurious tuples, has
practical consequences for the task of discovering acyclic schemas
that t a dataset. Currently, the system of [
14
] can discover acyclic
schemas that t the data well in terms of its
J
-measure. Under-
standing how the
J
-measure relates to the loss in terms of spurious
tuples will enable nding acyclic schemas that generate a bounded
number of spurious tuples. This is important for applications that
apply factorization as a means of compression, while wishing to
maintain the integrity of the data [22].
To summarize, in this paper we: (1) Show that the
J
-measure of
Lee [
18
,
19
] is the KL-divergence between the empirical distribution
associated with the original relation and the one induced by the
acyclic schema, (2) Prove a general lower bound on the loss (i.e.,
spurious tuples) in terms of the KL-divergence, and present a simple
family of relation instances for which this bound is tight, and (3)
Propose a random relation model and prove an upper bound on the
loss, which holds with high probability, and which converges to
the lower bound for large relational instances.
2 BACKGROUND
For the sake of consistency, we adopt some of the notation from [
14
].
We denote by
[𝑛]={1, . . . , 𝑛}
. Let
Ω
be a set of variables, also called
attributes. If 𝑋 , 𝑌 Ω, then 𝑋𝑌 denotes 𝑋𝑌.
2.1 Data Dependencies
Let
Ωdef
={𝑋1, . . . , 𝑋𝑛}
denote a set of attributes with domains
D(𝑋1), . . . , D(𝑋𝑛)
. We denote by
Rel(Ω)def
={𝑅:𝑅>𝑛
𝑖=1D(𝑋𝑖)}
the set of all possible relation instances over
Ω
. Fix a relation in-
stance
𝑅Rel(Ω)
of size
𝑁=|𝑅|
. For
𝑌Ω
we let
𝑅[𝑌]
de-
note the projection of
𝑅
onto the attributes
𝑌
. A schema is a set
S={Ω1, . . . , Ω𝑚}such that Ð𝑚
𝑖=1Ω𝑖=Ωand Ω𝑖Ω𝑗for 𝑖𝑗.
We say that the relation instance
𝑅
satises the join dependency
JD(S)
, and write
𝑅|=JD(S)
, if
𝑅=Z𝑚
𝑖=1𝑅[Ω𝑖]
. If
𝑅̸|=JD(S)
, then
Sincurs a loss with respect to 𝑅, denoted 𝜌(𝑅, S), dened:
𝜌(𝑅, S)def
=Z𝑚
𝑖=1𝑅[Ω𝑖]|𝑅|
|𝑅|(1)
We call the set of tuples
Z𝑚
𝑖=1𝑅[Ω𝑖]\𝑅
spurious tuples. Clearly,
𝑅|=JD(S)if and only if 𝜌(𝑅, S)=0.
We say that
𝑅
satises the multivalued dependency (MVD)
𝜙=
𝑋𝑌1|𝑌2|. . . |𝑌𝑚
where
𝑚2
, the
𝑌𝑖
s are pairwise disjoint, and
𝑋𝑌1···𝑌𝑚=Ω
, if
𝑅=𝑅[𝑋𝑌1]Z··· Z𝑅[𝑋𝑌𝑚]
, or if the schema
S={𝑋𝑌1, . . . , 𝑋𝑌𝑚}
is lossless (i.e.,
𝜌(𝑅, S)=0
). We review the
concept of a join tree from [3]:
Denition 2.1.
Ajoin tree or junction tree is a pair
T, 𝜒
where
T
is an undirected tree, and
𝜒
is a function that maps each
𝑢
nodes(T)
to a set of variables
𝜒(𝑢)
, called a bag, such that the
running intersection property holds: for every variable
𝑋
, the set
{𝑢nodes(T) | 𝑋𝜒(𝑢)}
is a connected component of
T
. We
denote by 𝜒(T) def
=Ð𝑢𝜒(𝑢), the set of variables of the join tree.
We often denote the join tree as
T
, dropping
𝜒
when it is clear
from the context. The schema dened by
T
is
S={Ω1, . . . , Ω𝑚}
,
where
Ω1, . . . , Ω𝑚
are the bags of
T
. We call a schema
S
acyclic
if there exists a join tree whose schema is
S
. When
Ω𝑖Ω𝑗
for
all
𝑖𝑗
, then the acyclic schema
S={Ω1, . . . , Ω𝑚}
satises
𝑚
|Ω|
[
3
]. We say that a relation
𝑅
satises the acyclic join dependency
S
, and denote
𝑅|=AJD(S)
, if
S
is acyclic and
𝑅|=JD(S)
. An
MVD
𝑋𝑌1| ··· |𝑌𝑚
represents a simple acyclic schema, namely
S={𝑋𝑌1, 𝑋𝑌2, . . . , 𝑋𝑌𝑚}.
Let
S={Ω1, . . . , Ω𝑚}
be an acyclic schema with join tree
(T, 𝜒)
.
We associate to every
(𝑢, 𝑣) edges(T)
an MVD
𝜙𝑢,𝑣
as follows.
Let
T
𝑢
and
T𝑣
be the two subtrees obtained by removing the edge
(𝑢, 𝑣)
. Then, we denote by
𝜙𝑢,𝑣 def
=𝜒(𝑢) ∩ 𝜒(𝑣)𝜒(T
𝑢)|𝜒(T𝑣)
.
Quantifying the Loss of Acyclic Join Dependencies PODS ’23, June 18–23, 2023, Seattle, WA, USA
We call the support of
T
the set of
𝑚1
MVDs associated with its
edges, in notation
MVD(T) ={𝜙𝑢,𝑣 | (𝑢, 𝑣) edges(T)}
. Beeri
et al. have shown that if
T
denes the acyclic schema
S
, then
𝑅
satises
AJD(S)
(i.e.,
𝑅|=AJD(S)
) if and only if
𝑅
satises all MVDs
in its support: 𝑅|=𝜙𝑢,𝑣 for all 𝜙𝑢,𝑣 𝑀𝑉 𝐷 (T) [3, Thm. 8.8].
2.2 Information Theory
Lee [
18
,
19
] gave an equivalent formulation of functional, mul-
tivalued, and acyclic join dependencies in terms of information
measures; we review this briey here, after a short background on
information theory.
Let
𝑋
be a random variable with a nite domain
D
and proba-
bility mass 𝑃(thus, Í𝑥D 𝑃(𝑥)=1). Its entropy is:
𝐻(𝑋)def
=
𝑥D
𝑝(𝑥)log 1
𝑃(𝑥).(2)
It holds that
𝐻(𝑋) log |D|
, and equality holds if and only if
𝑃
is
uniform. The denition of entropy naturally generalizes to sets of
jointly distributed random variables
Ω={𝑋1, . . . , 𝑋𝑛}
, by dening
the function
𝐻:2ΩR
as the entropy of the joint random
variables in the set. For example,
𝐻(𝑋1𝑋2)=
𝑥1D1,𝑥2D2
𝑃(𝑥1, 𝑥2)log 1
𝑃(𝑥1, 𝑥2).(3)
Let
𝐴, 𝐵,𝐶 Ω
. The conditional mutual information
𝐼(𝐴;𝐵|𝐶)
is
dened as:
𝐼(𝐴;𝐵|𝐶)def
=𝐻(𝐵𝐶) + 𝐻(𝐴𝐶) − 𝐻(𝐴𝐵𝐶) − 𝐻(𝐶).(4)
It is known that the conditional independence
𝑃|=𝐴𝐵|𝐶
(i.e.,
𝐴
is independent of
𝐵
given
𝐶
) holds if and only if
𝐼(𝐴;𝐵|
𝐶)=0
. When
𝐶=
, or if
𝐶
is a constant (i.e.,
𝐻(𝐶)=0
), then the
conditional mutual information is reduced to the standard mutual
information, and denoted by 𝐼(𝐴;𝐵).
Let
Xdef
={𝑋1, . . . , 𝑋𝑛}
a set of discrete random variables, and
let
𝑃(X)
and
𝑄(X)
denote discrete probability distributions. The
KL-divergence between 𝑃and 𝑄is:
𝐷𝐾𝐿 (𝑃|| 𝑄)def
=E𝑃log 𝑃(X)
𝑄(X)=
xD(X)
𝑃(x)log 𝑃(x)
𝑄(x)(5)
For any pair of probability distributions 𝑃, 𝑄 over the same proba-
bility space, it holds that
𝐷𝐾𝐿 (𝑃||𝑄) 0
, with equality if and only
if 𝑃=𝑄. It is an easy observation that
𝐼(𝐴;𝐵|𝐶)=𝐷𝐾𝐿 (𝑃(𝐴𝐵𝐶) || 𝑃(𝐴|𝐶)𝑃(𝐵|𝐶)𝑃(𝐶)) (6)
for any probability distribution 𝑃.
Let
𝑅
be a multiset of tuples over the attribute set
Ω={𝑋1, . . . , 𝑋𝑛}
,
and let
|𝑅|=𝑁
. The empirical distribution associated with
𝑅
is the
multivariate probability distribution over
D1×·· ·×D𝑛
that assigns
a probability of
𝑃(𝑡)def
=𝐾
𝑁
to every tuple
𝑡
in
𝑅
with multiplicity
𝐾
. When
𝑅
is a relation instance, and hence a set of
𝑁
tuples, its
empirical distribution is the uniform distribution over its tuples,
i.e.
𝑃(𝑡)=1/𝑁
for all
𝑡𝑅
, and so its entropy is
𝐻(Ω)=log 𝑁
. For
a relation instance
𝑅Rel(Ω)
, we let
𝑃𝑅
denote the empirical dis-
tribution over
𝑅
. For
𝛼⊆ [𝑛]
, we denote by
𝑋𝛼
the set of variables
𝑋𝑖, 𝑖 𝛼
, and denote by
𝑅(𝑋𝛼=𝑥𝛼)
the subset of tuples
𝑡𝑅
where
𝑡[𝑋𝛼]=𝑥𝛼
, for xed values
𝑥𝛼
. When
𝑃𝑅
is the empirical distribution
over 𝑅then the marginal probability is 𝑃𝑅(𝑋𝛼=𝑥𝛼)=|𝑅(𝑋𝛼=𝑥𝛼)|
𝑁.
Lee [
18
,
19
] formalized the following connection between data-
base constraints and entropic measures. Let
(T, 𝜒)
be a join tree
(Denition 2.1). The J-measure is dened as:
J(T, 𝜒)def
=
𝑣
nodes(T)
𝐻(𝜒(𝑣))−
(𝑣1,𝑣2)
edges(T)
𝐻(𝜒(𝑣1)𝜒(𝑣2))−𝐻(𝜒(T))
(7)
where
𝐻
is the entropy (see
(2)
) taken over the empirical distribution
associated with
𝑅
. We abbreviate
J(T, 𝜒)
with
J(T)
, or
J
, when
T, 𝜒
are clear from the context. Observe that
J
depends only on
the schema
S
dened by the join tree, and not on the tree itself. For a
simple example, consider the MVD
𝑋𝑈|𝑉|𝑊
and its associated
acyclic schema
{𝑋𝑈 , 𝑋𝑉 , 𝑋𝑊 }
. For the join tree
𝑋𝑈 𝑋𝑉 𝑋𝑊
, it
holds that
J=𝐻(𝑋𝑈 ) +𝐻(𝑋𝑉 ) +𝐻(𝑋𝑊 ) −2𝐻(𝑋) −𝐻(𝑋𝑈𝑉𝑊 )
.
Another join tree is
𝑋𝑈 𝑋𝑊 𝑋𝑉
, and
J
is the same. Therefore,
if
S
is acyclic, then we write
J(S)
to denote
J(T)
for any join
tree of
S
. When
S={𝑋𝑍, 𝑋𝑌 }
, then the
J
-measure reduces to the
conditional mutual information (see
(4)
), to wit
J(S)=𝐼(𝑍;𝑌|𝑋)
.
Theorem 2.1. ([
19
]) Let
S
be an acyclic schema over
Ω
, and
let
𝑅
be a relation instance over
Ω
. Then
𝑅|=AJD(S)
if and only if
J(S)=0.
In the particular case of a standard MVD, Lee’s result implies
that
𝑅|=𝑋𝑌|𝑍
if and only if
𝐼(𝑌;𝑍|𝑋)=0
in the empirical
distribution associated with 𝑅.
2.3 MVDs and Acyclic Join Dependencies
Beeri et al. [
3
] have shown that an acyclic join dependency dened
by the acyclic schema
S
, over
𝑚
relation schemas, is equivalent
to
𝑚1
MVDs (called its support). In [
14
], this characterization
was generalized as follows. Let
(T, 𝜒)
be the join tree correspond-
ing to the acyclic schema
S
. Root the tree
T
at node
𝑢1
, orient
the tree accordingly, and let
𝑢1, . . . ,𝑢𝑚
be a depth-rst enumer-
ation of
nodes(T)
. Thus,
𝑢1
is the root, and for every
𝑖>1
,
parent(𝑢𝑖)
is some node
𝑢𝑗
with
𝑗<𝑖
. For every
𝑖
, we dene
Ω𝑖def
=𝜒(𝑢𝑖)
,
Ω𝑖:𝑗def
=Ð=𝑖, 𝑗 Ω
, and
Δ𝑖def
=𝜒(parent(𝑢𝑖)) ∩ 𝜒(𝑢𝑖)
.
By the running intersection property (Denition 2.1) it holds that
Δ𝑖=Ω1:(𝑖1)Ω𝑖.
Theorem 2.2. ([
14
]) Let
(T, 𝜒)
be a join tree over variables
𝜒(T) =Ω
, where
nodes(T) ={𝑢1, . . . , 𝑢𝑚}
with corresponding
bags Ω𝑖=𝜒(𝑢𝑖). Then:
max
𝑖[2,𝑚]𝐼(Ω1,(𝑖1);Ω𝑖:𝑚|Δ𝑖) ≤ J(T, 𝜒) ≤
𝑚
𝑖=2
𝐼(Ω1,(𝑖1);Ω𝑖:𝑚|Δ𝑖)
(8)
Since the support of (T, 𝜒)are precisely the MVDs
{Δ𝑖Ω1:(𝑖1)|Ω𝑖:𝑚}𝑖[2,𝑚],(9)
then (8) generalizes the result of Beeri et al. [3].
Denition 2.2
(models
T
)
.
We say that a relation instance
𝑅
over attributes
Ω
models the join tree
(T, 𝜒)
, denoted
𝑅|=T
if
𝐼(Ω1,(𝑖1);Ω𝑖:𝑚|Δ𝑖)=0for every 𝑖∈ [2,𝑚].
PODS ’23, June 18–23, 2023, Seattle, WA, USA Batya Kenig and Nir Weinberger
3 CHARACTERIZING ACYCLIC SCHEMAS
WITH KL-DIVERGENCE
Let
Xdef
={𝑋1, . . . , 𝑋𝑛}
be a set of discrete random variables over
domains
D(𝑋𝑖)
, and let
𝑃(𝑋1, . . . , 𝑋𝑛)
be a joint probability distri-
bution over
X
. Let
x∈ D(𝑋1) × · ·· × D(𝑋𝑛)
. For a subset
YX
,
we denote by
x[Y]
the assignment
x
restricted to the variables
Y
.
We denote by 𝑃[Y]the marginal probability distribution over Y.
Let
(T, 𝜒)
be a join tree where
nodes(T) ={𝑢1, . . . , 𝑢𝑚}
. Let
S={Ω1, . . . , Ω𝑚}
be the acyclic schema associated with
(T, 𝜒)
over the variables Ω=𝜒(T), where Ω𝑖=𝜒(𝑢𝑖).
Proposition 3.1. Let
𝑃(𝑋1, . . . , 𝑋𝑛)
be any joint probability dis-
tribution over
𝑛
variables, and let
(T, 𝜒)
be a join tree where
𝜒(T) =
{𝑋1, . . . , 𝑋𝑛}
. Then
𝑃|=T
(Denition 2.2) if and only if
𝑃=𝑃T
where:
𝑃T(𝑥1, . . . , 𝑥𝑛)def
=Î𝑚
𝑖=1𝑃[Ω𝑖](x[Ω𝑖])
Î𝑚1
𝑖=1𝑃[Δ𝑖](x[Δ𝑖]) (10)
where
𝑃[Ω𝑖]
(
𝑃[Δ𝑖]
) denote the marginal probabilities over
Ω𝑖
(
Δ𝑖
).
The proof of Proposition 3.1 is deferred to Appendix A. It follows
from Denition 2.2, and a simple induction on the number of nodes
in T.
In this section, we rene the statement of Proposition 3.1 and
prove the following variational representation.
Theorem 3.2. For any joint probability distribution
𝑃(𝑋1, . . . , 𝑋𝑛)
and any join tree (T, 𝜒)with 𝜒(T) ={𝑋1, . . . , 𝑋𝑛}it holds that
J(T) =min
𝑄|=T𝐷𝐾𝐿 (𝑃||𝑄)=𝐷𝐾𝐿 (𝑃||𝑃T)(11)
In words, this theorem states that when the join tree
(T, 𝜒)
is
given, then out of all probability distributions
𝑄
over
Ω={𝑋1, . . . , 𝑋𝑛}
that model
T
(see
(10)
), the one closest to
𝑃
in terms of KL-Divergence,
is
𝑃T
. Importantly, this KL-divergence is precisely
J(T)
(i.e.,
J(T) =
𝐷𝐾𝐿 (𝑃||𝑃T)
). While the Theorem holds for all probability distri-
butions
𝑃
, a special case is when
𝑃
is the empirical distribution
associated with relation
𝑅
. The proof of Theorem 3.2 follows from
the following two lemmas, interesting on their own, and is deferred
to the complete version of this paper [15].
Lemma 3.3. Let
𝑃(𝑋1, . . . , 𝑋𝑛)
be a joint probability distribution
over
𝑛
random variables, and let
T
be a join tree over
𝑋1, . . . , 𝑋𝑛
with
bags
Ω1, . . . , Ω𝑚
. Then
𝑃[Ω𝑖]=𝑃T[Ω𝑖]
for every
𝑖∈ [1, 𝑚]
, and
𝑃[Δ𝑖]=𝑃T[Δ𝑖]for every 𝑖∈ [1,𝑚 1].
The proof of Lemma 3.3 follows from an easy induction on
𝑚
,
the number of nodes in the join tree
T
, and is deferred to Appendix
A.
Lemma 3.4. The following holds for any joint probability distribu-
tion 𝑃(𝑋1, . . . , 𝑋𝑛), and any join tree Tover variables 𝑋1, . . . , 𝑋𝑛:
arg min
𝑄|=T
𝐷𝐾𝐿 (𝑃||𝑄)=𝑃T(12)
Proof.
From Lemma 3.3 we have that, for every
𝑖∈ {1, . . . , 𝑚}
,
𝑃T[Ω𝑖]=𝑃[Ω𝑖]
where,
Ω𝑖=𝜒(𝑢𝑖)
. Since
Δ𝑖Ω𝑖
, then
𝑃T[Δ𝑖]=
𝑃[Δ𝑖]. Now,
min
𝑄|=T𝐷𝐾𝐿 (𝑃||𝑄)=min
𝑄|=TE𝑃log 𝑃(𝑋1, . . . , 𝑋𝑛)
𝑄(𝑋1, . . . , 𝑋𝑛)(13)
=min
𝑄|=TE𝑃log 𝑃(𝑋1, . . . , 𝑋𝑛)
𝑃T(𝑋1, . . . , 𝑋𝑛)·𝑃T(𝑋1, . . . , 𝑋𝑛)
𝑄(𝑋1, . . . , 𝑋𝑛)
(14)
=𝐷𝐾𝐿 (𝑃||𝑃T) + min
𝑄|=TE𝑃log 𝑃T(𝑋1, . . . , 𝑋𝑛)
𝑄(𝑋1, . . . , 𝑋𝑛)
(15)
Since the chosen distribution
𝑄(X)
has no consequence on the
rst term
𝐷𝐾𝐿 (𝑃||𝑃T)
, we take a closer look at the second term,
min𝑄|=TE𝑃hlog 𝑃T(𝑋1,...,𝑋𝑛)
𝑄(𝑋1,...,𝑋𝑛)i
. Since
𝑄|=T
, then by Proposi-
tion 3.1 it holds that
𝑄(𝑋1, . . . , 𝑋𝑛)=Î𝑚
𝑖=1𝑄[Ω𝑖](X[Ω𝑖])
Î𝑚1
𝑖=1𝑄[Δ𝑖](X[Δ𝑖])
. Hence,
in what follows, we refer to
𝑄
as
𝑄T
. In the remainder of the proof
we show that:
E𝑃"log 𝑃T(𝑋1, . . . , 𝑋𝑛)
𝑄T(𝑋1, . . . , 𝑋𝑛)#=E𝑃Tlog 𝑃T(𝑋1, . . . , 𝑋𝑛)
𝑄T(𝑋1, . . . , 𝑋𝑛)(16)
=𝐷𝐾𝐿 (𝑃T||𝑄T),(17)
where the last equality follows from
(5)
. Since
𝐷𝐾𝐿 (𝑃T||𝑄T) 0
,
with equality if and only if
𝑃T=𝑄T
, then choosing
𝑄T
to be
𝑃T
minimizes
𝐷𝐾𝐿 (𝑃||𝑄)
, thus proving the claim. The remainder of
the proof, proving
(16)
, follows from Lemma 3.3 which states that
𝑃[Ω𝑖]=𝑃T[Ω𝑖]
for every
𝑖∈ [1, 𝑚]
, and
𝑃[Δ𝑗]=𝑃T[Δ𝑗]
for
every
𝑗∈ [𝑚1]
. The proof is quite technical, and hence deferred
to Appendix A.
4 SPURIOUS TUPLES: A LOWER BOUND
BASED ON J(T)
Let
𝑃𝑅
be the empirical distribution over a relation instance
𝑅
with
𝑁
tuples and
𝑛
attributes
Ω={𝑋1, . . . , 𝑋𝑛}
. That is, the probability
associated with every record in
𝑅
is
1
𝑁
. Let
T
be any junction tree
over the variables
Ω
, and let
Sdef
={Ω1, . . . , Ω𝑚}
where
Ω𝑖=𝜒(𝑢𝑖)
.
By Theorem 2.1 and Theorem 3.2, it holds that
𝑅|=AJD(S)
if and
only if
J(T) =𝐷𝐾𝐿 (𝑃||𝑃T)=0
. In what follows, given an acyclic
schema
S
, we provide a lower bound for
𝜌(𝑅, S)
that is based on
its associated junction tree J(T).
Lemma 4.1. Let
𝑃
be the empirical distribution over a relation
instance
𝑅
with
𝑁
tuples, and let
S={Ω1, . . . , Ω𝑚}
denote an acyclic
schema with junction tree (T, 𝜒). Then:
J(T) log(1+𝜌(𝑅, S)).(18)
So if 𝜌(𝑅, S)=0, then J(T) =0as well.
Proof. From Theorem 3.2 we have that
J(T) =𝐷𝐾𝐿 (𝑃||𝑃T),(19)
where
𝑃T=arg min
𝑄|=T
𝐷𝐾𝐿 (𝑃||𝑄)(20)
Let us dene
𝑄𝑇𝑈 =arg min
𝑄|=T,
𝑄is uniform
𝐷𝐾𝐿 (𝑃||𝑄),(21)
Quantifying the Loss of Acyclic Join Dependencies PODS ’23, June 18–23, 2023, Seattle, WA, USA
and verify that such a distribution
𝑄𝑇𝑈
always exists: Let
𝑅def
=𝑚
𝑖=1
ΠΩ𝑖(𝑅)
. Let
𝑄𝑇𝑈
denote the empirical distribution over
𝑅
. By
construction,
𝑄𝑇𝑈
is a uniform distribution (i.e., over tuples
𝑅
),
and 𝑄𝑇𝑈 |=T.
By denition, we have that
|𝑅|=𝑁(1+𝜌(𝑅, S))
, where
|𝑅|=𝑁
,
and 𝜌(𝑅, S)is the loss of Swith respect to 𝑅(see (1)).
By limiting the minimization region to uniform distributions,
the minimum can only increase. Therefore:
J(T) =𝐷𝐾𝐿 (𝑃||𝑃T) 𝐷𝐾𝐿 (𝑃||𝑄𝑇𝑈 )(22)
Evaluating the
𝐾𝐿
-divergence term on the right hand side, and
using the fact that 𝑄𝑇𝑈 is uniform, we get:
𝐷𝐾𝐿 (𝑃||𝑄𝑇𝑈 )=
𝑟𝑅
𝑃(𝑟)log 𝑃(𝑟)
𝑄𝑇𝑈 (𝑟)(23)
=
𝑟𝑅
1
𝑁log 1/𝑁
1/(𝑁+𝑁·𝜌(𝑅, S)) (24)
=
𝑟𝑅
1
𝑁log (1+𝜌(𝑅, S)) (25)
=log(1+𝜌(𝑅, S)) (26)
Hence, J(T) log(1+𝜌(𝑅, S)), and 𝜌(𝑅, S) 2J(T) 1
The following simple example shows that the lower bound of
(18)
is tight. That is, there exists a family of relation instances
𝑅
, and a
schema Swhere J(S)=log(1+𝜌(𝑅, S)).
Example 4.1.
Let
Ω={𝐴, 𝐵}
, where
D(𝐴)={𝑎1, . . . , 𝑎𝑁}
and
D(𝐵)={𝑏1, . . . , 𝑏𝑁}where D(𝐴) ∩ D(𝐵)=. Let
𝑅={𝑡1=(𝑎1, 𝑏1), . . . , 𝑡𝑁=(𝑎𝑁, 𝑏𝑁)} (27)
be a relation instance over
Ω
. By denition,
𝑃𝑅(𝑡𝑖)=1
𝑁
. Noting that
𝐻(𝐴)=𝐻(𝐵)=𝐻(𝐴𝐵)=log 𝑁
(see
(2)
), we have that
𝐼(𝐴;𝐵)=
log 𝑁
(see
(4)
). Now, consider the schema
S={{𝐴},{𝐵}}
, and let
𝑅=Π𝐴(𝑅)Π𝐵(𝑅)
. Clearly,
|𝑅|=𝑁2
, and
𝜌(𝑅, S)=𝑁1
(see
(1)
). In particular, we have that
J(S)=𝐼(𝐴;𝐵)=log 𝑁=
log(1+𝜌(𝑅, S)), and that this holds for every 𝑁2.
5 SPURIOUS TUPLES: AN UPPER BOUND
BASED ON MUTUAL INFORMATION
In the previous section, we have shown that given a relation
𝑅
and
an acyclic schema
S
dened by a join tree
T
, it holds that
log(1+
𝜌(𝑅, S)) J(T) (Lemma 4.1). In this section, we derive an upper
bound on
𝜌(𝑅, S)
in terms of an information-theoretic measure. To
this end, recall that Theorem 2.2 shows that if
{Ω1, . . . , Ω𝑚}
are the
bags of
T
, then
J(T) Í𝑚
𝑖=2𝐼(Ω1:𝑖1;Ω𝑖:𝑚|Δ𝑖)
. That is,
J(T)
is upper bounded by the sum of the conditional mutual information
of the
𝑚1
MVDs in the support
T
, given by
Δ𝑖Ω1:𝑖1|Ω𝑖:𝑚
, for
𝑖∈ [2, 𝑚]1
. In this section, we relate this sum of conditional mutual
information of the
𝑚1
MVDs in the support
T
to an approximate
upper bound on 𝜌(𝑅, S), which holds with high probability.
To this end, we begin by relating the relative number of spurious
tuples of the relation
𝑅
of a schema
S
, that is
𝜌(𝑅, S)
, with the
spurious tuples of each of the MVDs in its support. Concretely, let
1
More accurately, the MVD should be written as
𝜙𝑖
def
=Δ𝑖
(Ω1:𝑖1\Δ𝑖)|(Ω𝑖:𝑚\Δ𝑖)
, so that the bags are disjoint. However, it can be
easily shown, using the chain rule of the mutual information [
9
, Theorem 2.5.2],
that
𝐼(Ω1:𝑖1;Ω𝑖:𝑚|Δ𝑖)=𝐼(Ω1:𝑖1\Δ𝑖;Ω𝑖:𝑚\Δ𝑖|Δ𝑖)
, and so we adopt the
simplied notation for MVD.
𝜙𝑖
def
=Δ𝑖Ω1:𝑖1|Ω𝑖:𝑚
be the
𝑖
th MVD in the support of
T
. Then,
the relative number of spurious tuples for 𝜙𝑖is dened as
𝜌(𝑅, 𝜙𝑖)def
=|ΠΩ1:𝑖1(𝑅)ΠΩ𝑖:𝑚(𝑅)| − |𝑅|
|𝑅|.(28)
Proposition 5.1. Let a relation
𝑅
be given, and let
S
be an acyclic
schema over the attributes of
𝑅
with join tree
T
, whose support are
the MVDs 𝜙𝑖=Δ𝑖Ω1:𝑖1|Ω𝑖:𝑚, for 𝑖∈ [2,𝑚]. Then,
log 1+𝜌(𝑅, S)
𝑚
𝑖=2
log 1+𝜌(𝑅, 𝜙𝑖).(29)
Proof.
We prove by induction on the number of MVDs (or
nodes) in the schema. Let
𝑚
be the number nodes (and
𝑚1
be
the number of MVDs) in the schema. The base case
𝑚11
is
immediate. Assuming it holds for
𝑚1<𝑘
, we prove the claim
for
𝑚1=𝑘
. Let
T
be a join tree representing
𝑘
MVDs (and
hence
𝑘+1
nodes). Let
𝑢𝑘+1
be a leaf in this join tree with parent
𝑝def
=parent(𝑢𝑘+1)
. Let
T
be the join tree where nodes
𝑢𝑘+1
and
𝑝
are merged to the node
𝑢
where
Ω(𝑢)=Ω(𝑢𝑘+1) Ω(𝑝)
. Hence,
by the induction hypothesis
1+𝜌(𝑅, T) ≤
𝑘
Ö
𝑖=21+𝜌(𝑅, 𝜙𝑖).(30)
Now, let
𝑅=𝑘
𝑖=1𝑅[Ω𝑖]
. Consider the MVD
𝜙𝑘+1=Ω(𝑢𝑘+1) ∩
Ω(𝑝)Ω(𝑢𝑘+1)|Ω1,𝑘
. Then,
𝑅′′ def
=ΠΩ1,𝑘 (𝑅)ΠΩ𝑘+1(𝑅)
and
by the induction hypothesis, |𝑅′′|≤|𝑅| · [1+𝜌(𝑅, 𝜙𝑘)]. By (30),
|𝑅|≤|𝑅| ·
𝑘
Ö
𝑖=21+𝜌(𝑅, 𝜙𝑖),(31)
and hence
|𝑅′′|≤|𝑅|·Î𝑘+1
𝑖=2[1+𝜌(𝑅, 𝜙𝑖)]
, which proves claim.
Proposition 5.1 reduces the problem of upper bounding
log[1+
𝜌(𝑅, S)]
to bounding each of the terms
log[1+𝜌(𝑅, 𝜙𝑖)]
in
(29)
, each
of them corresponding to the relative number of spurious tuples
of the
𝑚1
MVDs in the support of
T
. Considering an arbitrary
MVD, which we henceforth denote for simplicity by
𝜙def
=𝐶𝐴|𝐵
,
Lemma 4.1 implies the lower bound
log[1+𝜌(𝑅, 𝜙𝑖)] 𝐼(𝐴;𝐵|𝐶)
,
since an MVD is a simple instance of an acyclic schema. However,
obtaining an upper bound on
log[1+𝜌(𝑅, 𝜙𝑖)]
in terms of
𝐼(𝐴;𝐵|
𝐶)
is challenging because the mutual information
𝐼(𝐴;𝐵|𝐶)
varies
wildly for an MVD
𝜙
even when
𝜌(𝑅, 𝜙)
and the domains sizes
𝑑𝐴, 𝑑𝐵
and
𝑑𝐶
remain constant (where
𝑑𝐴
def
=|Π𝐴(𝑅)|
, and similarly
for
𝑑𝐵
and
𝑑𝐶
). Figure 1 illustrates this phenomenon in the simple
case in which
𝑑𝐶=1
and so
𝐶
is a degenerated random variable, and
𝑑𝐴=𝑑𝐵
. In other words, the value
𝐼(𝐴;𝐵|𝐶)
depends on the actual
contents of the relation instance
𝑅
. However, while
𝐼(𝐴;𝐵|𝐶)
might not be an accurate upper bound to
log[1+𝜌(𝑅, 𝜙)]
for an
arbitrary relation, it may hold that it is an approximate upper bound
for most relations. Therefore, we next propose a random relation
model, in which the tuples of the relation
𝑅
are chosen at random.
We then establish an upper bound on
log[1+𝜌(𝑅, 𝜙)]
that holds
with high probability over this randomly chosen relation.
Definition 5.2 (Random relation model). Let
Ωdef
={𝑋1, . . . , 𝑋𝑛}
be a set of attributes with domains
D(𝑋1), . . . , D(𝑋𝑛)
, and assume
w.l.o.g. that
D(𝑋𝑖)def
=[𝑑𝑖]
for
{𝑑𝑖}𝑛
𝑖=1N+
. Let
𝑁N+
be given
摘要:

QuantifyingtheLossofAcyclicJoinDependenciesBatyaKenigTechnion,IsraelInstituteofTechnologyHaifa,Israelbatyak@technion.ac.ilNirWeinbergerTechnion,IsraelInstituteofTechnologyHaifa,Israelnirwein@technion.ac.ilABSTRACTAcyclicschemespossesknownbenefitsfordatabasedesign,speed-ingupqueries,andreducingspacer...

展开>> 收起<<
Quantifying the Loss of Acyclic Join Dependencies.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:5.37MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注