
PODS ’23, June 18–23, 2023, Seattle, WA, USA Batya Kenig and Nir Weinberger
(i.e., no spurious tuples). Beyond this characterization, not much is
known about the relationship between these two measures of loss.
In fact, the relationship is not necessarily monotonic, and may vary
widely even between two acyclic schemas with similar
J
-measure
values [
14
]. Nevertheless, empirical studies have shown that low
values of the
J
-measure generally lead to acyclic schemas that
incur a small number of spurious tuples [14].
Our rst result is a characterization of the
J
-measure of an
acyclic schema
S={Ω1, . . . , Ω𝑘}
, with respect to a relation instance
𝑅
, as the KL-divergence between two empirical distributions; the
one associated with
𝑅
, and the one associated with
𝑅′def
=ΠΩ1(𝑅)⊲⊳
··· ⊲⊳ ΠΩ𝑘(𝑅)
. The empirical distribution is a standard notion used
to associate a multivariate probability distribution with a multiset
of tuples, and a formal denition is deferred to Section 2. The KL-
divergence is a non-symmetric measure of the similarity between
two probability distributions
𝑃(Ω)
and
𝑄(Ω)
. It has numerous
information-theoretic applications, and can be loosely thought of
as a measure of the information lost when
𝑄(Ω)
is used to ap-
proximate
𝑃(Ω)
. Our result that the
J
-measure is, in fact, the
KL-divergence between the empirical distributions associated with
the original relation instance and the one implied by the acyclic
schema, explains the success of the
J
-measure for identifying
“approximate acyclic schemas” in [
14
], and how the
J
-measure
characterizes lossless AJDs [18, 19].
With this result at hand, we address the following problem. Given
a relation schema
Ω
, an acyclic schema
S={Ω1, . . . , Ω𝑘}
, and a
value
J ≥ 0
, compute the minimum and maximum number of spu-
rious tuples generated by
S
with respect to any relation instance
𝑅
over
Ω
, whose KL-divergence from
𝑅′def
=ΠΩ1(𝑅)⊲⊳ ··· ⊲⊳ ΠΩ𝑘(𝑅)
is
J
. To this end, we prove a deterministic lower bound on the
number of spurious tuples that depends only on
J
. We then con-
sider the problem of determining an upper bound on the number of
spurious tuples. As it turns out, this problem is more challenging,
as a deterministic upper bound does not hold. We thus propose a
random relation model, in which a relation is drawn uniformly at
random from all possible empirical distributions of a given size
𝑁
.
We then show that a bound analogous to the deterministic lower
bound on the relative number of spurious tuples also holds as an
upper bound with two dierences: First, it holds with high proba-
bility over the random choice of relation (and not with probability
1
), and second, is holds with an additive term, though one which
vanishes for asymptotically large relation instances. The proof of
this result is fairly complicated, and as discussed in Section 5, re-
quires applications of multiple techniques from information theory
[9] and concentration of measure [6].
Beyond its theoretical interest, understanding the relationship
between the information-theoretic KL-divergence, and the tangible
property of loss, as measured by the number of spurious tuples, has
practical consequences for the task of discovering acyclic schemas
that t a dataset. Currently, the system of [
14
] can discover acyclic
schemas that t the data well in terms of its
J
-measure. Under-
standing how the
J
-measure relates to the loss in terms of spurious
tuples will enable nding acyclic schemas that generate a bounded
number of spurious tuples. This is important for applications that
apply factorization as a means of compression, while wishing to
maintain the integrity of the data [22].
To summarize, in this paper we: (1) Show that the
J
-measure of
Lee [
18
,
19
] is the KL-divergence between the empirical distribution
associated with the original relation and the one induced by the
acyclic schema, (2) Prove a general lower bound on the loss (i.e.,
spurious tuples) in terms of the KL-divergence, and present a simple
family of relation instances for which this bound is tight, and (3)
Propose a random relation model and prove an upper bound on the
loss, which holds with high probability, and which converges to
the lower bound for large relational instances.
2 BACKGROUND
For the sake of consistency, we adopt some of the notation from [
14
].
We denote by
[𝑛]={1, . . . , 𝑛}
. Let
Ω
be a set of variables, also called
attributes. If 𝑋 , 𝑌 ⊆Ω, then 𝑋𝑌 denotes 𝑋∪𝑌.
2.1 Data Dependencies
Let
Ωdef
={𝑋1, . . . , 𝑋𝑛}
denote a set of attributes with domains
D(𝑋1), . . . , D(𝑋𝑛)
. We denote by
Rel(Ω)def
={𝑅:𝑅⊆>𝑛
𝑖=1D(𝑋𝑖)}
the set of all possible relation instances over
Ω
. Fix a relation in-
stance
𝑅∈Rel(Ω)
of size
𝑁=|𝑅|
. For
𝑌⊆Ω
we let
𝑅[𝑌]
de-
note the projection of
𝑅
onto the attributes
𝑌
. A schema is a set
S={Ω1, . . . , Ω𝑚}such that Ð𝑚
𝑖=1Ω𝑖=Ωand Ω𝑖⊈Ω𝑗for 𝑖≠𝑗.
We say that the relation instance
𝑅
satises the join dependency
JD(S)
, and write
𝑅|=JD(S)
, if
𝑅=Z𝑚
𝑖=1𝑅[Ω𝑖]
. If
𝑅̸|=JD(S)
, then
Sincurs a loss with respect to 𝑅, denoted 𝜌(𝑅, S), dened:
𝜌(𝑅, S)def
=Z𝑚
𝑖=1𝑅[Ω𝑖]−|𝑅|
|𝑅|(1)
We call the set of tuples
Z𝑚
𝑖=1𝑅[Ω𝑖]\𝑅
spurious tuples. Clearly,
𝑅|=JD(S)if and only if 𝜌(𝑅, S)=0.
We say that
𝑅
satises the multivalued dependency (MVD)
𝜙=
𝑋↠𝑌1|𝑌2|. . . |𝑌𝑚
where
𝑚≥2
, the
𝑌𝑖
s are pairwise disjoint, and
𝑋𝑌1···𝑌𝑚=Ω
, if
𝑅=𝑅[𝑋𝑌1]Z··· Z𝑅[𝑋𝑌𝑚]
, or if the schema
S={𝑋𝑌1, . . . , 𝑋𝑌𝑚}
is lossless (i.e.,
𝜌(𝑅, S)=0
). We review the
concept of a join tree from [3]:
Denition 2.1.
Ajoin tree or junction tree is a pair
T, 𝜒
where
T
is an undirected tree, and
𝜒
is a function that maps each
𝑢∈
nodes(T)
to a set of variables
𝜒(𝑢)
, called a bag, such that the
running intersection property holds: for every variable
𝑋
, the set
{𝑢∈nodes(T) | 𝑋∈𝜒(𝑢)}
is a connected component of
T
. We
denote by 𝜒(T) def
=Ð𝑢𝜒(𝑢), the set of variables of the join tree.
We often denote the join tree as
T
, dropping
𝜒
when it is clear
from the context. The schema dened by
T
is
S={Ω1, . . . , Ω𝑚}
,
where
Ω1, . . . , Ω𝑚
are the bags of
T
. We call a schema
S
acyclic
if there exists a join tree whose schema is
S
. When
Ω𝑖⊈Ω𝑗
for
all
𝑖≠𝑗
, then the acyclic schema
S={Ω1, . . . , Ω𝑚}
satises
𝑚≤
|Ω|
[
3
]. We say that a relation
𝑅
satises the acyclic join dependency
S
, and denote
𝑅|=AJD(S)
, if
S
is acyclic and
𝑅|=JD(S)
. An
MVD
𝑋↠𝑌1| ··· |𝑌𝑚
represents a simple acyclic schema, namely
S={𝑋𝑌1, 𝑋𝑌2, . . . , 𝑋𝑌𝑚}.
Let
S={Ω1, . . . , Ω𝑚}
be an acyclic schema with join tree
(T, 𝜒)
.
We associate to every
(𝑢, 𝑣) ∈ edges(T)
an MVD
𝜙𝑢,𝑣
as follows.
Let
T
𝑢
and
T𝑣
be the two subtrees obtained by removing the edge
(𝑢, 𝑣)
. Then, we denote by
𝜙𝑢,𝑣 def
=𝜒(𝑢) ∩ 𝜒(𝑣)↠𝜒(T
𝑢)|𝜒(T𝑣)
.