Quantifying the Loss of Acyclic Join Dependencies

2025-05-02 2 0 5.37MB 35 页 10玖币

侵权投诉

Batya Kenig

Technion, Israel Institute of Technology

Haifa, Israel

batyak@technion.ac.il

Nir Weinberger

Technion, Israel Institute of Technology

Haifa, Israel

nirwein@technion.ac.il

ABSTRACT

Acyclic schemes posses known benets for database design, speed-

ing up queries, and reducing space requirements. An acyclic join

dependency (AJD) is lossless with respect to a universal relation if

joining the projections associated with the schema results in the

original universal relation. An intuitive and standard measure of

loss entailed by an AJD is the number of redundant tuples generated

by the acyclic join. Recent work has shown that the loss of an AJD

can also be characterized by an information-theoretic measure. Mo-

tivated by the problem of automatically tting an acyclic schema to

a universal relation, we investigate the connection between these

two characterizations of loss. We rst show that the loss of an AJD

is captured using the notion of KL-Divergence. We then show that

the KL-divergence can be used to bound the number of redundant

tuples. We prove a deterministic lower bound on the percentage

of redundant tuples. For an upper bound, we propose a random

database model, and establish a high probability bound on the per-

centage of redundant tuples, which coincides with the lower bound

for large databases.

ACM Reference Format:

Batya Kenig and Nir Weinberger. 2023. Quantifying the Loss of Acyclic

Join Dependencies. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI

Symposium on Principles of Database Systems (PODS ’23), June 18–23, 2023,

Seattle, WA, USA. ACM, New York, NY, USA, 35 pages. https://doi.org/10.

1145/3584372.3588658

1 INTRODUCTION

In the traditional approach to database design, the designer has a

clear conceptual model of the data, and of the dependencies between

the attributes. This information guides the database normalization

process, which leads to a database schema consisting of multiple

relation schemas that have the benet of reduced redundancies, and

more ecient querying and updating of the data. This approach

requires that the data precisely meet the constraints of the model;

various data repair techniques [

] have been developed to address

the case that the data does not meet the constraints of the schema

exactly. Current data management applications are required to han-

dle data that is noisy, erroneous, and inconsistent. The presumption

that such data meet a predened set of constraints is not likely to

hold. In many cases, such applications and are willing to tolerate

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than the

author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or

republish, to post on servers or to redistribute to lists, requires prior specic permission

and/or a fee. Request permissions from permissions@acm.org.

PODS ’23, June 18–23, 2023, Seattle, WA, USA

ACM ISBN 979-8-4007-0127-6/23/06. . . $15.00

https://doi.org/10.1145/3584372.3588658

the “loss” entailed by an imperfect database schema, and will be

content with a database schema that only “approximately ts” the

data. Motivated by the task of schema-discovery for a given dataset,

in this work, we investigate dierent ways of measuring the “loss”

of an imperfect database schema, and the relationship between

these dierent measures.

Decomposing a relation schema

is the process of breaking

the relation scheme into two or more relation schemes

Ω1, . . . , Ω𝑘

whose union is

. The decomposition is lossless if, for any relation

instance

𝑅

over

, it holds that

𝑅=ΠΩ1(𝑅)⊲⊳ ··· ⊲⊳ ΠΩ𝑘(𝑅)

. The

loss of a database scheme

{Ω1, . . . , Ω𝑘}

with respect to a relation

instance

𝑅

over

, is the set of tuples

⊲⊳𝑘

𝑖=1ΠΩ𝑖(𝑅)\𝑅

that are

in the join, but not in

𝑅

(formal denitions in Section 2). We say

that such tuples are spurious. Normalizing a relation scheme is the

process of (losslessly) decomposing it into a database scheme where

each of its resulting relational schemes have certain properties. The

specic properties imposed on the resulting relational schemes

dene dierent normal forms such as 3NF [

], BCNF [

], 4NF [

and 5NF [10, 12].

A data dependency denes a relationship between sets of at-

tributes in a database. A Join Dependency (JD) denes a

𝑘

-way

decomposition (where

𝑘≥2

) of a relation schema

, and is said

to hold in a relation instance

𝑅

over

if the join is lossless with

respect to

𝑅

(formal denitions in Section 2). Join Dependencies

generalize Multivalued Dependencies (MVDs) that are eectively

Join Dependencies where

𝑘=2

, which in turn generalize Functional

Dependencies (FDs), which are perhaps the most widely studied data

dependencies due to their simple and intuitive semantics.

Acyclic Join Dependencies (AJDs) is a type of JD that is specied

by an Acyclic Schema [

]. Acyclic schemes have many applications

in databases and in machine learning; they enable ecient query

evaluation [

], play a key role in database normalization and de-

sign [

], and improve the performance of many well-known

machine learning algorithms over relational data [16, 23, 24].

Consider how we may measure the loss of an acyclic schema

S={Ω1, . . . , Ω𝑘}

with respect to a given relation instance

𝑅

over

. An intuitive approach, based on the denition of a lossless join,

is to simply count the number of spurious tuples generated by the

join (i.e., and are not included in the relation instance

𝑅

). In previ-

ous work, Kenig et al. [

] presented an algorithm that discovers

“approximate Acyclic Schemes” for a given dataset. Building on

earlier work by Lee [

], the authors proposed to measure the

loss of an acyclic schema, with respect to a given dataset, using

an information-theoretic metric called the

-measure, formally

dened in Section 2. Lee has shown that this information-theoretic

measure characterizes lossless AJDs. That is, the

-measure of an

acyclic schema with respect to a dataset is

if and only if the AJD

dened by this schema is lossless with respect to the dataset [

]

arXiv:2210.14572v2 [cs.DB] 10 Apr 2023

PODS ’23, June 18–23, 2023, Seattle, WA, USA Batya Kenig and Nir Weinberger

(i.e., no spurious tuples). Beyond this characterization, not much is

known about the relationship between these two measures of loss.

In fact, the relationship is not necessarily monotonic, and may vary

widely even between two acyclic schemas with similar

-measure

values [

]. Nevertheless, empirical studies have shown that low

values of the

-measure generally lead to acyclic schemas that

incur a small number of spurious tuples [14].

Our rst result is a characterization of the

-measure of an

acyclic schema

S={Ω1, . . . , Ω𝑘}

, with respect to a relation instance

𝑅

, as the KL-divergence between two empirical distributions; the

one associated with

𝑅

, and the one associated with

𝑅′def

=ΠΩ1(𝑅)⊲⊳

··· ⊲⊳ ΠΩ𝑘(𝑅)

. The empirical distribution is a standard notion used

to associate a multivariate probability distribution with a multiset

of tuples, and a formal denition is deferred to Section 2. The KL-

divergence is a non-symmetric measure of the similarity between

two probability distributions

𝑃(Ω)

and

𝑄(Ω)

. It has numerous

information-theoretic applications, and can be loosely thought of

as a measure of the information lost when

𝑄(Ω)

is used to ap-

proximate

𝑃(Ω)

. Our result that the

-measure is, in fact, the

KL-divergence between the empirical distributions associated with

the original relation instance and the one implied by the acyclic

schema, explains the success of the

-measure for identifying

“approximate acyclic schemas” in [

], and how the

-measure

characterizes lossless AJDs [18, 19].

With this result at hand, we address the following problem. Given

a relation schema

, an acyclic schema

S={Ω1, . . . , Ω𝑘}

, and a

value

J ≥ 0

, compute the minimum and maximum number of spu-

rious tuples generated by

with respect to any relation instance

𝑅

over

, whose KL-divergence from

𝑅′def

=ΠΩ1(𝑅)⊲⊳ ··· ⊲⊳ ΠΩ𝑘(𝑅)

. To this end, we prove a deterministic lower bound on the

number of spurious tuples that depends only on

. We then con-

sider the problem of determining an upper bound on the number of

spurious tuples. As it turns out, this problem is more challenging,

as a deterministic upper bound does not hold. We thus propose a

random relation model, in which a relation is drawn uniformly at

random from all possible empirical distributions of a given size

𝑁

We then show that a bound analogous to the deterministic lower

bound on the relative number of spurious tuples also holds as an

upper bound with two dierences: First, it holds with high proba-

bility over the random choice of relation (and not with probability

), and second, is holds with an additive term, though one which

vanishes for asymptotically large relation instances. The proof of

this result is fairly complicated, and as discussed in Section 5, re-

quires applications of multiple techniques from information theory

[9] and concentration of measure [6].

Beyond its theoretical interest, understanding the relationship

between the information-theoretic KL-divergence, and the tangible

property of loss, as measured by the number of spurious tuples, has

practical consequences for the task of discovering acyclic schemas

that t a dataset. Currently, the system of [

] can discover acyclic

schemas that t the data well in terms of its

-measure. Under-

standing how the

-measure relates to the loss in terms of spurious

tuples will enable nding acyclic schemas that generate a bounded

number of spurious tuples. This is important for applications that

apply factorization as a means of compression, while wishing to

maintain the integrity of the data [22].

To summarize, in this paper we: (1) Show that the

-measure of

Lee [

] is the KL-divergence between the empirical distribution

associated with the original relation and the one induced by the

acyclic schema, (2) Prove a general lower bound on the loss (i.e.,

spurious tuples) in terms of the KL-divergence, and present a simple

family of relation instances for which this bound is tight, and (3)

Propose a random relation model and prove an upper bound on the

loss, which holds with high probability, and which converges to

the lower bound for large relational instances.

2 BACKGROUND

For the sake of consistency, we adopt some of the notation from [

We denote by

[𝑛]={1, . . . , 𝑛}

. Let

be a set of variables, also called

attributes. If 𝑋 , 𝑌 ⊆Ω, then 𝑋𝑌 denotes 𝑋∪𝑌.

2.1 Data Dependencies

Let

Ωdef

={𝑋1, . . . , 𝑋𝑛}

denote a set of attributes with domains

D(𝑋1), . . . , D(𝑋𝑛)

. We denote by

Rel(Ω)def

={𝑅:𝑅⊆>𝑛

𝑖=1D(𝑋𝑖)}

the set of all possible relation instances over

. Fix a relation in-

stance

𝑅∈Rel(Ω)

of size

𝑁=|𝑅|

. For

𝑌⊆Ω

we let

𝑅[𝑌]

de-

note the projection of

𝑅

onto the attributes

𝑌

. A schema is a set

S={Ω1, . . . , Ω𝑚}such that Ð𝑚

𝑖=1Ω𝑖=Ωand Ω𝑖⊈Ω𝑗for 𝑖≠𝑗.

We say that the relation instance

𝑅

satises the join dependency

JD(S)

, and write

𝑅|=JD(S)

, if

𝑅=Z𝑚

𝑖=1𝑅[Ω𝑖]

. If

𝑅̸|=JD(S)

, then

Sincurs a loss with respect to 𝑅, denoted 𝜌(𝑅, S), dened:

𝜌(𝑅, S)def

=Z𝑚

𝑖=1𝑅[Ω𝑖]−|𝑅|

|𝑅|(1)

We call the set of tuples

Z𝑚

𝑖=1𝑅[Ω𝑖]\𝑅

spurious tuples. Clearly,

𝑅|=JD(S)if and only if 𝜌(𝑅, S)=0.

We say that

𝑅

satises the multivalued dependency (MVD)

𝜙=

𝑋↠𝑌1|𝑌2|. . . |𝑌𝑚

where

𝑚≥2

, the

𝑌𝑖

s are pairwise disjoint, and

𝑋𝑌1···𝑌𝑚=Ω

, if

𝑅=𝑅[𝑋𝑌1]Z··· Z𝑅[𝑋𝑌𝑚]

, or if the schema

S={𝑋𝑌1, . . . , 𝑋𝑌𝑚}

is lossless (i.e.,

𝜌(𝑅, S)=0

). We review the

concept of a join tree from [3]:

Denition 2.1.

Ajoin tree or junction tree is a pair

T, 𝜒

where

is an undirected tree, and

𝜒

is a function that maps each

𝑢∈

nodes(T)

to a set of variables

𝜒(𝑢)

, called a bag, such that the

running intersection property holds: for every variable

𝑋

, the set

{𝑢∈nodes(T) | 𝑋∈𝜒(𝑢)}

is a connected component of

. We

denote by 𝜒(T) def

=Ð𝑢𝜒(𝑢), the set of variables of the join tree.

We often denote the join tree as

, dropping

𝜒

when it is clear

from the context. The schema dened by

S={Ω1, . . . , Ω𝑚}

where

Ω1, . . . , Ω𝑚

are the bags of

. We call a schema

acyclic

if there exists a join tree whose schema is

. When

Ω𝑖⊈Ω𝑗

for

all

𝑖≠𝑗

, then the acyclic schema

S={Ω1, . . . , Ω𝑚}

satises

𝑚≤

|Ω|

[

]. We say that a relation

𝑅

satises the acyclic join dependency

, and denote

𝑅|=AJD(S)

, if

is acyclic and

𝑅|=JD(S)

. An

MVD

𝑋↠𝑌1| ··· |𝑌𝑚

represents a simple acyclic schema, namely

S={𝑋𝑌1, 𝑋𝑌2, . . . , 𝑋𝑌𝑚}.

Let

S={Ω1, . . . , Ω𝑚}

be an acyclic schema with join tree

(T, 𝜒)

We associate to every

(𝑢, 𝑣) ∈ edges(T)

an MVD

𝜙𝑢,𝑣

as follows.

Let

𝑢

and

T𝑣

be the two subtrees obtained by removing the edge

(𝑢, 𝑣)

. Then, we denote by

𝜙𝑢,𝑣 def

=𝜒(𝑢) ∩ 𝜒(𝑣)↠𝜒(T

𝑢)|𝜒(T𝑣)

Quantifying the Loss of Acyclic Join Dependencies PODS ’23, June 18–23, 2023, Seattle, WA, USA

We call the support of

the set of

𝑚−1

MVDs associated with its

edges, in notation

MVD(T) ={𝜙𝑢,𝑣 | (𝑢, 𝑣) ∈ edges(T)}

. Beeri

et al. have shown that if

denes the acyclic schema

, then

𝑅

satises

AJD(S)

(i.e.,

𝑅|=AJD(S)

) if and only if

𝑅

satises all MVDs

in its support: 𝑅|=𝜙𝑢,𝑣 for all 𝜙𝑢,𝑣 ∈𝑀𝑉 𝐷 (T) [3, Thm. 8.8].

2.2 Information Theory

Lee [

] gave an equivalent formulation of functional, mul-

tivalued, and acyclic join dependencies in terms of information

measures; we review this briey here, after a short background on

information theory.

Let

𝑋

be a random variable with a nite domain

and proba-

bility mass 𝑃(thus, Í𝑥∈D 𝑃(𝑥)=1). Its entropy is:

𝐻(𝑋)def

=

𝑥∈D

𝑝(𝑥)log 1

𝑃(𝑥).(2)

It holds that

𝐻(𝑋) ≤ log |D|

, and equality holds if and only if

𝑃

uniform. The denition of entropy naturally generalizes to sets of

jointly distributed random variables

Ω={𝑋1, . . . , 𝑋𝑛}

, by dening

the function

𝐻:2Ω→R

as the entropy of the joint random

variables in the set. For example,

𝐻(𝑋1𝑋2)=

𝑥1∈D1,𝑥2∈D2

𝑃(𝑥1, 𝑥2)log 1

𝑃(𝑥1, 𝑥2).(3)

Let

𝐴, 𝐵,𝐶 ⊆Ω

. The conditional mutual information

𝐼(𝐴;𝐵|𝐶)

dened as:

𝐼(𝐴;𝐵|𝐶)def

=𝐻(𝐵𝐶) + 𝐻(𝐴𝐶) − 𝐻(𝐴𝐵𝐶) − 𝐻(𝐶).(4)

It is known that the conditional independence

𝑃|=𝐴⊥𝐵|𝐶

(i.e.,

𝐴

is independent of

𝐵

given

𝐶

) holds if and only if

𝐼(𝐴;𝐵|

𝐶)=0

. When

𝐶=∅

, or if

𝐶

is a constant (i.e.,

𝐻(𝐶)=0

), then the

conditional mutual information is reduced to the standard mutual

information, and denoted by 𝐼(𝐴;𝐵).

Let

Xdef

={𝑋1, . . . , 𝑋𝑛}

a set of discrete random variables, and

let

𝑃(X)

and

𝑄(X)

denote discrete probability distributions. The

KL-divergence between 𝑃and 𝑄is:

𝐷𝐾𝐿 (𝑃|| 𝑄)def

=E𝑃log 𝑃(X)

𝑄(X)=

x∈D(X)

𝑃(x)log 𝑃(x)

𝑄(x)(5)

For any pair of probability distributions 𝑃, 𝑄 over the same proba-

bility space, it holds that

𝐷𝐾𝐿 (𝑃||𝑄) ≥ 0

, with equality if and only

if 𝑃=𝑄. It is an easy observation that

𝐼(𝐴;𝐵|𝐶)=𝐷𝐾𝐿 (𝑃(𝐴𝐵𝐶) || 𝑃(𝐴|𝐶)𝑃(𝐵|𝐶)𝑃(𝐶)) (6)

for any probability distribution 𝑃.

Let

𝑅

be a multiset of tuples over the attribute set

Ω={𝑋1, . . . , 𝑋𝑛}

and let

|𝑅|=𝑁

. The empirical distribution associated with

𝑅

is the

multivariate probability distribution over

D1×·· ·×D𝑛

that assigns

a probability of

𝑃(𝑡)def

=𝐾

𝑁

to every tuple

𝑡

𝑅

with multiplicity

𝐾

. When

𝑅

is a relation instance, and hence a set of

𝑁

tuples, its

empirical distribution is the uniform distribution over its tuples,

i.e.

𝑃(𝑡)=1/𝑁

for all

𝑡∈𝑅

, and so its entropy is

𝐻(Ω)=log 𝑁

. For

a relation instance

𝑅∈Rel(Ω)

, we let

𝑃𝑅

denote the empirical dis-

tribution over

𝑅

. For

𝛼⊆ [𝑛]

, we denote by

𝑋𝛼

the set of variables

𝑋𝑖, 𝑖 ∈𝛼

, and denote by

𝑅(𝑋𝛼=𝑥𝛼)

the subset of tuples

𝑡∈𝑅

where

𝑡[𝑋𝛼]=𝑥𝛼

, for xed values

𝑥𝛼

. When

𝑃𝑅

is the empirical distribution

over 𝑅then the marginal probability is 𝑃𝑅(𝑋𝛼=𝑥𝛼)=|𝑅(𝑋𝛼=𝑥𝛼)|

𝑁.

Lee [

] formalized the following connection between data-

base constraints and entropic measures. Let

(T, 𝜒)

be a join tree

(Denition 2.1). The J-measure is dened as:

J(T, 𝜒)def

=

𝑣∈

nodes(T)

𝐻(𝜒(𝑣))− 

(𝑣1,𝑣2)∈

edges(T)

𝐻(𝜒(𝑣1)∩𝜒(𝑣2))−𝐻(𝜒(T))

(7)

where

𝐻

is the entropy (see

(2)

) taken over the empirical distribution

associated with

𝑅

. We abbreviate

J(T, 𝜒)

with

J(T)

, or

, when

T, 𝜒

are clear from the context. Observe that

depends only on

the schema

dened by the join tree, and not on the tree itself. For a

simple example, consider the MVD

𝑋↠𝑈|𝑉|𝑊

and its associated

acyclic schema

{𝑋𝑈 , 𝑋𝑉 , 𝑋𝑊 }

. For the join tree

𝑋𝑈 −𝑋𝑉 −𝑋𝑊

, it

holds that

J=𝐻(𝑋𝑈 ) +𝐻(𝑋𝑉 ) +𝐻(𝑋𝑊 ) −2𝐻(𝑋) −𝐻(𝑋𝑈𝑉𝑊 )

Another join tree is

𝑋𝑈 −𝑋𝑊 −𝑋𝑉

, and

is the same. Therefore,

is acyclic, then we write

J(S)

to denote

J(T)

for any join

tree of

. When

S={𝑋𝑍, 𝑋𝑌 }

, then the

-measure reduces to the

conditional mutual information (see

(4)

), to wit

J(S)=𝐼(𝑍;𝑌|𝑋)

Theorem 2.1. ([

]) Let

be an acyclic schema over

, and

let

𝑅

be a relation instance over

. Then

𝑅|=AJD(S)

if and only if

J(S)=0.

In the particular case of a standard MVD, Lee’s result implies

that

𝑅|=𝑋↠𝑌|𝑍

if and only if

𝐼(𝑌;𝑍|𝑋)=0

in the empirical

distribution associated with 𝑅.

2.3 MVDs and Acyclic Join Dependencies

Beeri et al. [

] have shown that an acyclic join dependency dened

by the acyclic schema

, over

𝑚

relation schemas, is equivalent

𝑚−1

MVDs (called its support). In [

], this characterization

was generalized as follows. Let

(T, 𝜒)

be the join tree correspond-

ing to the acyclic schema

. Root the tree

at node

𝑢1

, orient

the tree accordingly, and let

𝑢1, . . . ,𝑢𝑚

be a depth-rst enumer-

ation of

nodes(T)

. Thus,

𝑢1

is the root, and for every

𝑖>1

parent(𝑢𝑖)

is some node

𝑢𝑗

with

𝑗<𝑖

. For every

𝑖

, we dene

Ω𝑖def

=𝜒(𝑢𝑖)

Ω𝑖:𝑗def

=Ðℓ=𝑖, 𝑗 Ωℓ

, and

Δ𝑖def

=𝜒(parent(𝑢𝑖)) ∩ 𝜒(𝑢𝑖)

By the running intersection property (Denition 2.1) it holds that

Δ𝑖=Ω1:(𝑖−1)∩Ω𝑖.

Theorem 2.2. ([

]) Let

(T, 𝜒)

be a join tree over variables

𝜒(T) =Ω

, where

nodes(T) ={𝑢1, . . . , 𝑢𝑚}

with corresponding

bags Ω𝑖=𝜒(𝑢𝑖). Then:

max

𝑖∈[2,𝑚]𝐼(Ω1,(𝑖−1);Ω𝑖:𝑚|Δ𝑖) ≤ J(T, 𝜒) ≤

𝑚



𝑖=2

𝐼(Ω1,(𝑖−1);Ω𝑖:𝑚|Δ𝑖)

(8)

Since the support of (T, 𝜒)are precisely the MVDs

{Δ𝑖↠Ω1:(𝑖−1)|Ω𝑖:𝑚}𝑖∈[2,𝑚],(9)

then (8) generalizes the result of Beeri et al. [3].

Denition 2.2

(models

)

We say that a relation instance

𝑅

over attributes

models the join tree

(T, 𝜒)

, denoted

𝑅|=T

𝐼(Ω1,(𝑖−1);Ω𝑖:𝑚|Δ𝑖)=0for every 𝑖∈ [2,𝑚].

PODS ’23, June 18–23, 2023, Seattle, WA, USA Batya Kenig and Nir Weinberger

3 CHARACTERIZING ACYCLIC SCHEMAS

WITH KL-DIVERGENCE

Let

Xdef

={𝑋1, . . . , 𝑋𝑛}

be a set of discrete random variables over

domains

D(𝑋𝑖)

, and let

𝑃(𝑋1, . . . , 𝑋𝑛)

be a joint probability distri-

bution over

. Let

x∈ D(𝑋1) × · ·· × D(𝑋𝑛)

. For a subset

Y⊂X

we denote by

x[Y]

the assignment

restricted to the variables

We denote by 𝑃[Y]the marginal probability distribution over Y.

Let

(T, 𝜒)

be a join tree where

nodes(T) ={𝑢1, . . . , 𝑢𝑚}

. Let

S={Ω1, . . . , Ω𝑚}

be the acyclic schema associated with

(T, 𝜒)

over the variables Ω=𝜒(T), where Ω𝑖=𝜒(𝑢𝑖).

Proposition 3.1. Let

𝑃(𝑋1, . . . , 𝑋𝑛)

be any joint probability dis-

tribution over

𝑛

variables, and let

(T, 𝜒)

be a join tree where

𝜒(T) =

{𝑋1, . . . , 𝑋𝑛}

. Then

𝑃|=T

(Denition 2.2) if and only if

𝑃=𝑃T

where:

𝑃T(𝑥1, . . . , 𝑥𝑛)def

=Î𝑚

𝑖=1𝑃[Ω𝑖](x[Ω𝑖])

Î𝑚−1

𝑖=1𝑃[Δ𝑖](x[Δ𝑖]) (10)

where

𝑃[Ω𝑖]

(

𝑃[Δ𝑖]

) denote the marginal probabilities over

Ω𝑖

(

Δ𝑖

The proof of Proposition 3.1 is deferred to Appendix A. It follows

from Denition 2.2, and a simple induction on the number of nodes

in T.

In this section, we rene the statement of Proposition 3.1 and

prove the following variational representation.

Theorem 3.2. For any joint probability distribution

𝑃(𝑋1, . . . , 𝑋𝑛)

and any join tree (T, 𝜒)with 𝜒(T) ={𝑋1, . . . , 𝑋𝑛}it holds that

J(T) =min

𝑄|=T𝐷𝐾𝐿 (𝑃||𝑄)=𝐷𝐾𝐿 (𝑃||𝑃T)(11)

In words, this theorem states that when the join tree

(T, 𝜒)

given, then out of all probability distributions

𝑄

over

Ω={𝑋1, . . . , 𝑋𝑛}

that model

(see

(10)

), the one closest to

𝑃

in terms of KL-Divergence,

𝑃T

. Importantly, this KL-divergence is precisely

J(T)

(i.e.,

J(T) =

𝐷𝐾𝐿 (𝑃||𝑃T)

). While the Theorem holds for all probability distri-

butions

𝑃

, a special case is when

𝑃

is the empirical distribution

associated with relation

𝑅

. The proof of Theorem 3.2 follows from

the following two lemmas, interesting on their own, and is deferred

to the complete version of this paper [15].

Lemma 3.3. Let

𝑃(𝑋1, . . . , 𝑋𝑛)

be a joint probability distribution

over

𝑛

random variables, and let

be a join tree over

𝑋1, . . . , 𝑋𝑛

with

bags

Ω1, . . . , Ω𝑚

. Then

𝑃[Ω𝑖]=𝑃T[Ω𝑖]

for every

𝑖∈ [1, 𝑚]

, and

𝑃[Δ𝑖]=𝑃T[Δ𝑖]for every 𝑖∈ [1,𝑚 −1].

The proof of Lemma 3.3 follows from an easy induction on

𝑚

the number of nodes in the join tree

, and is deferred to Appendix

Lemma 3.4. The following holds for any joint probability distribu-

tion 𝑃(𝑋1, . . . , 𝑋𝑛), and any join tree Tover variables 𝑋1, . . . , 𝑋𝑛:

arg min

𝑄|=T

𝐷𝐾𝐿 (𝑃||𝑄)=𝑃T(12)

Proof.

From Lemma 3.3 we have that, for every

𝑖∈ {1, . . . , 𝑚}

𝑃T[Ω𝑖]=𝑃[Ω𝑖]

where,

Ω𝑖=𝜒(𝑢𝑖)

. Since

Δ𝑖⊂Ω𝑖

, then

𝑃T[Δ𝑖]=

𝑃[Δ𝑖]. Now,

min

𝑄|=T𝐷𝐾𝐿 (𝑃||𝑄)=min

𝑄|=TE𝑃log 𝑃(𝑋1, . . . , 𝑋𝑛)

𝑄(𝑋1, . . . , 𝑋𝑛)(13)

=min

𝑄|=TE𝑃log 𝑃(𝑋1, . . . , 𝑋𝑛)

𝑃T(𝑋1, . . . , 𝑋𝑛)·𝑃T(𝑋1, . . . , 𝑋𝑛)

𝑄(𝑋1, . . . , 𝑋𝑛)

(14)

=𝐷𝐾𝐿 (𝑃||𝑃T) + min

𝑄|=TE𝑃log 𝑃T(𝑋1, . . . , 𝑋𝑛)

𝑄(𝑋1, . . . , 𝑋𝑛)

(15)

Since the chosen distribution

𝑄(X)

has no consequence on the

rst term

𝐷𝐾𝐿 (𝑃||𝑃T)

, we take a closer look at the second term,

min𝑄|=TE𝑃hlog 𝑃T(𝑋1,...,𝑋𝑛)

𝑄(𝑋1,...,𝑋𝑛)i

. Since

𝑄|=T

, then by Proposi-

tion 3.1 it holds that

𝑄(𝑋1, . . . , 𝑋𝑛)=Î𝑚

𝑖=1𝑄[Ω𝑖](X[Ω𝑖])

Î𝑚−1

𝑖=1𝑄[Δ𝑖](X[Δ𝑖])

. Hence,

in what follows, we refer to

𝑄

𝑄T

. In the remainder of the proof

we show that:

E𝑃"log 𝑃T(𝑋1, . . . , 𝑋𝑛)

𝑄T(𝑋1, . . . , 𝑋𝑛)#=E𝑃Tlog 𝑃T(𝑋1, . . . , 𝑋𝑛)

𝑄T(𝑋1, . . . , 𝑋𝑛)(16)

=𝐷𝐾𝐿 (𝑃T||𝑄T),(17)

where the last equality follows from

(5)

. Since

𝐷𝐾𝐿 (𝑃T||𝑄T) ≥ 0

with equality if and only if

𝑃T=𝑄T

, then choosing

𝑄T

to be

𝑃T

minimizes

𝐷𝐾𝐿 (𝑃||𝑄)

, thus proving the claim. The remainder of

the proof, proving

(16)

, follows from Lemma 3.3 which states that

𝑃[Ω𝑖]=𝑃T[Ω𝑖]

for every

𝑖∈ [1, 𝑚]

, and

𝑃[Δ𝑗]=𝑃T[Δ𝑗]

for

every

𝑗∈ [𝑚−1]

. The proof is quite technical, and hence deferred

to Appendix A. □

4 SPURIOUS TUPLES: A LOWER BOUND

BASED ON J(T)

Let

𝑃𝑅

be the empirical distribution over a relation instance

𝑅

with

𝑁

tuples and

𝑛

attributes

Ω={𝑋1, . . . , 𝑋𝑛}

. That is, the probability

associated with every record in

𝑅

𝑁

. Let

be any junction tree

over the variables

, and let

Sdef

={Ω1, . . . , Ω𝑚}

where

Ω𝑖=𝜒(𝑢𝑖)

By Theorem 2.1 and Theorem 3.2, it holds that

𝑅|=AJD(S)

if and

only if

J(T) =𝐷𝐾𝐿 (𝑃||𝑃T)=0

. In what follows, given an acyclic

schema

, we provide a lower bound for

𝜌(𝑅, S)

that is based on

its associated junction tree J(T).

Lemma 4.1. Let

𝑃

be the empirical distribution over a relation

instance

𝑅

with

𝑁

tuples, and let

S={Ω1, . . . , Ω𝑚}

denote an acyclic

schema with junction tree (T, 𝜒). Then:

J(T) ≤ log(1+𝜌(𝑅, S)).(18)

So if 𝜌(𝑅, S)=0, then J(T) =0as well.

Proof. From Theorem 3.2 we have that

J(T) =𝐷𝐾𝐿 (𝑃||𝑃T),(19)

where

𝑃T=arg min

𝑄|=T

𝐷𝐾𝐿 (𝑃||𝑄)(20)

Let us dene

𝑄𝑇𝑈 =arg min

𝑄|=T,

𝑄is uniform

𝐷𝐾𝐿 (𝑃||𝑄),(21)

Quantifying the Loss of Acyclic Join Dependencies PODS ’23, June 18–23, 2023, Seattle, WA, USA

and verify that such a distribution

𝑄𝑇𝑈

always exists: Let

𝑅′def

=⊲⊳𝑚

𝑖=1

ΠΩ𝑖(𝑅)

. Let

𝑄𝑇𝑈

denote the empirical distribution over

𝑅′

. By

construction,

𝑄𝑇𝑈

is a uniform distribution (i.e., over tuples

𝑅′

and 𝑄𝑇𝑈 |=T.

By denition, we have that

|𝑅′|=𝑁(1+𝜌(𝑅, S))

, where

|𝑅|=𝑁

and 𝜌(𝑅, S)is the loss of Swith respect to 𝑅(see (1)).

By limiting the minimization region to uniform distributions,

the minimum can only increase. Therefore:

J(T) =𝐷𝐾𝐿 (𝑃||𝑃T) ≤ 𝐷𝐾𝐿 (𝑃||𝑄𝑇𝑈 )(22)

Evaluating the

𝐾𝐿

-divergence term on the right hand side, and

using the fact that 𝑄𝑇𝑈 is uniform, we get:

𝐷𝐾𝐿 (𝑃||𝑄𝑇𝑈 )=

𝑟∈𝑅

𝑃(𝑟)log 𝑃(𝑟)

𝑄𝑇𝑈 (𝑟)(23)

=

𝑟∈𝑅

𝑁log 1/𝑁

1/(𝑁+𝑁·𝜌(𝑅, S)) (24)

=

𝑟∈𝑅

𝑁log (1+𝜌(𝑅, S)) (25)

=log(1+𝜌(𝑅, S)) (26)

Hence, J(T) ≤ log(1+𝜌(𝑅, S)), and 𝜌(𝑅, S) ≥ 2J(T) −1□

The following simple example shows that the lower bound of

(18)

is tight. That is, there exists a family of relation instances

𝑅

, and a

schema Swhere J(S)=log(1+𝜌(𝑅, S)).

Example 4.1.

Let

Ω={𝐴, 𝐵}

, where

D(𝐴)={𝑎1, . . . , 𝑎𝑁}

and

D(𝐵)={𝑏1, . . . , 𝑏𝑁}where D(𝐴) ∩ D(𝐵)=∅. Let

𝑅={𝑡1=(𝑎1, 𝑏1), . . . , 𝑡𝑁=(𝑎𝑁, 𝑏𝑁)} (27)

be a relation instance over

. By denition,

𝑃𝑅(𝑡𝑖)=1

𝑁

. Noting that

𝐻(𝐴)=𝐻(𝐵)=𝐻(𝐴𝐵)=log 𝑁

(see

(2)

), we have that

𝐼(𝐴;𝐵)=

log 𝑁

(see

(4)

). Now, consider the schema

S={{𝐴},{𝐵}}

, and let

𝑅′=Π𝐴(𝑅)⊲⊳ Π𝐵(𝑅)

. Clearly,

|𝑅′|=𝑁2

, and

𝜌(𝑅, S)=𝑁−1

(see

(1)

). In particular, we have that

J(S)=𝐼(𝐴;𝐵)=log 𝑁=

log(1+𝜌(𝑅, S)), and that this holds for every 𝑁≥2.

5 SPURIOUS TUPLES: AN UPPER BOUND

BASED ON MUTUAL INFORMATION

In the previous section, we have shown that given a relation

𝑅

and

an acyclic schema

dened by a join tree

, it holds that

log(1+

𝜌(𝑅, S)) ≥ J(T) (Lemma 4.1). In this section, we derive an upper

bound on

𝜌(𝑅, S)

in terms of an information-theoretic measure. To

this end, recall that Theorem 2.2 shows that if

{Ω1, . . . , Ω𝑚}

are the

bags of

, then

J(T) ≤ Í𝑚

𝑖=2𝐼(Ω1:𝑖−1;Ω𝑖:𝑚|Δ𝑖)

. That is,

J(T)

is upper bounded by the sum of the conditional mutual information

of the

𝑚−1

MVDs in the support

, given by

Δ𝑖↠Ω1:𝑖−1|Ω𝑖:𝑚

, for

𝑖∈ [2, 𝑚]1

. In this section, we relate this sum of conditional mutual

information of the

𝑚−1

MVDs in the support

to an approximate

upper bound on 𝜌(𝑅, S), which holds with high probability.

To this end, we begin by relating the relative number of spurious

tuples of the relation

𝑅

of a schema

, that is

𝜌(𝑅, S)

, with the

spurious tuples of each of the MVDs in its support. Concretely, let

More accurately, the MVD should be written as

𝜙𝑖

def

=Δ𝑖↠

(Ω1:𝑖−1\Δ𝑖)|(Ω𝑖:𝑚\Δ𝑖)

, so that the bags are disjoint. However, it can be

easily shown, using the chain rule of the mutual information [

, Theorem 2.5.2],

that

𝐼(Ω1:𝑖−1;Ω𝑖:𝑚|Δ𝑖)=𝐼(Ω1:𝑖−1\Δ𝑖;Ω𝑖:𝑚\Δ𝑖|Δ𝑖)

, and so we adopt the

simplied notation for MVD.

𝜙𝑖

def

=Δ𝑖↠Ω1:𝑖−1|Ω𝑖:𝑚

be the

𝑖

th MVD in the support of

. Then,

the relative number of spurious tuples for 𝜙𝑖is dened as

𝜌(𝑅, 𝜙𝑖)def

=|ΠΩ1:𝑖−1(𝑅)⊲⊳ ΠΩ𝑖:𝑚(𝑅)| − |𝑅|

|𝑅|.(28)

Proposition 5.1. Let a relation

𝑅

be given, and let

be an acyclic

schema over the attributes of

𝑅

with join tree

, whose support are

the MVDs 𝜙𝑖=Δ𝑖↠Ω1:𝑖−1|Ω𝑖:𝑚, for 𝑖∈ [2,𝑚]. Then,

log 1+𝜌(𝑅, S)≤

𝑚



𝑖=2

log 1+𝜌(𝑅, 𝜙𝑖).(29)

Proof.

We prove by induction on the number of MVDs (or

nodes) in the schema. Let

𝑚

be the number nodes (and

𝑚−1

the number of MVDs) in the schema. The base case

𝑚−1≤1

immediate. Assuming it holds for

𝑚−1<𝑘

, we prove the claim

for

𝑚−1=𝑘

. Let

be a join tree representing

𝑘

MVDs (and

hence

𝑘+1

nodes). Let

𝑢𝑘+1

be a leaf in this join tree with parent

𝑝def

=parent(𝑢𝑘+1)

. Let

T′

be the join tree where nodes

𝑢𝑘+1

and

𝑝

are merged to the node

𝑢′

where

Ω(𝑢′)=Ω(𝑢𝑘+1) ∪ Ω(𝑝)

. Hence,

by the induction hypothesis

1+𝜌(𝑅, T′) ≤

𝑘

𝑖=21+𝜌(𝑅, 𝜙𝑖).(30)

Now, let

𝑅′=⊲⊳𝑘

𝑖=1𝑅[Ω𝑖]

. Consider the MVD

𝜙𝑘+1=Ω(𝑢𝑘+1) ∩

Ω(𝑝)↠Ω(𝑢𝑘+1)|Ω1,𝑘

. Then,

𝑅′′ def

=ΠΩ1,𝑘 (𝑅′)⊲⊳ ΠΩ𝑘+1(𝑅′)

and

by the induction hypothesis, |𝑅′′|≤|𝑅′| · [1+𝜌(𝑅, 𝜙𝑘)]. By (30),

|𝑅′|≤|𝑅| ·

𝑘

𝑖=21+𝜌(𝑅, 𝜙𝑖),(31)

and hence

|𝑅′′|≤|𝑅|·Î𝑘+1

𝑖=2[1+𝜌(𝑅, 𝜙𝑖)]

, which proves claim.

□

Proposition 5.1 reduces the problem of upper bounding

log[1+

𝜌(𝑅, S)]

to bounding each of the terms

log[1+𝜌(𝑅, 𝜙𝑖)]

(29)

, each

of them corresponding to the relative number of spurious tuples

of the

𝑚−1

MVDs in the support of

. Considering an arbitrary

MVD, which we henceforth denote for simplicity by

𝜙def

=𝐶↠𝐴|𝐵

Lemma 4.1 implies the lower bound

log[1+𝜌(𝑅, 𝜙𝑖)] ≥ 𝐼(𝐴;𝐵|𝐶)

since an MVD is a simple instance of an acyclic schema. However,

obtaining an upper bound on

log[1+𝜌(𝑅, 𝜙𝑖)]

in terms of

𝐼(𝐴;𝐵|

𝐶)

is challenging because the mutual information

𝐼(𝐴;𝐵|𝐶)

varies

wildly for an MVD

𝜙

even when

𝜌(𝑅, 𝜙)

and the domains sizes

𝑑𝐴, 𝑑𝐵

and

𝑑𝐶

remain constant (where

𝑑𝐴

def

=|Π𝐴(𝑅)|

, and similarly

for

𝑑𝐵

and

𝑑𝐶

). Figure 1 illustrates this phenomenon in the simple

case in which

𝑑𝐶=1

and so

𝐶

is a degenerated random variable, and

𝑑𝐴=𝑑𝐵

. In other words, the value

𝐼(𝐴;𝐵|𝐶)

depends on the actual

contents of the relation instance

𝑅

. However, while

𝐼(𝐴;𝐵|𝐶)

might not be an accurate upper bound to

log[1+𝜌(𝑅, 𝜙)]

for an

arbitrary relation, it may hold that it is an approximate upper bound

for most relations. Therefore, we next propose a random relation

model, in which the tuples of the relation

𝑅

are chosen at random.

We then establish an upper bound on

log[1+𝜌(𝑅, 𝜙)]

that holds

with high probability over this randomly chosen relation.

Definition 5.2 (Random relation model). Let

Ωdef

={𝑋1, . . . , 𝑋𝑛}

be a set of attributes with domains

D(𝑋1), . . . , D(𝑋𝑛)

, and assume

w.l.o.g. that

D(𝑋𝑖)def

=[𝑑𝑖]

for

{𝑑𝑖}𝑛

𝑖=1⊂N+

. Let

𝑁∈N+

be given

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

QuantifyingtheLossofAcyclicJoinDependenciesBatyaKenigTechnion,IsraelInstituteofTechnologyHaifa,Israelbatyak@technion.ac.ilNirWeinbergerTechnion,IsraelInstituteofTechnologyHaifa,Israelnirwein@technion.ac.ilABSTRACTAcyclicschemespossesknownbenefitsfordatabasedesign,speed-ingupqueries,andreducingspacer...

展开>> 收起<<

Quantifying the Loss of Acyclic Join Dependencies.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Quantifying the Loss of Acyclic Join Dependencies

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: