Coresets for Relational Data and The Applications Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding2 1School of Data Science

2025-05-06 0 0 1.51MB 21 页 10玖币

侵权投诉

Coresets for Relational Data and The Applications

Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding∗2

1School of Data Science

2School of Computer Science and Technology

University of Science and Technology of China

{czar, yangqingyuan, hrm}@mail.ustc.edu.cn,huding@ustc.edu.cn

Abstract

A coreset is a small set that can approximately preserve the structure of the original

input data set. Therefore we can run our algorithm on a coreset so as to reduce

the total computational complexity. Conventional coreset techniques assume that

the input data set is available to process explicitly. However, this assumption

may not hold in real-world scenarios. In this paper, we consider the problem

of coresets construction over relational data. Namely, the data is decoupled into

several relational tables, and it could be very expensive to directly materialize the

data matrix by joining the tables. We propose a novel approach called “aggregation

tree with pseudo-cube” that can build a coreset from bottom to up. Moreover, our

approach can neatly circumvent several troublesome issues of relational learning

problems [Khamis et al., PODS 2019]. Under some mild assumptions, we show

that our coreset approach can be applied for the machine learning tasks, such as

clustering, logistic regression and SVM.

1 Introduction

As the rapid development of information technologies, we are often confronted with large-scale data

in many practical scenarios. To reduce the computational complexity, a number of data summarization

techniques have been proposed [

Coreset

is a well-known sampling technique for compressing

large-scale data sets [

]. Roughly speaking, for a given large data set

, the coreset approach is

to construct a new (weighted) set

that approximately preserves the structure of

, and meanwhile

the size

|˜

is much smaller than

|P|

. Therefore, if we run an existing algorithm on

rather than

, the runtime can be signiﬁcantly reduced and the computational quality (e.g., the optimization

objectives for the machine learning tasks [

]) can be approximately guaranteed. In the past decades,

the coreset techniques have been successfully applied for solving many optimization problems, such

-means clustering [

], logistic regression [

], Gaussian mixture models [

], and

continual learning [11].

Usually we assume that the input data is stored in a matrix such that the coreset algorithm can easily

access the whole data. But this assumption may not hold in real-world scenarios. For example,

according to the Kaggle 2017 “State of Data Science Survey”,

65.5% of the data sets are relational

data

. Namely, the data is decoupled into several relational tables, and we cannot obtain the data

matrix (which is also called the “

design matrix

”) unless joining all the tables. Relational database

has a very large and fast growing market in this big data era. It is expected to reach USD

122.38

billion by 2027 [

]. Relational database has several favorable properties [

], e.g., the decoupled

relational data can save a large amount of space, and it is friendly to several common statistical

queries such as the aggregate functions

COUNT,SUM

and

AVG

in SQL. In recent years, the study on

∗Corresponding author.

Preprint. Under review.

arXiv:2210.04249v1 [cs.LG] 9 Oct 2022

“in-database learning” has gained a signiﬁcant amount of attention in the areas of machine learning

and data management [46, 38, 23, 37].

However, constructing a coreset for relational database is much more challenging.

Suppose

there are

s > 1

relational tables and the size of the largest table is

. Let

be the number of the

tuples obtained by joining all the tables, and then

can be as large as

O(Ns)

[

] (we illustrate an

instance of joining two tables in Table 1). Obviously it will take extremely large time and space

complexities to join all the tables and train a model on such scale of data. To remedy this issue, a

straightforward idea is trying to complete the computing task without explicitly constructing the

design matrix. Unfortunately, even for some very simple computational tasks, their implementations

on relational data can be NP-hard.

To see why these implementations are so hard, we can consider the following simple clustering

assignment task (which is often used as a building block for coresets construction [

]): suppose

the

cluster centers have already been given, and we want to determine the size of each cluster.

Note that if we have the design matrix, this task is trivial (just assign each point to its nearest center,

and then calculate the cluster sizes based on the assignment). However, if the data is decoupled

into several relational tables, the problem of computing the assignment can be quite complicated.

Khamis et al. [

] recently deﬁned the problem of answering

unctional

ggregate

ueries (FAQ)

in which some of the input factors are deﬁned by a collection of

dditive

nequalities

between

variables (denote by

FAQ-AI(m)

, where

is the number of additive inequalities). The problem

of computing the clustering assignment can be formulated as an instance of

FAQ-AI(k−1)

(to

determine the cluster membership for one point (tuple), we need to certify that its distance to its own

cluster center is lower than the distances to the other

k−1

centers). Recently, Khamis et al. [

]

showed that evaluating even a

FAQ-AI(1)

instance is #P-hard, and approximating a

FAQ-AI(m)

instance within any ﬁnite factor is NP-hard for any

m > 1

. Moseley et al. [

] also showed that

even approximating the clustering assignment to any factor is NP-hard for

k≥3

. An intuitive

understanding is that the query satisfying those inequalities require to access the information of each

tuple on all the dimensions, where it is almost equivalent to constructing the entire design matrix.

1.1 Our Contributions

We consider Empirical Risk Minimization (ERM) problems in machine learning [

]. Let

the data space. The training set

P={p1, p2,··· , pn} ⊂ Rd

. But we assume that this set

is not

explicitly given, where it is decoupled into

relational tables (the formal deﬁnitions are shown in

Section 2). The goal is to learn the hypothesis

(from the hypothesis space

) so as to minimize the

empirical risk

F(θ) = 1

i=1

f(θ, pi),(1)

where f(·,·)is the non-negative real-valued loss function.

Several coresets techniques on relational data have been studied very recently. Samadian et al. [

]

showed that the simple uniform sampling yields a coreset for regularized loss minimization problems

(note the uniform sampling can be efﬁciently implemented for relational database [

]). Their sample

size is

Θ (nκ·dim)

, where

κ∈(0,1)

(usually

is set to be

1/2

in practice) and

dim

is the VC

dimension of loss function (usually it is

Θ(d)

). Thus the size can be

Θ (√n·d)

, which is too large

especially for high dimensional data. Curtin et al. [

] constructed a coreset for

-means clustering

by building a weighted grid among the input

tables; the coreset yields a

-approximation for

the clustering objective. Independently, Ding et al. [

] also applied the “grid” idea to achieve a

(9 + )

-approximation for

-means clustering with distributed dimensions (attributes). The major

drawback of this “grid” idea is that the resulting coreset size can be as large as

which is exponential

in the number of tables

. Moseley et al. [

] recently proposed a coreset for

-means clustering

on relational data by using the

-means++ initialization method [

], however, their approximation

ratio is too high (

>400

) for real-world applications. Moreover, the coresets techniques proposed

in [

] can only handle

-means clustering, and it is unclear that whether they can be extended

for other machine learning problems as the form of (1).

Each additive inequality deﬁnes a constraint for the query; for example, in SVM, the “positive” class is

deﬁned by the inequality hx, ωi>0, where ωis the normal vector of the separating hyperplane.

In this paper, we aim to develop an efﬁcient and general coreset construction method for optimizing

(1) on relational data. First, we observe that real-world data sets often have low intrinsic dimensions

(e.g, nearby a low-dimensional manifold) [

]. Our coreset approach is inspired by the well-known

Gonzalez’s algorithm for

-center clustering [

]. The algorithm greedily selects

points from the

input point set, and the

balls centered at these selected

points with some appropriate radius can

cover the whole input; if the input data has a low intrinsic dimension (e.g., the doubling dimension

in Deﬁnition 3), the radius is no larger than an upper bound depending on

. Therefore, the set of

cluster centers (each center has the weight equal to the corresponding cluster size) can serve as a

coreset for the ERM problem (1), where the error yielded from the coreset is determined by the radius.

Actually this

-center clustering based intuition has been used to construct the coresets for various

applications before [35, 53, 20].

However, we have to address two challenging issues for realizing this approach on relational

data.

First, the greedy selection procedure for the

cluster centers cannot be directly implemented

for relational data. Second, it is hard to obtain the size of each cluster especially when the balls have

overlap (the

balls can partition the space into as many as

parts). Actually, both of these two

issues can be regarded as the instances of FAQ-AI(k−1) which are NP-hard to compute [1, 43].

Our approach relies on a novel and easy-to-implement structure called “

aggregation tree with

pseudo-cube

”. We build the tree from bottom to up, where each relational table represents a leaf.

Informally speaking, each node consists of a set of

cluster centers which is obtained by merging

its two children; each center also associates with a “pseudo-cube” region in the space. Eventually,

the root of the tree yields a qualiﬁed coreset for the implicit design matrix. The aggregation manner

can help us to avoid building the high-complexity grid coreset as [

]. Also, with the aid of

“pseudo-cube”, we can efﬁciently estimate the size of each cluster without tackling the troublesome

FAQ-AI counting issue [1, 43].

Comparing with the previous coresets methods, our approach enjoys several signiﬁcant advantages.

For example, our approach can deal with more general applications. In fact, for most ERM problems

in the form of (1) under some mild assumptions, we can construct their coresets by applying our

approach. Also our coresets have much lower complexities. It is worth to emphasize that our coreset

size is independent of the dimensionality

; instead, it only depends on the doubling dimension of the

input data.

1.2 Other Related Works

The earliest research on relational data was proposed by Codd [

]. A fundamental topic that is

closely related to machine learning on relational data is how to take uniform and independent samples

from the full join results [

]. Recently, Zhao et al. [

] proposed a random walk approach for

this problem with acyclic join; Chen and Yi [

] generalized the result to some speciﬁc cyclic join

scenarios.

Beside the aforementioned works [

], there also exist a number of results

on other machine learning problems over relational data. Khamis et al. [

] and Yang et al. [

]

respectively considered training SVMs and SVMs with Gaussian kernel on relational data. For the

problem of linear regression on relational data, it is common to use the factorization techniques [

The algorithms for training Gaussian Mixture Models and Neural Networks on relational data were

also studied recently [

]. We also refer the reader to the survey paper [

] for more details on

learning over relational data.

2 Preliminaries

Suppose the training set for the problem (1) is a set

points in

, and it is decoupled into

relational tables

{T1, . . . , Ts}

with the feature (attribute) sets

{D1, . . . , Ds}

. Let

D=SiDi

and

therefore the size

|D|=d

. Actually, each table

can be viewed as projection of

onto a subspace

spanned by

. To materialize the design matrix of

, a straightforward way is to compute the join

over these

tables. With a slight abuse of notations, we still use “

” to denote the design matrix. We

also let [s] = {1,2, . . . , s}for simplicity.

d1d2

1 1

2 1

2 2

3 3

d2d3

1 1

1 4

3 1

3 3

T1 T2

d1d2d3

1 1 1

1 1 4

2 1 1

2 1 4

3 3 1

3 3 3

Table 1: An illustration for the join over two tables.

Deﬁnition 1 (Join)

The join over the given

tables returns a

n×d

design matrix

P=T1 ··· 

, such that for any vector (point)

p∈Rd

p∈P

if and only if

∀i∈[s]

ProjDi(p)∈Ti

, where

ProjDi(p)is the projection of pon the subspace spanned by the features of Di.

Table 1 is a simple illustration for joining two relational tables. To have a more intuitive understanding

of join, we can generate a

hypergraph G= (V,E)

. Each vertex of

corresponds to an individual

feature of

; each hyperedge of

corresponds to an individual relational table

, and it connects all

the vertices (features) of

. Then we can deﬁne the

acyclic

and

cyclic

join queries. A join is acyclic

if the hypergraph

G= (V,E)

can be ablated to be empty by performing the following operation

iteratively: if there exists a vertex that connects with only one hyperedge, remove this vertex together

with the hyperedge. Otherwise, the join is cyclic. A cyclic query usually is extremely difﬁcult and

has a much higher complexity than that of an acyclic query. For example, for a cyclic join, it is even

NP-hard to determine that whether the join is empty or not [

]. Fortunately, most real-world joins

are acyclic, which allow us to take full advantage of relational data. Similar with most of the previous

articles on relational learning [

], we also assume that the join of the given tables is acyclic in this

paper.

Counting

is one of the most common aggregation queries on relational data. It returns the number

of tuples that satisfy some particular conditions. The counting on an acyclic join (without additive

inequalities) can be implemented effectively via dynamic programming [

]. But when the additive

inequalities are included, the counting problem can be much more challenging and we may even need

to materialize the whole design matrix [

]. For example, to count the tuples of

{t∈T1 T2&

ktk2

2≤5}

in Table 1, we need to check the whole design matrix

T1 T2

for selecting the tuples that

satisfy the constraint “ktk2

2≤5”.

For the machine learning problems studied in this paper, we have the following assumption for the

loss function in (1).

Assumption 1 (Continuity)

There exist real constants

α, z ≥0, β ∈[0,1)

, such that for any

p, q ∈Rdand any θin the hypothesis space, we have

|f(θ, p)−f(θ, q)| ≤ αkp−qkz+β|f(θ, q)|,(2)

where k·kis the Euclidean norm in the space.

Remark 1

Different machine learning problems have different values for

α, β

and

. For example,

for the

-means clustering, we have

z= 2, β =

and

α=O(1

)

(where



can be any small number

(0,1)

); for the

-median clustering, we have

z= 1, β = 0

and

α= 1

. Actually for a large number

of problems,

is small or even

, e.g., the logistic regression and SVM with soft margin problems

have the value β= 0.

We deﬁne the following coreset with both multiplicative error and additive error. We are aware

that the standard coresets usually only have multiplicative errors [

]. However, the deviation

bounds for the ERM problems with ﬁnite training data set only yield additive error guarantees and

the additive error is usually acceptable in practice [

]. So the coresets with additive error have

been also proposed recently [

]. Let

P={p1, p2, . . . , pn}

be the training set (which is not explicitly

given), and denote by ∆the diameter of P(i.e., the largest pairwise distance of P).

Deﬁnition 2 ((1, 2)z-Coreset)

Suppose

1∈(0,1)

and

2, z > 0

. The

(1, 2)z

-coreset, denoted

as ˜

P, is a point set {c1,··· , c|˜

P|}with a weight vector W= [w1, w2, . . . , w|˜

P|]satisfying that

F(θ)∈(1 ±1)F(θ)±2∆z,(3)

for any θin the hypothesis space H, where ˜

F(θ) = 1

P|˜

i=1 wiP|˜

i=1 wif(θ, ci).

Usually we want the coreset size

|˜

to be much smaller than

|P|

. So we can run our algorithm on

and save a large amount of running time.

As mentioned in Section 1.1, we also assume that the training set

has a low intrinsic dimension.

“Doubling dimension” is a widely used measure for indicating the intrinsic dimension of a given data

set [

]. Roughly speaking, it measures the growth rate of the given data in the space. Recently it has

gained a lot of attention for studying its relation to coresets and machine learning problems [

For any c∈Rdand r≥0, we use B(c, r)to denote the ball centered at cwith radius r.

Deﬁnition 3 (Doubling Dimension)

The doubling dimension of a data set

is the smallest number

ρ > 0

, such that for any

p∈P

and

r≥0, P ∩B(p, 2r)

is always covered by the union of at most

2ρ

balls with radius r.

It is easy to obtain the following proposition by recursively applying Deﬁnition 3 log ∆

rtimes.

Claim 1

For a given data set

and radius

r > 0

, if

has the doubling dimension

, then it can be

covered by (∆

r)ρballs of radius r.

3 Relational Coreset via Aggregation Tree

In this section, we present an efﬁcient coreset construction method for relational data. First, we

introduce the technique of aggregation tree with pseudo-cube. Actually there is an obstacle for

obtaining the coreset from the aggregation tree. The weight of each point of the coreset is determined

by the size of its corresponding pseudo-cube; however, the pseudo-cubes may have overlap and

it is challenging to separate them in the environment of relational data. To explain our idea more

clearly, we temporally ignore this “overlap” issue and show the “ideal” coreset construction result in

Section 3.1. Then we show that the overlap issue can be efﬁciently solved by using random sampling

and some novel geometric insights in Section 3.2.

Recall that our input is a set of

relational tables

{T1, . . . , Ts}

with the feature (attribute) sets

{D1, . . . , Ds}

. Also

D=SiDi

and

|D|=d

. For ease of presentation, we generate a new feature

set

for each

as follows: initially,

D1=D1

; starting from

i= 2

, we let

Di=Di\(∪i−1

l=1 ˆ

Dl)

It is easy to see that

Siˆ

Di=D

, and

∀i, j ∈[s], i 6=j

Di∩ˆ

Dj=∅

. With a slight abuse of notations,

we also use

to represent the subspace spanned by the features of

(so the Cartesian product

D1× ··· × ˆ

is the the whole space

). For any point set

Q⊂Rd

(resp., any point

c∈Rd

) and

any subspace

, we use

ProjH(Q)

(resp.,

ProjH(c)

) to denote the projection of

(resp.,

)

onto H.

3.1 Coreset Construction

Since our coreset construction algorithm is closely related to the Gonzalez’s algorithm for

-center

clustering [

], we brieﬂy introduce it ﬁrst. Initially, it arbitrarily selects a point from

, say

, and sets

C={c1}

; then it iteratively selects a new point that is farthest to the set

, i.e.,

arg maxp∈Pminq∈C||p−q||

, and adds it to

. After

iterations, we obtain

points in

(suppose

C={c1,··· , ck}

). The Gonzalez’s algorithm yields a

-approximation for

-center clustering.

If the optimal radius of

-center clustering on

ropt

, then

can be covered by the

balls

B(c1,2ropt),··· ,B(ck,2ropt)

. Together with Claim 1, for any given radius

r > 0

, we know that if

we run the Gonzalez’s algorithm on

with setting

k= (2∆

r)ρ

, the set

can be covered by the balls

B(c1, r),··· ,B(ck, r)

(since

ropt ≤r/2

). If we set

to be small enough, the obtained

can be a

good approximation (informally a coreset) for P.

However, as mentioned in Section 1, such

-center clustering approach cannot be implemented on

relational data since it is equivalent to an instance of

FAQ-AI(k−1)

. A straightfoward idea to

remedy this issue is to run the Gonzalez’s algorithm in each subspace

, and then compute the

Cartesian product to build a grid of size

as the methods of [

]. But this approach will result

in an exponentially large coreset (e.g., there are

s= 8

tables and

k= 1000

). Below, we introduce

our algorithm that can achieve a coreset with quasi-polynomial size.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CoresetsforRelationalDataandTheApplicationsJiaxiangChen1QingyuanYang2RuominHuang1HuDing21SchoolofDataScience2SchoolofComputerScienceandTechnologyUniversityofScienceandTechnologyofChina{czar,yangqingyuan,hrm}@mail.ustc.edu.cn,huding@ustc.edu.cnAbstractAcoresetisasmallsetthatcanapproximatelypreservet...

展开>> 收起<<

Coresets for Relational Data and The Applications Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding2 1School of Data Science.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Coresets for Relational Data and The Applications Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding2 1School of Data Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: