Coresets for Relational Data and The Applications Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding2 1School of Data Science

2025-05-06 0 0 1.51MB 21 页 10玖币
侵权投诉
Coresets for Relational Data and The Applications
Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding2
1School of Data Science
2School of Computer Science and Technology
University of Science and Technology of China
{czar, yangqingyuan, hrm}@mail.ustc.edu.cn,huding@ustc.edu.cn
Abstract
A coreset is a small set that can approximately preserve the structure of the original
input data set. Therefore we can run our algorithm on a coreset so as to reduce
the total computational complexity. Conventional coreset techniques assume that
the input data set is available to process explicitly. However, this assumption
may not hold in real-world scenarios. In this paper, we consider the problem
of coresets construction over relational data. Namely, the data is decoupled into
several relational tables, and it could be very expensive to directly materialize the
data matrix by joining the tables. We propose a novel approach called “aggregation
tree with pseudo-cube” that can build a coreset from bottom to up. Moreover, our
approach can neatly circumvent several troublesome issues of relational learning
problems [Khamis et al., PODS 2019]. Under some mild assumptions, we show
that our coreset approach can be applied for the machine learning tasks, such as
clustering, logistic regression and SVM.
1 Introduction
As the rapid development of information technologies, we are often confronted with large-scale data
in many practical scenarios. To reduce the computational complexity, a number of data summarization
techniques have been proposed [
47
].
Coreset
is a well-known sampling technique for compressing
large-scale data sets [
44
,
28
]. Roughly speaking, for a given large data set
P
, the coreset approach is
to construct a new (weighted) set
˜
P
that approximately preserves the structure of
P
, and meanwhile
the size
|˜
P|
is much smaller than
|P|
. Therefore, if we run an existing algorithm on
˜
P
rather than
P
, the runtime can be significantly reduced and the computational quality (e.g., the optimization
objectives for the machine learning tasks [
54
]) can be approximately guaranteed. In the past decades,
the coreset techniques have been successfully applied for solving many optimization problems, such
as
k
-means clustering [
28
,
19
], logistic regression [
34
,
45
], Gaussian mixture models [
40
], and
continual learning [11].
Usually we assume that the input data is stored in a matrix such that the coreset algorithm can easily
access the whole data. But this assumption may not hold in real-world scenarios. For example,
according to the Kaggle 2017 “State of Data Science Survey”,
65.5% of the data sets are relational
data
. Namely, the data is decoupled into several relational tables, and we cannot obtain the data
matrix (which is also called the “
design matrix
”) unless joining all the tables. Relational database
has a very large and fast growing market in this big data era. It is expected to reach USD
122.38
billion by 2027 [
49
]. Relational database has several favorable properties [
55
], e.g., the decoupled
relational data can save a large amount of space, and it is friendly to several common statistical
queries such as the aggregate functions
COUNT,SUM
and
AVG
in SQL. In recent years, the study on
Corresponding author.
Preprint. Under review.
arXiv:2210.04249v1 [cs.LG] 9 Oct 2022
“in-database learning” has gained a significant amount of attention in the areas of machine learning
and data management [46, 38, 23, 37].
However, constructing a coreset for relational database is much more challenging.
Suppose
there are
s > 1
relational tables and the size of the largest table is
N
. Let
n
be the number of the
tuples obtained by joining all the tables, and then
n
can be as large as
O(Ns)
[
7
] (we illustrate an
instance of joining two tables in Table 1). Obviously it will take extremely large time and space
complexities to join all the tables and train a model on such scale of data. To remedy this issue, a
straightforward idea is trying to complete the computing task without explicitly constructing the
design matrix. Unfortunately, even for some very simple computational tasks, their implementations
on relational data can be NP-hard.
To see why these implementations are so hard, we can consider the following simple clustering
assignment task (which is often used as a building block for coresets construction [
28
,
19
]): suppose
the
k
cluster centers have already been given, and we want to determine the size of each cluster.
Note that if we have the design matrix, this task is trivial (just assign each point to its nearest center,
and then calculate the cluster sizes based on the assignment). However, if the data is decoupled
into several relational tables, the problem of computing the assignment can be quite complicated.
Khamis et al. [
36
] recently defined the problem of answering
F
unctional
A
ggregate
Q
ueries (FAQ)
in which some of the input factors are defined by a collection of
A
dditive
I
nequalities
2
between
variables (denote by
FAQ-AI(m)
, where
m
is the number of additive inequalities). The problem
of computing the clustering assignment can be formulated as an instance of
FAQ-AI(k1)
(to
determine the cluster membership for one point (tuple), we need to certify that its distance to its own
cluster center is lower than the distances to the other
k1
centers). Recently, Khamis et al. [
1
]
showed that evaluating even a
FAQ-AI(1)
instance is #P-hard, and approximating a
FAQ-AI(m)
instance within any finite factor is NP-hard for any
m > 1
. Moseley et al. [
43
] also showed that
even approximating the clustering assignment to any factor is NP-hard for
k3
. An intuitive
understanding is that the query satisfying those inequalities require to access the information of each
tuple on all the dimensions, where it is almost equivalent to constructing the entire design matrix.
1.1 Our Contributions
We consider Empirical Risk Minimization (ERM) problems in machine learning [
57
]. Let
Rd
be
the data space. The training set
P={p1, p2,··· , pn} ⊂ Rd
. But we assume that this set
P
is not
explicitly given, where it is decoupled into
s
relational tables (the formal definitions are shown in
Section 2). The goal is to learn the hypothesis
θ
(from the hypothesis space
H
) so as to minimize the
empirical risk
F(θ) = 1
n
n
X
i=1
f(θ, pi),(1)
where f(·,·)is the non-negative real-valued loss function.
Several coresets techniques on relational data have been studied very recently. Samadian et al. [
50
]
showed that the simple uniform sampling yields a coreset for regularized loss minimization problems
(note the uniform sampling can be efficiently implemented for relational database [
60
]). Their sample
size is
Θ (nκ·dim)
, where
κ(0,1)
(usually
κ
is set to be
1/2
in practice) and
dim
is the VC
dimension of loss function (usually it is
Θ(d)
). Thus the size can be
Θ (n·d)
, which is too large
especially for high dimensional data. Curtin et al. [
23
] constructed a coreset for
k
-means clustering
by building a weighted grid among the input
s
tables; the coreset yields a
9
-approximation for
the clustering objective. Independently, Ding et al. [
25
] also applied the “grid” idea to achieve a
(9 + )
-approximation for
k
-means clustering with distributed dimensions (attributes). The major
drawback of this “grid” idea is that the resulting coreset size can be as large as
ks
which is exponential
in the number of tables
s
. Moseley et al. [
43
] recently proposed a coreset for
k
-means clustering
on relational data by using the
k
-means++ initialization method [
6
], however, their approximation
ratio is too high (
>400
) for real-world applications. Moreover, the coresets techniques proposed
in [
23
,
25
,
43
] can only handle
k
-means clustering, and it is unclear that whether they can be extended
for other machine learning problems as the form of (1).
2
Each additive inequality defines a constraint for the query; for example, in SVM, the “positive” class is
defined by the inequality hx, ωi>0, where ωis the normal vector of the separating hyperplane.
2
In this paper, we aim to develop an efficient and general coreset construction method for optimizing
(1) on relational data. First, we observe that real-world data sets often have low intrinsic dimensions
(e.g, nearby a low-dimensional manifold) [
10
]. Our coreset approach is inspired by the well-known
Gonzalez’s algorithm for
k
-center clustering [
30
]. The algorithm greedily selects
k
points from the
input point set, and the
k
balls centered at these selected
k
points with some appropriate radius can
cover the whole input; if the input data has a low intrinsic dimension (e.g., the doubling dimension
in Definition 3), the radius is no larger than an upper bound depending on
k
. Therefore, the set of
cluster centers (each center has the weight equal to the corresponding cluster size) can serve as a
coreset for the ERM problem (1), where the error yielded from the coreset is determined by the radius.
Actually this
k
-center clustering based intuition has been used to construct the coresets for various
applications before [35, 53, 20].
However, we have to address two challenging issues for realizing this approach on relational
data.
First, the greedy selection procedure for the
k
cluster centers cannot be directly implemented
for relational data. Second, it is hard to obtain the size of each cluster especially when the balls have
overlap (the
k
balls can partition the space into as many as
2k
parts). Actually, both of these two
issues can be regarded as the instances of FAQ-AI(k1) which are NP-hard to compute [1, 43].
Our approach relies on a novel and easy-to-implement structure called “
aggregation tree with
pseudo-cube
”. We build the tree from bottom to up, where each relational table represents a leaf.
Informally speaking, each node consists of a set of
k
cluster centers which is obtained by merging
its two children; each center also associates with a “pseudo-cube” region in the space. Eventually,
the root of the tree yields a qualified coreset for the implicit design matrix. The aggregation manner
can help us to avoid building the high-complexity grid coreset as [
23
,
25
]. Also, with the aid of
“pseudo-cube”, we can efficiently estimate the size of each cluster without tackling the troublesome
FAQ-AI counting issue [1, 43].
Comparing with the previous coresets methods, our approach enjoys several significant advantages.
For example, our approach can deal with more general applications. In fact, for most ERM problems
in the form of (1) under some mild assumptions, we can construct their coresets by applying our
approach. Also our coresets have much lower complexities. It is worth to emphasize that our coreset
size is independent of the dimensionality
d
; instead, it only depends on the doubling dimension of the
input data.
1.2 Other Related Works
The earliest research on relational data was proposed by Codd [
17
]. A fundamental topic that is
closely related to machine learning on relational data is how to take uniform and independent samples
from the full join results [
13
]. Recently, Zhao et al. [
60
] proposed a random walk approach for
this problem with acyclic join; Chen and Yi [
14
] generalized the result to some specific cyclic join
scenarios.
Beside the aforementioned works [
46
,
38
,
23
,
25
,
36
,
1
,
43
], there also exist a number of results
on other machine learning problems over relational data. Khamis et al. [
2
] and Yang et al. [
58
]
respectively considered training SVMs and SVMs with Gaussian kernel on relational data. For the
problem of linear regression on relational data, it is common to use the factorization techniques [
51
,
3
].
The algorithms for training Gaussian Mixture Models and Neural Networks on relational data were
also studied recently [
15
,
16
]. We also refer the reader to the survey paper [
52
] for more details on
learning over relational data.
2 Preliminaries
Suppose the training set for the problem (1) is a set
P
of
n
points in
Rd
, and it is decoupled into
s
relational tables
{T1, . . . , Ts}
with the feature (attribute) sets
{D1, . . . , Ds}
. Let
D=SiDi
and
therefore the size
|D|=d
. Actually, each table
Ti
can be viewed as projection of
P
onto a subspace
spanned by
Di
. To materialize the design matrix of
P
, a straightforward way is to compute the join
over these
s
tables. With a slight abuse of notations, we still use “
P
” to denote the design matrix. We
also let [s] = {1,2, . . . , s}for simplicity.
3
T1
d1d2
1 1
2 1
2 2
3 3
T2
d2d3
1 1
1 4
3 1
3 3
T1 T2
d1d2d3
1 1 1
1 1 4
2 1 1
2 1 4
3 3 1
3 3 3
Table 1: An illustration for the join over two tables.
Definition 1 (Join)
The join over the given
s
tables returns a
n×d
design matrix
P=T1···
Ts
, such that for any vector (point)
pRd
,
pP
if and only if
i[s]
,
ProjDi(p)Ti
, where
ProjDi(p)is the projection of pon the subspace spanned by the features of Di.
Table 1 is a simple illustration for joining two relational tables. To have a more intuitive understanding
of join, we can generate a
hypergraph G= (V,E)
. Each vertex of
V
corresponds to an individual
feature of
D
; each hyperedge of
E
corresponds to an individual relational table
Ti
, and it connects all
the vertices (features) of
Di
. Then we can define the
acyclic
and
cyclic
join queries. A join is acyclic
if the hypergraph
G= (V,E)
can be ablated to be empty by performing the following operation
iteratively: if there exists a vertex that connects with only one hyperedge, remove this vertex together
with the hyperedge. Otherwise, the join is cyclic. A cyclic query usually is extremely difficult and
has a much higher complexity than that of an acyclic query. For example, for a cyclic join, it is even
NP-hard to determine that whether the join is empty or not [
42
]. Fortunately, most real-world joins
are acyclic, which allow us to take full advantage of relational data. Similar with most of the previous
articles on relational learning [
1
,
2
], we also assume that the join of the given tables is acyclic in this
paper.
Counting
is one of the most common aggregation queries on relational data. It returns the number
of tuples that satisfy some particular conditions. The counting on an acyclic join (without additive
inequalities) can be implemented effectively via dynamic programming [
26
]. But when the additive
inequalities are included, the counting problem can be much more challenging and we may even need
to materialize the whole design matrix [
1
]. For example, to count the tuples of
{tT1 T2&
ktk2
25}
in Table 1, we need to check the whole design matrix
T1 T2
for selecting the tuples that
satisfy the constraint “ktk2
25”.
For the machine learning problems studied in this paper, we have the following assumption for the
loss function in (1).
Assumption 1 (Continuity)
There exist real constants
α, z 0, β [0,1)
, such that for any
p, q Rdand any θin the hypothesis space, we have
|f(θ, p)f(θ, q)| ≤ αkpqkz+β|f(θ, q)|,(2)
where k·kis the Euclidean norm in the space.
Remark 1
Different machine learning problems have different values for
α, β
and
z
. For example,
for the
k
-means clustering, we have
z= 2, β =
and
α=O(1
)
(where
can be any small number
in
(0,1)
); for the
k
-median clustering, we have
z= 1, β = 0
and
α= 1
. Actually for a large number
of problems,
β
is small or even
0
, e.g., the logistic regression and SVM with soft margin problems
have the value β= 0.
We define the following coreset with both multiplicative error and additive error. We are aware
that the standard coresets usually only have multiplicative errors [
44
,
28
]. However, the deviation
bounds for the ERM problems with finite training data set only yield additive error guarantees and
the additive error is usually acceptable in practice [
8
,
56
]. So the coresets with additive error have
been also proposed recently [
9
]. Let
P={p1, p2, . . . , pn}
be the training set (which is not explicitly
given), and denote by the diameter of P(i.e., the largest pairwise distance of P).
Definition 2 ((1, 2)z-Coreset)
Suppose
1(0,1)
and
2, z > 0
. The
(1, 2)z
-coreset, denoted
as ˜
P, is a point set {c1,··· , c|˜
P|}with a weight vector W= [w1, w2, . . . , w|˜
P|]satisfying that
˜
F(θ)(1 ±1)F(θ)±2z,(3)
4
for any θin the hypothesis space H, where ˜
F(θ) = 1
P|˜
P|
i=1 wiP|˜
P|
i=1 wif(θ, ci).
Usually we want the coreset size
|˜
P|
to be much smaller than
|P|
. So we can run our algorithm on
˜
P
and save a large amount of running time.
As mentioned in Section 1.1, we also assume that the training set
P
has a low intrinsic dimension.
“Doubling dimension” is a widely used measure for indicating the intrinsic dimension of a given data
set [
41
]. Roughly speaking, it measures the growth rate of the given data in the space. Recently it has
gained a lot of attention for studying its relation to coresets and machine learning problems [
39
,
33
,
18
].
For any cRdand r0, we use B(c, r)to denote the ball centered at cwith radius r.
Definition 3 (Doubling Dimension)
The doubling dimension of a data set
P
is the smallest number
ρ > 0
, such that for any
pP
and
r0, P B(p, 2r)
is always covered by the union of at most
2ρ
balls with radius r.
It is easy to obtain the following proposition by recursively applying Definition 3 log
rtimes.
Claim 1
For a given data set
P
and radius
r > 0
, if
P
has the doubling dimension
ρ
, then it can be
covered by (
r)ρballs of radius r.
3 Relational Coreset via Aggregation Tree
In this section, we present an efficient coreset construction method for relational data. First, we
introduce the technique of aggregation tree with pseudo-cube. Actually there is an obstacle for
obtaining the coreset from the aggregation tree. The weight of each point of the coreset is determined
by the size of its corresponding pseudo-cube; however, the pseudo-cubes may have overlap and
it is challenging to separate them in the environment of relational data. To explain our idea more
clearly, we temporally ignore this “overlap” issue and show the “ideal” coreset construction result in
Section 3.1. Then we show that the overlap issue can be efficiently solved by using random sampling
and some novel geometric insights in Section 3.2.
Recall that our input is a set of
s
relational tables
{T1, . . . , Ts}
with the feature (attribute) sets
{D1, . . . , Ds}
. Also
D=SiDi
and
|D|=d
. For ease of presentation, we generate a new feature
set
ˆ
Di
for each
Di
as follows: initially,
ˆ
D1=D1
; starting from
i= 2
to
s
, we let
ˆ
Di=Di\(i1
l=1 ˆ
Dl)
.
It is easy to see that
Siˆ
Di=D
, and
i, j [s], i 6=j
,
ˆ
Diˆ
Dj=
. With a slight abuse of notations,
we also use
ˆ
Di
to represent the subspace spanned by the features of
ˆ
Di
(so the Cartesian product
ˆ
D1× ··· × ˆ
Ds
is the the whole space
Rd
). For any point set
QRd
(resp., any point
cRd
) and
any subspace
H
of
Rd
, we use
ProjH(Q)
(resp.,
ProjH(c)
) to denote the projection of
Q
(resp.,
c
)
onto H.
3.1 Coreset Construction
Since our coreset construction algorithm is closely related to the Gonzalez’s algorithm for
k
-center
clustering [
30
], we briefly introduce it first. Initially, it arbitrarily selects a point from
P
, say
c1
, and sets
C={c1}
; then it iteratively selects a new point that is farthest to the set
C
, i.e.,
arg maxpPminqC||pq||
, and adds it to
C
. After
k
iterations, we obtain
k
points in
C
(suppose
C={c1,··· , ck}
). The Gonzalez’s algorithm yields a
2
-approximation for
k
-center clustering.
If the optimal radius of
k
-center clustering on
P
is
ropt
, then
P
can be covered by the
k
balls
B(c1,2ropt),··· ,B(ck,2ropt)
. Together with Claim 1, for any given radius
r > 0
, we know that if
we run the Gonzalez’s algorithm on
P
with setting
k= (2∆
r)ρ
, the set
P
can be covered by the balls
B(c1, r),··· ,B(ck, r)
(since
ropt r/2
). If we set
r
to be small enough, the obtained
C
can be a
good approximation (informally a coreset) for P.
However, as mentioned in Section 1, such
k
-center clustering approach cannot be implemented on
relational data since it is equivalent to an instance of
FAQ-AI(k1)
. A straightfoward idea to
remedy this issue is to run the Gonzalez’s algorithm in each subspace
ˆ
Di
, and then compute the
Cartesian product to build a grid of size
ks
as the methods of [
23
,
25
]. But this approach will result
in an exponentially large coreset (e.g., there are
s= 8
tables and
k= 1000
). Below, we introduce
our algorithm that can achieve a coreset with quasi-polynomial size.
5
摘要:

CoresetsforRelationalDataandTheApplicationsJiaxiangChen1QingyuanYang2RuominHuang1HuDing21SchoolofDataScience2SchoolofComputerScienceandTechnologyUniversityofScienceandTechnologyofChina{czar,yangqingyuan,hrm}@mail.ustc.edu.cn,huding@ustc.edu.cnAbstractAcoresetisasmallsetthatcanapproximatelypreservet...

展开>> 收起<<
Coresets for Relational Data and The Applications Jiaxiang Chen1Qingyuan Yang2Ruomin Huang1Hu Ding2 1School of Data Science.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.51MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注