In this paper, we aim to develop an efficient and general coreset construction method for optimizing
(1) on relational data. First, we observe that real-world data sets often have low intrinsic dimensions
(e.g, nearby a low-dimensional manifold) [
10
]. Our coreset approach is inspired by the well-known
Gonzalez’s algorithm for
k
-center clustering [
30
]. The algorithm greedily selects
k
points from the
input point set, and the
k
balls centered at these selected
k
points with some appropriate radius can
cover the whole input; if the input data has a low intrinsic dimension (e.g., the doubling dimension
in Definition 3), the radius is no larger than an upper bound depending on
k
. Therefore, the set of
cluster centers (each center has the weight equal to the corresponding cluster size) can serve as a
coreset for the ERM problem (1), where the error yielded from the coreset is determined by the radius.
Actually this
k
-center clustering based intuition has been used to construct the coresets for various
applications before [35, 53, 20].
However, we have to address two challenging issues for realizing this approach on relational
data.
First, the greedy selection procedure for the
k
cluster centers cannot be directly implemented
for relational data. Second, it is hard to obtain the size of each cluster especially when the balls have
overlap (the
k
balls can partition the space into as many as
2k
parts). Actually, both of these two
issues can be regarded as the instances of FAQ-AI(k−1) which are NP-hard to compute [1, 43].
Our approach relies on a novel and easy-to-implement structure called “
aggregation tree with
pseudo-cube
”. We build the tree from bottom to up, where each relational table represents a leaf.
Informally speaking, each node consists of a set of
k
cluster centers which is obtained by merging
its two children; each center also associates with a “pseudo-cube” region in the space. Eventually,
the root of the tree yields a qualified coreset for the implicit design matrix. The aggregation manner
can help us to avoid building the high-complexity grid coreset as [
23
,
25
]. Also, with the aid of
“pseudo-cube”, we can efficiently estimate the size of each cluster without tackling the troublesome
FAQ-AI counting issue [1, 43].
Comparing with the previous coresets methods, our approach enjoys several significant advantages.
For example, our approach can deal with more general applications. In fact, for most ERM problems
in the form of (1) under some mild assumptions, we can construct their coresets by applying our
approach. Also our coresets have much lower complexities. It is worth to emphasize that our coreset
size is independent of the dimensionality
d
; instead, it only depends on the doubling dimension of the
input data.
1.2 Other Related Works
The earliest research on relational data was proposed by Codd [
17
]. A fundamental topic that is
closely related to machine learning on relational data is how to take uniform and independent samples
from the full join results [
13
]. Recently, Zhao et al. [
60
] proposed a random walk approach for
this problem with acyclic join; Chen and Yi [
14
] generalized the result to some specific cyclic join
scenarios.
Beside the aforementioned works [
46
,
38
,
23
,
25
,
36
,
1
,
43
], there also exist a number of results
on other machine learning problems over relational data. Khamis et al. [
2
] and Yang et al. [
58
]
respectively considered training SVMs and SVMs with Gaussian kernel on relational data. For the
problem of linear regression on relational data, it is common to use the factorization techniques [
51
,
3
].
The algorithms for training Gaussian Mixture Models and Neural Networks on relational data were
also studied recently [
15
,
16
]. We also refer the reader to the survey paper [
52
] for more details on
learning over relational data.
2 Preliminaries
Suppose the training set for the problem (1) is a set
P
of
n
points in
Rd
, and it is decoupled into
s
relational tables
{T1, . . . , Ts}
with the feature (attribute) sets
{D1, . . . , Ds}
. Let
D=SiDi
and
therefore the size
|D|=d
. Actually, each table
Ti
can be viewed as projection of
P
onto a subspace
spanned by
Di
. To materialize the design matrix of
P
, a straightforward way is to compute the join
over these
s
tables. With a slight abuse of notations, we still use “
P
” to denote the design matrix. We
also let [s] = {1,2, . . . , s}for simplicity.
3