Bagged k-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets Hanyuan Hang

2025-05-02 0 0 4.06MB 56 页 10玖币
侵权投诉
Bagged k-Distance for Mode-Based Clustering Using the
Probability of Localized Level Sets
Hanyuan Hang
Department of Applied Mathematics
University of Twente, The Netherlands
h.hang@utwente.nl
October 19, 2022
Abstract
In this paper, we propose an ensemble learning algorithm named bagged k-distance for
mode-based clustering (BDMBC ) by putting forward a new measurement called the proba-
bility of localized level sets (PLLS), which enables us to find all clusters for varying densities
with a global threshold. On the theoretical side, we show that with a properly chosen num-
ber of nearest neighbors kDin the bagged k-distance, the sub-sample size s, the bagging
rounds B, and the number of nearest neighbors kLfor the localized level sets, BDMBC can
achieve optimal convergence rates for mode estimation. It turns out that with a relatively
small B, the sub-sample size scan be much smaller than the number of training data nat
each bagging round, and the number of nearest neighbors kDcan be reduced simultaneously.
Moreover, we establish optimal convergence results for the level set estimation of the PLLS
in terms of Hausdorff distance, which reveals that BDMBC can find localized level sets for
varying densities and thus enjoys local adaptivity. On the practical side, we conduct nu-
merical experiments to empirically verify the effectiveness of BDMBC for mode estimation
and level set estimation, which demonstrates the promising accuracy and efficiency of our
proposed algorithm.
1 Introduction
In the field of density-based clustering, the common assumption that all clusters have similar
levels of densities is shared by many algorithms. In detail, those algorithms employ a global
threshold for densities to define the high-density regions and categorize them as clusters. Due
to the algorithmic simplicity, such paradigm, also named as single-level density-based clustering,
attracts lots of attention in the early stage of clustering researches [22,61,30,34,36]. However,
with the rapid development of information technology, the assumption is hard to hold as the
number of clusters and the size of data keeps growing. It has also been proved in experiments that
the well-performed single-level clustering algorithms are incompetent in encountering datasets
that have varying densities for different clusters [73,51]. Consequently, a more general setting
for density-based clustering called multi-level density clustering comes into vogue [73,52,14]
and is applied in various subjects including computer vision [57,11,31], medicine and biometrics
[46,60], etc.
1
arXiv:2210.09786v1 [stat.ML] 18 Oct 2022
To solve the multi-level density clustering problem, a primary idea is to expand the solutions
proposed in single-level clustering problems. Some researchers therefore hold the opinion that
increasing the number of thresholds can help seek clusters with different densities, and propose
the paradigm called hierarchical density-based clustering. The term hierarchical means that the
algorithm follows either an agglomerative (bottom-up) or a divisive (top-down) order to move the
threshold, estimates the clusters with each threshold, and finally grows a clustering tree based on
the clustering results. And by carefully selecting the nodes in the clustering tree, the hierarchical
methods can obtain promising results in multi-level density situations [51,48,2]. For example,
[51] takes the advantage of DBSCAN and proposes an automatic framework called HDBSCAN
to decide which thresholds are better according to some pre-determined informational metric; In
addition, [48] also studied how to choose the optimal clustering results from a clustering tree.
Nevertheless, the hierarchical methods are criticized for the heavy computational cost of growing
the clustering tree. And it is still an unsolved problem for automatically determining the clusters
from a cluster tree still remains.
Therefore, to improve the computation efficiency and directly obtain the clustering result,
another part of the research aims at finding a suitable transforming for the current density
measure to balance the levels of densities for all clusters into a similar level [16,37,73,72,53].
To be specific, such algorithms care little about the absolute density value of samples. Instead,
they attach more importance to the relative information of samples in a local area. For example,
[16] proposes the mean shift method to find the density hill of clusters by iteratively searching the
center of mass from several randomly chosen initial points. And in [73], the estimated probability
density function is transformed to a new measure called density ratio to help balance the density
measure. Since they result in seeking the bump or hill in the distribution, they are also called
mode-based clustering algorithms.
Although mode-based methods largely increase the computational efficiency of multi-level
density clustering problems, they still suffer from two inevitable shortcomings. Firstly, many
mode-based algorithms require the estimation of the probability density function, e.g. [73]. Nev-
ertheless, density estimation problems suffer from the curse of dimensionality, which means
estimating a satisfactory density function will be much harder and require more training samples
when the dimension of the input variables is high. Hence, it is hard to perform the mode-based
clustering algorithm on high-dimensional datasets. Secondly, the computational efficiency of the
mode-based algorithms may not be as satisfactory as expected. On the one hand, mode-based
algorithms require much training time when the sample size is large. On the other hand, the
procedure of searching an optimal combination of parameters can be even more tiresome.
Under such background, in this paper, we propose an ensemble learning algorithm called
bagged k-distance for mode-based clustering (BDMBC ) to solve the multi-level density clustering
problems. To be specific, we first introduce a new measurement called probability of localized
level sets (PLLS) to deal with the multi-level density problems. PLLS represents the local rank
of the density which makes it possible to employ a global threshold to recognize the multi-
level density clusters. Secondly, to resist the curse of dimensionality in density estimation, we
introduce the k-distance as the density function which is then plugged into the localized level
set estimation. As a distance-based measure, k-distance has strong resistance to the curse of
dimensionality, and hence enables BDMBC to deal with high-dimensional data sets. Last but
not least, we further employ the bagging technique to enhance the computational efficiency in
calculating the k-distance. In particular, when dealing with large-scale datasets, the bagging
technique can accelerate the algorithm with a small sampling ratio and thus uses a much smaller
training dataset in each bagging iteration. Since the size of the training dataset in each iteration
2
is largely decreased by sub-sampling, the searching grid for sample-size-based hyper-parameters
can also be simplified, preventing the practitioners from tedious hyper-parameter tuning.
The theoretical and experimental contributions of this paper are summarized as follows:
(i) From the theoretical perspective, we first conduct a learning theory analysis of the bagged
k-distance by introducing the hypothetical density estimation. Under the Hölder smoothness of
the density function, with properly chosen k, we establish optimal convergence rates of the
hypothetical density estimation in terms of the L-norm. It is worth pointing out that our finite
sample results demonstrate the explicit relationship among bagging rounds B, the number of
nearest neighbors k, and the sub-sample size s.
Then we propose a novel mode estimation built from the probability of a localized level
set. Based on the convergence rates of the hypothetical density estimation, we show that under
mild assumptions on modes, we obtain optimal convergence rates for mode estimation with
properly chosen parameters. We show that the bagging technique helps to reduce the subsample
size and the number of neighbors simultaneously for mode estimation and thus increases the
computational efficiency.
Moreover, under mild assumptions on the density function, we establish convergence results
of level set estimation for the probability of localized level set in terms of Hausdorff distance.
Compared to previous works on level set estimation in clustering that focus on a single threshold,
our results reveal level sets for varying densities. This reveals the local adaptivity of our BDMBC
in multi-level density clustering.
(ii) From the experimental perspective, we conduct numerical experiments to illustrate the
properties of our proposed BDMBC. Firstly, we verify our theoretical results about mode estima-
tion by conducting the experiment of mode estimation on synthetic datasets. We demonstrate
that our BDMBC can detect all modes successfully and thus can cluster all mode-based clusters.
Secondly, we verify our theoretical results about level-set estimation by conducting numerical
comparisons with other competing methods. We show the promising accuracy and efficiency of
our proposed algorithm compared with other density-based, cluster-tree-based, and mode-based
methods. Thirdly, we conduct parameter analysis on our proposed BDMBC, and empirically
demonstrate that with a relatively small subsample ratio, bagging can significantly narrow the
search-grid of parameters. Moreover, we compare the bagging and non-bagging version of the
BDMBC on large-scale synthetic datasets and verify that bagging can largely shorten the com-
putation time without sacrificing accuracy.
The remainder of this paper is organized as follows. Section 2is a warm-up section for the
introduction of some notations and the new measurement, the probability of localized level sets
(PLLS). Then we propose the bagged k-distance for mode-based clustering (BDMBC ) in Section
2. In Section 3, we first present our main results on the convergence rates for mode estimation
and level set estimation. Then we provide some comments and discussions concerning the main
results in this section. In Section 4, we conduct the error analysis for the bagged k-distance and
calculate its computational complexity. Section 5presents experimental results on both real and
synthetic data. We also conduct scalability experiments to show the computational efficiency of
our algorithm in this section. In Section 6, we demonstrate the details of proofs. Finally, we
summarize our work in Section 7.
3
2 Methodology
In this section, we briefly recall some necessary notations and algorithms as preliminaries in
Section 2.1. Then, to avoid the drawback of classical density-based clustering methods, we
propose a new measurement, the probability of localized level sets (PLLS) in Section 2.2, introduce
the bagged k-distance in Section 2.3, and construct the corresponding density-based clustering
algorithm called bagged k-distance for mode-based clustering (BDMBC ) in Section 2.4.
2.1 Preliminaries
First, we introduce some basic notations that will be frequently used in this paper. We use the
notation ab:= max{a, b}and ab:= min{a, b}. For any xR, let bxcdenote the largest
integer less than or equal to xand dxethe smallest integer greater than or equal to x. Recall
that for 1p < , the `p-norm is defined as kxkp:= (xp
1+··· +xp
d)1/p, and the `-norm is
defined as kxk:= maxi=1,...,d |xi|. Let (Ω,A, µ)be a probability space. We denote Lp(µ)as the
space of (equivalence classes of) measurable functions g: Ω Rwith finite Lp-norm kgkp. For
any xRdand r > 0, denote B(x, r) := {x0Rd:kx0xk2r}as the closed ball centered at
xwith radius r. For a set ARd, the cardinality of Ais denoted by #(A)and the indicator
function on Ais denoted by 1Aor 1{A}.
In the sequel, the notations an.bnand an=O(bn)denote that there exists some positive
constant c(0,1), such that ancbnand an&bndenotes that there exists some positive
constant c(0,1), such that anc1bn. Moreover, the notation anbnmeans that there
hold an.bnand bn.ansimultaneously. Let Pbe a probability distribution on Rdwith the
underlying density fwhich has a compact support X [R, R]dfor some R > 0. Suppose that
the data Dn= (X1, . . . , Xn)∈ Xnis drawn from Pin an i.i.d. fashion. With a slight abuse
of notation, in this paper, c, c0, C will be used interchangeably for positive constants while their
values may vary across different lemmas, propositions, theorems, and corollaries.
2.2 Probability of Localized Level Sets
One of the main drawbacks of density-based clustering based on density estimation is that it can
not find all clusters with varying densities using a global threshold, see, e.g., [12,72,13]. Here
we give a simple univariate example of this phenomenon in Figure 1. For the univariate trimodal
density, there are three different clusters that are visually identifiable, yet none of the level sets
of the density has three connected components. In fact, if the level is chosen too low, the two
clusters of high densities will be merged into a single cluster. If the density level is chosen too
high, the other cluster exhibiting a lower density will be lost. Clearly, in such case, the clusters
derived from a single density level cannot completely describe the inherent clustering structure
of the data set.
To deal with this issue, we propose a local measurement named the probability of localized
level sets (PLLS ) to implement the density-based clustering.
Definition 1 (Probability of Localized Level Sets).Let x∈ X and η(x)>0be the local radius
parameter. Given the true density function f:X R, the probability of localized level sets
(PLLS) is defined by
pη(x) = Pf(y)f(x)|yB(x, η(x))=Pf(y)f(x), y B(x, η(x))
P(yB(x, η(x))) .(1)
4
Figure 1: Univariate trimodal density for which it is not possible to capture its whole cluster structure
using a global threshold.
Note that the PLLS is the conditional probability of the event where the density of the
instance is larger than that of its neighborhood. To explain the advantages of the PLLS over
the original probability density function for clustering, we point out two critical observations
from (1). On the one hand, if xis a mode of f, then f(y)f(x)for all yB(x, η(x)). This
yields that pη(x)=1. On the other hand, if f(x)is a local minimum of the density, i.e., if
f(y)f(x)for all yB(x, η(x)), then we have pη(x) = 0. Therefore, the PLLS figures out
the relative positions to the modes of the density funlike the probability density function. As
a result, we can deal with the variation in density across different clusters and thus allow for a
single threshold to identify all the modes and the corresponding clusters simultaneously.
2.3 Bagged k-Distance
In this section, we introduce the bagged k-distance, which represents the density implicitly, for
the construction of mode-based clustering. For any xRdand any subset DDn, we denote
X(k)(x) := X(k)(x;D)as the k-th nearest neighbor of xin D. Then we denote Rk(x;D)as
the distance between xand X(k)(x;D), termed as the k-distance of xin D. Specifically, we let
Rk(x) := Rk(x;Dn).
We first recall k-nearest neighbor (k-NN) for density estimation. To be specific, denote
µ(B(x, r)) as the area (described in the Lebesgue measure) of the ball B(x, r). Then the k-NN
density estimator [8, Definition 3.1] is defined by
fk(x) = k/n
µ(B(x, Rk(x))) =k/n
VdRk(x)d,(2)
where Vd:= πd/2/Γ(d/2 + 1) is the volume of the unit ball.
However, in practice, a major problem is the numerical issues when computing k-NN den-
sity estimation for high-dimensional data where samples in a finite dataset can distribute quite
sparsely. As a consequence, the target density can be extremely small in areas of the input space.
In this case, Rk(x)din (2) can be extremely large when dis big, leading to arithmetic overflow in
the process of computing. Therefore, density estimation is problematic to be derived directly for
high-dimensional data. On the other hand, for large-scale datasets, the computational burden
of searching for k-nearest neighbors can be heavy. To deal with these two problems, in this
work, we adopt the bagging technique to reduce the number of nearest neighbors to search, and
5
摘要:

Baggedk-DistanceforMode-BasedClusteringUsingtheProbabilityofLocalizedLevelSetsHanyuanHangDepartmentofAppliedMathematicsUniversityofTwente,TheNetherlandsh.hang@utwente.nlOctober19,2022AbstractInthispaper,weproposeanensemblelearningalgorithmnamedbaggedk-distanceformode-basedclustering(BDMBC)byputtingf...

展开>> 收起<<
Bagged k-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets Hanyuan Hang.pdf

共56页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:56 页 大小:4.06MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 56
客服
关注