Bagged k-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets Hanyuan Hang

2025-05-02 1 0 4.06MB 56 页 10玖币

侵权投诉

Bagged k-Distance for Mode-Based Clustering Using the

Probability of Localized Level Sets

Hanyuan Hang

Department of Applied Mathematics

University of Twente, The Netherlands

h.hang@utwente.nl

October 19, 2022

Abstract

In this paper, we propose an ensemble learning algorithm named bagged k-distance for

mode-based clustering (BDMBC ) by putting forward a new measurement called the proba-

bility of localized level sets (PLLS), which enables us to ﬁnd all clusters for varying densities

with a global threshold. On the theoretical side, we show that with a properly chosen num-

ber of nearest neighbors kDin the bagged k-distance, the sub-sample size s, the bagging

rounds B, and the number of nearest neighbors kLfor the localized level sets, BDMBC can

achieve optimal convergence rates for mode estimation. It turns out that with a relatively

small B, the sub-sample size scan be much smaller than the number of training data nat

each bagging round, and the number of nearest neighbors kDcan be reduced simultaneously.

Moreover, we establish optimal convergence results for the level set estimation of the PLLS

in terms of Hausdorﬀ distance, which reveals that BDMBC can ﬁnd localized level sets for

varying densities and thus enjoys local adaptivity. On the practical side, we conduct nu-

merical experiments to empirically verify the eﬀectiveness of BDMBC for mode estimation

and level set estimation, which demonstrates the promising accuracy and eﬃciency of our

proposed algorithm.

1 Introduction

In the ﬁeld of density-based clustering, the common assumption that all clusters have similar

levels of densities is shared by many algorithms. In detail, those algorithms employ a global

threshold for densities to deﬁne the high-density regions and categorize them as clusters. Due

to the algorithmic simplicity, such paradigm, also named as single-level density-based clustering,

attracts lots of attention in the early stage of clustering researches [22,61,30,34,36]. However,

with the rapid development of information technology, the assumption is hard to hold as the

number of clusters and the size of data keeps growing. It has also been proved in experiments that

the well-performed single-level clustering algorithms are incompetent in encountering datasets

that have varying densities for diﬀerent clusters [73,51]. Consequently, a more general setting

for density-based clustering called multi-level density clustering comes into vogue [73,52,14]

and is applied in various subjects including computer vision [57,11,31], medicine and biometrics

[46,60], etc.

arXiv:2210.09786v1 [stat.ML] 18 Oct 2022

To solve the multi-level density clustering problem, a primary idea is to expand the solutions

proposed in single-level clustering problems. Some researchers therefore hold the opinion that

increasing the number of thresholds can help seek clusters with diﬀerent densities, and propose

the paradigm called hierarchical density-based clustering. The term hierarchical means that the

algorithm follows either an agglomerative (bottom-up) or a divisive (top-down) order to move the

threshold, estimates the clusters with each threshold, and ﬁnally grows a clustering tree based on

the clustering results. And by carefully selecting the nodes in the clustering tree, the hierarchical

methods can obtain promising results in multi-level density situations [51,48,2]. For example,

[51] takes the advantage of DBSCAN and proposes an automatic framework called HDBSCAN

to decide which thresholds are better according to some pre-determined informational metric; In

addition, [48] also studied how to choose the optimal clustering results from a clustering tree.

Nevertheless, the hierarchical methods are criticized for the heavy computational cost of growing

the clustering tree. And it is still an unsolved problem for automatically determining the clusters

from a cluster tree still remains.

Therefore, to improve the computation eﬃciency and directly obtain the clustering result,

another part of the research aims at ﬁnding a suitable transforming for the current density

measure to balance the levels of densities for all clusters into a similar level [16,37,73,72,53].

To be speciﬁc, such algorithms care little about the absolute density value of samples. Instead,

they attach more importance to the relative information of samples in a local area. For example,

[16] proposes the mean shift method to ﬁnd the density hill of clusters by iteratively searching the

center of mass from several randomly chosen initial points. And in [73], the estimated probability

density function is transformed to a new measure called density ratio to help balance the density

measure. Since they result in seeking the bump or hill in the distribution, they are also called

mode-based clustering algorithms.

Although mode-based methods largely increase the computational eﬃciency of multi-level

density clustering problems, they still suﬀer from two inevitable shortcomings. Firstly, many

mode-based algorithms require the estimation of the probability density function, e.g. [73]. Nev-

ertheless, density estimation problems suﬀer from the curse of dimensionality, which means

estimating a satisfactory density function will be much harder and require more training samples

when the dimension of the input variables is high. Hence, it is hard to perform the mode-based

clustering algorithm on high-dimensional datasets. Secondly, the computational eﬃciency of the

mode-based algorithms may not be as satisfactory as expected. On the one hand, mode-based

algorithms require much training time when the sample size is large. On the other hand, the

procedure of searching an optimal combination of parameters can be even more tiresome.

Under such background, in this paper, we propose an ensemble learning algorithm called

bagged k-distance for mode-based clustering (BDMBC ) to solve the multi-level density clustering

problems. To be speciﬁc, we ﬁrst introduce a new measurement called probability of localized

level sets (PLLS) to deal with the multi-level density problems. PLLS represents the local rank

of the density which makes it possible to employ a global threshold to recognize the multi-

level density clusters. Secondly, to resist the curse of dimensionality in density estimation, we

introduce the k-distance as the density function which is then plugged into the localized level

set estimation. As a distance-based measure, k-distance has strong resistance to the curse of

dimensionality, and hence enables BDMBC to deal with high-dimensional data sets. Last but

not least, we further employ the bagging technique to enhance the computational eﬃciency in

calculating the k-distance. In particular, when dealing with large-scale datasets, the bagging

technique can accelerate the algorithm with a small sampling ratio and thus uses a much smaller

training dataset in each bagging iteration. Since the size of the training dataset in each iteration

is largely decreased by sub-sampling, the searching grid for sample-size-based hyper-parameters

can also be simpliﬁed, preventing the practitioners from tedious hyper-parameter tuning.

The theoretical and experimental contributions of this paper are summarized as follows:

(i) From the theoretical perspective, we ﬁrst conduct a learning theory analysis of the bagged

k-distance by introducing the hypothetical density estimation. Under the Hölder smoothness of

the density function, with properly chosen k, we establish optimal convergence rates of the

hypothetical density estimation in terms of the L∞-norm. It is worth pointing out that our ﬁnite

sample results demonstrate the explicit relationship among bagging rounds B, the number of

nearest neighbors k, and the sub-sample size s.

Then we propose a novel mode estimation built from the probability of a localized level

set. Based on the convergence rates of the hypothetical density estimation, we show that under

mild assumptions on modes, we obtain optimal convergence rates for mode estimation with

properly chosen parameters. We show that the bagging technique helps to reduce the subsample

size and the number of neighbors simultaneously for mode estimation and thus increases the

computational eﬃciency.

Moreover, under mild assumptions on the density function, we establish convergence results

of level set estimation for the probability of localized level set in terms of Hausdorﬀ distance.

Compared to previous works on level set estimation in clustering that focus on a single threshold,

our results reveal level sets for varying densities. This reveals the local adaptivity of our BDMBC

in multi-level density clustering.

(ii) From the experimental perspective, we conduct numerical experiments to illustrate the

properties of our proposed BDMBC. Firstly, we verify our theoretical results about mode estima-

tion by conducting the experiment of mode estimation on synthetic datasets. We demonstrate

that our BDMBC can detect all modes successfully and thus can cluster all mode-based clusters.

Secondly, we verify our theoretical results about level-set estimation by conducting numerical

comparisons with other competing methods. We show the promising accuracy and eﬃciency of

our proposed algorithm compared with other density-based, cluster-tree-based, and mode-based

methods. Thirdly, we conduct parameter analysis on our proposed BDMBC, and empirically

demonstrate that with a relatively small subsample ratio, bagging can signiﬁcantly narrow the

search-grid of parameters. Moreover, we compare the bagging and non-bagging version of the

BDMBC on large-scale synthetic datasets and verify that bagging can largely shorten the com-

putation time without sacriﬁcing accuracy.

The remainder of this paper is organized as follows. Section 2is a warm-up section for the

introduction of some notations and the new measurement, the probability of localized level sets

(PLLS). Then we propose the bagged k-distance for mode-based clustering (BDMBC ) in Section

2. In Section 3, we ﬁrst present our main results on the convergence rates for mode estimation

and level set estimation. Then we provide some comments and discussions concerning the main

results in this section. In Section 4, we conduct the error analysis for the bagged k-distance and

calculate its computational complexity. Section 5presents experimental results on both real and

synthetic data. We also conduct scalability experiments to show the computational eﬃciency of

our algorithm in this section. In Section 6, we demonstrate the details of proofs. Finally, we

summarize our work in Section 7.

2 Methodology

In this section, we brieﬂy recall some necessary notations and algorithms as preliminaries in

Section 2.1. Then, to avoid the drawback of classical density-based clustering methods, we

propose a new measurement, the probability of localized level sets (PLLS) in Section 2.2, introduce

the bagged k-distance in Section 2.3, and construct the corresponding density-based clustering

algorithm called bagged k-distance for mode-based clustering (BDMBC ) in Section 2.4.

2.1 Preliminaries

First, we introduce some basic notations that will be frequently used in this paper. We use the

notation a∨b:= max{a, b}and a∧b:= min{a, b}. For any x∈R, let bxcdenote the largest

integer less than or equal to xand dxethe smallest integer greater than or equal to x. Recall

that for 1≤p < ∞, the `p-norm is deﬁned as kxkp:= (xp

1+··· +xp

d)1/p, and the `∞-norm is

deﬁned as kxk∞:= maxi=1,...,d |xi|. Let (Ω,A, µ)be a probability space. We denote Lp(µ)as the

space of (equivalence classes of) measurable functions g: Ω →Rwith ﬁnite Lp-norm kgkp. For

any x∈Rdand r > 0, denote B(x, r) := {x0∈Rd:kx0−xk2≤r}as the closed ball centered at

xwith radius r. For a set A⊂Rd, the cardinality of Ais denoted by #(A)and the indicator

function on Ais denoted by 1Aor 1{A}.

In the sequel, the notations an.bnand an=O(bn)denote that there exists some positive

constant c∈(0,1), such that an≤cbnand an&bndenotes that there exists some positive

constant c∈(0,1), such that an≥c−1bn. Moreover, the notation anbnmeans that there

hold an.bnand bn.ansimultaneously. Let Pbe a probability distribution on Rdwith the

underlying density fwhich has a compact support X ⊂ [−R, R]dfor some R > 0. Suppose that

the data Dn= (X1, . . . , Xn)∈ Xnis drawn from Pin an i.i.d. fashion. With a slight abuse

of notation, in this paper, c, c0, C will be used interchangeably for positive constants while their

values may vary across diﬀerent lemmas, propositions, theorems, and corollaries.

2.2 Probability of Localized Level Sets

One of the main drawbacks of density-based clustering based on density estimation is that it can

not ﬁnd all clusters with varying densities using a global threshold, see, e.g., [12,72,13]. Here

we give a simple univariate example of this phenomenon in Figure 1. For the univariate trimodal

density, there are three diﬀerent clusters that are visually identiﬁable, yet none of the level sets

of the density has three connected components. In fact, if the level is chosen too low, the two

clusters of high densities will be merged into a single cluster. If the density level is chosen too

high, the other cluster exhibiting a lower density will be lost. Clearly, in such case, the clusters

derived from a single density level cannot completely describe the inherent clustering structure

of the data set.

To deal with this issue, we propose a local measurement named the probability of localized

level sets (PLLS ) to implement the density-based clustering.

Deﬁnition 1 (Probability of Localized Level Sets).Let x∈ X and η(x)>0be the local radius

parameter. Given the true density function f:X → R, the probability of localized level sets

(PLLS) is deﬁned by

pη(x) = Pf(y)≤f(x)|y∈B(x, η(x))=Pf(y)≤f(x), y ∈B(x, η(x))

P(y∈B(x, η(x))) .(1)

Figure 1: Univariate trimodal density for which it is not possible to capture its whole cluster structure

using a global threshold.

Note that the PLLS is the conditional probability of the event where the density of the

instance is larger than that of its neighborhood. To explain the advantages of the PLLS over

the original probability density function for clustering, we point out two critical observations

from (1). On the one hand, if xis a mode of f, then f(y)≤f(x)for all y∈B(x, η(x)). This

yields that pη(x)=1. On the other hand, if f(x)is a local minimum of the density, i.e., if

f(y)≥f(x)for all y∈B(x, η(x)), then we have pη(x) = 0. Therefore, the PLLS ﬁgures out

the relative positions to the modes of the density funlike the probability density function. As

a result, we can deal with the variation in density across diﬀerent clusters and thus allow for a

single threshold to identify all the modes and the corresponding clusters simultaneously.

2.3 Bagged k-Distance

In this section, we introduce the bagged k-distance, which represents the density implicitly, for

the construction of mode-based clustering. For any x∈Rdand any subset D⊂Dn, we denote

X(k)(x) := X(k)(x;D)as the k-th nearest neighbor of xin D. Then we denote Rk(x;D)as

the distance between xand X(k)(x;D), termed as the k-distance of xin D. Speciﬁcally, we let

Rk(x) := Rk(x;Dn).

We ﬁrst recall k-nearest neighbor (k-NN) for density estimation. To be speciﬁc, denote

µ(B(x, r)) as the area (described in the Lebesgue measure) of the ball B(x, r). Then the k-NN

density estimator [8, Deﬁnition 3.1] is deﬁned by

fk(x) = k/n

µ(B(x, Rk(x))) =k/n

VdRk(x)d,(2)

where Vd:= πd/2/Γ(d/2 + 1) is the volume of the unit ball.

However, in practice, a major problem is the numerical issues when computing k-NN den-

sity estimation for high-dimensional data where samples in a ﬁnite dataset can distribute quite

sparsely. As a consequence, the target density can be extremely small in areas of the input space.

In this case, Rk(x)din (2) can be extremely large when dis big, leading to arithmetic overﬂow in

the process of computing. Therefore, density estimation is problematic to be derived directly for

high-dimensional data. On the other hand, for large-scale datasets, the computational burden

of searching for k-nearest neighbors can be heavy. To deal with these two problems, in this

work, we adopt the bagging technique to reduce the number of nearest neighbors to search, and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Baggedk-DistanceforMode-BasedClusteringUsingtheProbabilityofLocalizedLevelSetsHanyuanHangDepartmentofAppliedMathematicsUniversityofTwente,TheNetherlandsh.hang@utwente.nlOctober19,2022AbstractInthispaper,weproposeanensemblelearningalgorithmnamedbaggedk-distanceformode-basedclustering(BDMBC)byputtingf...

展开>> 收起<<

Bagged k-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets Hanyuan Hang.pdf

共56页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bagged k-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets Hanyuan Hang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: