To solve the multi-level density clustering problem, a primary idea is to expand the solutions
proposed in single-level clustering problems. Some researchers therefore hold the opinion that
increasing the number of thresholds can help seek clusters with different densities, and propose
the paradigm called hierarchical density-based clustering. The term hierarchical means that the
algorithm follows either an agglomerative (bottom-up) or a divisive (top-down) order to move the
threshold, estimates the clusters with each threshold, and finally grows a clustering tree based on
the clustering results. And by carefully selecting the nodes in the clustering tree, the hierarchical
methods can obtain promising results in multi-level density situations [51,48,2]. For example,
[51] takes the advantage of DBSCAN and proposes an automatic framework called HDBSCAN
to decide which thresholds are better according to some pre-determined informational metric; In
addition, [48] also studied how to choose the optimal clustering results from a clustering tree.
Nevertheless, the hierarchical methods are criticized for the heavy computational cost of growing
the clustering tree. And it is still an unsolved problem for automatically determining the clusters
from a cluster tree still remains.
Therefore, to improve the computation efficiency and directly obtain the clustering result,
another part of the research aims at finding a suitable transforming for the current density
measure to balance the levels of densities for all clusters into a similar level [16,37,73,72,53].
To be specific, such algorithms care little about the absolute density value of samples. Instead,
they attach more importance to the relative information of samples in a local area. For example,
[16] proposes the mean shift method to find the density hill of clusters by iteratively searching the
center of mass from several randomly chosen initial points. And in [73], the estimated probability
density function is transformed to a new measure called density ratio to help balance the density
measure. Since they result in seeking the bump or hill in the distribution, they are also called
mode-based clustering algorithms.
Although mode-based methods largely increase the computational efficiency of multi-level
density clustering problems, they still suffer from two inevitable shortcomings. Firstly, many
mode-based algorithms require the estimation of the probability density function, e.g. [73]. Nev-
ertheless, density estimation problems suffer from the curse of dimensionality, which means
estimating a satisfactory density function will be much harder and require more training samples
when the dimension of the input variables is high. Hence, it is hard to perform the mode-based
clustering algorithm on high-dimensional datasets. Secondly, the computational efficiency of the
mode-based algorithms may not be as satisfactory as expected. On the one hand, mode-based
algorithms require much training time when the sample size is large. On the other hand, the
procedure of searching an optimal combination of parameters can be even more tiresome.
Under such background, in this paper, we propose an ensemble learning algorithm called
bagged k-distance for mode-based clustering (BDMBC ) to solve the multi-level density clustering
problems. To be specific, we first introduce a new measurement called probability of localized
level sets (PLLS) to deal with the multi-level density problems. PLLS represents the local rank
of the density which makes it possible to employ a global threshold to recognize the multi-
level density clusters. Secondly, to resist the curse of dimensionality in density estimation, we
introduce the k-distance as the density function which is then plugged into the localized level
set estimation. As a distance-based measure, k-distance has strong resistance to the curse of
dimensionality, and hence enables BDMBC to deal with high-dimensional data sets. Last but
not least, we further employ the bagging technique to enhance the computational efficiency in
calculating the k-distance. In particular, when dealing with large-scale datasets, the bagging
technique can accelerate the algorithm with a small sampling ratio and thus uses a much smaller
training dataset in each bagging iteration. Since the size of the training dataset in each iteration
2