An enhanced method of initial cluster center selection for K-means algorithm 1stZillur Rahman

2025-04-30 0 0 336.67KB 6 页 10玖币
侵权投诉
An enhanced method of initial cluster center
selection for K-means algorithm
1st Zillur Rahman
Department of Electrical and Electronic Engineering
Chittagong University of Engineering and Technology
Chittagong, Bangladesh
u1502056@student.cuet.ac.bd
2nd Md. Sabir Hossain
Department of Computer Science and Engineering
Chittagong University of Engineering and Technology
Chittagong, Bangladesh
sabir.cse@cuet.ac.bd
3rd Mohammad Hasan
Department of Computer Science and Engineering
Chittagong University of Engineering and Technology
Chittagong, Bangladesh
hasancse.cuet13@gmail.com
4th Ahmed Imteaj
Department of Computer Science and Engineering
Chittagong University of Engineering and Technology
Chittagong, Bangladesh
aimteaj@gmail.com
Abstract—Clustering is one of the widely used techniques to
find out patterns from a dataset that can be applied in different
applications or analyses. K-means, the most popular and simple
clustering algorithm, might get trapped into local minima if not
properly initialized and the initialization of this algorithm is done
randomly. In this paper, we propose a novel approach to improve
initial cluster selection for K-means algorithm. This algorithm
is based on the fact that the initial centroids must be well
separated from each other since the final clusters are separated
groups in feature space. The Convex Hull algorithm facilitates
the computing of the first two centroids and the remaining ones
are selected according to the distance from previously selected
centers. To ensure the selection of one center per cluster, we
use the nearest neighbor technique. To check the robustness of
our proposed algorithm, we consider several real-world datasets.
We obtained only 7.33%, 7.90%, and 0% clustering error in
Iris, Letter, and Ruspini data respectively which proves better
performance than other existing systems. The results indicate
that our proposed method outperforms the conventional K means
approach by accelerating the computation when the number of
clusters is greater than 2.
Index Terms—Clustering, Initial Centroid, K-means, Error
Percentage, Rand Index
I. INTRODUCTION
Clustering is a technique of grouping data in accordance to
the features such that data within same clusters have maxi-
mum similarities and data of different clusters have minimum
similarities. This technique is used in tons of applications
like data analysis, image segmentation, vector quantization,
and pattern recognition. There are several different clustering
algorithms such as hierarchical clustering, partitioning based,
grid-based, and density-based clustering algorithm [1]. Among
them, partitioning based algorithms are widely used.
K-means is a famous partitioning based clustering algorithm
that groups the data samples into K number of disjointed
clusters [2]. If the dataset contains N number of samples such
as x1, x2....xNand there are K clusters such as C1, C2...CK,
then the cost function used in K-means is defined by (1) where
xjis a data sample of cluster Ciand ciis the centroid of that
cluster.
J=
K
X
i=1
X
xjCi
d(xj, ci)(1)
The centroid of each cluster is the center of mass of that cluster
computed using (2) where |Ci|is the number of samples in
cluster Ci;0<|Ci|< N
ci=1
|Ci|X
xjCi
xj(2)
The detail of the simple K-means algorithm is as follows:
1) Take K and dataset as input and repeat process 2-5 for
a certain number of iterations.
2) Randomly select K samples from the dataset as the
initial cluster centers and repeat process 3-5 until con-
vergence.
3) Calculate the Euclidean distance between each data
samples and each centroid.
4) Assign each data sample to its nearest cluster.
5) Calculate the centroid of each cluster using (2)
6) Compute the error using (1) and take those initial
centroids corresponding to the minimum error.
From the above algorithm, it is clearly understood that the
error depends on the initial random centroids. Each random
selection results in a different amount of error and finding the
best solution might take a lot of iterations and computational
time as well. This is the biggest limitation of the k-means
algorithm.
In this paper, we propose a simple and yet efficient approach
to select the initial cluster centers which result in very fast
convergence along with excellent clustering performance in
four real-world datasets.
The rest of the paper is organized as follows. The section II
briefly describes the existing related systems, and section III
978-1-6654-3405-8/21/$31.00 ©2021 IEEE
arXiv:2210.09507v1 [cs.LG] 18 Oct 2022
presents our proposed algorithm in detail. The results obtained
from the four experiments along with dataset descriptions are
in section IV. The paper ends with the conclusion in section V.
II. RELATED WORKS
Several systems have been developed to find better ini-
tialization for the K-means clustering algorithm. A novel
approach was presented to select the initial centroids of the
clusters in [3]. A set of sub-samples are created from the
entire dataset, and K-means is applied to each sub-sample.
In that way, (K x J) number of centroids will be found for
the J number of sub-samples. Then, K-means is again applied
to these centroids using K centroids of each sub-sample as
initialization. The centroids that give minimum error will be
used as the initial centroids for the whole dataset. This method
is proved to be useful for large datasets.
A fast global k-means algorithm was proposed in [4] where a
global search technique was developed which adds one cluster
center at a time. The process is repeated until the K number
of centroids are found and principle component analysis(PCA)
is used for dimensionality reduction of different real-world
datasets.
A very simple initial centroid selection algorithm was devel-
oped in [5]. The first centroid is selected randomly from all the
samples. The second centroid is the sample having maximum
distance from its nearest centroid. This process continues until
the number of initial centroids reaches K.
Another method for selecting initial centroids was proposed
in [6]. All the Euclidean distances between the samples are
computed and the two samples with the shortest distance are
selected. Then these two points are deleted from the main
sample set along with their n nearest neighbor. In that way, a
new set will be created. By repeating this process for K times,
K sets are created. The mean of each set is used as an initial
centroid for the K-means algorithm.
In [7], the authors proposed Cluster Center Initialization
Algorithm (CCIA) to select initial centroids. In this system
first, all the attributes are normalized and one attribute is
selected. The mean, standard deviation, and percentile value
from the normal curve are used to find K number of centers for
that attribute. Then, K-means is applied for that attribute only
and each sample is labeled by a cluster number. By repeating
this for the remaining attributes, a pattern is created for each
sample. The unique patterns are then identified and samples
corresponding to such each pattern are assigned to a single
cluster.
Another approach was presented for initial centroid selection
where the whole dataset is considered as a single cell and
the axis with the highest variance is selected as the principal
axis in [8]. The cell is divided into two with help of a cutting
plane and error is calculated before and after the division. In
that way, cells are divided until the number of cells is the same
as the number of clusters, and the mean of each cell is then
used as the initial centroids for the K-means algorithm.
A novel approach for selecting the number of clusters was
proposed in [9]. The author developed a new cost function as
well as a new function for assigning each sample to a cluster.
The simple K-means is applied for different K and the value
respective to the global minimum of the cost function is the
actual number of cluster. Each sample is included in a cluster if
the corresponding function value for that cluster is minimum.
The authors in [10] presented another technique for initial
centroid selection where all the distances between samples
are calculated first. The mean distance and density of each
sample are computed using the developed density function.
The sample having maximum density is selected as the first
centroid and some neighbors of this sample are deleted from
the stored density array. This process is repeated K times.
A simple technique based on the weighted average of the
attributes was developed in [11]. The attribute values are
weighted averaged for each sample and sorted. The sorted list
is then divided into K set and the mean of each set will be
the initial centroids for simple K-means algorithm.
The authors in [12] proposed a new initial centroids selection
method. First, all the distances between samples are calculated.
Then the sums of distances for all points are computed
and sorted using their sum of distances from maximum to
minimum. The point, having the maximum sum of distances
is the first centroid. Then next N/K samples are discarded
and the first sample after that is selected as the second initial
centroid. This process will be for K times.
Another solution for initial centroid selection was presented
in [13]. This algorithm is based on the farthest samples. The
mean of the dataset is calculated and the farthest sample
from the mean is selected as the first initial centroid. The
sample having maximum distance from the previously selected
centroids is the second centroid and so on.
III. METHODOLOGY
This section explains the proposed algorithm to compute the
initial centroids for the K-means. To illustrate the procedures,
we generated a small dataset using random samples namely
synthetic dataset 1 that consists of two attributes and five
clusters. There is total 35 samples, and we named the attributes
variable1 and variable2. The dataset is shown in Fig. 1.
Fig. 1. Synthetic dataset 1
摘要:

AnenhancedmethodofinitialclustercenterselectionforK-meansalgorithm1stZillurRahmanDepartmentofElectricalandElectronicEngineeringChittagongUniversityofEngineeringandTechnologyChittagong,Bangladeshu1502056@student.cuet.ac.bd2ndMd.SabirHossainDepartmentofComputerScienceandEngineeringChittagongUniversity...

展开>> 收起<<
An enhanced method of initial cluster center selection for K-means algorithm 1stZillur Rahman.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:336.67KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注