An enhanced method of initial cluster center selection for K-means algorithm 1stZillur Rahman

2025-04-30 0 0 336.67KB 6 页 10玖币

侵权投诉

An enhanced method of initial cluster center

selection for K-means algorithm

1st Zillur Rahman

Department of Electrical and Electronic Engineering

Chittagong University of Engineering and Technology

Chittagong, Bangladesh

u1502056@student.cuet.ac.bd

2nd Md. Sabir Hossain

Department of Computer Science and Engineering

Chittagong University of Engineering and Technology

Chittagong, Bangladesh

sabir.cse@cuet.ac.bd

3rd Mohammad Hasan

Department of Computer Science and Engineering

Chittagong University of Engineering and Technology

Chittagong, Bangladesh

hasancse.cuet13@gmail.com

4th Ahmed Imteaj

Department of Computer Science and Engineering

Chittagong University of Engineering and Technology

Chittagong, Bangladesh

aimteaj@gmail.com

Abstract—Clustering is one of the widely used techniques to

ﬁnd out patterns from a dataset that can be applied in different

applications or analyses. K-means, the most popular and simple

clustering algorithm, might get trapped into local minima if not

properly initialized and the initialization of this algorithm is done

randomly. In this paper, we propose a novel approach to improve

initial cluster selection for K-means algorithm. This algorithm

is based on the fact that the initial centroids must be well

separated from each other since the ﬁnal clusters are separated

groups in feature space. The Convex Hull algorithm facilitates

the computing of the ﬁrst two centroids and the remaining ones

are selected according to the distance from previously selected

centers. To ensure the selection of one center per cluster, we

use the nearest neighbor technique. To check the robustness of

our proposed algorithm, we consider several real-world datasets.

We obtained only 7.33%, 7.90%, and 0% clustering error in

Iris, Letter, and Ruspini data respectively which proves better

performance than other existing systems. The results indicate

that our proposed method outperforms the conventional K means

approach by accelerating the computation when the number of

clusters is greater than 2.

Index Terms—Clustering, Initial Centroid, K-means, Error

Percentage, Rand Index

I. INTRODUCTION

Clustering is a technique of grouping data in accordance to

the features such that data within same clusters have maxi-

mum similarities and data of different clusters have minimum

similarities. This technique is used in tons of applications

like data analysis, image segmentation, vector quantization,

and pattern recognition. There are several different clustering

algorithms such as hierarchical clustering, partitioning based,

grid-based, and density-based clustering algorithm [1]. Among

them, partitioning based algorithms are widely used.

K-means is a famous partitioning based clustering algorithm

that groups the data samples into K number of disjointed

clusters [2]. If the dataset contains N number of samples such

as x1, x2....xNand there are K clusters such as C1, C2...CK,

then the cost function used in K-means is deﬁned by (1) where

xjis a data sample of cluster Ciand ciis the centroid of that

cluster.

i=1

xjCi

d(xj, ci)(1)

The centroid of each cluster is the center of mass of that cluster

computed using (2) where |Ci|is the number of samples in

cluster Ci;0<|Ci|< N

ci=1

|Ci|X

xjCi

xj(2)

The detail of the simple K-means algorithm is as follows:

1) Take K and dataset as input and repeat process 2-5 for

a certain number of iterations.

2) Randomly select K samples from the dataset as the

initial cluster centers and repeat process 3-5 until con-

vergence.

3) Calculate the Euclidean distance between each data

samples and each centroid.

4) Assign each data sample to its nearest cluster.

5) Calculate the centroid of each cluster using (2)

6) Compute the error using (1) and take those initial

centroids corresponding to the minimum error.

From the above algorithm, it is clearly understood that the

error depends on the initial random centroids. Each random

selection results in a different amount of error and ﬁnding the

best solution might take a lot of iterations and computational

time as well. This is the biggest limitation of the k-means

algorithm.

In this paper, we propose a simple and yet efﬁcient approach

to select the initial cluster centers which result in very fast

convergence along with excellent clustering performance in

four real-world datasets.

The rest of the paper is organized as follows. The section II

brieﬂy describes the existing related systems, and section III

arXiv:2210.09507v1 [cs.LG] 18 Oct 2022

presents our proposed algorithm in detail. The results obtained

from the four experiments along with dataset descriptions are

in section IV. The paper ends with the conclusion in section V.

II. RELATED WORKS

Several systems have been developed to ﬁnd better ini-

tialization for the K-means clustering algorithm. A novel

approach was presented to select the initial centroids of the

clusters in [3]. A set of sub-samples are created from the

entire dataset, and K-means is applied to each sub-sample.

In that way, (K x J) number of centroids will be found for

the J number of sub-samples. Then, K-means is again applied

to these centroids using K centroids of each sub-sample as

initialization. The centroids that give minimum error will be

used as the initial centroids for the whole dataset. This method

is proved to be useful for large datasets.

A fast global k-means algorithm was proposed in [4] where a

global search technique was developed which adds one cluster

center at a time. The process is repeated until the K number

of centroids are found and principle component analysis(PCA)

is used for dimensionality reduction of different real-world

datasets.

A very simple initial centroid selection algorithm was devel-

oped in [5]. The ﬁrst centroid is selected randomly from all the

samples. The second centroid is the sample having maximum

distance from its nearest centroid. This process continues until

the number of initial centroids reaches K.

Another method for selecting initial centroids was proposed

in [6]. All the Euclidean distances between the samples are

computed and the two samples with the shortest distance are

selected. Then these two points are deleted from the main

sample set along with their n nearest neighbor. In that way, a

new set will be created. By repeating this process for K times,

K sets are created. The mean of each set is used as an initial

centroid for the K-means algorithm.

In [7], the authors proposed Cluster Center Initialization

Algorithm (CCIA) to select initial centroids. In this system

ﬁrst, all the attributes are normalized and one attribute is

selected. The mean, standard deviation, and percentile value

from the normal curve are used to ﬁnd K number of centers for

that attribute. Then, K-means is applied for that attribute only

and each sample is labeled by a cluster number. By repeating

this for the remaining attributes, a pattern is created for each

sample. The unique patterns are then identiﬁed and samples

corresponding to such each pattern are assigned to a single

cluster.

Another approach was presented for initial centroid selection

where the whole dataset is considered as a single cell and

the axis with the highest variance is selected as the principal

axis in [8]. The cell is divided into two with help of a cutting

plane and error is calculated before and after the division. In

that way, cells are divided until the number of cells is the same

as the number of clusters, and the mean of each cell is then

used as the initial centroids for the K-means algorithm.

A novel approach for selecting the number of clusters was

proposed in [9]. The author developed a new cost function as

well as a new function for assigning each sample to a cluster.

The simple K-means is applied for different K and the value

respective to the global minimum of the cost function is the

actual number of cluster. Each sample is included in a cluster if

the corresponding function value for that cluster is minimum.

The authors in [10] presented another technique for initial

centroid selection where all the distances between samples

are calculated ﬁrst. The mean distance and density of each

sample are computed using the developed density function.

The sample having maximum density is selected as the ﬁrst

centroid and some neighbors of this sample are deleted from

the stored density array. This process is repeated K times.

A simple technique based on the weighted average of the

attributes was developed in [11]. The attribute values are

weighted averaged for each sample and sorted. The sorted list

is then divided into K set and the mean of each set will be

the initial centroids for simple K-means algorithm.

The authors in [12] proposed a new initial centroids selection

method. First, all the distances between samples are calculated.

Then the sums of distances for all points are computed

and sorted using their sum of distances from maximum to

minimum. The point, having the maximum sum of distances

is the ﬁrst centroid. Then next N/K samples are discarded

and the ﬁrst sample after that is selected as the second initial

centroid. This process will be for K times.

Another solution for initial centroid selection was presented

in [13]. This algorithm is based on the farthest samples. The

mean of the dataset is calculated and the farthest sample

from the mean is selected as the ﬁrst initial centroid. The

sample having maximum distance from the previously selected

centroids is the second centroid and so on.

III. METHODOLOGY

This section explains the proposed algorithm to compute the

initial centroids for the K-means. To illustrate the procedures,

we generated a small dataset using random samples namely

synthetic dataset 1 that consists of two attributes and ﬁve

clusters. There is total 35 samples, and we named the attributes

variable1 and variable2. The dataset is shown in Fig. 1.

Fig. 1. Synthetic dataset 1

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnenhancedmethodofinitialclustercenterselectionforK-meansalgorithm1stZillurRahmanDepartmentofElectricalandElectronicEngineeringChittagongUniversityofEngineeringandTechnologyChittagong,Bangladeshu1502056@student.cuet.ac.bd2ndMd.SabirHossainDepartmentofComputerScienceandEngineeringChittagongUniversity...

展开>> 收起<<

An enhanced method of initial cluster center selection for K-means algorithm 1stZillur Rahman.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An enhanced method of initial cluster center selection for K-means algorithm 1stZillur Rahman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: