
presents our proposed algorithm in detail. The results obtained
from the four experiments along with dataset descriptions are
in section IV. The paper ends with the conclusion in section V.
II. RELATED WORKS
Several systems have been developed to find better ini-
tialization for the K-means clustering algorithm. A novel
approach was presented to select the initial centroids of the
clusters in [3]. A set of sub-samples are created from the
entire dataset, and K-means is applied to each sub-sample.
In that way, (K x J) number of centroids will be found for
the J number of sub-samples. Then, K-means is again applied
to these centroids using K centroids of each sub-sample as
initialization. The centroids that give minimum error will be
used as the initial centroids for the whole dataset. This method
is proved to be useful for large datasets.
A fast global k-means algorithm was proposed in [4] where a
global search technique was developed which adds one cluster
center at a time. The process is repeated until the K number
of centroids are found and principle component analysis(PCA)
is used for dimensionality reduction of different real-world
datasets.
A very simple initial centroid selection algorithm was devel-
oped in [5]. The first centroid is selected randomly from all the
samples. The second centroid is the sample having maximum
distance from its nearest centroid. This process continues until
the number of initial centroids reaches K.
Another method for selecting initial centroids was proposed
in [6]. All the Euclidean distances between the samples are
computed and the two samples with the shortest distance are
selected. Then these two points are deleted from the main
sample set along with their n nearest neighbor. In that way, a
new set will be created. By repeating this process for K times,
K sets are created. The mean of each set is used as an initial
centroid for the K-means algorithm.
In [7], the authors proposed Cluster Center Initialization
Algorithm (CCIA) to select initial centroids. In this system
first, all the attributes are normalized and one attribute is
selected. The mean, standard deviation, and percentile value
from the normal curve are used to find K number of centers for
that attribute. Then, K-means is applied for that attribute only
and each sample is labeled by a cluster number. By repeating
this for the remaining attributes, a pattern is created for each
sample. The unique patterns are then identified and samples
corresponding to such each pattern are assigned to a single
cluster.
Another approach was presented for initial centroid selection
where the whole dataset is considered as a single cell and
the axis with the highest variance is selected as the principal
axis in [8]. The cell is divided into two with help of a cutting
plane and error is calculated before and after the division. In
that way, cells are divided until the number of cells is the same
as the number of clusters, and the mean of each cell is then
used as the initial centroids for the K-means algorithm.
A novel approach for selecting the number of clusters was
proposed in [9]. The author developed a new cost function as
well as a new function for assigning each sample to a cluster.
The simple K-means is applied for different K and the value
respective to the global minimum of the cost function is the
actual number of cluster. Each sample is included in a cluster if
the corresponding function value for that cluster is minimum.
The authors in [10] presented another technique for initial
centroid selection where all the distances between samples
are calculated first. The mean distance and density of each
sample are computed using the developed density function.
The sample having maximum density is selected as the first
centroid and some neighbors of this sample are deleted from
the stored density array. This process is repeated K times.
A simple technique based on the weighted average of the
attributes was developed in [11]. The attribute values are
weighted averaged for each sample and sorted. The sorted list
is then divided into K set and the mean of each set will be
the initial centroids for simple K-means algorithm.
The authors in [12] proposed a new initial centroids selection
method. First, all the distances between samples are calculated.
Then the sums of distances for all points are computed
and sorted using their sum of distances from maximum to
minimum. The point, having the maximum sum of distances
is the first centroid. Then next N/K samples are discarded
and the first sample after that is selected as the second initial
centroid. This process will be for K times.
Another solution for initial centroid selection was presented
in [13]. This algorithm is based on the farthest samples. The
mean of the dataset is calculated and the farthest sample
from the mean is selected as the first initial centroid. The
sample having maximum distance from the previously selected
centroids is the second centroid and so on.
III. METHODOLOGY
This section explains the proposed algorithm to compute the
initial centroids for the K-means. To illustrate the procedures,
we generated a small dataset using random samples namely
synthetic dataset 1 that consists of two attributes and five
clusters. There is total 35 samples, and we named the attributes
variable1 and variable2. The dataset is shown in Fig. 1.
Fig. 1. Synthetic dataset 1