APREPRINT - OCTOBER 25, 2022
Maldonado et al. (2022) demonstrated that traditional SMOTE faces major difficulties when it comes to defining
the neighborhood to generate additional minority samples. To overcome these concerns, the authors proposed a
feature-weighted oversampling, also known as (FW-SMOTE) [
13
]. Aside from that, SMOTE is often computationally
costly, considering the time and memory usage for high-dimensional data. Berando et al. (2022) pioneered the use of
C-SMOTE to address time complexity issues in binary classification problems [
14
]. Obiedat et al. (2022) presented
SVM-SMOTE combined with particle swarm optimization (PSO) for sentiment analysis of customer evaluations;
however, the proposed algorithms remained sensitive to multidimensional data [15]
On the other hand, Undersampling procedures reduce the sample size of the major classes to create a balanced dataset.
Near-miss Undersampling, Condensed Nearest Neighbour, and Tomek Links are three popular Undersampling methods,
that are used more frequently [
11
]. Once the data sample is reduced using Under-sampling techniques, there is a higher
chance that it will also eliminate many crucial information from the major class. As a result, Oversampling is generally
preferred rather than Undersampling by researchers and practitioners [16].
Most ML algorithms consider that all the misclassification performed by the model is equivalent, which is a frequently
unusual case for CIP, wherein misclassifying a positive (minor) class is considered as the worse scenario than misclassi-
fying the negative (major) class. Therefore, in the cost-sensitive approach higher penalty is introduced to the model for
misclassifying the minor samples. In this process, the cost is assigned based on the error made by the model. Suppose,
if the algorithm fails to classify the minor class, then the penalty will be higher (i.e., 10 for each misclassification),
and if the algorithm fails to classify the major class, then the penalty will be lower (i.e., 1 for each misclassification).
Shon et al. (2020) proposed hybrid deep learning based cost-sensitive approaches to classify kidney cancer [
17
]. Wang
et al. (2020) used multiple kernel learning-based cost-sensitive approaches to generate synthetic instances and train
the classifier simultaneously using the same feature space [
18
]. One of the potential drawbacks of the cost-sensitive
approach is that no defined protocol can be used to set the penalty for misclassification. Therefore, adjusting weight is
less preferred due to its complexity of use. The cost (weight) for misclassification is set by the expert’s opinions or by
manually experimenting until the appropriate cost is identified, which is very time-consuming. Further, determining
the penalty requires measuring the impact of the features and considering various criteria. However, such a procedure
becomes more complex with multidimensional and multiclass label data.
Several algorithm-based solutions have been proposed to improve the effect of ML classification on the imbalanced
dataset. Galar et al. (2013) suggested an ensemble-based solution (EUSBoost), which integrated random Undersampling
and boosting algorithms. The author assert that their proposed approach can resolve the imbalanced class problem’s
overfitting issues [
19
]. Shi et al. (2022) proposed an ensemble resampling based approach considering sample
concatenation (ENRe-SC). According to the author, the proposed strategy can mitigate the adverse effect of removing
the major class caused by Undersampling approaches [
20
]. Muhammad et al. (2021) proposed an evolving SVM
decision function that employs a genetic method to tackle class imbalanced situations [
21
]. Jiang et al. (2019) proposed
generative adversarial network (GAN) based approaches to handle the imbalanced class problem in time series data [
22
].
Majid et al. (2014) employed K-nearest neighbor (KNN) and support vector machines (SVM) to detect human breast
and colon cancer. The authors use a two-step process to address the imbalance problem: a preprocessor and a predictor.
On one hand, Mega-trend diffusion (MTD) is used in the preprocessing stage to increase the minority sample and
balance the dataset. On the other hand, K-nearest neighbors (KNN) and SVM are used in the predictor stage to create
hybrid techniques MTD-SVM and MTD-KNN. Their study result shows that MTD-SVM outperformed all other
proposed techniques by demonstrating an accuracy of 96.71% [
23
]. Xiao et al. (2021) introduced a deep learning
(DL)-based method known as the Wasserstein generative adversarial network (WGAN) model and applied it to three
different datasets: lung, stomach, and breast cancer. WGAN can generate new instances from the minor class and solve
the CIP ratio problem [24].
GAN has become a widely utilized technique in computer vision domains. GAN’s capacity to generate real images
from random noise is one of its potential benefits [
25
]. This dynamic characteristic contributes to GAN’s appeal, as
it has been used in nearly any data format (i.e., time-series data, audio data, image data) [
26
]. Sharma et al. (2022)
showed that using GAN, it is possible to generate data in which the sample demonstrates better Gaussian distribution,
which is often difficult to achieve using traditional imbalanced approaches. Their proposed GAN-based approaches
show comparatively 10% higher performance than any other existing techniques while producing minor samples that
are almost real [
27
]. However, one major drawback of their suggested approach is that the model is hardly stable and
very time-consuming in generating new samples. Therefore, an updated stable GAN-based oversampling technique
might play a crucial role in tackling class imbalanced problems. Considering this opportunity, in this work we have
presented an updated GAN-based oversampling technique. Our technical contributions can be summarized as follows:
1.
Taking into account the advantages of two algorithms: SVM-SMOTE and GAN, we proposed two Oversam-
pling strategies: GAN-based Oversampling (GBO) and SVM-SMOTE-GAN (SSG).
3