IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL NETWORK -BASED APPROACH SSG AND GBO

2025-05-08 0 0 839.08KB 13 页 10玖币
侵权投诉
IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND
IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL
NETWORK-BASED APPROACH: SSG AND GBO
A PREPRINT
Md Manjurul Ahsan
Department of Industrial and Systems Engineering
University of Oklahoma
Norman, Oklahoma-73071
ahsan@ou.edu
Md Shahin Ali
Department of Biomedical Engineering
Islamic University
Kushtia, 7003, Bangladesh
shahin@std.iu.ac.bd
Zahed Siddique
School of Aerospace and Mechanical Engineering
University of Oklahoma
Norman, Oklahoma-73019
zsiddique@ou.edu
October 25, 2022
ABSTRACT
Class imbalance in a dataset is one of the major challenges that can significantly impact the per-
formance of machine learning models resulting in biased predictions. Numerous techniques have
been proposed to address class imbalanced problems, including, but not limited to, Oversampling,
Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversam-
pling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) is among the
most widely used methodology by researchers. However, one of SMOTE’s potential disadvantages is
that newly created minor samples may overlap with major samples. As an effect, the probability of
ML models’ biased performance towards major classes increases. Recently, generative adversarial
network (GAN) has garnered much attention due to its ability to create almost real samples. However,
GAN is hard to train even though it has much potential. This study proposes two novel techniques:
GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome
the limitations of the existing oversampling approaches. The preliminary computational result shows
that SSG and GBO performed better on the expanded imbalanced eight benchmark datasets than the
original SMOTE. The study also revealed that the minor sample generated by SSG demonstrates
Gaussian distributions, which is often difficult to achieve using original SMOTE.
Keywords
GAN
·
Imbalanced class data
·
Minor sample
·
Neural network
·
Machine learning
·
Oversampling
·
SVM-SMOTE
1 Introduction
An imbalanced class ratio with datasets is a potential challenge in machine learning (ML)-based model development
systems [
1
]. Class imbalance occurs when the total number of samples from one class is significantly higher than the
other classes [
2
]. In both binary and multiclass classification situations, this inequality can be observed [
3
]. The data
class with the lowest sample calls the minor class, and the data class with the highest sample calls the major class [
3
].
Major class frequently refers to a negative class in binary classification problems, whereas minor class refers to a
positive class. It is currently a significant issue in various domains such as biology, health, finance, telecommunications,
arXiv:2210.12870v1 [cs.LG] 23 Oct 2022
APREPRINT - OCTOBER 25, 2022
and disease diagnosis [
4
]. As an effect, it is considered one of the most severe problems in data mining [
5
]. Figure 1
depicts a two-dimensional representation of the major and minor classes.
Figure 1: Hypothetical example of majority and minority class
Most of the ML algorithms are built in such a way that they perform well with balanced data, but are unable to perform
on the imbalanced dataset [
6
]. Therefore, several conditions, such as detecting credit card fraud or identifying malignant
tumor cells, are difficult to accomplish using typical ML algorithms, where the primary goal is to identify the positive
samples (rare samples) [
7
]. A well-known example is Caruana et al. (2015)’s study, that sought to determine which
pneumonia patients might be hospitalized and which might be discharged home [
8
]. Unfortunately, their proposed
approach generated misleading results for patients with asthma or chest pain by estimating a lower likelihood of dying.
Numerous ML algorithms have been proposed, and their performance remains biased toward the major class. For
instance, consider an imbalanced dataset comprising 10 924 non-cancerous (majority class) and 260 malignant (minority
class) cell image data. Using traditional ML algorithms, there is a greater chance that the classification will exhibit a
100% accuracy for the major class and 0% -10% accuracy for the minor class, resulting in the probability of classifying
234 minor class as the major class [
9
]. Therefore, 234 patients with cancer would be misdiagnosed as non-cancerous.
Such an error is more costly in medical treatment, and a misdiagnosis of a malignant cell has significant health
repercussions and may result in a patient’s death [10].
Algorithms and techniques to address class imbalanced problems (CIP) are usually classified into three main categories:
data level, cost-sensitive, and ensemble algorithms (as shown in Figure 2).
Figure 2: Major approaches to handle CIP in the machine learning domain.
In data level solutions Oversampling approaches are mostly used where minor class data is oversampled by applying
different techniques. Oversampling techniques that are often employed include Adaptive Synthetic (ADASYN),
Random Oversampling, Synthetic Minority Oversampling Techniques (SMOTE), and Borderline SMOTE [
11
]. Among
all Oversampling approaches, Chawla’s SMOTE is the most popular and commonly utilized [3]. However, traditional
SMOTE produces more noise and is unsuitable for high-dimensional data. To resolve these issues, Wang et al.
(2021) proposed active learning-based SMOTE [
3
]. Zhang et al. (2022) proposed SMOTE-RkNN (reverse k-Nearest
Neighbors), a hybrid oversampling technique to identify the noise instead of local neighborhood information [
12
].
2
APREPRINT - OCTOBER 25, 2022
Maldonado et al. (2022) demonstrated that traditional SMOTE faces major difficulties when it comes to defining
the neighborhood to generate additional minority samples. To overcome these concerns, the authors proposed a
feature-weighted oversampling, also known as (FW-SMOTE) [
13
]. Aside from that, SMOTE is often computationally
costly, considering the time and memory usage for high-dimensional data. Berando et al. (2022) pioneered the use of
C-SMOTE to address time complexity issues in binary classification problems [
14
]. Obiedat et al. (2022) presented
SVM-SMOTE combined with particle swarm optimization (PSO) for sentiment analysis of customer evaluations;
however, the proposed algorithms remained sensitive to multidimensional data [15]
On the other hand, Undersampling procedures reduce the sample size of the major classes to create a balanced dataset.
Near-miss Undersampling, Condensed Nearest Neighbour, and Tomek Links are three popular Undersampling methods,
that are used more frequently [
11
]. Once the data sample is reduced using Under-sampling techniques, there is a higher
chance that it will also eliminate many crucial information from the major class. As a result, Oversampling is generally
preferred rather than Undersampling by researchers and practitioners [16].
Most ML algorithms consider that all the misclassification performed by the model is equivalent, which is a frequently
unusual case for CIP, wherein misclassifying a positive (minor) class is considered as the worse scenario than misclassi-
fying the negative (major) class. Therefore, in the cost-sensitive approach higher penalty is introduced to the model for
misclassifying the minor samples. In this process, the cost is assigned based on the error made by the model. Suppose,
if the algorithm fails to classify the minor class, then the penalty will be higher (i.e., 10 for each misclassification),
and if the algorithm fails to classify the major class, then the penalty will be lower (i.e., 1 for each misclassification).
Shon et al. (2020) proposed hybrid deep learning based cost-sensitive approaches to classify kidney cancer [
17
]. Wang
et al. (2020) used multiple kernel learning-based cost-sensitive approaches to generate synthetic instances and train
the classifier simultaneously using the same feature space [
18
]. One of the potential drawbacks of the cost-sensitive
approach is that no defined protocol can be used to set the penalty for misclassification. Therefore, adjusting weight is
less preferred due to its complexity of use. The cost (weight) for misclassification is set by the expert’s opinions or by
manually experimenting until the appropriate cost is identified, which is very time-consuming. Further, determining
the penalty requires measuring the impact of the features and considering various criteria. However, such a procedure
becomes more complex with multidimensional and multiclass label data.
Several algorithm-based solutions have been proposed to improve the effect of ML classification on the imbalanced
dataset. Galar et al. (2013) suggested an ensemble-based solution (EUSBoost), which integrated random Undersampling
and boosting algorithms. The author assert that their proposed approach can resolve the imbalanced class problem’s
overfitting issues [
19
]. Shi et al. (2022) proposed an ensemble resampling based approach considering sample
concatenation (ENRe-SC). According to the author, the proposed strategy can mitigate the adverse effect of removing
the major class caused by Undersampling approaches [
20
]. Muhammad et al. (2021) proposed an evolving SVM
decision function that employs a genetic method to tackle class imbalanced situations [
21
]. Jiang et al. (2019) proposed
generative adversarial network (GAN) based approaches to handle the imbalanced class problem in time series data [
22
].
Majid et al. (2014) employed K-nearest neighbor (KNN) and support vector machines (SVM) to detect human breast
and colon cancer. The authors use a two-step process to address the imbalance problem: a preprocessor and a predictor.
On one hand, Mega-trend diffusion (MTD) is used in the preprocessing stage to increase the minority sample and
balance the dataset. On the other hand, K-nearest neighbors (KNN) and SVM are used in the predictor stage to create
hybrid techniques MTD-SVM and MTD-KNN. Their study result shows that MTD-SVM outperformed all other
proposed techniques by demonstrating an accuracy of 96.71% [
23
]. Xiao et al. (2021) introduced a deep learning
(DL)-based method known as the Wasserstein generative adversarial network (WGAN) model and applied it to three
different datasets: lung, stomach, and breast cancer. WGAN can generate new instances from the minor class and solve
the CIP ratio problem [24].
GAN has become a widely utilized technique in computer vision domains. GAN’s capacity to generate real images
from random noise is one of its potential benefits [
25
]. This dynamic characteristic contributes to GAN’s appeal, as
it has been used in nearly any data format (i.e., time-series data, audio data, image data) [
26
]. Sharma et al. (2022)
showed that using GAN, it is possible to generate data in which the sample demonstrates better Gaussian distribution,
which is often difficult to achieve using traditional imbalanced approaches. Their proposed GAN-based approaches
show comparatively 10% higher performance than any other existing techniques while producing minor samples that
are almost real [
27
]. However, one major drawback of their suggested approach is that the model is hardly stable and
very time-consuming in generating new samples. Therefore, an updated stable GAN-based oversampling technique
might play a crucial role in tackling class imbalanced problems. Considering this opportunity, in this work we have
presented an updated GAN-based oversampling technique. Our technical contributions can be summarized as follows:
1.
Taking into account the advantages of two algorithms: SVM-SMOTE and GAN, we proposed two Oversam-
pling strategies: GAN-based Oversampling (GBO) and SVM-SMOTE-GAN (SSG).
3
摘要:

IMBALANCEDCLASSDATAPERFORMANCEEVALUATIONANDIMPROVEMENTUSINGNOVELGENERATIVEADVERSARIALNETWORK-BASEDAPPROACH:SSGANDGBOAPREPRINTMdManjurulAhsanDepartmentofIndustrialandSystemsEngineeringUniversityofOklahomaNorman,Oklahoma-73071ahsan@ou.eduMdShahinAliDepartmentofBiomedicalEngineeringIslamicUniversityKus...

展开>> 收起<<
IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL NETWORK -BASED APPROACH SSG AND GBO.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:839.08KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注