IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL NETWORK -BASED APPROACH SSG AND GBO

2025-05-08 0 0 839.08KB 13 页 10玖币

侵权投诉

IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND

IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL

NETWORK-BASED APPROACH: SSG AND GBO

A PREPRINT

Md Manjurul Ahsan

Department of Industrial and Systems Engineering

University of Oklahoma

Norman, Oklahoma-73071

ahsan@ou.edu

Md Shahin Ali

Department of Biomedical Engineering

Islamic University

Kushtia, 7003, Bangladesh

shahin@std.iu.ac.bd

Zahed Siddique

School of Aerospace and Mechanical Engineering

University of Oklahoma

Norman, Oklahoma-73019

zsiddique@ou.edu

October 25, 2022

ABSTRACT

Class imbalance in a dataset is one of the major challenges that can signiﬁcantly impact the per-

formance of machine learning models resulting in biased predictions. Numerous techniques have

been proposed to address class imbalanced problems, including, but not limited to, Oversampling,

Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversam-

pling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) is among the

most widely used methodology by researchers. However, one of SMOTE’s potential disadvantages is

that newly created minor samples may overlap with major samples. As an effect, the probability of

ML models’ biased performance towards major classes increases. Recently, generative adversarial

network (GAN) has garnered much attention due to its ability to create almost real samples. However,

GAN is hard to train even though it has much potential. This study proposes two novel techniques:

GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome

the limitations of the existing oversampling approaches. The preliminary computational result shows

that SSG and GBO performed better on the expanded imbalanced eight benchmark datasets than the

original SMOTE. The study also revealed that the minor sample generated by SSG demonstrates

Gaussian distributions, which is often difﬁcult to achieve using original SMOTE.

Keywords

GAN

Imbalanced class data

Minor sample

Neural network

Machine learning

Oversampling

SVM-SMOTE

1 Introduction

An imbalanced class ratio with datasets is a potential challenge in machine learning (ML)-based model development

systems [

]. Class imbalance occurs when the total number of samples from one class is signiﬁcantly higher than the

other classes [

]. In both binary and multiclass classiﬁcation situations, this inequality can be observed [

]. The data

class with the lowest sample calls the minor class, and the data class with the highest sample calls the major class [

Major class frequently refers to a negative class in binary classiﬁcation problems, whereas minor class refers to a

positive class. It is currently a signiﬁcant issue in various domains such as biology, health, ﬁnance, telecommunications,

arXiv:2210.12870v1 [cs.LG] 23 Oct 2022

APREPRINT - OCTOBER 25, 2022

and disease diagnosis [

]. As an effect, it is considered one of the most severe problems in data mining [

]. Figure 1

depicts a two-dimensional representation of the major and minor classes.

Figure 1: Hypothetical example of majority and minority class

Most of the ML algorithms are built in such a way that they perform well with balanced data, but are unable to perform

on the imbalanced dataset [

]. Therefore, several conditions, such as detecting credit card fraud or identifying malignant

tumor cells, are difﬁcult to accomplish using typical ML algorithms, where the primary goal is to identify the positive

samples (rare samples) [

]. A well-known example is Caruana et al. (2015)’s study, that sought to determine which

pneumonia patients might be hospitalized and which might be discharged home [

]. Unfortunately, their proposed

approach generated misleading results for patients with asthma or chest pain by estimating a lower likelihood of dying.

Numerous ML algorithms have been proposed, and their performance remains biased toward the major class. For

instance, consider an imbalanced dataset comprising 10 924 non-cancerous (majority class) and 260 malignant (minority

class) cell image data. Using traditional ML algorithms, there is a greater chance that the classiﬁcation will exhibit a

100% accuracy for the major class and 0% -10% accuracy for the minor class, resulting in the probability of classifying

234 minor class as the major class [

]. Therefore, 234 patients with cancer would be misdiagnosed as non-cancerous.

Such an error is more costly in medical treatment, and a misdiagnosis of a malignant cell has signiﬁcant health

repercussions and may result in a patient’s death [10].

Algorithms and techniques to address class imbalanced problems (CIP) are usually classiﬁed into three main categories:

data level, cost-sensitive, and ensemble algorithms (as shown in Figure 2).

Figure 2: Major approaches to handle CIP in the machine learning domain.

In data level solutions Oversampling approaches are mostly used where minor class data is oversampled by applying

different techniques. Oversampling techniques that are often employed include Adaptive Synthetic (ADASYN),

Random Oversampling, Synthetic Minority Oversampling Techniques (SMOTE), and Borderline SMOTE [

]. Among

all Oversampling approaches, Chawla’s SMOTE is the most popular and commonly utilized [3]. However, traditional

SMOTE produces more noise and is unsuitable for high-dimensional data. To resolve these issues, Wang et al.

(2021) proposed active learning-based SMOTE [

]. Zhang et al. (2022) proposed SMOTE-RkNN (reverse k-Nearest

Neighbors), a hybrid oversampling technique to identify the noise instead of local neighborhood information [

APREPRINT - OCTOBER 25, 2022

Maldonado et al. (2022) demonstrated that traditional SMOTE faces major difﬁculties when it comes to deﬁning

the neighborhood to generate additional minority samples. To overcome these concerns, the authors proposed a

feature-weighted oversampling, also known as (FW-SMOTE) [

]. Aside from that, SMOTE is often computationally

costly, considering the time and memory usage for high-dimensional data. Berando et al. (2022) pioneered the use of

C-SMOTE to address time complexity issues in binary classiﬁcation problems [

]. Obiedat et al. (2022) presented

SVM-SMOTE combined with particle swarm optimization (PSO) for sentiment analysis of customer evaluations;

however, the proposed algorithms remained sensitive to multidimensional data [15]

On the other hand, Undersampling procedures reduce the sample size of the major classes to create a balanced dataset.

Near-miss Undersampling, Condensed Nearest Neighbour, and Tomek Links are three popular Undersampling methods,

that are used more frequently [

]. Once the data sample is reduced using Under-sampling techniques, there is a higher

chance that it will also eliminate many crucial information from the major class. As a result, Oversampling is generally

preferred rather than Undersampling by researchers and practitioners [16].

Most ML algorithms consider that all the misclassiﬁcation performed by the model is equivalent, which is a frequently

unusual case for CIP, wherein misclassifying a positive (minor) class is considered as the worse scenario than misclassi-

fying the negative (major) class. Therefore, in the cost-sensitive approach higher penalty is introduced to the model for

misclassifying the minor samples. In this process, the cost is assigned based on the error made by the model. Suppose,

if the algorithm fails to classify the minor class, then the penalty will be higher (i.e., 10 for each misclassiﬁcation),

and if the algorithm fails to classify the major class, then the penalty will be lower (i.e., 1 for each misclassiﬁcation).

Shon et al. (2020) proposed hybrid deep learning based cost-sensitive approaches to classify kidney cancer [

]. Wang

et al. (2020) used multiple kernel learning-based cost-sensitive approaches to generate synthetic instances and train

the classiﬁer simultaneously using the same feature space [

]. One of the potential drawbacks of the cost-sensitive

approach is that no deﬁned protocol can be used to set the penalty for misclassiﬁcation. Therefore, adjusting weight is

less preferred due to its complexity of use. The cost (weight) for misclassiﬁcation is set by the expert’s opinions or by

manually experimenting until the appropriate cost is identiﬁed, which is very time-consuming. Further, determining

the penalty requires measuring the impact of the features and considering various criteria. However, such a procedure

becomes more complex with multidimensional and multiclass label data.

Several algorithm-based solutions have been proposed to improve the effect of ML classiﬁcation on the imbalanced

dataset. Galar et al. (2013) suggested an ensemble-based solution (EUSBoost), which integrated random Undersampling

and boosting algorithms. The author assert that their proposed approach can resolve the imbalanced class problem’s

overﬁtting issues [

]. Shi et al. (2022) proposed an ensemble resampling based approach considering sample

concatenation (ENRe-SC). According to the author, the proposed strategy can mitigate the adverse effect of removing

the major class caused by Undersampling approaches [

]. Muhammad et al. (2021) proposed an evolving SVM

decision function that employs a genetic method to tackle class imbalanced situations [

]. Jiang et al. (2019) proposed

generative adversarial network (GAN) based approaches to handle the imbalanced class problem in time series data [

Majid et al. (2014) employed K-nearest neighbor (KNN) and support vector machines (SVM) to detect human breast

and colon cancer. The authors use a two-step process to address the imbalance problem: a preprocessor and a predictor.

On one hand, Mega-trend diffusion (MTD) is used in the preprocessing stage to increase the minority sample and

balance the dataset. On the other hand, K-nearest neighbors (KNN) and SVM are used in the predictor stage to create

hybrid techniques MTD-SVM and MTD-KNN. Their study result shows that MTD-SVM outperformed all other

proposed techniques by demonstrating an accuracy of 96.71% [

]. Xiao et al. (2021) introduced a deep learning

(DL)-based method known as the Wasserstein generative adversarial network (WGAN) model and applied it to three

different datasets: lung, stomach, and breast cancer. WGAN can generate new instances from the minor class and solve

the CIP ratio problem [24].

GAN has become a widely utilized technique in computer vision domains. GAN’s capacity to generate real images

from random noise is one of its potential beneﬁts [

]. This dynamic characteristic contributes to GAN’s appeal, as

it has been used in nearly any data format (i.e., time-series data, audio data, image data) [

]. Sharma et al. (2022)

showed that using GAN, it is possible to generate data in which the sample demonstrates better Gaussian distribution,

which is often difﬁcult to achieve using traditional imbalanced approaches. Their proposed GAN-based approaches

show comparatively 10% higher performance than any other existing techniques while producing minor samples that

are almost real [

]. However, one major drawback of their suggested approach is that the model is hardly stable and

very time-consuming in generating new samples. Therefore, an updated stable GAN-based oversampling technique

might play a crucial role in tackling class imbalanced problems. Considering this opportunity, in this work we have

presented an updated GAN-based oversampling technique. Our technical contributions can be summarized as follows:

Taking into account the advantages of two algorithms: SVM-SMOTE and GAN, we proposed two Oversam-

pling strategies: GAN-based Oversampling (GBO) and SVM-SMOTE-GAN (SSG).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IMBALANCEDCLASSDATAPERFORMANCEEVALUATIONANDIMPROVEMENTUSINGNOVELGENERATIVEADVERSARIALNETWORK-BASEDAPPROACH:SSGANDGBOAPREPRINTMdManjurulAhsanDepartmentofIndustrialandSystemsEngineeringUniversityofOklahomaNorman,Oklahoma-73071ahsan@ou.eduMdShahinAliDepartmentofBiomedicalEngineeringIslamicUniversityKus...

展开>> 收起<<

IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL NETWORK -BASED APPROACH SSG AND GBO.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

IMBALANCED CLASS DATA PERFORMANCE EVALUATION AND IMPROVEMENT USING NOVEL GENERATIVE ADVERSARIAL NETWORK -BASED APPROACH SSG AND GBO

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: