Learning Classiﬁers for Imbalanced and Overlapping Data Shivaditya Shivganesh 19BPS1103Nitin Narayanan N

2025-05-02 0 0 251.95KB 4 页 10玖币

侵权投诉

Learning Classiﬁers for Imbalanced and Overlapping Data

Shivaditya Shivganesh

19BPS1103

Nitin Narayanan N

19BPS1050

Pranav Murali

19BPS1035

Ajaykumar M

19BPS1093

Abstract— This study is about inducing classiﬁers using

data that is imbalanced, with a minority class being under-

represented in relation to the majority classes. The ﬁrst section

of this research focuses on the main characteristics of data that

generate this problem. Following a study of previous, relevant

research, a variety of artiﬁcial, imbalanced data sets inﬂuenced

by important elements were created. These data sets were used

to create decision trees and rule-based classiﬁers. The second

section of this research looks into how to improve classiﬁers by

pre-processing data with resampling approaches. The results of

the following trials are compared to the performance of distinct

pre-processing re-sampling methods: two variants of random

over-sampling and focused under-sampling NCR. This paper

further optimises class imbalance with a new method called

Sparsity. The data is made more sparse from its class centers,

hence making it more homogenous.

I. INTRODUCTION

Supervised learning of classiﬁers from examples is one

of the main tasks in machine learning and data mining.

However, their usefulness for obtaining high predictive ac-

curacy in real life data depends on different factors, includ-

ing also difﬁculties of the learning problem and its data

characteristics. Class imbalance is one of the sources of

these difﬁculties. Many real life problems are characterized

by a highly imbalanced distribution of examples in classes.

Typical examples are rare medical diagnosis, recognition of

oil spills in satellite images, detecting speciﬁc astronomical

objects in sky surveys or technical diagnostics of equip-

ment failures. Moreover, in fraud detection, either in card

transactions or in telephone calls the number of legitimate

transactions is much higher than the number of fraudulent

ones. Similar situations occur either in direct marketing

where the response rate class is usually very small in most

marketing campaigns or information ﬁltering where some

important categories contain few messages only.

A. Issues with Class Imbalance

If imbalance in the class distribution is extensive, i.e.

some classes are strongly under-represented, then the typical

learning methods do not work properly. An even class

distribution is often assumed (also non explicitly) and the

classiﬁers are “somehow biased” to focus searching on the

more frequent classes while “missing” examples from the

minority class. As a result constructed classiﬁers are also

biased toward recognition of the majority classes and they

usually have difﬁculties (or even are unable) to classify

correctly new objects from the minority class.

B. Possible Solutions

This paper concerns classiﬁer-independent methods that

rely on transforming the original data to change the distri-

bution of classes, e.g., by resampling as these methods are

more universal and they can be used in a pre-processing stage

before applying many learning algorithms. A small number

of examples in the minority class is not the only source of

difﬁculties for classiﬁers. Recent works also suggest that

there are other factors that contribute to difﬁculties. The

degradation of performance was also related to other factors,

mainly to decomposition of the minority class into many

sub-clusters with very few examples. The rare sub-concepts

correspond to, so called, small disjuncts, which lead to

classiﬁcation errors more often than examples from larger

parts of the class.

C. Goals

Studying the role of these factors in class imbalance is still

an open research problem. Therefore, the main aim of this

study is to experimentally examine which of these factors are

more critical for the performance of the classiﬁer. Carrying

out such experiments requires preparing a new collection of

artiﬁcial data sets which are affected by the above mentioned

factors. Proposing such data sets is another sub-aim of this

paper. In this paper we are particularly interested in focused

(also called informed) resampling methods, which modify

the class distribution taking into account local characteris-

tics of examples. Representative methods such as SMOTE

for selective over-sampling of the minority class, one side

sampling and NCR for removing examples from the majority

classes are used.

II. LITERATURE REVIEW

Jerzy Stefanowski works on inducing classiﬁers from

unbalanced data, in which one class (a minority class) is

underrepresented compared to the other classes (majority

classes). The minority group is usually the focus of attention,

and it must be recognised as accurately as possible. Because

most algorithms learning classiﬁers are biased toward the

majority classes, class imbalance is a challenge for them. The

ﬁrst section of this research focuses on the main characteris-

tics of data that generate this problem. Following the review,

related research resulted in the creation of numerous forms

of artiﬁcial, imbalanced data sets inﬂuenced by important

aspects. These data sets were used to create decision trees

and rule-based classiﬁers. The results of the initial studies

suggest that a lack of examples from the minority class is

not the primary cause of problems.These ﬁndings support the

arXiv:2210.12446v1 [cs.LG] 22 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningClassiersforImbalancedandOverlappingDataShivadityaShivganesh19BPS1103NitinNarayananN19BPS1050PranavMurali19BPS1035AjaykumarM19BPS1093AbstractThisstudyisaboutinducingclassiersusingdatathatisimbalanced,withaminorityclassbeingunder-representedinrelationtothemajorityclasses.Therstsectionofth...

展开>> 收起<<

Learning Classiﬁers for Imbalanced and Overlapping Data Shivaditya Shivganesh 19BPS1103Nitin Narayanan N.pdf

共4页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Classiﬁers for Imbalanced and Overlapping Data Shivaditya Shivganesh 19BPS1103Nitin Narayanan N

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: