Learning Classifiers for Imbalanced and Overlapping Data Shivaditya Shivganesh 19BPS1103Nitin Narayanan N

2025-05-02 0 0 251.95KB 4 页 10玖币
侵权投诉
Learning Classifiers for Imbalanced and Overlapping Data
Shivaditya Shivganesh
19BPS1103
Nitin Narayanan N
19BPS1050
Pranav Murali
19BPS1035
Ajaykumar M
19BPS1093
Abstract This study is about inducing classifiers using
data that is imbalanced, with a minority class being under-
represented in relation to the majority classes. The first section
of this research focuses on the main characteristics of data that
generate this problem. Following a study of previous, relevant
research, a variety of artificial, imbalanced data sets influenced
by important elements were created. These data sets were used
to create decision trees and rule-based classifiers. The second
section of this research looks into how to improve classifiers by
pre-processing data with resampling approaches. The results of
the following trials are compared to the performance of distinct
pre-processing re-sampling methods: two variants of random
over-sampling and focused under-sampling NCR. This paper
further optimises class imbalance with a new method called
Sparsity. The data is made more sparse from its class centers,
hence making it more homogenous.
I. INTRODUCTION
Supervised learning of classifiers from examples is one
of the main tasks in machine learning and data mining.
However, their usefulness for obtaining high predictive ac-
curacy in real life data depends on different factors, includ-
ing also difficulties of the learning problem and its data
characteristics. Class imbalance is one of the sources of
these difficulties. Many real life problems are characterized
by a highly imbalanced distribution of examples in classes.
Typical examples are rare medical diagnosis, recognition of
oil spills in satellite images, detecting specific astronomical
objects in sky surveys or technical diagnostics of equip-
ment failures. Moreover, in fraud detection, either in card
transactions or in telephone calls the number of legitimate
transactions is much higher than the number of fraudulent
ones. Similar situations occur either in direct marketing
where the response rate class is usually very small in most
marketing campaigns or information filtering where some
important categories contain few messages only.
A. Issues with Class Imbalance
If imbalance in the class distribution is extensive, i.e.
some classes are strongly under-represented, then the typical
learning methods do not work properly. An even class
distribution is often assumed (also non explicitly) and the
classifiers are “somehow biased” to focus searching on the
more frequent classes while “missing” examples from the
minority class. As a result constructed classifiers are also
biased toward recognition of the majority classes and they
usually have difficulties (or even are unable) to classify
correctly new objects from the minority class.
B. Possible Solutions
This paper concerns classifier-independent methods that
rely on transforming the original data to change the distri-
bution of classes, e.g., by resampling as these methods are
more universal and they can be used in a pre-processing stage
before applying many learning algorithms. A small number
of examples in the minority class is not the only source of
difficulties for classifiers. Recent works also suggest that
there are other factors that contribute to difficulties. The
degradation of performance was also related to other factors,
mainly to decomposition of the minority class into many
sub-clusters with very few examples. The rare sub-concepts
correspond to, so called, small disjuncts, which lead to
classification errors more often than examples from larger
parts of the class.
C. Goals
Studying the role of these factors in class imbalance is still
an open research problem. Therefore, the main aim of this
study is to experimentally examine which of these factors are
more critical for the performance of the classifier. Carrying
out such experiments requires preparing a new collection of
artificial data sets which are affected by the above mentioned
factors. Proposing such data sets is another sub-aim of this
paper. In this paper we are particularly interested in focused
(also called informed) resampling methods, which modify
the class distribution taking into account local characteris-
tics of examples. Representative methods such as SMOTE
for selective over-sampling of the minority class, one side
sampling and NCR for removing examples from the majority
classes are used.
II. LITERATURE REVIEW
Jerzy Stefanowski works on inducing classifiers from
unbalanced data, in which one class (a minority class) is
underrepresented compared to the other classes (majority
classes). The minority group is usually the focus of attention,
and it must be recognised as accurately as possible. Because
most algorithms learning classifiers are biased toward the
majority classes, class imbalance is a challenge for them. The
first section of this research focuses on the main characteris-
tics of data that generate this problem. Following the review,
related research resulted in the creation of numerous forms
of artificial, imbalanced data sets influenced by important
aspects. These data sets were used to create decision trees
and rule-based classifiers. The results of the initial studies
suggest that a lack of examples from the minority class is
not the primary cause of problems.These findings support the
arXiv:2210.12446v1 [cs.LG] 22 Oct 2022
摘要:

LearningClassiersforImbalancedandOverlappingDataShivadityaShivganesh19BPS1103NitinNarayananN19BPS1050PranavMurali19BPS1035AjaykumarM19BPS1093Abstract—Thisstudyisaboutinducingclassiersusingdatathatisimbalanced,withaminorityclassbeingunder-representedinrelationtothemajorityclasses.Therstsectionofth...

展开>> 收起<<
Learning Classifiers for Imbalanced and Overlapping Data Shivaditya Shivganesh 19BPS1103Nitin Narayanan N.pdf

共4页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:4 页 大小:251.95KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 4
客服
关注