An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH_2

2025-04-27 0 0 9.58MB 23 页 10玖币

侵权投诉

An Instance Selection Algorithm for Big Data in High

imbalanced datasets based on LSH

Germán E. Melo-Acostaa,∗, Freddy Duitama-Muñoza,b, Julián D. Arias-Londoñoc

aIntelligent Information Systems Lab, Universidad de Antioquia, Calle 67 No. 53 - 108, 050010,

Medellín, Colombia.

bDepartment of Systems Engineering, Universidad de Antioquia, Calle 67 No. 53 - 108, 050010,

Medellín, Colombia.

cGAPS, SSR Department, ETSIT-Universidad Politécnica de Madrid, Av. Complutense, 30, 28040

Madrid, Spain.

Abstract

Training of Machine Learning (ML) models in real contexts often deals with big data

sets and high-class imbalance samples where the class of interest is unrepresented (mi-

nority class). Practical solutions using classical ML models address the problem of large

data sets using parallel/distributed implementations of training algorithms, approximate

model-based solutions, or applying instance selection (IS) algorithms to eliminate redun-

dant information. However, the combined problem of big and high imbalanced datasets

has been less addressed. This work proposes three new methods for IS to be able to

deal with large and imbalanced data sets. The proposed methods use Locality Sensitive

Hashing (LSH) as a base clustering technique, and then three diﬀerent sampling meth-

ods are applied on top of the clusters (or buckets) generated by LSH. The algorithms

were developed in the Apache Spark framework, guaranteeing their scalability. The ex-

periments carried out in three diﬀerent datasets suggest that the proposed IS methods

can improve the performance of a base ML model between 5% and 19% in terms of the

geometric mean.

Keywords: Instance Selection, Big Data, Locality-Sensitive Hashing, Imbalanced data

sets, Imbalanced Classiﬁcation

1. Introduction

The large amount of data that currently are being generated in diﬀerent ﬁelds of

knowledge bring about several challenges not only for data storing and management sys-

tems but also for data analysis tools. In the Big Data context, many applications use

historical data for training machine learning (ML)-based systems to provide informa-

tion systems with the ability to make predictions over new data. The increment in the

∗Corresponding author

Email addresses: geduardo.melo@udea.edu.co (Germán E. Melo-Acosta),

john.duitama@udea.edu.co (Freddy Duitama-Muñoz), julian.arias@upm.es (Julián D.

Arias-Londoño)

Preprint submitted to Elsevier October 11, 2022

arXiv:2210.04310v1 [cs.LG] 9 Oct 2022

amount of data available for training purposes has boosted the performance in many ML

applications because such an increment adds more information for the learning process,

reduces the epistemic uncertainty and also helps to prevent over-ﬁtting [1]. However,

having more data is problematic for many learning algorithms since most of them have,

at least, an exponential computational complexity concerning the number of training

samples. Moreover, some non-parametric models use a prototype-based approach, where

some training samples (or all of them) are stored in order to be used for new predic-

tions (i.e., the well-known k-nearest neighbors —k-NN, Support Vector Machines [2] and

Gaussian Processes — GP classiﬁcation and regression methods [3]). In those models,

the predictions are based on the comparison (by applying a distance, a similarity measure

or a kernel function) among the new samples and a set of the stored training samples,

which implies a lot of memory and processing time [4].

The specialized literature shows that there are mainly three approaches to tackle this

problem: parallel/distributed implementations of ML training algorithms [5], model-

based solutions that use variants of the original models and provides approximate so-

lutions (i.e., pruned or cluster-based k-NN, sparse or inducing variables GPs) [6, 7],

and instance selection (IS) methods that reduce the number of samples to be consid-

ered during training [4]. The ﬁrst alternative is perhaps the most attractive one, but

it requires a considerable academic community eﬀort to get a solution for every single

model that is wanted to be applied in a Big data context. Moreover, sometimes the

training algorithms are, by principle, sequential; thus, the parallel/distributed variants

provide approximate solutions with insurmountable performance degradation. The sec-

ond alternative is similar to the ﬁrst one, but its purpose is to reduce the computational

complexity of the training process instead of parallelizing it. In this case, variants of

the original models or training algorithms are also required to use the training data in a

diﬀerent and less demanding way, but most of the time, they also yield to approximate

solutions. The last alternative focuses on the data used for training purposes; in this

case, the solutions reduce the number of samples required to achieve proper model ﬁtting

and performance without modifying the original models or processing all the data. This

strategy could also be combined with the approaches based on parallel or distributed

model training. Notwithstanding their advantages, the methods in the last group must

satisfy two conditions: the samples selected must represent the underlying data distribu-

tion, and their computational cost needs to be light enough to make their use worthwhile.

In this sense, many algorithms for IS proposed in the literature are not suitable for

applications in the context of large volumes of data, due to their complexity is at least

O(nlog n)or greater, and also because they are based on sorting algorithms that, by

design, require all samples ﬁt in one single machine [8]. The computational load of these

algorithms comes from the fact that usually, they must search for critical or close-to-the-

decision frontier instances within the whole dataset, which results in the need to perform

comparisons between each pair of instances in the datasets, or its ﬁnding depends on

iterative optimization algorithms [9, 10].

In addition to the computational complexity issue, ML models have to deal with a

skewed data distributions in many real-life situations [11], such as fraudulent transac-

tions, mammogram lesions, or credit defaults detection. In those contexts, the severity

of class imbalance can vary from minor to severe and is measured by the Imbalance Ra-

tio(IR), computed as the number of majority class examples divided by the number of

minority class examples. According to [12, 13], an IR higher than 100 is considered a high

imbalanced dataset; consequently, in those scenarios, the IS methods must pay attention

to the sampling process of the minority class since it can undermine the possibilities to

train a model properly. In fact, some state-of-the-art works have announced that IS can

help reduce the IR’s impact in the training process [14, 15]; although, no experiments in

high imbalanced datasets nor their eﬀects on the performance of ML models are thought-

fully analyzed.

The adaption of IS methods to tackle large data volume and high-IR problems is

usually treated separately. For instance, in [8] an algorithm called Democratic Instance

Selection (DIS) is proposed, which consists in the creation of several rounds of an IS

execution from diﬀerent subsets of the original dataset (partitioning the training set into

several disjoint subsets). The well-known Decremental Reduction Optimization Proce-

dure Version 3 (DROP3) [16] or Iterative Case Filtering (ICF) [17] IS algorithm “votes”

for the instances to remove in each round. Then, these rounds are combined using a vot-

ing scheme that uses a voting threshold computed with an optimization process for both

the training error and the size of instances in memory. Using a similar idea of splitting

the original dataset to scale the IS methods, [4] proposed the use of Locality Sensitive

Hashing (LSH) to create partitions of the original dataset, taking advantage of that LSH

has a linear cost. The Authors used LSH partitions to create two new IS algorithms. The

ﬁrst one is called LSH-IS-S, it leaves only one instance of each class inside each partition

and the second one called LSH-IS-F, which consists on to evaluate whether there is only

one instance of a class within the partition, the instance is rejected; otherwise, if two or

more instances of the same class are present, one of them is randomly chosen. In the

same way, [9] built an algorithm called XLDIS, which splits the original dataset using a

modiﬁed k-NN method. Then, in each subset, the most representative instances of each

class are selected using the local density of the instances (average similarity between the

instance and all its pairs). Although the results demonstrated that XLDIS time com-

plexity is lower than other traditional IS approaches, the experimental setup did not use

datasets with more than 20000 samples.

In the three previous works, although the database used had some level of imbalance,

their IR was not analyzed, and their main results were focused on demonstrating the near-

linear computational complexity of the algorithms. Furthermore, although the authors

claimed that the proposed algorithms are suitable to implement in Big data contexts,

only DIS was developed to be implemented using a distributed processing framework [18].

On the opposite, [19], and [20] proposed new IS algorithms for high IR problems

(varying from 2 to 130). Both works used similarity-based clustering algorithms (such

as k-means or aﬃnity propagation) to group the instances of each class separately; then,

within the clusters of the majority class, they performed the IS. The ﬁrst method used a

parallelizable agent-based optimization algorithm to select the instances that maximized

the problem’s accuracy. However, though the algorithm can be parallelizable, it must

complete multiple training loops of the same classiﬁer to optimize its accuracy. The sec-

ond method applied standard DROP3, Instance-Based Learning version 3 (IB3) [21], and

Genetic Algorithms inside the clusters (which were created only for the majority class

samples) to perform the IS process. Additionally, both works focused on small datasets

(minor than 5000 instances), and the complexity of the techniques was not analyzed.

To the best of our knowledge, only [10] aimed to adapt IS to solve the two pointed

out problems. The authors proposed the Oliogaric IS (OligoIS) algorithm, introducing

two modiﬁcations to the DIS algorithm: using a random dataset splitting to create the

partitions and changing the voting threshold favoring minority class instances. However,

even though the results showed better accuracy when the IR of the problem was higher

and the algorithm was tested in a large dataset (more than 5000000 instances), no dis-

tributed processing framework was used. Moreover, the ﬁnal performance of the models

trained with the reduced dataset depends on the IS method used. The authors showed

that the diﬀerence in performance between the algorithm using an evolutionary IS-based

method and DROP3 achieves even a 20% in relative terms, which could be explained be-

cause the work did not propose any adaptation of the IS methods to the splitting process.

Bearing this in mind, the present work proposes three new IS algorithms to work in

high IR problems, which can also be used in a Big data context. One method balances

the trade-oﬀ between fast calculations and a basic informed IS process using an entropy

measure to guide the amount of sampling required. The other two methods adapt the

popular DROP3 algorithm to deal with imbalanced class problems and perform a more

informed IS. As in [4], the proposed algorithms use LSH as a base clustering method to

achieve linear computational cost in large-volume datasets, and subsequently apply the

IS methods in a divide-and-conquer manner. The inﬂuence of three LSH families is stud-

ied concerning the models’ performance obtained in several experiments with diﬀerent

datasets. IS algorithms are also analyzed in terms of computational complexity and IR.

All the algorithms were implemented using the distributed computing framework Apache

Spark [22], and are publicly accessible on Gitlab1.

The rest of the paper is organized as follows. First, section 2 presents the LSH

concepts and their families, as well as a detailed explanation of the three algorithms

introduced for IS based on LSH and how they are used to solve the class imbalance

problem. Secondly, section 3 describes the experimental setup, the results obtained,

and their analysis. Finally, Section 4 presents the main conclusions of the work and

propositions of further lines.

2. Methods

2.1. Instance Selection using Locality Sensitive Hashing

LSH is a method for determining which items (e.g., samples on a features space) in a

given set are similar and is based on the simple idea that, if two points are close together,

then after applying a “projection” operation these two points will remain close together

[23]. Due to the use of hash functions instead of distances to estimate the similarity of

1https://gitlab.com/germanEduardo/lsh-analysis/

the objects, LSH can create clusters or groups of similar objects with linear computa-

tional complexity. Figure 1 shows a simpliﬁed diagram to demonstrate how LSH can

be used to perform IS. The ﬁrst step is to divide the feature space of the training data

into regions using “buckets” created using LSH. The LSH buckets group the instances

based on their similarity. Next, within every bucket, a sampling algorithm is applied

to select the most representative instances. The sampling techniques proposed in this

work (Sections 2.3 and 2.4) try to preserve the most signiﬁcant instances, considering

the particular challenges of high IR problems discussed before.

Figure 1: Methodology proposed to use LSH to perform Instance Selection

As mentioned, the LSH needs the deﬁnition of hash functions with particular prop-

erties. Additionally, it is essential to understand how those functions divide the fea-

ture space to raise awareness of the high-class imbalance. Ultimately, the selection of

those functions determines which samples are being considered by the underline sampling

method. Therefore, the following sections formalize the hash functions properties, how

they are addressed in the study, and the sampling methods proposed.

2.2. Locality-sensitive functions

In the LSH approach, the central point is deﬁning a hash function suitable for “hash-

ing” the items under analysis. The functions to be used in LSH must take two items and

provide a decision about whether these items are likely to be similar or not. These kind

of functions are called locality-senstive functions. In Figure 2, although vectors ~x and ~y

may be in a space of many dimensions, they always deﬁne a plane and make an angle θ

between them which is measured in this plane (gray region). Suppose we pick two addi-

tional hyperplanes through the origin. Each hyperplane intersects the plane of ~x and ~y

in a line (dotted line). To pick a random hyperplane, we actually pick the normal vector

to the hyperplane, say ~v. Note that with respect to the blue hyperplane, the vectors ~x

and ~y are on diﬀerent sides of this; as a consequence, the dot products ~v ·~x and ~v ·~y will

have diﬀerent signs. In the same way, the randomly chosen vector~

v0is normal to the red

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnInstanceSelectionAlgorithmforBigDatainHighimbalanceddatasetsbasedonLSHGermánE.Melo-Acostaa,,FreddyDuitama-Muñoza,b,JuliánD.Arias-LondoñocaIntelligentInformationSystemsLab,UniversidaddeAntioquia,Calle67No.53-108,050010,Medellín,Colombia.bDepartmentofSystemsEngineering,UniversidaddeAntioquia,Calle6...

展开>> 收起<<

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH_2.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: