An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH_2

2025-04-27 0 0 9.58MB 23 页 10玖币
侵权投诉
An Instance Selection Algorithm for Big Data in High
imbalanced datasets based on LSH
Germán E. Melo-Acostaa,, Freddy Duitama-Muñoza,b, Julián D. Arias-Londoñoc
aIntelligent Information Systems Lab, Universidad de Antioquia, Calle 67 No. 53 - 108, 050010,
Medellín, Colombia.
bDepartment of Systems Engineering, Universidad de Antioquia, Calle 67 No. 53 - 108, 050010,
Medellín, Colombia.
cGAPS, SSR Department, ETSIT-Universidad Politécnica de Madrid, Av. Complutense, 30, 28040
Madrid, Spain.
Abstract
Training of Machine Learning (ML) models in real contexts often deals with big data
sets and high-class imbalance samples where the class of interest is unrepresented (mi-
nority class). Practical solutions using classical ML models address the problem of large
data sets using parallel/distributed implementations of training algorithms, approximate
model-based solutions, or applying instance selection (IS) algorithms to eliminate redun-
dant information. However, the combined problem of big and high imbalanced datasets
has been less addressed. This work proposes three new methods for IS to be able to
deal with large and imbalanced data sets. The proposed methods use Locality Sensitive
Hashing (LSH) as a base clustering technique, and then three different sampling meth-
ods are applied on top of the clusters (or buckets) generated by LSH. The algorithms
were developed in the Apache Spark framework, guaranteeing their scalability. The ex-
periments carried out in three different datasets suggest that the proposed IS methods
can improve the performance of a base ML model between 5% and 19% in terms of the
geometric mean.
Keywords: Instance Selection, Big Data, Locality-Sensitive Hashing, Imbalanced data
sets, Imbalanced Classification
1. Introduction
The large amount of data that currently are being generated in different fields of
knowledge bring about several challenges not only for data storing and management sys-
tems but also for data analysis tools. In the Big Data context, many applications use
historical data for training machine learning (ML)-based systems to provide informa-
tion systems with the ability to make predictions over new data. The increment in the
Corresponding author
Email addresses: geduardo.melo@udea.edu.co (Germán E. Melo-Acosta),
john.duitama@udea.edu.co (Freddy Duitama-Muñoz), julian.arias@upm.es (Julián D.
Arias-Londoño)
Preprint submitted to Elsevier October 11, 2022
arXiv:2210.04310v1 [cs.LG] 9 Oct 2022
amount of data available for training purposes has boosted the performance in many ML
applications because such an increment adds more information for the learning process,
reduces the epistemic uncertainty and also helps to prevent over-fitting [1]. However,
having more data is problematic for many learning algorithms since most of them have,
at least, an exponential computational complexity concerning the number of training
samples. Moreover, some non-parametric models use a prototype-based approach, where
some training samples (or all of them) are stored in order to be used for new predic-
tions (i.e., the well-known k-nearest neighbors —k-NN, Support Vector Machines [2] and
Gaussian Processes — GP classification and regression methods [3]). In those models,
the predictions are based on the comparison (by applying a distance, a similarity measure
or a kernel function) among the new samples and a set of the stored training samples,
which implies a lot of memory and processing time [4].
The specialized literature shows that there are mainly three approaches to tackle this
problem: parallel/distributed implementations of ML training algorithms [5], model-
based solutions that use variants of the original models and provides approximate so-
lutions (i.e., pruned or cluster-based k-NN, sparse or inducing variables GPs) [6, 7],
and instance selection (IS) methods that reduce the number of samples to be consid-
ered during training [4]. The first alternative is perhaps the most attractive one, but
it requires a considerable academic community effort to get a solution for every single
model that is wanted to be applied in a Big data context. Moreover, sometimes the
training algorithms are, by principle, sequential; thus, the parallel/distributed variants
provide approximate solutions with insurmountable performance degradation. The sec-
ond alternative is similar to the first one, but its purpose is to reduce the computational
complexity of the training process instead of parallelizing it. In this case, variants of
the original models or training algorithms are also required to use the training data in a
different and less demanding way, but most of the time, they also yield to approximate
solutions. The last alternative focuses on the data used for training purposes; in this
case, the solutions reduce the number of samples required to achieve proper model fitting
and performance without modifying the original models or processing all the data. This
strategy could also be combined with the approaches based on parallel or distributed
model training. Notwithstanding their advantages, the methods in the last group must
satisfy two conditions: the samples selected must represent the underlying data distribu-
tion, and their computational cost needs to be light enough to make their use worthwhile.
In this sense, many algorithms for IS proposed in the literature are not suitable for
applications in the context of large volumes of data, due to their complexity is at least
O(nlog n)or greater, and also because they are based on sorting algorithms that, by
design, require all samples fit in one single machine [8]. The computational load of these
algorithms comes from the fact that usually, they must search for critical or close-to-the-
decision frontier instances within the whole dataset, which results in the need to perform
comparisons between each pair of instances in the datasets, or its finding depends on
iterative optimization algorithms [9, 10].
In addition to the computational complexity issue, ML models have to deal with a
skewed data distributions in many real-life situations [11], such as fraudulent transac-
tions, mammogram lesions, or credit defaults detection. In those contexts, the severity
2
of class imbalance can vary from minor to severe and is measured by the Imbalance Ra-
tio(IR), computed as the number of majority class examples divided by the number of
minority class examples. According to [12, 13], an IR higher than 100 is considered a high
imbalanced dataset; consequently, in those scenarios, the IS methods must pay attention
to the sampling process of the minority class since it can undermine the possibilities to
train a model properly. In fact, some state-of-the-art works have announced that IS can
help reduce the IR’s impact in the training process [14, 15]; although, no experiments in
high imbalanced datasets nor their effects on the performance of ML models are thought-
fully analyzed.
The adaption of IS methods to tackle large data volume and high-IR problems is
usually treated separately. For instance, in [8] an algorithm called Democratic Instance
Selection (DIS) is proposed, which consists in the creation of several rounds of an IS
execution from different subsets of the original dataset (partitioning the training set into
several disjoint subsets). The well-known Decremental Reduction Optimization Proce-
dure Version 3 (DROP3) [16] or Iterative Case Filtering (ICF) [17] IS algorithm “votes”
for the instances to remove in each round. Then, these rounds are combined using a vot-
ing scheme that uses a voting threshold computed with an optimization process for both
the training error and the size of instances in memory. Using a similar idea of splitting
the original dataset to scale the IS methods, [4] proposed the use of Locality Sensitive
Hashing (LSH) to create partitions of the original dataset, taking advantage of that LSH
has a linear cost. The Authors used LSH partitions to create two new IS algorithms. The
first one is called LSH-IS-S, it leaves only one instance of each class inside each partition
and the second one called LSH-IS-F, which consists on to evaluate whether there is only
one instance of a class within the partition, the instance is rejected; otherwise, if two or
more instances of the same class are present, one of them is randomly chosen. In the
same way, [9] built an algorithm called XLDIS, which splits the original dataset using a
modified k-NN method. Then, in each subset, the most representative instances of each
class are selected using the local density of the instances (average similarity between the
instance and all its pairs). Although the results demonstrated that XLDIS time com-
plexity is lower than other traditional IS approaches, the experimental setup did not use
datasets with more than 20000 samples.
In the three previous works, although the database used had some level of imbalance,
their IR was not analyzed, and their main results were focused on demonstrating the near-
linear computational complexity of the algorithms. Furthermore, although the authors
claimed that the proposed algorithms are suitable to implement in Big data contexts,
only DIS was developed to be implemented using a distributed processing framework [18].
On the opposite, [19], and [20] proposed new IS algorithms for high IR problems
(varying from 2 to 130). Both works used similarity-based clustering algorithms (such
as k-means or affinity propagation) to group the instances of each class separately; then,
within the clusters of the majority class, they performed the IS. The first method used a
parallelizable agent-based optimization algorithm to select the instances that maximized
the problem’s accuracy. However, though the algorithm can be parallelizable, it must
complete multiple training loops of the same classifier to optimize its accuracy. The sec-
ond method applied standard DROP3, Instance-Based Learning version 3 (IB3) [21], and
3
Genetic Algorithms inside the clusters (which were created only for the majority class
samples) to perform the IS process. Additionally, both works focused on small datasets
(minor than 5000 instances), and the complexity of the techniques was not analyzed.
To the best of our knowledge, only [10] aimed to adapt IS to solve the two pointed
out problems. The authors proposed the Oliogaric IS (OligoIS) algorithm, introducing
two modifications to the DIS algorithm: using a random dataset splitting to create the
partitions and changing the voting threshold favoring minority class instances. However,
even though the results showed better accuracy when the IR of the problem was higher
and the algorithm was tested in a large dataset (more than 5000000 instances), no dis-
tributed processing framework was used. Moreover, the final performance of the models
trained with the reduced dataset depends on the IS method used. The authors showed
that the difference in performance between the algorithm using an evolutionary IS-based
method and DROP3 achieves even a 20% in relative terms, which could be explained be-
cause the work did not propose any adaptation of the IS methods to the splitting process.
Bearing this in mind, the present work proposes three new IS algorithms to work in
high IR problems, which can also be used in a Big data context. One method balances
the trade-off between fast calculations and a basic informed IS process using an entropy
measure to guide the amount of sampling required. The other two methods adapt the
popular DROP3 algorithm to deal with imbalanced class problems and perform a more
informed IS. As in [4], the proposed algorithms use LSH as a base clustering method to
achieve linear computational cost in large-volume datasets, and subsequently apply the
IS methods in a divide-and-conquer manner. The influence of three LSH families is stud-
ied concerning the models’ performance obtained in several experiments with different
datasets. IS algorithms are also analyzed in terms of computational complexity and IR.
All the algorithms were implemented using the distributed computing framework Apache
Spark [22], and are publicly accessible on Gitlab1.
The rest of the paper is organized as follows. First, section 2 presents the LSH
concepts and their families, as well as a detailed explanation of the three algorithms
introduced for IS based on LSH and how they are used to solve the class imbalance
problem. Secondly, section 3 describes the experimental setup, the results obtained,
and their analysis. Finally, Section 4 presents the main conclusions of the work and
propositions of further lines.
2. Methods
2.1. Instance Selection using Locality Sensitive Hashing
LSH is a method for determining which items (e.g., samples on a features space) in a
given set are similar and is based on the simple idea that, if two points are close together,
then after applying a “projection” operation these two points will remain close together
[23]. Due to the use of hash functions instead of distances to estimate the similarity of
1https://gitlab.com/germanEduardo/lsh-analysis/
4
the objects, LSH can create clusters or groups of similar objects with linear computa-
tional complexity. Figure 1 shows a simplified diagram to demonstrate how LSH can
be used to perform IS. The first step is to divide the feature space of the training data
into regions using “buckets” created using LSH. The LSH buckets group the instances
based on their similarity. Next, within every bucket, a sampling algorithm is applied
to select the most representative instances. The sampling techniques proposed in this
work (Sections 2.3 and 2.4) try to preserve the most significant instances, considering
the particular challenges of high IR problems discussed before.
Figure 1: Methodology proposed to use LSH to perform Instance Selection
As mentioned, the LSH needs the definition of hash functions with particular prop-
erties. Additionally, it is essential to understand how those functions divide the fea-
ture space to raise awareness of the high-class imbalance. Ultimately, the selection of
those functions determines which samples are being considered by the underline sampling
method. Therefore, the following sections formalize the hash functions properties, how
they are addressed in the study, and the sampling methods proposed.
2.2. Locality-sensitive functions
In the LSH approach, the central point is defining a hash function suitable for “hash-
ing” the items under analysis. The functions to be used in LSH must take two items and
provide a decision about whether these items are likely to be similar or not. These kind
of functions are called locality-senstive functions. In Figure 2, although vectors ~x and ~y
may be in a space of many dimensions, they always define a plane and make an angle θ
between them which is measured in this plane (gray region). Suppose we pick two addi-
tional hyperplanes through the origin. Each hyperplane intersects the plane of ~x and ~y
in a line (dotted line). To pick a random hyperplane, we actually pick the normal vector
to the hyperplane, say ~v. Note that with respect to the blue hyperplane, the vectors ~x
and ~y are on different sides of this; as a consequence, the dot products ~v ·~x and ~v ·~y will
have different signs. In the same way, the randomly chosen vector~
v0is normal to the red
5
摘要:

AnInstanceSelectionAlgorithmforBigDatainHighimbalanceddatasetsbasedonLSHGermánE.Melo-Acostaa,,FreddyDuitama-Muñoza,b,JuliánD.Arias-LondoñocaIntelligentInformationSystemsLab,UniversidaddeAntioquia,Calle67No.53-108,050010,Medellín,Colombia.bDepartmentofSystemsEngineering,UniversidaddeAntioquia,Calle6...

展开>> 收起<<
An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH_2.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:9.58MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注