Evaluating k-NN in the Classification of Data Streams with Concept Drift_2

2025-04-27 0 0 639.52KB 55 页 10玖币

侵权投诉

arXiv:2210.03119v1 [cs.LG] 6 Oct 2022

Evaluating k-NN in the Classiﬁcation of Data Streams with Concept Drift

Roberto Souto Maior de Barrosa,∗

, Silas Garrido Teixeira de Carvalho Santosa, Jean Paul Barddalb

aCentro de Inform´atica, Universidade Federal de Pernambuco, Cidade Universit´aria, 50740-560, Recife, Brazil

bPrograma de P´os-Gradua¸c˜ao em Inform´atica, Pontif´ıcia Universidade Cat´olica do Paran´a, 80215-901, Curitiba, Brazil

Abstract

Data streams are often deﬁned as large amounts of data ﬂowing continuously at high speed. Moreover,

these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons

mentioned above, learning from streams is often online and under restrictions of memory consumption and

run-time. Although many classiﬁcation algorithms exist, most of the works published in the area use Naive

Bayes (NB) and Hoeﬀding Trees (HT) as base learners in their experiments. This article proposes an in-depth

evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept

drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of

nearest neighbors used for predictions (k), and window size (w). We compare diﬀerent parameter values for

k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We

formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate

for data stream classiﬁcation, especially when the run-time constraint is not too restrictive.

Keywords: k-Nearest Neighbors, data stream, concept drift, online learning, large-scale evaluation.

1. Introduction

Data stream environments are related to large amounts of data that may be inﬁnite and continuously ﬂow

rapidly. Learning from such stream environments typically requires online methods and is often restricted in

memory consumption and run-time. In addition, the reading and processing of each instance are often limited

to single access, an arrangement called single-pass processing. Finally, changes in the data distribution are

expected, a phenomenon known as concept drift [1, 2, 3].

Concept drift can be categorized in several ways, and the most common one is based on the speed of

the changes. Such drifts are named abrupt when the changes between concepts are sudden or rapid, and

are considered gradual when the transition from one concept to the next one occurs over a larger number of

instances [4, 5, 6, 7].

In the real-world, concept drifts occur for several reasons, including attacks and intrusion, mechanical or

technical equipment failure, seasonal changes, movement detected from sensors, changes in the characteristic

of spam messages in an e-mail, and others. Consequently, it is essential to detect and adapt these concept

drifts swiftly.

Diﬀerent approaches have been proposed to learn from data streams containing concept drift [1, 8]. A

common instrumentation to tackle concept drift regards ensembles that may adopt a base learner together

with diﬀerent strategies or weighting functions to estimate the resulting classiﬁcation. Examples of such

ensembles include Dynamic Weighted Majority (DWM) [9], Diversity for Dealing with Drifts (DDD) [10],

Adaptable Diversity-based Online Boosting (ADOB) [11], Boosting-like Online Learning Ensemble (BOLE)

[12], Fast Adaptive Stacking of Ensembles (FASE) [13], Online AdaBoost-based M1 (OABM1) and M2

∗Corresponding author

Email addresses: roberto@cin.ufpe.br (Roberto Souto Maior de Barros), sgtcs@cin.ufpe.br (Silas Garrido Teixeira de

Carvalho Santos), jean.barddal@ppgia.pucpr.br (Jean Paul Barddal)

Preprint submitted to arXiv October 10, 2022

(OABM2) [14]. In addition, some ensemble algorithms reuse previous classiﬁers on recurring concepts,

e.g. Recurring Concept Drifts (RCD) [5] and Enhanced Concept Proﬁling Framework (ECPF) [15].

Another common approach is based on concept drift detectors, lightweight methods that monitor the

prediction results of a base learner and are meant to identify concept drifts in the distribution of the data

explicitly. Many concept drift detection methods have been proposed in the literature [2, 16], with current

emphasis on Drift Detection Method (DDM) [4], Reactive Drift Detection Method (RDDM) [17], Drift

Detection Methods based on Hoeﬀding’s Bounds (HDDM) [6] with versions HDDMAand HDDMW, Fast

Hoeﬀding Drift Detection Method (FHDDM) [7], Wilcoxon Rank Sum Test Drift Detector (WSTD) [18]

and Fisher Test Drift Detector (FTDD) [19]. It is also worth pointing out that concept drift detectors have

often been used as auxiliary methods in ensembles [5, 10, 12, 13, 14, 15], as well as to form ensembles that

share a single base classiﬁer [20, 21], a comparatively less explored arrangement.

In this work, we target a roughly overlooked approach for classifying data streams: k-Nearest Neighbors

(k-NN) [22]. k-NN is a lazy and instance-based classiﬁer: in the training step, instances are buﬀered for

posterior querying and distance computation in the test step. When using k-NN in data streams, two issues

arise [23, 24]. First, how to choose k, the number of nearest buﬀered instances used for prediction. Intuitively,

larger values of kreduce the impact of noisy data, yet, render the class boundaries less distinct. Second,

it is not possible to buﬀer instances indeﬁnitely, as it jeopardizes both run-time and memory consumption

constraints. This is probably the main reason why k-NN is overlooked in data stream mining, frequently

simply dismissed based on the argument that it is too slow, and why bayesian [25] and decision tree [26]

models are usually preferred, even though they can become very slow too.

This article aims to analyze in-depth the general performance of k-NN applied to data stream envi-

ronments, surveying its advantages and limitations as well as discussing the reasons for its low popularity

in this context. For this purpose, large-scale experiments were conducted using several artiﬁcial dataset

generators, conﬁgured with both abrupt and gradual drift versions of several sizes, comparing k-NN to the

two most commonly used base classiﬁers in the data stream area — Naive Bayes (NB) [25] and Hoeﬀding

Tree (HT) [26, 27], and run in the Massive Online Analysis (MOA) framework [28]. More speciﬁcally, these

experiments were designed to answer the following research questions:

•RQ1: What is the impact of the neighborhood (k) on the accuracy of k-NN?

•RQ2: Does the accuracy of k-NN beneﬁt from concept drift detection? If so, how much does it

improve with diﬀerent values of k?

•RQ3: What is the impact of the neighborhood (k) on the run-time of k-NN?

•RQ4: How does concept drift detection impact the run-time of k-NN for diﬀerent values of k?

•RQ5: What is the impact of the window size (w) on the accuracy of k-NN?

•RQ6: How does drift detection aﬀect the accuracy of k-NN given diﬀerent wvalues?

•RQ7: What is the impact of the window size (w) on k-NN’s run-time?

•RQ8: How does concept drift detection impact the run-time of k-NN for diﬀerent values of w?

•RQ9: How does k-NN compare to NB and HT in terms of accuracy?

•RQ10: How does k-NN compare to NB and HT regarding run-time?

The remainder of the document is organized into six sections. Section 2 reviews the main concepts and

approaches involving data stream classiﬁcation and concept drift. Section 3 shows the conﬁguration of the

experiments, also including descriptions of the datasets used in the tests. Section 4 analyses the accuracy

and run-time results of k-NN varying k, evaluates them statistically and answers research questions RQ1–

RQ4. Section 5 addresses the accuracy and run-time results of k-NN varying w, evaluates them statistically,

and answers research questions RQ5–RQ8. Section 6 compares the best conﬁgurations of k-NN with NB

and HT and answers RQ9 and RQ10. Finally, Section 7 concludes this article and proposes future work.

2. Background

This section is divided into two subsections. The ﬁrst describes the characteristics of an online classiﬁ-

cation and then reviews the NB,HT, and k-NN classiﬁers. The second subsection deﬁnes concept drift, a

common problem in data stream environments, and surveys some of the main approaches that deal with it.

The contents presented here were selected to cover the topics explored in this article.

2.1. Online Classiﬁcation

Classiﬁcation is the task that distributes a set of instances into discrete classes according to relations or

aﬃnities. More formally, from a set of ninstances in the (~x, y) form, such that ~x is a d-dimensional vector

of attributes and y∈Yis the target, a classiﬁer builds a model f:~x →ythat is able to predict the unseen

targets of ~x values with accuracy.

Data stream classiﬁcation, or online classiﬁcation, is a variant of the traditional batch classiﬁcation.

The main diﬀerence between these approaches relies on how data is presented to the learner. In the batch

conﬁguration, the dataset is static and entirely accessible. In contrast, in streaming environments, instances

are not readily available to the classiﬁer for training; instead, they are presented sequentially over time,

and the learner must adapt its model according to the arrival of instances from the stream [29]. Therefore,

we denote S= [(~x t, yt)]n→∞

t=0 as a data stream providing instances (~x t, yt), each of which arriving at a

timestamp t.

Learning from data streams shares issues common to traditional classiﬁcation, such as high dimension-

ality, class imbalance, generalization to new data, and scalability, but it also suﬀers from speciﬁc issues

[30, 1, 31]. First, the classiﬁers must be able to process instances sequentially, according to their arrival,

and discard them right after. Though there is no restriction towards buﬀering instances for a small period

of time, such as in the k-NN – algorithm that we assess in this article – this should be done cautiously

w.r.t. memory space and processing time. The following subsections present three diﬀerent approaches used

in the online classiﬁcation context.

2.1.1. Naive Bayes

Naive Bayes (NB) is a classiﬁer based on the Bayes theorem. Its rationale is to compute class probabilities

based on the values of instance features and select the most likely class according to the product of the

conditional probabilities between feature values and the class. It is referred to as a naive classiﬁer since it

works under the assumption that all attributes are independent and that all features have the same impact

on the outcome. Even though both assumptions are unveriﬁed in most scenarios [32], NB does surprisingly

well for its simplicity in many classiﬁcation tasks.

Formally, considering that ~x is a d-dimensional vector of attributes of a given problem and y∈Yis the

associated label, the probability of a class y, given the ~x attributes, is deﬁned by Equation 1, where P(~x)

and P(y) represent the evidence and the class prior probability, respectively, and P(~x |y) is the likelihood.

P(y|~x) = P(y)P(~x |y)

P(~x),(1)

The estimation of P(~x |y) can be performed in several ways, such as using the binomial or multinomial

distribution for discrete values or the normal distribution for continuous values.

In terms of time complexity, for each training instance, NB will need to go through the attributes

updating the metrics that will be used for the calculation of P(~x |y) and will have an asymptotic performance

of Θ(d). The frequency of the labels should also be updated to deﬁne P(y), this being a constant cost

operation.

On the other hand, the cost involved in the classiﬁcation is related to the calculation of P(~x |y), which

will require that each of the attributes associated with the diﬀerent classes is covered, demanding an Θ(c×d)

time complexity, where ccorresponds to the number of classes. It is important to note that the calculation

of P(~x) (Equation 1) can be omitted since it will be the same for all classes. Thus, the probability of a class

ygiven a vector of attributes ~x is deﬁned as follows:

P(y|~x)∝P(y)

j=1

P(~xj|y).(2)

2.1.2. Hoeﬀding Tree

Decision trees are a popular choice for classiﬁcation tasks as they are simple, robust, and human-

understandable. They are learned by recursion and replacing leaves with split nodes that guide class

separation and classiﬁcation. The deﬁnition of which attribute will be used in a split node is chosen by

comparing all available features and choosing the best option according to some heuristic function J, e.g.,

Conditional Entropy, Information Gain, and Symmetrical Uncertainty. The splitting process is repeated on

top of a set of training examples stored in main memory. As a result, classical decision trees are limited to

learning from this speciﬁc set of instances and, consequently, are not tailored to evolve over time.

In streaming scenarios, the assumption that the entire dataset is available for training does not hold and,

thus, the Hoeﬀding Tree (HT) was proposed in [26] to learn tree models from streaming data incrementally.

HT relaxes this constraint by comparing which feature is the most appropriate, according to J, based on a

small data sample. To determine how big this sample should be, the authors [26] proposed the use of the

Hoeﬀding bound [33]. Assuming the heuristic function Jwith range R, the Hoeﬀding bound states that,

with probability (1 −δ), the true mean of Jis at least ( ¯

J−ǫ), where ǫis the bound calculated following

Equation 3.

ǫ=pR2×ln(1/δ)/2n(3)

Therefore, with high probability, the data distribution observed in a data sample of size nadheres to

the population distribution, which is potentially unbounded in data streams. Consequently, HTs attempt

a new split every time a leaf contains ninstances. Assuming that the goal is to maximize J, that Xais

the best-ranked feature in terms of J, and that Xbis the second best, then a split will be performed on

Xaif ∆J=J(Xa, Y )−J(Xb;Y)≥ǫ. As a result of empirical experiments [26] noticed that reasonably

small values of n, e.g., 200, achieve eﬀective results, and the same value has been adopted in the MOA

framework, which has been used in the experiments of this article.

For training, the cost associated with HT involves going through the tree and, when ﬁnding the leaf

node, updating the corresponding statistics. Besides, a check should be done to indicate whether a leaf split

should occur from time to time. All of these operations will result in time complexity of O(l+c×d×v) or

simply O(c×d×v) for training, where lcorresponds to the height of the tree, cis the number of classes, d

represents the number of attributes, and vis the maximum number of values per attribute.

Finally, to classify an instance, the associated cost is basically going through the tree to the leaf node,

with time complexity of O(l). However, a common strategy to improve the accuracy of HT is to use NB

learners at the leaves instead of the majority class classiﬁer [32]. Thus, it is possible to calculate P(y|~x)

(details on Subsection 2.1.1) using the statistics available in the leaf node, though adopting this strategy

brings an additional cost, making the complexity become O(l+c×d).

2.1.3. k-Nearest Neighbors

k-Nearest Neighbors (k-NN) is one of the most fundamental, simple, and widely used classiﬁcation

methods, which can learn complex (non-linear) functions [22]. k-NN is a lazy learner since it does not

require building a model before its actual use. It classiﬁes unlabeled instances according to the closest

previously seen labeled ones stored in a buﬀer. The deﬁnition of close means that a distance measure is

used to determine how similar or dissimilar two instances are.

Several approaches compute distances between instances. Although there is no optimal distance for all

cases, speciﬁc scenarios can beneﬁt from choosing the most suitable one. A recent and in-depth study on

this topic can be found at [34]. As this type of analysis is outside the scope of this article, we concentrate

on the most popular, which is the Euclidean distance [35], given by Equation 4. The vectors ~xiand ~xjare

two arbitrary instances and the summation occurs over all features Xk∈ X .

distEuclid(~xi, ~xj) = sX

Xk∈X

(~xi[Xk]−~xj[Xk])2(4)

k-NN classiﬁes unlabeled instances according to the majority label in the kclosest instances and, thus,

picking an appropriate value of kfor each application is also signiﬁcant. If kis too small, k-NN becomes

more prone to over-ﬁtting and tends to misclassify instances in easy situations. Conversely, bigger values of

kmay mislead classiﬁcation when an instance is surrounded by several instances of a diﬀerent label in fuzzy

decision borders.

The classiﬁcation of data streams using k-NN requires an additional important trait: dealing with time

and memory limitations. Continuously buﬀering instances as they arrive is unfeasible since the stream is

potentially unbounded. Therefore, an incremental version of k-NN must forget older instances as the stream

progresses. A simple approach for forgetting is storing instances in a queue with size w. Again, deﬁning a

value for wis nontrivial, and it must be set according to constraints of memory and processing time.

Given that a queue of size wwill store the most recent instances of the stream, the classiﬁcation process

consists of checking the distance between the current instance and the instances in w. After that, the k

instances with the shortest distances are selected. A linear implementation of k-NN, which uses a heap

to keep the kshorter distances as they are calculated, will have time complexity O(w×(d+log k)) for

classiﬁcation. On the other hand, because it is a lazy algorithm that postpones all processing for the

classiﬁcation period, the training will have constant time complexity, requiring only to store the instances

as they arrive.

It is noteworthy highlighting that instance-based approaches such as IBLStreams [23] and Self-adjusting

Memory k-NN (SAM-kNN) [24] have been proposed for data streams. IBLStreams provides two mechanisms

for self-adjusting the neighborhood hyper-parameter k, such that the ﬁrst accounts for the error rate obtained

by (k−1), k, and (k+ 1) versions of the IBLStreams in the previous 100 instances, while the second tests

kernel sizes that are coupled with centroids with 5% deviation when compared to the current kernel size

followed. However, both mechanisms induce relevant computational overheads because multiple instances

of IBLStreams are to be run in parallel. On the other hand, SAM-kNN proposes a combination of k-

NN with short-term and long-term memories. Despite this approach not requiring explicitly determining

window sizes, it does require the deﬁnition of hyper-parameters that determine the size of long-term and

short-term memories, as well as the neighborhood size k. In this paper, we focus solely on the original k-NN

algorithm and conduct an exploratory study on the impact of the neighborhood size (k) and window size

(w) hyper-parameters on the performance of k-NN.

2.2. Concept Drift Problem

Due to the temporal and ephemeral characteristics of data streams, these are expected to change their

data distributions, thus giving rise to concept drifts [36, 37]. Following the deﬁnition in [38], we denote a

concept Cas a set of class probabilities and a class-conditional probability density function as stated in

C=Sy∈Y{(P[y], P [~x |y])}.

Consequently, given a stream S, instances (~x t, yt) are generated according to the current concept Ct. If

during every instant tiit follows that Cti=Cti−1, then the concept is stable. Otherwise, if between any

two timestamps tiand tj=ti+ ∆ it occurs that Cti6=Ctj, we have a concept drift. The cause of concept

drifts can neither be determined nor predicted by learning algorithms as the information behind their causes

is often unavailable (hidden context), or it is too costly [39]. As a result, learners tailored for data streams

must detect and adapt to these changes automatically and autonomously.

Concept drifts are often categorized according to their speed. Given that drift occurs at timestamp ti

and that it becomes stable once again at a timestamp tj=ti+ 1, the drift is said to be abrupt. Otherwise,

it takes more instances between the drift and the moment when the new concept becomes stable, and the

drift is called gradual.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

arXiv:2210.03119v1[cs.LG]6Oct2022Evaluatingk-NNintheClassiﬁcationofDataStreamswithConceptDriftRobertoSoutoMaiordeBarrosa,∗,SilasGarridoTeixeiradeCarvalhoSantosa,JeanPaulBarddalbaCentrodeInform´atica,UniversidadeFederaldePernambuco,CidadeUniversit´aria,50740-560,Recife,BrazilbProgramadeP´os-Gradua¸c˜...

展开>> 收起<<

Evaluating k-NN in the Classification of Data Streams with Concept Drift_2.pdf

共55页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Evaluating k-NN in the Classification of Data Streams with Concept Drift_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: