Evaluating k-NN in the Classification of Data Streams with Concept Drift_2

2025-04-27 0 0 639.52KB 55 页 10玖币
侵权投诉
arXiv:2210.03119v1 [cs.LG] 6 Oct 2022
Evaluating k-NN in the Classification of Data Streams with Concept Drift
Roberto Souto Maior de Barrosa,
, Silas Garrido Teixeira de Carvalho Santosa, Jean Paul Barddalb
aCentro de Inform´atica, Universidade Federal de Pernambuco, Cidade Universit´aria, 50740-560, Recife, Brazil
bPrograma de P´os-Gradua¸ao em Inform´atica, Pontif´ıcia Universidade Cat´olica do Paran´a, 80215-901, Curitiba, Brazil
Abstract
Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover,
these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons
mentioned above, learning from streams is often online and under restrictions of memory consumption and
run-time. Although many classification algorithms exist, most of the works published in the area use Naive
Bayes (NB) and Hoeffding Trees (HT) as base learners in their experiments. This article proposes an in-depth
evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept
drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of
nearest neighbors used for predictions (k), and window size (w). We compare different parameter values for
k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We
formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate
for data stream classification, especially when the run-time constraint is not too restrictive.
Keywords: k-Nearest Neighbors, data stream, concept drift, online learning, large-scale evaluation.
1. Introduction
Data stream environments are related to large amounts of data that may be infinite and continuously flow
rapidly. Learning from such stream environments typically requires online methods and is often restricted in
memory consumption and run-time. In addition, the reading and processing of each instance are often limited
to single access, an arrangement called single-pass processing. Finally, changes in the data distribution are
expected, a phenomenon known as concept drift [1, 2, 3].
Concept drift can be categorized in several ways, and the most common one is based on the speed of
the changes. Such drifts are named abrupt when the changes between concepts are sudden or rapid, and
are considered gradual when the transition from one concept to the next one occurs over a larger number of
instances [4, 5, 6, 7].
In the real-world, concept drifts occur for several reasons, including attacks and intrusion, mechanical or
technical equipment failure, seasonal changes, movement detected from sensors, changes in the characteristic
of spam messages in an e-mail, and others. Consequently, it is essential to detect and adapt these concept
drifts swiftly.
Different approaches have been proposed to learn from data streams containing concept drift [1, 8]. A
common instrumentation to tackle concept drift regards ensembles that may adopt a base learner together
with different strategies or weighting functions to estimate the resulting classification. Examples of such
ensembles include Dynamic Weighted Majority (DWM) [9], Diversity for Dealing with Drifts (DDD) [10],
Adaptable Diversity-based Online Boosting (ADOB) [11], Boosting-like Online Learning Ensemble (BOLE)
[12], Fast Adaptive Stacking of Ensembles (FASE) [13], Online AdaBoost-based M1 (OABM1) and M2
Corresponding author
Email addresses: roberto@cin.ufpe.br (Roberto Souto Maior de Barros), sgtcs@cin.ufpe.br (Silas Garrido Teixeira de
Carvalho Santos), jean.barddal@ppgia.pucpr.br (Jean Paul Barddal)
Preprint submitted to arXiv October 10, 2022
(OABM2) [14]. In addition, some ensemble algorithms reuse previous classifiers on recurring concepts,
e.g. Recurring Concept Drifts (RCD) [5] and Enhanced Concept Profiling Framework (ECPF) [15].
Another common approach is based on concept drift detectors, lightweight methods that monitor the
prediction results of a base learner and are meant to identify concept drifts in the distribution of the data
explicitly. Many concept drift detection methods have been proposed in the literature [2, 16], with current
emphasis on Drift Detection Method (DDM) [4], Reactive Drift Detection Method (RDDM) [17], Drift
Detection Methods based on Hoeffding’s Bounds (HDDM) [6] with versions HDDMAand HDDMW, Fast
Hoeffding Drift Detection Method (FHDDM) [7], Wilcoxon Rank Sum Test Drift Detector (WSTD) [18]
and Fisher Test Drift Detector (FTDD) [19]. It is also worth pointing out that concept drift detectors have
often been used as auxiliary methods in ensembles [5, 10, 12, 13, 14, 15], as well as to form ensembles that
share a single base classifier [20, 21], a comparatively less explored arrangement.
In this work, we target a roughly overlooked approach for classifying data streams: k-Nearest Neighbors
(k-NN) [22]. k-NN is a lazy and instance-based classifier: in the training step, instances are buffered for
posterior querying and distance computation in the test step. When using k-NN in data streams, two issues
arise [23, 24]. First, how to choose k, the number of nearest buffered instances used for prediction. Intuitively,
larger values of kreduce the impact of noisy data, yet, render the class boundaries less distinct. Second,
it is not possible to buffer instances indefinitely, as it jeopardizes both run-time and memory consumption
constraints. This is probably the main reason why k-NN is overlooked in data stream mining, frequently
simply dismissed based on the argument that it is too slow, and why bayesian [25] and decision tree [26]
models are usually preferred, even though they can become very slow too.
This article aims to analyze in-depth the general performance of k-NN applied to data stream envi-
ronments, surveying its advantages and limitations as well as discussing the reasons for its low popularity
in this context. For this purpose, large-scale experiments were conducted using several artificial dataset
generators, configured with both abrupt and gradual drift versions of several sizes, comparing k-NN to the
two most commonly used base classifiers in the data stream area — Naive Bayes (NB) [25] and Hoeffding
Tree (HT) [26, 27], and run in the Massive Online Analysis (MOA) framework [28]. More specifically, these
experiments were designed to answer the following research questions:
RQ1: What is the impact of the neighborhood (k) on the accuracy of k-NN?
RQ2: Does the accuracy of k-NN benefit from concept drift detection? If so, how much does it
improve with different values of k?
RQ3: What is the impact of the neighborhood (k) on the run-time of k-NN?
RQ4: How does concept drift detection impact the run-time of k-NN for different values of k?
RQ5: What is the impact of the window size (w) on the accuracy of k-NN?
RQ6: How does drift detection affect the accuracy of k-NN given different wvalues?
RQ7: What is the impact of the window size (w) on k-NN’s run-time?
RQ8: How does concept drift detection impact the run-time of k-NN for different values of w?
RQ9: How does k-NN compare to NB and HT in terms of accuracy?
RQ10: How does k-NN compare to NB and HT regarding run-time?
The remainder of the document is organized into six sections. Section 2 reviews the main concepts and
approaches involving data stream classification and concept drift. Section 3 shows the configuration of the
experiments, also including descriptions of the datasets used in the tests. Section 4 analyses the accuracy
and run-time results of k-NN varying k, evaluates them statistically and answers research questions RQ1–
RQ4. Section 5 addresses the accuracy and run-time results of k-NN varying w, evaluates them statistically,
and answers research questions RQ5–RQ8. Section 6 compares the best configurations of k-NN with NB
and HT and answers RQ9 and RQ10. Finally, Section 7 concludes this article and proposes future work.
2
2. Background
This section is divided into two subsections. The first describes the characteristics of an online classifi-
cation and then reviews the NB,HT, and k-NN classifiers. The second subsection defines concept drift, a
common problem in data stream environments, and surveys some of the main approaches that deal with it.
The contents presented here were selected to cover the topics explored in this article.
2.1. Online Classification
Classification is the task that distributes a set of instances into discrete classes according to relations or
affinities. More formally, from a set of ninstances in the (~x, y) form, such that ~x is a d-dimensional vector
of attributes and yYis the target, a classifier builds a model f:~x ythat is able to predict the unseen
targets of ~x values with accuracy.
Data stream classification, or online classification, is a variant of the traditional batch classification.
The main difference between these approaches relies on how data is presented to the learner. In the batch
configuration, the dataset is static and entirely accessible. In contrast, in streaming environments, instances
are not readily available to the classifier for training; instead, they are presented sequentially over time,
and the learner must adapt its model according to the arrival of instances from the stream [29]. Therefore,
we denote S= [(~x t, yt)]n→∞
t=0 as a data stream providing instances (~x t, yt), each of which arriving at a
timestamp t.
Learning from data streams shares issues common to traditional classification, such as high dimension-
ality, class imbalance, generalization to new data, and scalability, but it also suffers from specific issues
[30, 1, 31]. First, the classifiers must be able to process instances sequentially, according to their arrival,
and discard them right after. Though there is no restriction towards buffering instances for a small period
of time, such as in the k-NN – algorithm that we assess in this article – this should be done cautiously
w.r.t. memory space and processing time. The following subsections present three different approaches used
in the online classification context.
2.1.1. Naive Bayes
Naive Bayes (NB) is a classifier based on the Bayes theorem. Its rationale is to compute class probabilities
based on the values of instance features and select the most likely class according to the product of the
conditional probabilities between feature values and the class. It is referred to as a naive classifier since it
works under the assumption that all attributes are independent and that all features have the same impact
on the outcome. Even though both assumptions are unverified in most scenarios [32], NB does surprisingly
well for its simplicity in many classification tasks.
Formally, considering that ~x is a d-dimensional vector of attributes of a given problem and yYis the
associated label, the probability of a class y, given the ~x attributes, is defined by Equation 1, where P(~x)
and P(y) represent the evidence and the class prior probability, respectively, and P(~x |y) is the likelihood.
P(y|~x) = P(y)P(~x |y)
P(~x),(1)
The estimation of P(~x |y) can be performed in several ways, such as using the binomial or multinomial
distribution for discrete values or the normal distribution for continuous values.
In terms of time complexity, for each training instance, NB will need to go through the attributes
updating the metrics that will be used for the calculation of P(~x |y) and will have an asymptotic performance
of Θ(d). The frequency of the labels should also be updated to define P(y), this being a constant cost
operation.
On the other hand, the cost involved in the classification is related to the calculation of P(~x |y), which
will require that each of the attributes associated with the different classes is covered, demanding an Θ(c×d)
time complexity, where ccorresponds to the number of classes. It is important to note that the calculation
3
of P(~x) (Equation 1) can be omitted since it will be the same for all classes. Thus, the probability of a class
ygiven a vector of attributes ~x is defined as follows:
P(y|~x)P(y)
d
Y
j=1
P(~xj|y).(2)
2.1.2. Hoeffding Tree
Decision trees are a popular choice for classification tasks as they are simple, robust, and human-
understandable. They are learned by recursion and replacing leaves with split nodes that guide class
separation and classification. The definition of which attribute will be used in a split node is chosen by
comparing all available features and choosing the best option according to some heuristic function J, e.g.,
Conditional Entropy, Information Gain, and Symmetrical Uncertainty. The splitting process is repeated on
top of a set of training examples stored in main memory. As a result, classical decision trees are limited to
learning from this specific set of instances and, consequently, are not tailored to evolve over time.
In streaming scenarios, the assumption that the entire dataset is available for training does not hold and,
thus, the Hoeffding Tree (HT) was proposed in [26] to learn tree models from streaming data incrementally.
HT relaxes this constraint by comparing which feature is the most appropriate, according to J, based on a
small data sample. To determine how big this sample should be, the authors [26] proposed the use of the
Hoeffding bound [33]. Assuming the heuristic function Jwith range R, the Hoeffding bound states that,
with probability (1 δ), the true mean of Jis at least ( ¯
Jǫ), where ǫis the bound calculated following
Equation 3.
ǫ=pR2×ln(1)/2n(3)
Therefore, with high probability, the data distribution observed in a data sample of size nadheres to
the population distribution, which is potentially unbounded in data streams. Consequently, HTs attempt
a new split every time a leaf contains ninstances. Assuming that the goal is to maximize J, that Xais
the best-ranked feature in terms of J, and that Xbis the second best, then a split will be performed on
Xaif ∆J=J(Xa, Y )J(Xb;Y)ǫ. As a result of empirical experiments [26] noticed that reasonably
small values of n, e.g., 200, achieve effective results, and the same value has been adopted in the MOA
framework, which has been used in the experiments of this article.
For training, the cost associated with HT involves going through the tree and, when finding the leaf
node, updating the corresponding statistics. Besides, a check should be done to indicate whether a leaf split
should occur from time to time. All of these operations will result in time complexity of O(l+c×d×v) or
simply O(c×d×v) for training, where lcorresponds to the height of the tree, cis the number of classes, d
represents the number of attributes, and vis the maximum number of values per attribute.
Finally, to classify an instance, the associated cost is basically going through the tree to the leaf node,
with time complexity of O(l). However, a common strategy to improve the accuracy of HT is to use NB
learners at the leaves instead of the majority class classifier [32]. Thus, it is possible to calculate P(y|~x)
(details on Subsection 2.1.1) using the statistics available in the leaf node, though adopting this strategy
brings an additional cost, making the complexity become O(l+c×d).
2.1.3. k-Nearest Neighbors
k-Nearest Neighbors (k-NN) is one of the most fundamental, simple, and widely used classification
methods, which can learn complex (non-linear) functions [22]. k-NN is a lazy learner since it does not
require building a model before its actual use. It classifies unlabeled instances according to the closest
previously seen labeled ones stored in a buffer. The definition of close means that a distance measure is
used to determine how similar or dissimilar two instances are.
Several approaches compute distances between instances. Although there is no optimal distance for all
cases, specific scenarios can benefit from choosing the most suitable one. A recent and in-depth study on
this topic can be found at [34]. As this type of analysis is outside the scope of this article, we concentrate
4
on the most popular, which is the Euclidean distance [35], given by Equation 4. The vectors ~xiand ~xjare
two arbitrary instances and the summation occurs over all features Xk X .
distEuclid(~xi, ~xj) = sX
Xk∈X
(~xi[Xk]~xj[Xk])2(4)
k-NN classifies unlabeled instances according to the majority label in the kclosest instances and, thus,
picking an appropriate value of kfor each application is also significant. If kis too small, k-NN becomes
more prone to over-fitting and tends to misclassify instances in easy situations. Conversely, bigger values of
kmay mislead classification when an instance is surrounded by several instances of a different label in fuzzy
decision borders.
The classification of data streams using k-NN requires an additional important trait: dealing with time
and memory limitations. Continuously buffering instances as they arrive is unfeasible since the stream is
potentially unbounded. Therefore, an incremental version of k-NN must forget older instances as the stream
progresses. A simple approach for forgetting is storing instances in a queue with size w. Again, defining a
value for wis nontrivial, and it must be set according to constraints of memory and processing time.
Given that a queue of size wwill store the most recent instances of the stream, the classification process
consists of checking the distance between the current instance and the instances in w. After that, the k
instances with the shortest distances are selected. A linear implementation of k-NN, which uses a heap
to keep the kshorter distances as they are calculated, will have time complexity O(w×(d+log k)) for
classification. On the other hand, because it is a lazy algorithm that postpones all processing for the
classification period, the training will have constant time complexity, requiring only to store the instances
as they arrive.
It is noteworthy highlighting that instance-based approaches such as IBLStreams [23] and Self-adjusting
Memory k-NN (SAM-kNN) [24] have been proposed for data streams. IBLStreams provides two mechanisms
for self-adjusting the neighborhood hyper-parameter k, such that the first accounts for the error rate obtained
by (k1), k, and (k+ 1) versions of the IBLStreams in the previous 100 instances, while the second tests
kernel sizes that are coupled with centroids with 5% deviation when compared to the current kernel size
followed. However, both mechanisms induce relevant computational overheads because multiple instances
of IBLStreams are to be run in parallel. On the other hand, SAM-kNN proposes a combination of k-
NN with short-term and long-term memories. Despite this approach not requiring explicitly determining
window sizes, it does require the definition of hyper-parameters that determine the size of long-term and
short-term memories, as well as the neighborhood size k. In this paper, we focus solely on the original k-NN
algorithm and conduct an exploratory study on the impact of the neighborhood size (k) and window size
(w) hyper-parameters on the performance of k-NN.
2.2. Concept Drift Problem
Due to the temporal and ephemeral characteristics of data streams, these are expected to change their
data distributions, thus giving rise to concept drifts [36, 37]. Following the definition in [38], we denote a
concept Cas a set of class probabilities and a class-conditional probability density function as stated in
C=SyY{(P[y], P [~x |y])}.
Consequently, given a stream S, instances (~x t, yt) are generated according to the current concept Ct. If
during every instant tiit follows that Cti=Cti1, then the concept is stable. Otherwise, if between any
two timestamps tiand tj=ti+ ∆ it occurs that Cti6=Ctj, we have a concept drift. The cause of concept
drifts can neither be determined nor predicted by learning algorithms as the information behind their causes
is often unavailable (hidden context), or it is too costly [39]. As a result, learners tailored for data streams
must detect and adapt to these changes automatically and autonomously.
Concept drifts are often categorized according to their speed. Given that drift occurs at timestamp ti
and that it becomes stable once again at a timestamp tj=ti+ 1, the drift is said to be abrupt. Otherwise,
it takes more instances between the drift and the moment when the new concept becomes stable, and the
drift is called gradual.
5
摘要:

arXiv:2210.03119v1[cs.LG]6Oct2022Evaluatingk-NNintheClassificationofDataStreamswithConceptDriftRobertoSoutoMaiordeBarrosa,∗,SilasGarridoTeixeiradeCarvalhoSantosa,JeanPaulBarddalbaCentrodeInform´atica,UniversidadeFederaldePernambuco,CidadeUniversit´aria,50740-560,Recife,BrazilbProgramadeP´os-Gradua¸c˜...

展开>> 收起<<
Evaluating k-NN in the Classification of Data Streams with Concept Drift_2.pdf

共55页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:55 页 大小:639.52KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 55
客服
关注