
of P(~x) (Equation 1) can be omitted since it will be the same for all classes. Thus, the probability of a class
ygiven a vector of attributes ~x is defined as follows:
P(y|~x)∝P(y)
d
Y
j=1
P(~xj|y).(2)
2.1.2. Hoeffding Tree
Decision trees are a popular choice for classification tasks as they are simple, robust, and human-
understandable. They are learned by recursion and replacing leaves with split nodes that guide class
separation and classification. The definition of which attribute will be used in a split node is chosen by
comparing all available features and choosing the best option according to some heuristic function J, e.g.,
Conditional Entropy, Information Gain, and Symmetrical Uncertainty. The splitting process is repeated on
top of a set of training examples stored in main memory. As a result, classical decision trees are limited to
learning from this specific set of instances and, consequently, are not tailored to evolve over time.
In streaming scenarios, the assumption that the entire dataset is available for training does not hold and,
thus, the Hoeffding Tree (HT) was proposed in [26] to learn tree models from streaming data incrementally.
HT relaxes this constraint by comparing which feature is the most appropriate, according to J, based on a
small data sample. To determine how big this sample should be, the authors [26] proposed the use of the
Hoeffding bound [33]. Assuming the heuristic function Jwith range R, the Hoeffding bound states that,
with probability (1 −δ), the true mean of Jis at least ( ¯
J−ǫ), where ǫis the bound calculated following
Equation 3.
ǫ=pR2×ln(1/δ)/2n(3)
Therefore, with high probability, the data distribution observed in a data sample of size nadheres to
the population distribution, which is potentially unbounded in data streams. Consequently, HTs attempt
a new split every time a leaf contains ninstances. Assuming that the goal is to maximize J, that Xais
the best-ranked feature in terms of J, and that Xbis the second best, then a split will be performed on
Xaif ∆J=J(Xa, Y )−J(Xb;Y)≥ǫ. As a result of empirical experiments [26] noticed that reasonably
small values of n, e.g., 200, achieve effective results, and the same value has been adopted in the MOA
framework, which has been used in the experiments of this article.
For training, the cost associated with HT involves going through the tree and, when finding the leaf
node, updating the corresponding statistics. Besides, a check should be done to indicate whether a leaf split
should occur from time to time. All of these operations will result in time complexity of O(l+c×d×v) or
simply O(c×d×v) for training, where lcorresponds to the height of the tree, cis the number of classes, d
represents the number of attributes, and vis the maximum number of values per attribute.
Finally, to classify an instance, the associated cost is basically going through the tree to the leaf node,
with time complexity of O(l). However, a common strategy to improve the accuracy of HT is to use NB
learners at the leaves instead of the majority class classifier [32]. Thus, it is possible to calculate P(y|~x)
(details on Subsection 2.1.1) using the statistics available in the leaf node, though adopting this strategy
brings an additional cost, making the complexity become O(l+c×d).
2.1.3. k-Nearest Neighbors
k-Nearest Neighbors (k-NN) is one of the most fundamental, simple, and widely used classification
methods, which can learn complex (non-linear) functions [22]. k-NN is a lazy learner since it does not
require building a model before its actual use. It classifies unlabeled instances according to the closest
previously seen labeled ones stored in a buffer. The definition of close means that a distance measure is
used to determine how similar or dissimilar two instances are.
Several approaches compute distances between instances. Although there is no optimal distance for all
cases, specific scenarios can benefit from choosing the most suitable one. A recent and in-depth study on
this topic can be found at [34]. As this type of analysis is outside the scope of this article, we concentrate
4