Sampling Streaming Data with Parallel Vector Quantization - PVQ
Mujahid Sultan
mujahid.sultan@mlsoft.ai
October 5, 2022
Abstract—Accumulation of corporate data in the cloud has
attracted more enterprise applications to the cloud creating data
gravity. As a consequence, network traffic has become more
cloud-centric. This increase in cloud-centric traffic poses new
challenges in designing learning systems for streaming data due
to class imbalance. The number of classes plays a vital role in
the accuracy of the classifiers’ built from the data streams.
In this paper, we present a vector quantization based sampling
method, which substantially reduces the class imbalance in data
streams. We demonstrate its effectiveness by conducting exper-
iments on network traffic and anomaly dataset with commonly
used ML model building methods — Multilayered Perceptron
on TensorFlow backend, Support Vector Machines, K-Nearest
Neighbour, and Random Forests. We built models using parallel
processing, batch processing, and randomly selecting samples. We
show that the accuracy of classification models improves when
the data streams are pre-processed with our method. We used
out of the box hyper-parameters of these classifiers as well as
used auto-sklearn1for hyper-parameter optimization.
Keywords-Data Streams; Class Imbalance; Vector Quantiza-
tion; cloud-centric Traffic; TensorFlow; Classification;
I. INTRODUCTION
With more and more enterprise data moving to the cloud,
it has attracted enterprise applications and traffic to the cloud.
This accumulation of data in the cloud and its gravity has
resulted in an enormous volume of cloud-centric traffic, which
poses new computational challenges in developing machine
learning (ML) models from these data streams. We observe
that when ML models are built from data streams, they do not
perform well as the number of observations per class differs
significantly in these data streams.
Class balancing techniques like ROSE [
1
] and SMOTE [
2
],
with its variants given in [
3
], are generally used to address
the class imbalance. In our experiments, we observe that even
if the data streams are balanced (i.e., when the number of
observations per class is roughly the same) with the methods
given in [
1
], [
2
], [
3
]. We found that the classification models do
not perform well unless some intelligent stream pre-processing
or sampling is done.
When a dataset is balanced, trivial random sampling may
yield good results [
4
]. However, when the dataset is imbalanced,
random sampling fails to capture full classes in a dataset [4].
One can improve “vanilla” random sampling by performing
random sampling from each class of data independently. Hybrid
approaches, such as SMOTE and ROSE down-sample the
majority class and up-sample the minority classes. These
1https://www.automl.org/
approaches require knowledge of the labels limiting these
to supervised learning. Besides, we find that no substantial
modeling accuracy can be achieved by applying these methods
to streaming data.
II. BACKGROUND AND MOTIVATION
In this publication, we propose a novel method to sample
imbalanced streaming datasets for building high accuracy ML
models. To demonstrate this on network traffic data streams, we
select a labeled intrusion and anomaly detection dataset. There
are not enough labeled anomaly datasets publicly available out
there. Mainly because organizational Security Information and
Event Management Systems (SIEMS) are domain-specific and
depend upon many factors, e.g., time, geographical location,
and nature of the business to label the anomalies, above
all, attack and anomaly labels are proprietary. Therefore, to
generalize the problem, we use a publicly available and well
researched labeled intrusion detection dataset KDDCUP [
5
].
Though there are newer datasets available at [
6
], [
7
], [
8
],
these are either smaller or not as widely used by the research
community. KDDCUP dataset is a perfect fit for our purposes
as it is highly imbalanced; the test set is not from the same
domain as the train set, making it very close to real-world
problems.
A general rule in ML is — the more the data, the better
learning by ML algorithms. Therefore, we use the full KDD-
CUP dataset as a single streaming window to create a baseline
by first analyzing the entire dataset in a single batch, so that the
results of small data streams (mini batches) can be compared.
We perform a number of ad-hoc experiments. Our testbed
is a computer with two Intel Xeon(R) E5-2680v4 processors
(allowing execution of 56 hyper-threads in parallel) and 512
GB of memory.
To create a baseline, we use the full batch of KDDCUP
dataset and trained Naive Bayes [
9
] model using 10-fold cross-
validation using Weka v.3.8.2, and it takes
≈
30 seconds and
another
≈
100 seconds for testing. The performance of the
model, not surprisingly, is not stellar (precision of
≈0.83
and
recall of
≈0.70
). We also tried to train support vector machine
(SVM) model using Weka on the same dataset, but it could not
complete after 15 days, so we have to terminate the training.
We then switch to an automated machine learning tool
autosklearn [
10
], which automates data pre-processing and
parameter optimization for machine learning algorithms. Hyper-
parameter optimization and automated machine learning is not
a new concept and has been experimenting for several decades.
1
arXiv:2210.01792v1 [cs.LG] 4 Oct 2022