Sampling Streaming Data with Parallel Vector Quantization - PVQ Mujahid Sultan mujahid.sultanmlsoft.ai

2025-05-03 0 0 536.96KB 10 页 10玖币
侵权投诉
Sampling Streaming Data with Parallel Vector Quantization - PVQ
Mujahid Sultan
mujahid.sultan@mlsoft.ai
October 5, 2022
Abstract—Accumulation of corporate data in the cloud has
attracted more enterprise applications to the cloud creating data
gravity. As a consequence, network traffic has become more
cloud-centric. This increase in cloud-centric traffic poses new
challenges in designing learning systems for streaming data due
to class imbalance. The number of classes plays a vital role in
the accuracy of the classifiers’ built from the data streams.
In this paper, we present a vector quantization based sampling
method, which substantially reduces the class imbalance in data
streams. We demonstrate its effectiveness by conducting exper-
iments on network traffic and anomaly dataset with commonly
used ML model building methods — Multilayered Perceptron
on TensorFlow backend, Support Vector Machines, K-Nearest
Neighbour, and Random Forests. We built models using parallel
processing, batch processing, and randomly selecting samples. We
show that the accuracy of classification models improves when
the data streams are pre-processed with our method. We used
out of the box hyper-parameters of these classifiers as well as
used auto-sklearn1for hyper-parameter optimization.
Keywords-Data Streams; Class Imbalance; Vector Quantiza-
tion; cloud-centric Traffic; TensorFlow; Classification;
I. INTRODUCTION
With more and more enterprise data moving to the cloud,
it has attracted enterprise applications and traffic to the cloud.
This accumulation of data in the cloud and its gravity has
resulted in an enormous volume of cloud-centric traffic, which
poses new computational challenges in developing machine
learning (ML) models from these data streams. We observe
that when ML models are built from data streams, they do not
perform well as the number of observations per class differs
significantly in these data streams.
Class balancing techniques like ROSE [
1
] and SMOTE [
2
],
with its variants given in [
3
], are generally used to address
the class imbalance. In our experiments, we observe that even
if the data streams are balanced (i.e., when the number of
observations per class is roughly the same) with the methods
given in [
1
], [
2
], [
3
]. We found that the classification models do
not perform well unless some intelligent stream pre-processing
or sampling is done.
When a dataset is balanced, trivial random sampling may
yield good results [
4
]. However, when the dataset is imbalanced,
random sampling fails to capture full classes in a dataset [4].
One can improve “vanilla” random sampling by performing
random sampling from each class of data independently. Hybrid
approaches, such as SMOTE and ROSE down-sample the
majority class and up-sample the minority classes. These
1https://www.automl.org/
approaches require knowledge of the labels limiting these
to supervised learning. Besides, we find that no substantial
modeling accuracy can be achieved by applying these methods
to streaming data.
II. BACKGROUND AND MOTIVATION
In this publication, we propose a novel method to sample
imbalanced streaming datasets for building high accuracy ML
models. To demonstrate this on network traffic data streams, we
select a labeled intrusion and anomaly detection dataset. There
are not enough labeled anomaly datasets publicly available out
there. Mainly because organizational Security Information and
Event Management Systems (SIEMS) are domain-specific and
depend upon many factors, e.g., time, geographical location,
and nature of the business to label the anomalies, above
all, attack and anomaly labels are proprietary. Therefore, to
generalize the problem, we use a publicly available and well
researched labeled intrusion detection dataset KDDCUP [
5
].
Though there are newer datasets available at [
6
], [
7
], [
8
],
these are either smaller or not as widely used by the research
community. KDDCUP dataset is a perfect fit for our purposes
as it is highly imbalanced; the test set is not from the same
domain as the train set, making it very close to real-world
problems.
A general rule in ML is — the more the data, the better
learning by ML algorithms. Therefore, we use the full KDD-
CUP dataset as a single streaming window to create a baseline
by first analyzing the entire dataset in a single batch, so that the
results of small data streams (mini batches) can be compared.
We perform a number of ad-hoc experiments. Our testbed
is a computer with two Intel Xeon(R) E5-2680v4 processors
(allowing execution of 56 hyper-threads in parallel) and 512
GB of memory.
To create a baseline, we use the full batch of KDDCUP
dataset and trained Naive Bayes [
9
] model using 10-fold cross-
validation using Weka v.3.8.2, and it takes
30 seconds and
another
100 seconds for testing. The performance of the
model, not surprisingly, is not stellar (precision of
0.83
and
recall of
0.70
). We also tried to train support vector machine
(SVM) model using Weka on the same dataset, but it could not
complete after 15 days, so we have to terminate the training.
We then switch to an automated machine learning tool
autosklearn [
10
], which automates data pre-processing and
parameter optimization for machine learning algorithms. Hyper-
parameter optimization and automated machine learning is not
a new concept and has been experimenting for several decades.
1
arXiv:2210.01792v1 [cs.LG] 4 Oct 2022
2
A comprehensive survey is given by [
11
]. We set the system
memory limit available memory (512 GB), the rest of the
default parameters are used and described in Section
VI-C
2.The
training process runs for 15 days, on the system described
above, before we kill it without any results.
Then we used approach to build models in parallel by shard-
ing the data and training an ensemble of models (one model
per data shard) [
12
].Training of each member of the group is
independent of each other hence the ease of parallelization.
We implement this approach using the classification algorithms
mentioned above. We find that this approach does not work
well for the KDDCUP dataset. No individual classifiers can
pick all the classes in the dataset, resulting in very lowprecision,
details are given in Section VI-C1.
Then we tried parallel versions of classification models.
Though there are ways to perform parallel processing of all the
classification algorithms, not all the classification algorithms,
have parallel versions available [
13
]. Most of these methods find
work-around and either compromise on accuracy or perform
some sort of data reduction to design the parallel version
of an algorithm, e.g., random forest (RF) is a parallelizable
algorithm, but the implementations often compromise on
accuracy or performance [
14
]. As we discussed above, not
every model is parallelizable (even if it is, it may be challenging
to do it efficiently, not to mention the cost associated with
parallelization). Thus, this approach is not universal.
Finally, we used graph-based parallel processing systems,
like TensorFlow [
15
], CNTK [
16
], and Theano [
17
], as the
backend to speed up the computations on multicore-systems.
Not all the classification algorithms can be used with graph-
based backends, and even the ones that can be used, e.g.,
Neural Networks, are degraded to the extent that these models
cannot be generalized. We demonstrate this in detail in Section
VII-C.
III. PROPOSED METHOD: PARALLEL VECTOR
QUANTIZATION
To address the issues discussed in the previous section, we
designed a Parallel Vector Quantization (PVQ) method that
improves the classification accuracy of the classifiers mentioned
above by removing the class imbalance. This method gives
much better results when used with graph-based backends like
TensorFlow.
The schematic diagram of PVQ is shown in Figure 1. The
intuition behind this approach comes from scalar quantization.
Quantization takes a continuous function, like a
sine
wave,
samples it at a much coarser scale, and produces a much smaller
dataset than the original one. The steps of the scalar quantiza-
tion are separated by quantization error. In telecommunications,
the analog signal is quantized by the coder, and much smaller
signal is transmitted across the network, and the decoder then
re-constructs it back using the quantization error. Similarly,
Vector Quantization (VQ) is a data sampling mechanism in
high dimensional spaces that preserves the characteristics of
the full dataset. The details of the PVQ method are given in
Section V.
Fig. 1. Schematic Diagram of PVQ
Fig. 2. Architecture for real-time classification of streaming data using PVQ
in the cloud. PVQ can be used with distributed stream processing systems, as
shown in this high-level diagram. Network data streams are captured using
Snort or Kafka and, PVQ can be applied to weekly or daily streams. The
models build with PVQ sampled data are then used by the classifier to predict
anomalies in real-time traffic.
A. PVQ Cloud Architecture
The architecture to use PVQ with distributed stream pro-
cessing systems in the cloud is shown in Figure 2, and the
schematic representation of data streams is shown in Figure 3.
Mini batches of network traffic data streams are captured by a
streaming service Google’s cloud data flow [
18
]. Network traffic
can be captured by widely used network security monitors (e.g.,
Zeek [
19
]) or widely used intrusion prevention systems (e.g.,
Snort [
20
]). The network traffic data streams are intercepted
by the PVQ segment as shown in Figure 2.
The streaming data is sampled by PVQ and passed to ML
model building layer. To demonstrate our method, we stream
mini-batches of 80k network traffic packets and a full batch
of 4.9M packets, as shown in Figure 3.
B. Our Contributions
The purpose of our work is not to evaluate how different
ML algorithms perform on streaming datasets (for which the
reader is directed to [
21
]) nor to build a new intrusion detection
摘要:

SamplingStreamingDatawithParallelVectorQuantization-PVQMujahidSultanmujahid.sultan@mlsoft.aiOctober5,2022Abstract—Accumulationofcorporatedatainthecloudhasattractedmoreenterpriseapplicationstothecloudcreatingdatagravity.Asaconsequence,networktrafchasbecomemorecloud-centric.Thisincreaseincloud-centri...

展开>> 收起<<
Sampling Streaming Data with Parallel Vector Quantization - PVQ Mujahid Sultan mujahid.sultanmlsoft.ai.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:536.96KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注