Sampling Streaming Data with Parallel Vector Quantization - PVQ Mujahid Sultan mujahid.sultanmlsoft.ai

2025-05-03 0 0 536.96KB 10 页 10玖币

侵权投诉

Sampling Streaming Data with Parallel Vector Quantization - PVQ

Mujahid Sultan

mujahid.sultan@mlsoft.ai

October 5, 2022

Abstract—Accumulation of corporate data in the cloud has

attracted more enterprise applications to the cloud creating data

gravity. As a consequence, network trafﬁc has become more

cloud-centric. This increase in cloud-centric trafﬁc poses new

challenges in designing learning systems for streaming data due

to class imbalance. The number of classes plays a vital role in

the accuracy of the classiﬁers’ built from the data streams.

In this paper, we present a vector quantization based sampling

method, which substantially reduces the class imbalance in data

streams. We demonstrate its effectiveness by conducting exper-

iments on network trafﬁc and anomaly dataset with commonly

used ML model building methods — Multilayered Perceptron

on TensorFlow backend, Support Vector Machines, K-Nearest

Neighbour, and Random Forests. We built models using parallel

processing, batch processing, and randomly selecting samples. We

show that the accuracy of classiﬁcation models improves when

the data streams are pre-processed with our method. We used

out of the box hyper-parameters of these classiﬁers as well as

used auto-sklearn1for hyper-parameter optimization.

Keywords-Data Streams; Class Imbalance; Vector Quantiza-

tion; cloud-centric Trafﬁc; TensorFlow; Classiﬁcation;

I. INTRODUCTION

With more and more enterprise data moving to the cloud,

it has attracted enterprise applications and trafﬁc to the cloud.

This accumulation of data in the cloud and its gravity has

resulted in an enormous volume of cloud-centric trafﬁc, which

poses new computational challenges in developing machine

learning (ML) models from these data streams. We observe

that when ML models are built from data streams, they do not

perform well as the number of observations per class differs

signiﬁcantly in these data streams.

Class balancing techniques like ROSE [

] and SMOTE [

with its variants given in [

], are generally used to address

the class imbalance. In our experiments, we observe that even

if the data streams are balanced (i.e., when the number of

observations per class is roughly the same) with the methods

given in [

], [

]. We found that the classiﬁcation models do

not perform well unless some intelligent stream pre-processing

or sampling is done.

When a dataset is balanced, trivial random sampling may

yield good results [

]. However, when the dataset is imbalanced,

random sampling fails to capture full classes in a dataset [4].

One can improve “vanilla” random sampling by performing

random sampling from each class of data independently. Hybrid

approaches, such as SMOTE and ROSE down-sample the

majority class and up-sample the minority classes. These

1https://www.automl.org/

approaches require knowledge of the labels limiting these

to supervised learning. Besides, we ﬁnd that no substantial

modeling accuracy can be achieved by applying these methods

to streaming data.

II. BACKGROUND AND MOTIVATION

In this publication, we propose a novel method to sample

imbalanced streaming datasets for building high accuracy ML

models. To demonstrate this on network trafﬁc data streams, we

select a labeled intrusion and anomaly detection dataset. There

are not enough labeled anomaly datasets publicly available out

there. Mainly because organizational Security Information and

Event Management Systems (SIEMS) are domain-speciﬁc and

depend upon many factors, e.g., time, geographical location,

and nature of the business to label the anomalies, above

all, attack and anomaly labels are proprietary. Therefore, to

generalize the problem, we use a publicly available and well

researched labeled intrusion detection dataset KDDCUP [

Though there are newer datasets available at [

], [

these are either smaller or not as widely used by the research

community. KDDCUP dataset is a perfect ﬁt for our purposes

as it is highly imbalanced; the test set is not from the same

domain as the train set, making it very close to real-world

problems.

A general rule in ML is — the more the data, the better

learning by ML algorithms. Therefore, we use the full KDD-

CUP dataset as a single streaming window to create a baseline

by ﬁrst analyzing the entire dataset in a single batch, so that the

results of small data streams (mini batches) can be compared.

We perform a number of ad-hoc experiments. Our testbed

is a computer with two Intel Xeon(R) E5-2680v4 processors

(allowing execution of 56 hyper-threads in parallel) and 512

GB of memory.

To create a baseline, we use the full batch of KDDCUP

dataset and trained Naive Bayes [

] model using 10-fold cross-

validation using Weka v.3.8.2, and it takes

≈

30 seconds and

another

≈

100 seconds for testing. The performance of the

model, not surprisingly, is not stellar (precision of

≈0.83

and

recall of

≈0.70

). We also tried to train support vector machine

(SVM) model using Weka on the same dataset, but it could not

complete after 15 days, so we have to terminate the training.

We then switch to an automated machine learning tool

autosklearn [

], which automates data pre-processing and

parameter optimization for machine learning algorithms. Hyper-

parameter optimization and automated machine learning is not

a new concept and has been experimenting for several decades.

arXiv:2210.01792v1 [cs.LG] 4 Oct 2022

A comprehensive survey is given by [

]. We set the system

memory limit available memory (512 GB), the rest of the

default parameters are used and described in Section

VI-C

2.The

training process runs for 15 days, on the system described

above, before we kill it without any results.

Then we used approach to build models in parallel by shard-

ing the data and training an ensemble of models (one model

per data shard) [

].Training of each member of the group is

independent of each other hence the ease of parallelization.

We implement this approach using the classiﬁcation algorithms

mentioned above. We ﬁnd that this approach does not work

well for the KDDCUP dataset. No individual classiﬁers can

pick all the classes in the dataset, resulting in very lowprecision,

details are given in Section VI-C1.

Then we tried parallel versions of classiﬁcation models.

Though there are ways to perform parallel processing of all the

classiﬁcation algorithms, not all the classiﬁcation algorithms,

have parallel versions available [

]. Most of these methods ﬁnd

work-around and either compromise on accuracy or perform

some sort of data reduction to design the parallel version

of an algorithm, e.g., random forest (RF) is a parallelizable

algorithm, but the implementations often compromise on

accuracy or performance [

]. As we discussed above, not

every model is parallelizable (even if it is, it may be challenging

to do it efﬁciently, not to mention the cost associated with

parallelization). Thus, this approach is not universal.

Finally, we used graph-based parallel processing systems,

like TensorFlow [

], CNTK [

], and Theano [

], as the

backend to speed up the computations on multicore-systems.

Not all the classiﬁcation algorithms can be used with graph-

based backends, and even the ones that can be used, e.g.,

Neural Networks, are degraded to the extent that these models

cannot be generalized. We demonstrate this in detail in Section

VII-C.

III. PROPOSED METHOD: PARALLEL VECTOR

QUANTIZATION

To address the issues discussed in the previous section, we

designed a Parallel Vector Quantization (PVQ) method that

improves the classiﬁcation accuracy of the classiﬁers mentioned

above by removing the class imbalance. This method gives

much better results when used with graph-based backends like

TensorFlow.

The schematic diagram of PVQ is shown in Figure 1. The

intuition behind this approach comes from scalar quantization.

Quantization takes a continuous function, like a

sine

wave,

samples it at a much coarser scale, and produces a much smaller

dataset than the original one. The steps of the scalar quantiza-

tion are separated by quantization error. In telecommunications,

the analog signal is quantized by the coder, and much smaller

signal is transmitted across the network, and the decoder then

re-constructs it back using the quantization error. Similarly,

Vector Quantization (VQ) is a data sampling mechanism in

high dimensional spaces that preserves the characteristics of

the full dataset. The details of the PVQ method are given in

Section V.

Fig. 1. Schematic Diagram of PVQ

Fig. 2. Architecture for real-time classiﬁcation of streaming data using PVQ

in the cloud. PVQ can be used with distributed stream processing systems, as

shown in this high-level diagram. Network data streams are captured using

Snort or Kafka and, PVQ can be applied to weekly or daily streams. The

models build with PVQ sampled data are then used by the classiﬁer to predict

anomalies in real-time trafﬁc.

A. PVQ Cloud Architecture

The architecture to use PVQ with distributed stream pro-

cessing systems in the cloud is shown in Figure 2, and the

schematic representation of data streams is shown in Figure 3.

Mini batches of network trafﬁc data streams are captured by a

streaming service Google’s cloud data ﬂow [

]. Network trafﬁc

can be captured by widely used network security monitors (e.g.,

Zeek [

]) or widely used intrusion prevention systems (e.g.,

Snort [

]). The network trafﬁc data streams are intercepted

by the PVQ segment as shown in Figure 2.

The streaming data is sampled by PVQ and passed to ML

model building layer. To demonstrate our method, we stream

mini-batches of 80k network trafﬁc packets and a full batch

of 4.9M packets, as shown in Figure 3.

B. Our Contributions

The purpose of our work is not to evaluate how different

ML algorithms perform on streaming datasets (for which the

reader is directed to [

]) nor to build a new intrusion detection

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SamplingStreamingDatawithParallelVectorQuantization-PVQMujahidSultanmujahid.sultan@mlsoft.aiOctober5,2022AbstractAccumulationofcorporatedatainthecloudhasattractedmoreenterpriseapplicationstothecloudcreatingdatagravity.Asaconsequence,networktrafchasbecomemorecloud-centric.Thisincreaseincloud-centri...

展开>> 收起<<

Sampling Streaming Data with Parallel Vector Quantization - PVQ Mujahid Sultan mujahid.sultanmlsoft.ai.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Sampling Streaming Data with Parallel Vector Quantization - PVQ Mujahid Sultan mujahid.sultanmlsoft.ai

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: