AnoML-IoT An End to End Re-configurable Multi-protocol Anomaly Detection Pipeline for Internet of Things_2

2025-04-27 0 0 1.81MB 26 页 10玖币

侵权投诉

AnoML-IoT: An End to End Re-conﬁgurable Multi-protocol

Anomaly Detection Pipeline for Internet of Things

Hakan Kayana, Yasar Majiba, Wael Alsaferya, Mahmoud Barhamgib, Charith Pereraa

aCardiﬀUniversity, UK

bClaude Bernard Lyon 1 University, France

Abstract

The rapid development in ubiquitous computing has enabled the use of microcontrollers as edge

devices. These devices are used to develop truly distributed IoT-based mechanisms where ma-

chine learning (ML) models are utilized. However, integrating ML models to edge devices re-

quires an understanding of various software tools such as programming languages and domain-

speciﬁc knowledge. Anomaly detection is one of the domains where a high level of expertise is

required to achieve promising results. In this work, we present AnoML which is an end-to-end

data science pipeline that allows the integration of multiple wireless communication protocols,

anomaly detection algorithms, deployment to the edge, fog, and cloud platforms with minimal

user interaction. We facilitate the development of IoT anomaly detection mechanisms by reduc-

ing the barriers that are formed due to the heterogeneity of an IoT environment. The proposed

pipeline supports four main phases: (i) data ingestion, (ii) model training, (iii) model deploy-

ment, (iv) inference and maintaining. We evaluate the pipeline with two anomaly detection

datasets while comparing the eﬃciency of several machine learning algorithms within diﬀerent

nodes. We also provide the source code of the developed tools which are the main components

of the pipeline.

Keywords: Internet of Things, Data Science, Pipeline, Data Analytics, Multi-Protocol

1. Introduction

Edge AI which is critical for resource-constrained environments that operates in the Internet

of Things (IoT) domain where intelligent tasks are performed has started to become a hot topic

with the arrival of Industry 4.0 [1]. It manages the interaction with the physical world that

is provided by sensors and actuators. Management of such an environment requires series of

tasks (e.g., data collection, anomaly detection) that are operated by microcontrollers running

ML models. Data-related professions (e.g., data scientists, ML engineers) deﬁne rules/ranges

and search for the best practices to increase the operability of edge mechanisms in their relevant

scientiﬁc disciplines. Finding hidden information from big data can enhance the quality of living

but it is not a straightforward task [2].

While for data scientists, being an expert in edge-related infrastructures (e.g., programming

languages, microcontrollers, sensors) is not expected, they should be able to utilize data science

pipelines which are executable workﬂows of data-related tasks that automate the desired process.

Thus, we developed a reconﬁgurable data science pipeline based on an IoT sensing infrastructure

that utilizes open-source software to facilitate developing an interconnected anomaly detection

Preprint submitted to Internet of Things Journal October 5, 2022

arXiv:2210.01771v1 [cs.NI] 4 Oct 2022

system that runs on edge, fog, and cloud platforms. We deﬁne the edge as the platform where the

ﬁrst interaction between the cyber and physical world happens. Hence, microcontrollers (e.g.,

Raspberry Pi Pico) that gather physical data are edge devices. We deﬁne fog as the platform

where several edge devices can be supervised. Hence, single-board computers (e.g., Raspberry

Pi 4B) are fog devices that might act as edge devices as well. Cloud is the platform where real-

world data gathered by the edge and fog devices are progressed. We implemented our system

based on an example use case scenario to describe how to proposed system works while providing

some results.

The contributions of this paper are as follows:

•We provide reconﬁgurable IoT sensing infrastructure that consists of two main open-

source components: (i) The Edge to Cloud Code Generator (EECG) that generates ready-

to-deploy codes to enable data circulation from edge to fog. (ii) The Node-RED package

is hosted on Node-RED servers that enables accessing and processing to the edge data

from anywhere that has access to Node-RED servers while oﬀering visualization via the

graphical user interface (GUI). We also provide one Python library and executable shell

script that facilitate data training and inference phases.

•We propose a data science pipeline that interconnects edge, fog and cloud devices/services

to provide end-to-end anomaly detection system development. The pipeline contains four

main stages: (i) the data collection which is provided by the components mentioned at the

ﬁrst contribution point, (ii) the anomaly detection model training, (iii) model deployment

to the edge, fog, and cloud, (iv) inference, and maintaining the model based on the new

data. We demonstrate how the proposed tools are utilized during these stages.

•We provide a dataset that is generated via the utilization of proposed tool. We analyze

the performance of Convolutional Neural Network (CNN) [3], Recurrent Neural Network

(RNN) [4], Isolation Forest [5] and One-class Support Vector Machines (OC-SVM) [6]

on the proposed dataset [7] and the WADI dataset [8]. We also evaluate them according

to the platform (edge, for or cloud) where the anomaly detection model is deployed via

the utilization of proposed pipeline. In the edge, we only evaluated CNN due to lack of

application programming interface (API).

Structure of the Paper: This section provides a high-level understanding of what we pro-

posed. We outlined the previous commercial and academic works in section 2. Section 3 contains

the architecture of an IoT anomaly detection pipeline infrastructure. Section 4 presents details

about how the data circulated and progressed within the AnoML-IoT pipeline. We demonstrate

our evaluations and results in section 5. Then, we discuss about the results in the section 6 and

ﬁnally provide our conclusions in section 7.

2. Related Work

In this section, we introduce the data science pipelines that are oﬀered either by academia

or commercial entities, and anomaly detection techniques in time-series sensor data. We also

analyze the capabilities of open-source platforms that facilitates data circulation.

Unsupervised Anomaly Detection Algorithms

Neural Networks Decision Trees One-class SVM

CNN RNN

Isolation Forest TensorFlow

Scikit-Learn

Figure 1: Illustrates the utilized algorithms. TensorFlow also has an API [27] for decision trees, but due to having better

documentation we prefer using scikit-learn for implementing Isolation Forest.

2.1. Unsupervised Anomaly Detection in Time Series Sensor Data

Anomaly detection is one of the fundamental ﬁelds that utilize the machine learning (ML)

model as the main component. There is extensive research being done in this ﬁeld [9, 10, 11].

There are three types of anomalies: (i) point anomalies, (ii) contextual anomalies, (iii) collective

anomalies. If the anomalies are contextual where the context is time, the time series anomaly

detection models are applied. For example, in an environment where the weather temperature

decreases at night if the temperature value generated by the sensor acts otherwise, there is a con-

textual anomaly. While point anomalies are easier to detect, contextual and collective anomaly

detection requires additional tasks to identify the normal behavior of the system.

The nature of the input data is the core element that determines the eﬃciency of the ML

model. The features of the data may depend on several complementary terms such as labels, con-

text, and domain. For example, if the input data do not contain any labels that deﬁne normality,

unsupervised algorithms [12] are applied, if the data is related to a certain context, context-aware

[13] methods are selected, if the environment is industrial, because of the importance of detection

time, faster models with reduced complexity [14] are preferred.

In an interconnected domain such as IoT, cyber-physical systems [15] are utilized to supervise

the environment. These systems observe the behavioral changes (e.g., change in the temperature

or movement) in surroundings through modules that manage sensors [16]. They can also act as

controllers if they contain actuators. In such environments, the anomaly might occur either by

independent or dependent events. If the events are independent, univariate analysis [17] is ap-

plied. For example, the behavioral changes in temperature, loudness, light density, and humidity

can be detected via the related data only, hence require univariate analysis. However, changes in

the angular momentum or acceleration are measured by sensors (e.g., accelerometer, gyroscope)

that generate data per dimension. Hence, the relation between the data points should also be

analyzed to detect anomalies. Then, the multivariate analysis [18] is applied.

While academia keeps oﬀering new anomaly detection algorithms [19, 20], most of the time

these are based on the fundamental ones [21]. Hence, for this work, we selected the following

algorithms as they are the most common ones that are utilized for the unsupervised anomaly

detection in time series data and accessible via common ML programming libraries/frameworks

(e.g., scikit-learn [22], TensorFlow [23]) : (i) convolutional neural networks (CNN) [24, 25], (ii)

recurrent neural networks (RNN) [26], (iii) isolation forest (IF) [5], and (iv) one-class support

vector machines (OC-SVM) [6]. Figure 1 demonstrates the used algorithms in this study.

2.2. Machine Learning Platforms and Data Science Pipelines

Machine learning platforms.After showing promising results in a variety of tasks includ-

ing speech recognition, image processing, anomaly detection, and medical diagnosis, ML has

taken an interest in both academic and commercial entities, hence resulting in the creation of

many open-source and proprietary ML platforms and pipelines. ML models can be generated via

hand-coding, code generators, or interpreters. Hand coding. There are many machine learning

libraries available [22, 23, 28] that allows user to create ML models or to deploy and evaluate ML

algorithms. A person with the profession might prefer hand-coding as it oﬀers high customiza-

tion, allows the development and employment of novel algorithms, and is easy to maintain.

However, hand-coding might be very resource-consuming, thus most of the time it is done by a

group of programmers. Code generators. ML consists of many steps (e.g., data acquisition, data

pre-processing, and ﬁtting). Rather than hand-coding all these steps, code generators [29, 30]

might be utilized to facilitate the process. Due to the variety of complicated tasks, most code

generators provide a speciﬁc code for a speciﬁc task. Interpreters. One of the main challenges

of ML is the portability of the generated model. Interpreters provide portability by generating a

model ﬁle that can be run on other platforms with minimal coding. TensorFlow [23] is the most

common one that oﬀers model generation for resource-constrained platforms.

Data science pipelines. Raw data are needed to be interpreted to be utilized within data

science-related tasks. If the data science pipeline contains all the steps that are required to in-

terpret the data from data gathering to deployment of a machine learning model, it is called

end-to-end. These end-to-end pipelines can be either manual where the user provides many in-

puts and sets parameters each time before a new model is generated or automated where little to

no input is taken. Due to a variety of data types, automated pipelines put a certain set of rules

(e.g., time format) for their system to accept the input data [31]. These pipelines can also be

named according to the performed tasks (e.g., anomaly detection pipeline). Now we introduce

pipelines that are presented by either industry or academia.

Azure Machine Learning Pipeline [32]. Microsoft provides an ML pipeline based on run-

ning Python scripts on the cloud while automatically handling resource usage. Each step of the

pipeline can be independently customized hence oﬀering scalability to the end-user. One of the

practical features that Azure Machine Learning Pipeline oﬀers is the automated dependency han-

dling that allows the usage of a variety of hardware and software environments. Microsoft also

provides Azure Cognitive Services [33] where you can utilize their ML pipeline and Anomaly

Detector [34] service. They apply Graph Attention Network (GAN) [35] for multivariate analy-

sis, apply SR-CNN [31] for the univariate analysis.

Amazon Web Services (AWS) Machine Learning Pipeline [36]. Amazon provides an end-

to-end ML pipeline as a service for detecting anomalies in real-time. Inside the pipeline, there

are many diﬀerent services (e.g., database, data formatting) that can be utilized for pipeline

tasks. Amazon SageMaker [37] is the main service that provides anomaly detection for both

univariate and multivariate data. It allows users to either use a built-in unsupervised anomaly

detection algorithm based on Random Cut Forest (RCF) [38] or use a custom algorithm that can

be deployed via a Docker image. Now we introduce the pipelines proposed by the academia.

Prado et al. [39] propose an end-to-end modular AI pipeline that allows users with less ex-

pertise to implement their AI applications such as keyword spotting, image classiﬁcation, and

object detection to systems that contain embedded devices. Their framework relies on Low

Power Deep Neural Network (LPDNN) that contains an Inference Engine (LNE) that is compat-

ible with Caﬀe [40]. LNE is a code generator that facilitates the deployment to the embedded

devices. The authors use FIWARE [41] for IoT hub integration and Kurento Media server [42]

for media streaming which are required to run live inference. The authors deﬁne Raspberry Pi

devices as edge and evaluate the eﬃciency of LPDNN compared TF Lite [43] on these devices

by running benchmarks that are included in the TF Lite repository.

Drori et al. [44] propose an automatic ML (AutoML) system that optimizes the ML pipeline

according to the given dataset. Their pipeline utilizes LSTM-RNN as a base ML algorithm.

Monte Carlo Tree Search (MCTS) [45] is applied to the predictions generated by the LSTM-RNN

to evaluate the performance of the pipeline and decide on the better pipeline. They evaluate the

proposed pipeline compared to baseline stochastic gradient descent (SGD) [46] estimators from

scikit-learn [22]. They claim their pipeline provides faster run time according to its peers.

Sutton et al. [47] propose an open-source ML pipeline that receives physiological data that

is used to identify anomalous behaviors as an input in real-time. The authors try to detect Parox-

ysmal atrial ﬁbrillation (PAF) by applying Probabilistic Symbolic Pattern Recognition (PSPR)

to the Electrocardiogram (ECG) signals. PSPR is used for online feature extraction while they

apply random forest (RF) to classify ECG data. The proposed pipeline is based on Spark’s ML

library (MLlib) [48], hence allows other anomaly detection techniques included within MLlib.

Nitsche and Halbritter [49] propose a data science pipeline that is optimized for text classi-

ﬁcation. The authors benchmark diﬀerent GPUs to evaluate the performance of their hardware

setup which consists of 10 NVIDIA Quadro P6000 and the eﬀect of the number of GPUs on the

image processing time. They apply the Naive Bayes classiﬁer that is included in scikit-learn API

and achieve above 90% accuracy on Deutsche Presse-Agentur (dpa) dataset.

Shaikh et al. [50] focus the challenges of ensuring policy fairness within end-to-end ML

pipelines. They claim the ML-based tasks are done by engineers that have a variety of professions

including data creators and future engineers. Hence, each step of the ML pipeline might be

subjected to a policy violation. The authors provide an end-to-end ML pipeline that is based

on log management to prevent these violations as manually ensuring policy fairness is highly

resource-consuming.

Boovaraghavan et al. [51] propose an adaptive end-to-end ML system for IoT applications.

Their pipeline is optimized for activity recognition-based tasks including object recognition. Au-

thors claim that the main challenge regarding end-to-end pipeline is due to the heterogeneity of

IoT applications. Authors evaluate their pipeline with various hardware platforms and datasets

while comparing prediction time and accuracy per each machine learning technique they applied.

Molinara et al. [52] propose an end-to-end ML-based indoor air monitoring system for con-

taminant classiﬁcation. Authors compare the performances of Multi Layer Perceptron (MLP) to

CNN and LSTM based deep learning techniques while testing the performance of MLP and CNN

on ESP32 MCU. They investigate the power consumption of the MCU regarding the utilized ML

technique. They claim the proposed system is only lacked to classify alcohol and acetone due to

their chemical similarities.

Vinzamuri et al. [53] propose an end-to-end context-aware anomaly detection system that

requires time-series data. The proposed system utilizes a semi-supervised algorithm with Sparse

Gaussian Graphical Models. They benchmark the pipeline on several public datasets. The au-

thors claim semantics can improve the Gaussian Graphical Models further beyond other anomaly

detection techniques. Their ML comparison is based on F-Score as the authors mention that the

proposed pipeline is promising for industrial IoT environments.

Li et al. [54] develop an end-to-end automated anomaly detection system. They utilize

Apache Spark backend server to run the query-based operations. After the user provides a

dataset, the proposed system automatically selects the most appropriate algorithm then applies

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnoML-IoT:AnEndtoEndRe-congurableMulti-protocolAnomalyDetectionPipelineforInternetofThingsHakanKayana,YasarMajiba,WaelAlsaferya,MahmoudBarhamgib,CharithPereraaaCardiUniversity,UKbClaudeBernardLyon1University,FranceAbstractTherapiddevelopmentinubiquitouscomputinghasenabledtheuseofmicrocontrollersas...

展开>> 收起<<

AnoML-IoT An End to End Re-configurable Multi-protocol Anomaly Detection Pipeline for Internet of Things_2.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AnoML-IoT An End to End Re-configurable Multi-protocol Anomaly Detection Pipeline for Internet of Things_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: