AnoML-IoT An End to End Re-configurable Multi-protocol Anomaly Detection Pipeline for Internet of Things_2

2025-04-27 0 0 1.81MB 26 页 10玖币
侵权投诉
AnoML-IoT: An End to End Re-configurable Multi-protocol
Anomaly Detection Pipeline for Internet of Things
Hakan Kayana, Yasar Majiba, Wael Alsaferya, Mahmoud Barhamgib, Charith Pereraa
aCardiUniversity, UK
bClaude Bernard Lyon 1 University, France
Abstract
The rapid development in ubiquitous computing has enabled the use of microcontrollers as edge
devices. These devices are used to develop truly distributed IoT-based mechanisms where ma-
chine learning (ML) models are utilized. However, integrating ML models to edge devices re-
quires an understanding of various software tools such as programming languages and domain-
specific knowledge. Anomaly detection is one of the domains where a high level of expertise is
required to achieve promising results. In this work, we present AnoML which is an end-to-end
data science pipeline that allows the integration of multiple wireless communication protocols,
anomaly detection algorithms, deployment to the edge, fog, and cloud platforms with minimal
user interaction. We facilitate the development of IoT anomaly detection mechanisms by reduc-
ing the barriers that are formed due to the heterogeneity of an IoT environment. The proposed
pipeline supports four main phases: (i) data ingestion, (ii) model training, (iii) model deploy-
ment, (iv) inference and maintaining. We evaluate the pipeline with two anomaly detection
datasets while comparing the eciency of several machine learning algorithms within dierent
nodes. We also provide the source code of the developed tools which are the main components
of the pipeline.
Keywords: Internet of Things, Data Science, Pipeline, Data Analytics, Multi-Protocol
1. Introduction
Edge AI which is critical for resource-constrained environments that operates in the Internet
of Things (IoT) domain where intelligent tasks are performed has started to become a hot topic
with the arrival of Industry 4.0 [1]. It manages the interaction with the physical world that
is provided by sensors and actuators. Management of such an environment requires series of
tasks (e.g., data collection, anomaly detection) that are operated by microcontrollers running
ML models. Data-related professions (e.g., data scientists, ML engineers) define rules/ranges
and search for the best practices to increase the operability of edge mechanisms in their relevant
scientific disciplines. Finding hidden information from big data can enhance the quality of living
but it is not a straightforward task [2].
While for data scientists, being an expert in edge-related infrastructures (e.g., programming
languages, microcontrollers, sensors) is not expected, they should be able to utilize data science
pipelines which are executable workflows of data-related tasks that automate the desired process.
Thus, we developed a reconfigurable data science pipeline based on an IoT sensing infrastructure
that utilizes open-source software to facilitate developing an interconnected anomaly detection
Preprint submitted to Internet of Things Journal October 5, 2022
arXiv:2210.01771v1 [cs.NI] 4 Oct 2022
system that runs on edge, fog, and cloud platforms. We define the edge as the platform where the
first interaction between the cyber and physical world happens. Hence, microcontrollers (e.g.,
Raspberry Pi Pico) that gather physical data are edge devices. We define fog as the platform
where several edge devices can be supervised. Hence, single-board computers (e.g., Raspberry
Pi 4B) are fog devices that might act as edge devices as well. Cloud is the platform where real-
world data gathered by the edge and fog devices are progressed. We implemented our system
based on an example use case scenario to describe how to proposed system works while providing
some results.
The contributions of this paper are as follows:
We provide reconfigurable IoT sensing infrastructure that consists of two main open-
source components: (i) The Edge to Cloud Code Generator (EECG) that generates ready-
to-deploy codes to enable data circulation from edge to fog. (ii) The Node-RED package
is hosted on Node-RED servers that enables accessing and processing to the edge data
from anywhere that has access to Node-RED servers while oering visualization via the
graphical user interface (GUI). We also provide one Python library and executable shell
script that facilitate data training and inference phases.
We propose a data science pipeline that interconnects edge, fog and cloud devices/services
to provide end-to-end anomaly detection system development. The pipeline contains four
main stages: (i) the data collection which is provided by the components mentioned at the
first contribution point, (ii) the anomaly detection model training, (iii) model deployment
to the edge, fog, and cloud, (iv) inference, and maintaining the model based on the new
data. We demonstrate how the proposed tools are utilized during these stages.
We provide a dataset that is generated via the utilization of proposed tool. We analyze
the performance of Convolutional Neural Network (CNN) [3], Recurrent Neural Network
(RNN) [4], Isolation Forest [5] and One-class Support Vector Machines (OC-SVM) [6]
on the proposed dataset [7] and the WADI dataset [8]. We also evaluate them according
to the platform (edge, for or cloud) where the anomaly detection model is deployed via
the utilization of proposed pipeline. In the edge, we only evaluated CNN due to lack of
application programming interface (API).
Structure of the Paper: This section provides a high-level understanding of what we pro-
posed. We outlined the previous commercial and academic works in section 2. Section 3 contains
the architecture of an IoT anomaly detection pipeline infrastructure. Section 4 presents details
about how the data circulated and progressed within the AnoML-IoT pipeline. We demonstrate
our evaluations and results in section 5. Then, we discuss about the results in the section 6 and
finally provide our conclusions in section 7.
2. Related Work
In this section, we introduce the data science pipelines that are oered either by academia
or commercial entities, and anomaly detection techniques in time-series sensor data. We also
analyze the capabilities of open-source platforms that facilitates data circulation.
2
Unsupervised Anomaly Detection Algorithms
Neural Networks Decision Trees One-class SVM
CNN RNN
Isolation Forest TensorFlow
Scikit-Learn
Figure 1: Illustrates the utilized algorithms. TensorFlow also has an API [27] for decision trees, but due to having better
documentation we prefer using scikit-learn for implementing Isolation Forest.
2.1. Unsupervised Anomaly Detection in Time Series Sensor Data
Anomaly detection is one of the fundamental fields that utilize the machine learning (ML)
model as the main component. There is extensive research being done in this field [9, 10, 11].
There are three types of anomalies: (i) point anomalies, (ii) contextual anomalies, (iii) collective
anomalies. If the anomalies are contextual where the context is time, the time series anomaly
detection models are applied. For example, in an environment where the weather temperature
decreases at night if the temperature value generated by the sensor acts otherwise, there is a con-
textual anomaly. While point anomalies are easier to detect, contextual and collective anomaly
detection requires additional tasks to identify the normal behavior of the system.
The nature of the input data is the core element that determines the eciency of the ML
model. The features of the data may depend on several complementary terms such as labels, con-
text, and domain. For example, if the input data do not contain any labels that define normality,
unsupervised algorithms [12] are applied, if the data is related to a certain context, context-aware
[13] methods are selected, if the environment is industrial, because of the importance of detection
time, faster models with reduced complexity [14] are preferred.
In an interconnected domain such as IoT, cyber-physical systems [15] are utilized to supervise
the environment. These systems observe the behavioral changes (e.g., change in the temperature
or movement) in surroundings through modules that manage sensors [16]. They can also act as
controllers if they contain actuators. In such environments, the anomaly might occur either by
independent or dependent events. If the events are independent, univariate analysis [17] is ap-
plied. For example, the behavioral changes in temperature, loudness, light density, and humidity
can be detected via the related data only, hence require univariate analysis. However, changes in
the angular momentum or acceleration are measured by sensors (e.g., accelerometer, gyroscope)
that generate data per dimension. Hence, the relation between the data points should also be
analyzed to detect anomalies. Then, the multivariate analysis [18] is applied.
While academia keeps oering new anomaly detection algorithms [19, 20], most of the time
these are based on the fundamental ones [21]. Hence, for this work, we selected the following
algorithms as they are the most common ones that are utilized for the unsupervised anomaly
detection in time series data and accessible via common ML programming libraries/frameworks
(e.g., scikit-learn [22], TensorFlow [23]) : (i) convolutional neural networks (CNN) [24, 25], (ii)
recurrent neural networks (RNN) [26], (iii) isolation forest (IF) [5], and (iv) one-class support
vector machines (OC-SVM) [6]. Figure 1 demonstrates the used algorithms in this study.
3
2.2. Machine Learning Platforms and Data Science Pipelines
Machine learning platforms.After showing promising results in a variety of tasks includ-
ing speech recognition, image processing, anomaly detection, and medical diagnosis, ML has
taken an interest in both academic and commercial entities, hence resulting in the creation of
many open-source and proprietary ML platforms and pipelines. ML models can be generated via
hand-coding, code generators, or interpreters. Hand coding. There are many machine learning
libraries available [22, 23, 28] that allows user to create ML models or to deploy and evaluate ML
algorithms. A person with the profession might prefer hand-coding as it oers high customiza-
tion, allows the development and employment of novel algorithms, and is easy to maintain.
However, hand-coding might be very resource-consuming, thus most of the time it is done by a
group of programmers. Code generators. ML consists of many steps (e.g., data acquisition, data
pre-processing, and fitting). Rather than hand-coding all these steps, code generators [29, 30]
might be utilized to facilitate the process. Due to the variety of complicated tasks, most code
generators provide a specific code for a specific task. Interpreters. One of the main challenges
of ML is the portability of the generated model. Interpreters provide portability by generating a
model file that can be run on other platforms with minimal coding. TensorFlow [23] is the most
common one that oers model generation for resource-constrained platforms.
Data science pipelines. Raw data are needed to be interpreted to be utilized within data
science-related tasks. If the data science pipeline contains all the steps that are required to in-
terpret the data from data gathering to deployment of a machine learning model, it is called
end-to-end. These end-to-end pipelines can be either manual where the user provides many in-
puts and sets parameters each time before a new model is generated or automated where little to
no input is taken. Due to a variety of data types, automated pipelines put a certain set of rules
(e.g., time format) for their system to accept the input data [31]. These pipelines can also be
named according to the performed tasks (e.g., anomaly detection pipeline). Now we introduce
pipelines that are presented by either industry or academia.
Azure Machine Learning Pipeline [32]. Microsoft provides an ML pipeline based on run-
ning Python scripts on the cloud while automatically handling resource usage. Each step of the
pipeline can be independently customized hence oering scalability to the end-user. One of the
practical features that Azure Machine Learning Pipeline oers is the automated dependency han-
dling that allows the usage of a variety of hardware and software environments. Microsoft also
provides Azure Cognitive Services [33] where you can utilize their ML pipeline and Anomaly
Detector [34] service. They apply Graph Attention Network (GAN) [35] for multivariate analy-
sis, apply SR-CNN [31] for the univariate analysis.
Amazon Web Services (AWS) Machine Learning Pipeline [36]. Amazon provides an end-
to-end ML pipeline as a service for detecting anomalies in real-time. Inside the pipeline, there
are many dierent services (e.g., database, data formatting) that can be utilized for pipeline
tasks. Amazon SageMaker [37] is the main service that provides anomaly detection for both
univariate and multivariate data. It allows users to either use a built-in unsupervised anomaly
detection algorithm based on Random Cut Forest (RCF) [38] or use a custom algorithm that can
be deployed via a Docker image. Now we introduce the pipelines proposed by the academia.
Prado et al. [39] propose an end-to-end modular AI pipeline that allows users with less ex-
pertise to implement their AI applications such as keyword spotting, image classification, and
object detection to systems that contain embedded devices. Their framework relies on Low
Power Deep Neural Network (LPDNN) that contains an Inference Engine (LNE) that is compat-
ible with Cae [40]. LNE is a code generator that facilitates the deployment to the embedded
4
devices. The authors use FIWARE [41] for IoT hub integration and Kurento Media server [42]
for media streaming which are required to run live inference. The authors define Raspberry Pi
devices as edge and evaluate the eciency of LPDNN compared TF Lite [43] on these devices
by running benchmarks that are included in the TF Lite repository.
Drori et al. [44] propose an automatic ML (AutoML) system that optimizes the ML pipeline
according to the given dataset. Their pipeline utilizes LSTM-RNN as a base ML algorithm.
Monte Carlo Tree Search (MCTS) [45] is applied to the predictions generated by the LSTM-RNN
to evaluate the performance of the pipeline and decide on the better pipeline. They evaluate the
proposed pipeline compared to baseline stochastic gradient descent (SGD) [46] estimators from
scikit-learn [22]. They claim their pipeline provides faster run time according to its peers.
Sutton et al. [47] propose an open-source ML pipeline that receives physiological data that
is used to identify anomalous behaviors as an input in real-time. The authors try to detect Parox-
ysmal atrial fibrillation (PAF) by applying Probabilistic Symbolic Pattern Recognition (PSPR)
to the Electrocardiogram (ECG) signals. PSPR is used for online feature extraction while they
apply random forest (RF) to classify ECG data. The proposed pipeline is based on Spark’s ML
library (MLlib) [48], hence allows other anomaly detection techniques included within MLlib.
Nitsche and Halbritter [49] propose a data science pipeline that is optimized for text classi-
fication. The authors benchmark dierent GPUs to evaluate the performance of their hardware
setup which consists of 10 NVIDIA Quadro P6000 and the eect of the number of GPUs on the
image processing time. They apply the Naive Bayes classifier that is included in scikit-learn API
and achieve above 90% accuracy on Deutsche Presse-Agentur (dpa) dataset.
Shaikh et al. [50] focus the challenges of ensuring policy fairness within end-to-end ML
pipelines. They claim the ML-based tasks are done by engineers that have a variety of professions
including data creators and future engineers. Hence, each step of the ML pipeline might be
subjected to a policy violation. The authors provide an end-to-end ML pipeline that is based
on log management to prevent these violations as manually ensuring policy fairness is highly
resource-consuming.
Boovaraghavan et al. [51] propose an adaptive end-to-end ML system for IoT applications.
Their pipeline is optimized for activity recognition-based tasks including object recognition. Au-
thors claim that the main challenge regarding end-to-end pipeline is due to the heterogeneity of
IoT applications. Authors evaluate their pipeline with various hardware platforms and datasets
while comparing prediction time and accuracy per each machine learning technique they applied.
Molinara et al. [52] propose an end-to-end ML-based indoor air monitoring system for con-
taminant classification. Authors compare the performances of Multi Layer Perceptron (MLP) to
CNN and LSTM based deep learning techniques while testing the performance of MLP and CNN
on ESP32 MCU. They investigate the power consumption of the MCU regarding the utilized ML
technique. They claim the proposed system is only lacked to classify alcohol and acetone due to
their chemical similarities.
Vinzamuri et al. [53] propose an end-to-end context-aware anomaly detection system that
requires time-series data. The proposed system utilizes a semi-supervised algorithm with Sparse
Gaussian Graphical Models. They benchmark the pipeline on several public datasets. The au-
thors claim semantics can improve the Gaussian Graphical Models further beyond other anomaly
detection techniques. Their ML comparison is based on F-Score as the authors mention that the
proposed pipeline is promising for industrial IoT environments.
Li et al. [54] develop an end-to-end automated anomaly detection system. They utilize
Apache Spark backend server to run the query-based operations. After the user provides a
dataset, the proposed system automatically selects the most appropriate algorithm then applies
5
摘要:

AnoML-IoT:AnEndtoEndRe-congurableMulti-protocolAnomalyDetectionPipelineforInternetofThingsHakanKayana,YasarMajiba,WaelAlsaferya,MahmoudBarhamgib,CharithPereraaaCardi University,UKbClaudeBernardLyon1University,FranceAbstractTherapiddevelopmentinubiquitouscomputinghasenabledtheuseofmicrocontrollersas...

展开>> 收起<<
AnoML-IoT An End to End Re-configurable Multi-protocol Anomaly Detection Pipeline for Internet of Things_2.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:1.81MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注