2.2. Machine Learning Platforms and Data Science Pipelines
Machine learning platforms.After showing promising results in a variety of tasks includ-
ing speech recognition, image processing, anomaly detection, and medical diagnosis, ML has
taken an interest in both academic and commercial entities, hence resulting in the creation of
many open-source and proprietary ML platforms and pipelines. ML models can be generated via
hand-coding, code generators, or interpreters. Hand coding. There are many machine learning
libraries available [22, 23, 28] that allows user to create ML models or to deploy and evaluate ML
algorithms. A person with the profession might prefer hand-coding as it offers high customiza-
tion, allows the development and employment of novel algorithms, and is easy to maintain.
However, hand-coding might be very resource-consuming, thus most of the time it is done by a
group of programmers. Code generators. ML consists of many steps (e.g., data acquisition, data
pre-processing, and fitting). Rather than hand-coding all these steps, code generators [29, 30]
might be utilized to facilitate the process. Due to the variety of complicated tasks, most code
generators provide a specific code for a specific task. Interpreters. One of the main challenges
of ML is the portability of the generated model. Interpreters provide portability by generating a
model file that can be run on other platforms with minimal coding. TensorFlow [23] is the most
common one that offers model generation for resource-constrained platforms.
Data science pipelines. Raw data are needed to be interpreted to be utilized within data
science-related tasks. If the data science pipeline contains all the steps that are required to in-
terpret the data from data gathering to deployment of a machine learning model, it is called
end-to-end. These end-to-end pipelines can be either manual where the user provides many in-
puts and sets parameters each time before a new model is generated or automated where little to
no input is taken. Due to a variety of data types, automated pipelines put a certain set of rules
(e.g., time format) for their system to accept the input data [31]. These pipelines can also be
named according to the performed tasks (e.g., anomaly detection pipeline). Now we introduce
pipelines that are presented by either industry or academia.
Azure Machine Learning Pipeline [32]. Microsoft provides an ML pipeline based on run-
ning Python scripts on the cloud while automatically handling resource usage. Each step of the
pipeline can be independently customized hence offering scalability to the end-user. One of the
practical features that Azure Machine Learning Pipeline offers is the automated dependency han-
dling that allows the usage of a variety of hardware and software environments. Microsoft also
provides Azure Cognitive Services [33] where you can utilize their ML pipeline and Anomaly
Detector [34] service. They apply Graph Attention Network (GAN) [35] for multivariate analy-
sis, apply SR-CNN [31] for the univariate analysis.
Amazon Web Services (AWS) Machine Learning Pipeline [36]. Amazon provides an end-
to-end ML pipeline as a service for detecting anomalies in real-time. Inside the pipeline, there
are many different services (e.g., database, data formatting) that can be utilized for pipeline
tasks. Amazon SageMaker [37] is the main service that provides anomaly detection for both
univariate and multivariate data. It allows users to either use a built-in unsupervised anomaly
detection algorithm based on Random Cut Forest (RCF) [38] or use a custom algorithm that can
be deployed via a Docker image. Now we introduce the pipelines proposed by the academia.
Prado et al. [39] propose an end-to-end modular AI pipeline that allows users with less ex-
pertise to implement their AI applications such as keyword spotting, image classification, and
object detection to systems that contain embedded devices. Their framework relies on Low
Power Deep Neural Network (LPDNN) that contains an Inference Engine (LNE) that is compat-
ible with Caffe [40]. LNE is a code generator that facilitates the deployment to the embedded
4