network related data, many organizations may find it chal-
lenging to share data for training ML models. Consequently,
these organizations end up with an ML model that does not
achieve its maximum potential. To resolve this issue, FL
can be implemented. FL is a decentralized collaborative ML
technique [1]. Instead of aggregating data to create a single
ML model, models are trained iteratively at every node, and
the model parameters from each node are fused together using
FL fusion algorithms [1].
FL is often implemented with a central FL server node
orchestrating training rounds over multiple participating client
nodes. At the beginning of each training round, the FL
server shares a global FL model with each client node. Upon
receiving the global FL model, each client runs ML training
over the client’s local data. These clients then send the updated
model with learned parameters back to the FL server for
aggregation. The FL server collects all the updates and fuses
them by using one of the FL fusion algorithms. FedAvg is one
of the pioneering fusion algorithms [5]. Using FedAvg, the
global model update is obtained by the weighted average over
all the parameters of each client model [5]. This completes
one training round. Several training rounds are orchestrated by
the FL server until the desired performance is achieved. This
helps to ensure that client data never leaves its source location,
and it allows multiple client nodes to collaborate and build a
common ML model without directly sharing sensitive data.
III. LITERATURE REVIEW
With the constant advancements of ML techniques and
the increased availability of intrusion detection data sets,
researchers have been setting out to improve upon the current
IDS. The variety of methods used to detect anomalies with
ML have provided insights about the challenges of dealing
with cyberattack data as well as possible solutions to overcome
them.
The Canadian Institute for Cybersecurity 2017 Intrusion De-
tection System (CIC-IDS2017) and Canadian Institute for Cy-
bersecurity 2018 Intrusion Detection System (CIC-IDS2018)
data sets contain labeled network activity data for benign
and malicious behavior [6] [7]. Given the CIC-IDS data
sets contain labeled data, a classification model is a logical
approach to determine whether the data are benign or mali-
cious. Zhou and Pezaros experimented with using 6 different
types of classification models on the CIC-IDS2018 data set to
determine if the data are ‘evil’ or ‘benign’ [8]. The experiment
initially tested each model on individual attacks, but in the
final experiment the team used a decision tree classifier with
each of the attack types grouped together as ‘evil’ data [8].
The decision tree had an f-1 score of 1.0 detecting benign
data and 0.57 detecting the attack data [8]. The classifier had
great results with detecting one type of attack, but it becomes
increasingly difficult to differentiate between attacks as more
types are added.
Although classification models have shown to be a viable
approach, autoencoders have been very successful at detecting
anomalies. Hindy et al. conducted an experiment on the CIC-
IDS2017 data set using an autoencoder with various threshold
levels [4]. The autoencoder was trained using benign data so
that the reconstruction loss would be higher when process-
ing attack data. With the optimal threshold, the autoencoder
had the following accuracies:90.01%, 98.43%, 98.47%, and
99.67% for DoS GoldenEye, DoS Hulk, Port Scanning, and
DDoS attacks [4]. These results are very promising, but
the varied accuracies based on the different threshold levels
highlights the importance of using an optimal threshold.
In another experiment, Li combines the autoencoder and
classifier approaches to detect the attacks [9]. To start the
process, the normal data is sent through the autoencoder for
dimensionality reduction [9]. The data is then fed into a dense
neural network that consists of 4 layers and an output layer
for binary classification [9]. The classifier’s predictions were
then used to train and test a decision tree [9]. Along with
Li’s experiment, Rezvy et al. followed a similar approach
using an autoencoder and a classifier [10]. The difference,
however, is Rezvy et al. use the autoencoder to minimize the
reconstruction error [10]. The reconstruction error is then used
as the input data for the classification model [10]. The results
from this experiment are very promising, and the idea to use
a classifier along with the autoencoder is a possible solution
to finding the optimal threshold level.
In “Chained Anomaly Detection Models for Federated
Learning: An Intrusion Detection Case Study”, Preuveneers et
al. built autoencoder based intrusion detection models using
the CIC-IDS2017 data set [11]. They partitioned data into
12 parties based on internet protocol (IP) addresses of victim
machines. The autoencoders were trained using only benign
traffic from the first day of CIC-IDS2017 simulations. In the
experiments, the authors varied the number of parties from 1
to 12. 1 represented the central training and 12 represented
the extreme case where each victim machine is a separate FL
party [11]. They observed that FL setups with more parties
required more epochs for the model to converge. In their
results, they claim that it took around 20 epochs for the central
model to converge while the 12 party FL setup took around
50 epochs [11]. While more epochs are needed, the amount of
time for each epoch reduces as each party trains local models
in parallel.
In “Federated Learning for Malware Detection in IoT De-
vices”, Marmol Campos et al. worked with malware detection
using the N-BaIoT IOT dataset [5]. They described and
compared two variations of FedAvg algorithm: Mini-Batch
Aggregation and Multi-Epoch Aggregation. In Mini-Batch
Aggregation, data at party nodes are grouped into mini-batches
for each FL round. Only a single mini-batch is used for
training, and the updated model parameters are sent back to
the FL server [5]. This process is repeated until all mini-
batches are covered. In Multi-Epoch Aggregation, the received
model is trained for multiple epochs using all the available
data at a party node before sending model updates back to
the server [5]. They described that an FL model trained with
mini batch aggregation converges better than the multi-epoch