
D.P. Giakatos, S. Kostoglou, et al.
Figure 1: Overall methodology pipeline.
future research. Our experimental results (Section 4) provide ini-
tial insights about the eciency of GNNs on Internet data related
tasks (e.g., the role of graph structure and/or node attributes for
dierent tasks), and reveal several challenges.
•Open data and code:
We make publicly available the compiled
dataset and our code (using a popular GNN library [
2
,
33
])
1
in [
1
].
2 DATASET
In this section, we present the data sources (Section 2.1) and the
preprocessing (Section 2.2) we applied on the data to generate the
compiled dataset. The overall methodology is depicted in Fig. 1.
2.1 Data sources
Each network or Autonomous System (AS) can be characterized by
a multitude of features, such as, location, connectivity, trac levels,
etc.. We collect data from multiple online (public) data sources to
compile a dataset, which contains multiple information for each
AS.
The rst three data sources are widely used by Internet re-
searchers and operators for multiple purposes:
•
CAIDA AS-rank [
4
]: various information about ASes, such as,
location, network size, topology, etc.
•
CAIDA AS-relationship [
5
]: a list of AS links (i.e., edges), which
are used to build the AS-graph.
•
PeeringDB [
7
,
21
]: online database, where network operators reg-
ister information about the connectivity, network types, trac,
etc., of their networks
We also use the following sources that provide data related to the
routing properties of ASes and their business types:
•AS hegemony [23]
•Country-level Transit Inuence (CTI) [6]
•ASDB [29]
From the above sources, we collect the most relevant attributes
per AS, resulting to a dataset of 19 attributes/features (see Table 1
for the detailed list). For ease of analysis, in the online repository [
1
]
we also provide a visual exploratory data analysis with the detailed
distributions of all attributes.
2.2 Data preprocessing
The collected data are highly heterogeneous, including both numer-
ical and categorical attributes. Moreover, numerical attributes take
values in dierent ranges, and some of them span ranges several
orders of magnitude larger than others (see Table 1). Since, it is
1As well as, all the experimental results of the paper, for reproducibility purposes.
well known that non-homogeneous data values can impact the per-
formance of deep learning models, we need to preprocess the data.
In the following we describe the transformation we apply to each
type of attributes to generate a dataset with normalized attributes
taking values in the interval [0,1].
Categorical features.
For every categorical feature, one-hot en-
coding is applied. In the one-hot encoding technique, a new feature
is created for every value of the categorical feature. For example,
the "Location-continent" feature contains 6 values (Africa, Asia,
Europe, N. America, S. America, Oceania), which means that after
the one-hot enconding 6 new numerical columns are created; hence,
an AS located in Europe will have a value of 1 in the respective new
feature for Europe, and a value of 0 in the other 5 new features that
correspond to the other continents.
Numerical features.
As it can be seen in Table 1, some numerical
attributes take values in very large ranges (e.g., the customer cone
of ASes spans from 1 to more than 48k ASNs). Also, for many of
these attributes the values for dierent ASes are not distributed
uniformly, but they have a heavy tail distribution (e.g., almost 95%
of ASes have a customer cone of 1 ASN). To alleviate this large
heterogeneity and variability of the numerical features, we perform
the following transformations.
•
First, for every numerical feature, except for the AS hegemony
and the CTI top features that only take values less than 1, we
apply a logarithmic transformation to decrease their variability,
as follows: 𝑥→log(𝑥+1).
•
Then, we normalize all numerical feature according to the Min-
Max scaling method:
𝑥→𝑥−𝑚𝑖𝑛 (𝑥)
𝑚𝑎𝑥 (𝑥)−𝑚𝑖𝑛(𝑥)
. As a result, all the
resulting values are in the range of [0,1].
Graph preprocessing.
The AS graph contains a large number of
leaf nodes (i.e., edge networks with a single upstream). These nodes
are of limited interest in the ML downstream tasks we consider
(see Section 3.2), namely, for (i) link prediction: they only have a
single link, and (ii) node classication: the characteristics/classes
we consider can be easily inferred for edge networks. Moreover,
taking them into account would lead to a graph structure that is
more challenging to be captured by a GNN or graph-ML model.
Hence, we preprocess the graph and remove all nodes with degree
equal to one (and repeat two more times this process); the resulting
graph has around 46K nodes and 434K edges.
3 GNN BENCHMARKING METHODOLOGY
To benchmark GNNs on the compiled dataset, we use a set of GNN,
graph-ML, and traditional ML models (Section 3.1), and design the
downstream tasks on which the eciency of the models will be
tested (Section 3.2).
3.1 Models
GNN models: We consider three widely used GNN models.
GraphSAGE
[
15
] learns a function (neural network) that gen-
erates embeddings for a node by sampling and aggregating node
features from its local neighborhood. The embeddings capture both
the local graph structure of a node and the feature distribution of
its neighborhood.
2