Benchmarking Graph Neural Networks for Internet Routing Data

2025-05-06 0 0 924.29KB 6 页 10玖币
侵权投诉
Benchmarking Graph Neural Networks for Internet Routing
Data
Dimitrios P. Giakatos
dgiakatos@csd.auth.gr
Aristotle University of Thessaloniki
Greece
Soa Kostoglou
sokost@csd.auth.gr
Aristotle University of Thessaloniki
Greece
Pavlos Sermpezis
sermpezis@csd.auth.gr
Aristotle University of Thessaloniki
Greece
Athena Vakali
avakali@csd.auth.gr
Aristotle University of Thessaloniki
Greece
ABSTRACT
The Internet is composed of networks, called Autonomous Systems
(or, ASes), interconnected to each other, thus forming a large graph.
While both the AS-graph is known and there is a multitude of data
available for the ASes (i.e., node attributes), the research on applying
graph machine learning (ML) methods on Internet data has not
attracted a lot of attention. In this work, we provide a benchmarking
framework aiming to facilitate research on Internet data using
graph-ML and graph neural network (GNN) methods. Specically,
we compile a dataset with heterogeneous node/AS attributes by
collecting data from multiple online sources, and preprocessing
them so that they can be easily used as input in GNN architectures.
Then, we create a framework/pipeline for applying GNNs on the
compiled data. For a set of tasks, we perform a benchmarking of
dierent GNN models (as well as, non-GNN ML models) to test
their eciency; our results can serve as a common baseline for
future research and provide initial insights for the application of
GNNs on Internet data.
1 INTRODUCTION
The Internet is a network of networks, which are called Autonomous
Systems (or, ASes). Today there exist more than 100k ASes originat-
ing IP prexes in the Internet routing table, which are connected
to each other through private or public peering links. Representing
the ASes as nodes and their interconnections as edges, results in a
large and sparse (density <0.01%) graph.
Since ASes and their interconnections play a signicant role
for network operations, Internet policies, routing optimization,
etc., there has been many eorts to characterize these networks.
Hence, there exist rich datasets with information about ASes (open
datasets [
4
,
5
,
17
,
29
], self-declared databases [
21
], data from custom
measurements, etc.).
These datasets with AS attributes have been used by several
works employing (traditional) ML methodologies for various appli-
cations [
9
,
10
,
12
,
16
,
25
,
31
]. One would expect that with the advent
of Graph Neural Networks (GNNs) many works would exploit the
known AS-graph structure along the AS attributes to devise GNN-
based methodologies for problems related to Internet routing and
operations. However, there only exist a few eorts generating graph
embeddings [
27
,
28
], and, in fact, they are not based on GNNs (but
on methods from the natural language processing eld) and they
do not take into account the node attributes (but only the graph
structure).
While there can be many reasons behind this lack of GNN-based
works for Internet routing data (and it is out of our scope to inves-
tigate them), a main challenge for applying GNNs on Internet data
is that signicant expertise is needed in both domains: namely, a
researcher needs (i) rich Internet data and (ii) a good understanding
of advanced deep learning techniques and graph theory concepts.
On one hand, it may be straightforward for Internet researchers
to access sources of Internet data (which are typically well known
within this community), but it may be a more tedious task for re-
searchers of other domains (e.g., more focused to GNNs) to compile
a rich dataset that would be needed by a GNN architecture. On
the other hand, while there are widely used and well documented
libraries (pytorch geometric [
3
], dgl [
2
], etc.) that have made access
to GNNs easy, there are many intricacies in the application of GNNs
to Internet data (e.g., imbalanced data, heavy tailed distributions,
etc.), which render their ecient application a non-trivial task for
an Internet-focused researcher.
Motivated by the aforementioned observation, in this paper we
aim to facilitate research with GNNs on Internet data through the
following contributions:
Dataset:
We compile a rich dataset of Internet data that can be
used as input to GNN models (Section 2). Specically, we collect
from multiple online sources a set of 19 AS attributes, including
both numerical and categorical variables. We then preprocess
the data and transform them to a format that is readily available
to be used as input to GNNs (e.g., all values normalized in [0,1]).
The compiled dataset not only oers easy access to researchers,
but it also serves as a benchmark dataset. The lack of benchmark
datasets, has been identied as a key barrier that challenge ML re-
search in networking applications [
8
]. Having a common dataset,
on which dierent ML approaches are applied and compared
(e.g., similarly to the ImageNet [
11
] and CIFAR-10 [
19
] datasets
in computer vision), can further boost GNN research on Internet
data.
GNN benchmarking & initial insights:
We test several GNN,
graph-ML, and (non-graph) ML models on the compiled dataset,
for several downstream tasks (Section 3). Our goal is not to pro-
pose a specic GNN architecture, and thus we refrain from exten-
sive model optimization. Hence, we use a basic architecture and
hyperparameter tuning for all models, and we produce initial
results which can serve as a point of reference (e.g., baselines) for
1
arXiv:2210.14189v1 [cs.NI] 25 Oct 2022
D.P. Giakatos, S. Kostoglou, et al.
Figure 1: Overall methodology pipeline.
future research. Our experimental results (Section 4) provide ini-
tial insights about the eciency of GNNs on Internet data related
tasks (e.g., the role of graph structure and/or node attributes for
dierent tasks), and reveal several challenges.
Open data and code:
We make publicly available the compiled
dataset and our code (using a popular GNN library [
2
,
33
])
1
in [
1
].
2 DATASET
In this section, we present the data sources (Section 2.1) and the
preprocessing (Section 2.2) we applied on the data to generate the
compiled dataset. The overall methodology is depicted in Fig. 1.
2.1 Data sources
Each network or Autonomous System (AS) can be characterized by
a multitude of features, such as, location, connectivity, trac levels,
etc.. We collect data from multiple online (public) data sources to
compile a dataset, which contains multiple information for each
AS.
The rst three data sources are widely used by Internet re-
searchers and operators for multiple purposes:
CAIDA AS-rank [
4
]: various information about ASes, such as,
location, network size, topology, etc.
CAIDA AS-relationship [
5
]: a list of AS links (i.e., edges), which
are used to build the AS-graph.
PeeringDB [
7
,
21
]: online database, where network operators reg-
ister information about the connectivity, network types, trac,
etc., of their networks
We also use the following sources that provide data related to the
routing properties of ASes and their business types:
AS hegemony [23]
Country-level Transit Inuence (CTI) [6]
ASDB [29]
From the above sources, we collect the most relevant attributes
per AS, resulting to a dataset of 19 attributes/features (see Table 1
for the detailed list). For ease of analysis, in the online repository [
1
]
we also provide a visual exploratory data analysis with the detailed
distributions of all attributes.
2.2 Data preprocessing
The collected data are highly heterogeneous, including both numer-
ical and categorical attributes. Moreover, numerical attributes take
values in dierent ranges, and some of them span ranges several
orders of magnitude larger than others (see Table 1). Since, it is
1As well as, all the experimental results of the paper, for reproducibility purposes.
well known that non-homogeneous data values can impact the per-
formance of deep learning models, we need to preprocess the data.
In the following we describe the transformation we apply to each
type of attributes to generate a dataset with normalized attributes
taking values in the interval [0,1].
Categorical features.
For every categorical feature, one-hot en-
coding is applied. In the one-hot encoding technique, a new feature
is created for every value of the categorical feature. For example,
the "Location-continent" feature contains 6 values (Africa, Asia,
Europe, N. America, S. America, Oceania), which means that after
the one-hot enconding 6 new numerical columns are created; hence,
an AS located in Europe will have a value of 1 in the respective new
feature for Europe, and a value of 0 in the other 5 new features that
correspond to the other continents.
Numerical features.
As it can be seen in Table 1, some numerical
attributes take values in very large ranges (e.g., the customer cone
of ASes spans from 1 to more than 48k ASNs). Also, for many of
these attributes the values for dierent ASes are not distributed
uniformly, but they have a heavy tail distribution (e.g., almost 95%
of ASes have a customer cone of 1 ASN). To alleviate this large
heterogeneity and variability of the numerical features, we perform
the following transformations.
First, for every numerical feature, except for the AS hegemony
and the CTI top features that only take values less than 1, we
apply a logarithmic transformation to decrease their variability,
as follows: 𝑥log(𝑥+1).
Then, we normalize all numerical feature according to the Min-
Max scaling method:
𝑥𝑥𝑚𝑖𝑛 (𝑥)
𝑚𝑎𝑥 (𝑥)𝑚𝑖𝑛(𝑥)
. As a result, all the
resulting values are in the range of [0,1].
Graph preprocessing.
The AS graph contains a large number of
leaf nodes (i.e., edge networks with a single upstream). These nodes
are of limited interest in the ML downstream tasks we consider
(see Section 3.2), namely, for (i) link prediction: they only have a
single link, and (ii) node classication: the characteristics/classes
we consider can be easily inferred for edge networks. Moreover,
taking them into account would lead to a graph structure that is
more challenging to be captured by a GNN or graph-ML model.
Hence, we preprocess the graph and remove all nodes with degree
equal to one (and repeat two more times this process); the resulting
graph has around 46K nodes and 434K edges.
3 GNN BENCHMARKING METHODOLOGY
To benchmark GNNs on the compiled dataset, we use a set of GNN,
graph-ML, and traditional ML models (Section 3.1), and design the
downstream tasks on which the eciency of the models will be
tested (Section 3.2).
3.1 Models
GNN models: We consider three widely used GNN models.
GraphSAGE
[
15
] learns a function (neural network) that gen-
erates embeddings for a node by sampling and aggregating node
features from its local neighborhood. The embeddings capture both
the local graph structure of a node and the feature distribution of
its neighborhood.
2
摘要:

BenchmarkingGraphNeuralNetworksforInternetRoutingDataDimitriosP.Giakatosdgiakatos@csd.auth.grAristotleUniversityofThessalonikiGreeceSofiaKostoglousofikost@csd.auth.grAristotleUniversityofThessalonikiGreecePavlosSermpezissermpezis@csd.auth.grAristotleUniversityofThessalonikiGreeceAthenaVakaliavakali@...

展开>> 收起<<
Benchmarking Graph Neural Networks for Internet Routing Data.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:924.29KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注