Benchmarking Graph Neural Networks for Internet Routing Data

2025-05-06 0 0 924.29KB 6 页 10玖币

侵权投诉

Benchmarking Graph Neural Networks for Internet Routing

Data

Dimitrios P. Giakatos

dgiakatos@csd.auth.gr

Aristotle University of Thessaloniki

Greece

Soa Kostoglou

sokost@csd.auth.gr

Aristotle University of Thessaloniki

Greece

Pavlos Sermpezis

sermpezis@csd.auth.gr

Aristotle University of Thessaloniki

Greece

Athena Vakali

avakali@csd.auth.gr

Aristotle University of Thessaloniki

Greece

ABSTRACT

The Internet is composed of networks, called Autonomous Systems

(or, ASes), interconnected to each other, thus forming a large graph.

While both the AS-graph is known and there is a multitude of data

available for the ASes (i.e., node attributes), the research on applying

graph machine learning (ML) methods on Internet data has not

attracted a lot of attention. In this work, we provide a benchmarking

framework aiming to facilitate research on Internet data using

graph-ML and graph neural network (GNN) methods. Specically,

we compile a dataset with heterogeneous node/AS attributes by

collecting data from multiple online sources, and preprocessing

them so that they can be easily used as input in GNN architectures.

Then, we create a framework/pipeline for applying GNNs on the

compiled data. For a set of tasks, we perform a benchmarking of

dierent GNN models (as well as, non-GNN ML models) to test

their eciency; our results can serve as a common baseline for

future research and provide initial insights for the application of

GNNs on Internet data.

1 INTRODUCTION

The Internet is a network of networks, which are called Autonomous

Systems (or, ASes). Today there exist more than 100k ASes originat-

ing IP prexes in the Internet routing table, which are connected

to each other through private or public peering links. Representing

the ASes as nodes and their interconnections as edges, results in a

large and sparse (density <0.01%) graph.

Since ASes and their interconnections play a signicant role

for network operations, Internet policies, routing optimization,

etc., there has been many eorts to characterize these networks.

Hence, there exist rich datasets with information about ASes (open

datasets [

], self-declared databases [

], data from custom

measurements, etc.).

These datasets with AS attributes have been used by several

works employing (traditional) ML methodologies for various appli-

cations [

]. One would expect that with the advent

of Graph Neural Networks (GNNs) many works would exploit the

known AS-graph structure along the AS attributes to devise GNN-

based methodologies for problems related to Internet routing and

operations. However, there only exist a few eorts generating graph

embeddings [

], and, in fact, they are not based on GNNs (but

on methods from the natural language processing eld) and they

do not take into account the node attributes (but only the graph

structure).

While there can be many reasons behind this lack of GNN-based

works for Internet routing data (and it is out of our scope to inves-

tigate them), a main challenge for applying GNNs on Internet data

is that signicant expertise is needed in both domains: namely, a

researcher needs (i) rich Internet data and (ii) a good understanding

of advanced deep learning techniques and graph theory concepts.

On one hand, it may be straightforward for Internet researchers

to access sources of Internet data (which are typically well known

within this community), but it may be a more tedious task for re-

searchers of other domains (e.g., more focused to GNNs) to compile

a rich dataset that would be needed by a GNN architecture. On

the other hand, while there are widely used and well documented

libraries (pytorch geometric [

], dgl [

], etc.) that have made access

to GNNs easy, there are many intricacies in the application of GNNs

to Internet data (e.g., imbalanced data, heavy tailed distributions,

etc.), which render their ecient application a non-trivial task for

an Internet-focused researcher.

Motivated by the aforementioned observation, in this paper we

aim to facilitate research with GNNs on Internet data through the

following contributions:

•Dataset:

We compile a rich dataset of Internet data that can be

used as input to GNN models (Section 2). Specically, we collect

from multiple online sources a set of 19 AS attributes, including

both numerical and categorical variables. We then preprocess

the data and transform them to a format that is readily available

to be used as input to GNNs (e.g., all values normalized in [0,1]).

The compiled dataset not only oers easy access to researchers,

but it also serves as a benchmark dataset. The lack of benchmark

datasets, has been identied as a key barrier that challenge ML re-

search in networking applications [

]. Having a common dataset,

on which dierent ML approaches are applied and compared

(e.g., similarly to the ImageNet [

] and CIFAR-10 [

] datasets

in computer vision), can further boost GNN research on Internet

data.

•GNN benchmarking & initial insights:

We test several GNN,

graph-ML, and (non-graph) ML models on the compiled dataset,

for several downstream tasks (Section 3). Our goal is not to pro-

pose a specic GNN architecture, and thus we refrain from exten-

sive model optimization. Hence, we use a basic architecture and

hyperparameter tuning for all models, and we produce initial

results which can serve as a point of reference (e.g., baselines) for

arXiv:2210.14189v1 [cs.NI] 25 Oct 2022

D.P. Giakatos, S. Kostoglou, et al.

Figure 1: Overall methodology pipeline.

future research. Our experimental results (Section 4) provide ini-

tial insights about the eciency of GNNs on Internet data related

tasks (e.g., the role of graph structure and/or node attributes for

dierent tasks), and reveal several challenges.

•Open data and code:

We make publicly available the compiled

dataset and our code (using a popular GNN library [

])

in [

2 DATASET

In this section, we present the data sources (Section 2.1) and the

preprocessing (Section 2.2) we applied on the data to generate the

compiled dataset. The overall methodology is depicted in Fig. 1.

2.1 Data sources

Each network or Autonomous System (AS) can be characterized by

a multitude of features, such as, location, connectivity, trac levels,

etc.. We collect data from multiple online (public) data sources to

compile a dataset, which contains multiple information for each

AS.

The rst three data sources are widely used by Internet re-

searchers and operators for multiple purposes:

•

CAIDA AS-rank [

]: various information about ASes, such as,

location, network size, topology, etc.

•

CAIDA AS-relationship [

]: a list of AS links (i.e., edges), which

are used to build the AS-graph.

•

PeeringDB [

]: online database, where network operators reg-

ister information about the connectivity, network types, trac,

etc., of their networks

We also use the following sources that provide data related to the

routing properties of ASes and their business types:

•AS hegemony [23]

•Country-level Transit Inuence (CTI) [6]

•ASDB [29]

From the above sources, we collect the most relevant attributes

per AS, resulting to a dataset of 19 attributes/features (see Table 1

for the detailed list). For ease of analysis, in the online repository [

]

we also provide a visual exploratory data analysis with the detailed

distributions of all attributes.

2.2 Data preprocessing

The collected data are highly heterogeneous, including both numer-

ical and categorical attributes. Moreover, numerical attributes take

values in dierent ranges, and some of them span ranges several

orders of magnitude larger than others (see Table 1). Since, it is

1As well as, all the experimental results of the paper, for reproducibility purposes.

well known that non-homogeneous data values can impact the per-

formance of deep learning models, we need to preprocess the data.

In the following we describe the transformation we apply to each

type of attributes to generate a dataset with normalized attributes

taking values in the interval [0,1].

Categorical features.

For every categorical feature, one-hot en-

coding is applied. In the one-hot encoding technique, a new feature

is created for every value of the categorical feature. For example,

the "Location-continent" feature contains 6 values (Africa, Asia,

Europe, N. America, S. America, Oceania), which means that after

the one-hot enconding 6 new numerical columns are created; hence,

an AS located in Europe will have a value of 1 in the respective new

feature for Europe, and a value of 0 in the other 5 new features that

correspond to the other continents.

Numerical features.

As it can be seen in Table 1, some numerical

attributes take values in very large ranges (e.g., the customer cone

of ASes spans from 1 to more than 48k ASNs). Also, for many of

these attributes the values for dierent ASes are not distributed

uniformly, but they have a heavy tail distribution (e.g., almost 95%

of ASes have a customer cone of 1 ASN). To alleviate this large

heterogeneity and variability of the numerical features, we perform

the following transformations.

•

First, for every numerical feature, except for the AS hegemony

and the CTI top features that only take values less than 1, we

apply a logarithmic transformation to decrease their variability,

as follows: 𝑥→log(𝑥+1).

•

Then, we normalize all numerical feature according to the Min-

Max scaling method:

𝑥→𝑥−𝑚𝑖𝑛 (𝑥)

𝑚𝑎𝑥 (𝑥)−𝑚𝑖𝑛(𝑥)

. As a result, all the

resulting values are in the range of [0,1].

Graph preprocessing.

The AS graph contains a large number of

leaf nodes (i.e., edge networks with a single upstream). These nodes

are of limited interest in the ML downstream tasks we consider

(see Section 3.2), namely, for (i) link prediction: they only have a

single link, and (ii) node classication: the characteristics/classes

we consider can be easily inferred for edge networks. Moreover,

taking them into account would lead to a graph structure that is

more challenging to be captured by a GNN or graph-ML model.

Hence, we preprocess the graph and remove all nodes with degree

equal to one (and repeat two more times this process); the resulting

graph has around 46K nodes and 434K edges.

3 GNN BENCHMARKING METHODOLOGY

To benchmark GNNs on the compiled dataset, we use a set of GNN,

graph-ML, and traditional ML models (Section 3.1), and design the

downstream tasks on which the eciency of the models will be

tested (Section 3.2).

3.1 Models

GNN models: We consider three widely used GNN models.

GraphSAGE

[

] learns a function (neural network) that gen-

erates embeddings for a node by sampling and aggregating node

features from its local neighborhood. The embeddings capture both

the local graph structure of a node and the feature distribution of

its neighborhood.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BenchmarkingGraphNeuralNetworksforInternetRoutingDataDimitriosP.Giakatosdgiakatos@csd.auth.grAristotleUniversityofThessalonikiGreeceSofiaKostoglousofikost@csd.auth.grAristotleUniversityofThessalonikiGreecePavlosSermpezissermpezis@csd.auth.grAristotleUniversityofThessalonikiGreeceAthenaVakaliavakali@...

展开>> 收起<<

Benchmarking Graph Neural Networks for Internet Routing Data.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Benchmarking Graph Neural Networks for Internet Routing Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: