Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU Hanqiu Chen1 Yahya Alhinai1 Yihan Jiang1 Eunjee Na2 Cong Hao1

2025-04-30 0 0 1.49MB 16 页 10玖币

侵权投诉

Bottleneck Analysis of Dynamic Graph

Neural Network Inference on CPU and GPU

Hanqiu Chen1, Yahya Alhinai1§, Yihan Jiang1§, Eunjee Na2§, Cong Hao1

{hchen799,yhinai3,yjiang400}@gatech.edu, jasmin7907@kaist.ac.kr, callie.hao@gatech.edu

1School of Electrical and Computer Engineering, Georgia Institute of Technology

2Korea Advanced Institute of Science & Technology

Abstract

Dynamic graph neural network (DGNN) is becoming

increasingly popular because of its widespread use in cap-

turing the dynamic features in the real world. A variety

of dynamic graph neural networks designed from algorith-

mic perspectives have succeeded in incorporating temporal

information into graph processing. Despite the promising

algorithmic performance, deploying DGNNs on hardware

presents additional challenges due to the model complexity,

diversity, and the nature of the time-dependency. Meanwhile,

the dierences between DGNNs and static graph neural

networks make hardware-related optimizations for static

graph neural networks unsuitable for DGNNs. In this paper,

we select eight prevailing DGNNs with dierent characteristics

and prole them on both CPU and GPU. The proling

results are summarized and analyzed, providing in-depth

insights into the bottlenecks of DGNNs on hardware and

identifying potential optimization opportunities for future

DGNN acceleration. Followed by a comprehensive survey, we

provide a detailed analysis of DGNN performance bottlenecks

on hardware, including temporal data dependency, workload

imbalance, data movement, and GPU warm-up. We suggest

several optimizations from both software and hardware

perspectives. This paper is the rst to provide an in-depth

analysis of the hardware performance of DGNN

∗

. Code is

available at https:// github.com/sharc-lab/ DGNN_analysis.

1. Introduction

Deep neural networks (DNNs) have made tremendous

breakthroughs in various domains such as speech [1–

3], image [4–7], and natural language processing [8–10].

While DNNs can eectively capture the hidden pattern in

Euclidean data, they do not perform well in processing

non-Euclidean data presented in graph format, such as

social networks, recommendation systems, and knowledge

graphs. Facing the challenges of DNNs, researchers have

an increasing interest in the graph data processing. Graph

Neural Network (GNN) is a powerful tool for processing

graphs. GNNs have shown the ability to capture the

∗

Hanqiu Chen proled EvolveGCN and TGAT models and led paper writing.

Yahya Alhinai proled JODIE, TGN, and MolDGNN models. Yihan Jiang proled

DyRep and LDG models. Eunjee proled ASTGNN model and prepared the codebase;

work done during her intern at Georgia Tech. §Equal contribution.

expressive power of graphs in many research areas. These

areas include social networks [11], physical systems [12]

and new drug discovery [13]. Fig. 1(a-b) demonstrates how

a social network can be represented as a graph, and how

GNNs can be applied to the graph. This is called static

GNN, in which both the graph and the model parameters

do not change.

Although GNNs have a strong representation power

on static graphs, in real-life, most of the graphs are

changing with time. For instance, as shown in Fig. 1 (c),

social networks are constantly growing as more people

are joining, and the edge weights are also changing as

the relationships between people evolve. In this case, the

graph topology changes over time, with nodes and edges

appearing and disappearing as time progresses. To better

represent complex graph structures changing over time,

dynamic graph neural network (DGNNs) is a promising

solution. Fig. 1(c) illustrates an example of a dynamically

changing graph, and Fig. 1(d) is an example of a dynamic

GNN, which uses a graph structure encoder to capture

the spatial information and a time encoder to encode the

temporal information.

According to a recent survey [14], DGNNs can be

divided into two categories according to the graph represen-

tation temporal granularity: (i) discrete time dynamic graph

neural network (DTDG) and (ii) continuous time dynamic

graph neural network (CTDG). DTDG captures the status

of a changing graph at dierent time steps and uses the

information from multiple snapshots of the dynamic graph

to gain a deeper understanding of the graph features. The

algorithm takes into account the temporal information to

better understand the features of the graph. CTDG is an

event-based neural network that can update node and edge

embeddings at any time when an edge between two nodes

appears. This update method is more realistic because it is

closer to real-world scenarios.

Although DGNNs have achieved great success from an

algorithmic perspective, their hardware performance is ex-

tremely poor due to the dire lack of hardware optimization.

Because of hardware bottlenecks, the lack of optimization

results in high latency and a low degree of parallelism.

Motivated by the software-hardware performance gap of

DGNNs, this paper seeks to provide a quantitative analysis

of eight representative models with dierent characteristics.

Graph structure encoder (GNN)

……

Time encoder (RNN, time2vec…)

…

embeddings

…

…final results

aggregator NN

Original graph

Embedding

after MP Embedding

after NT

Message Passing (MP) Node Transformation (NT)

NN: Neural Network

(b) Static GNN

social network

people are connected graph abstraction

edge

deleted node

added

edge

added

graph topology

changing with time

node

deleted

person

disappeared person

appeared

(a) Static graph (b) Static GNN

Figure 1: Introduction to graph neural networks (GNNs) and dynamic graph neural networks (DGNNs). (a) Static graph

modelling of a social network. (b) An example structure of a static GNN with message passing (MP) and node transformation

(NT). (c) Dynamic graph modelling of a social network varying with time. (d) A high-level architecture of a DGNN with a

graph structure encoder to capture spatial information and a time encoder to capture temporal information.

By proling these models on both CPU and GPU, we aim to

provide root causes behind the hardware bottlenecks during

inference, as well as further acceleration optimizations.

The models analyzed are: JODIE [15], EvolveGCN [16],

TGAT [17], MolDGNN [18], ASTGNN [19], DyRep [20],

TGN [21], and LDG [22].

To date, little research has been conducted on the

hardware performance of DGNNs. Thus, this paper provides

the rst comprehensive DGNN performance analysis on

hardware with detailed bottleneck analysis. Our contribu-

tions are summarized as follows:

Comprehensive review. We rst provide an overview

of the background of GNNs and DGNNs. We then

introduce eight representative DGNNs with their high-

level dataow and data dependency graphs to explain

their computational behavior, which largely impact the

hardware performance.

Hardware proling. We use PyTorch Proler [23] and

NVIDIA Nsight Systems [24] to prole eight DGNN

models on both CPU (6226R) and GPU (A6000). PyTorch

Proler is used to investigate memory usage and exe-

cution time of dierent modules in dierent DGNNs;

Nvidia Nsight Systems is used for detailed CUDA and

kernel operations breakdown and analysis.

DGNN hardware bottleneck analysis. We showcase

and analyze the proling results of DGNN inference on

both CPU and GPU. We identify four common hardware

bottlenecks that will degrade DGNN hardware perfor-

mance:

Temporal data dependencies between dierent

snapshots, node embedding, and edge embedding lead

to limited parallelism on hardware, which causes low

GPU utilization and degrades GPU acceleration rate.

The time-evolving graph topology of DGNNs causes

irregular memory accesses and a high graph sampling

overhead. This leads to a workload imbalance between

CPU and GPU, which makes CPU or GPU computation

kernels idle and limits high hardware utilization.

The

time-evolving graph topology, node and edge embedding

cause frequent data exchange between CPU and GPU.

Before inference starts on GPU, GPU needs to warm up

because of model initialization and memory allocation

on GPU.

Proposal of promising optimizations. We propose

several optimizations that can help overcome the DGNN

hardware bottlenecks uncovered in this work. We classify

them into three categories according to their application

scope:

software optimizations including reduc-

ing graph preprocessing overhead and implementing

smarter framework;

hardware optimizations in-

cluding improving parallelism on GPU and reducing

data transferring overhead between CPU-GPU;

hard-

ware/software co-design including extending current

TABLE 1: Summary of the DGNNs profiled in this work. Column 2 is the type of a DGNN; column 3

∼

6 show which

feature is evolving with time; column 7 is the time encoding method used in DGNN; column 8 gives examples of

the tasks the DGNN can be applied to.

DGNN type node feature edge feature graph topology weights time encoding tasks

JODIE

[15] continuous ✓ ✓ RNN future interaction prediction

state change prediction

TGN

[21] continuous ✓ ✓ time embedding future edge prediction

EvolveGCN

[16] discrete ✓ ✓ ✓ RNN

link prediction

node classication

edge classication

TGAT

[17] continuous ✓ ✓ ✓ time embedding link prediction

link classication

ASTGNN

[19] discrete ✓ ✓ self-attention trac ow prediction

DyRep

[20] continuous ✓ ✓ ✓ RNN dynamic link prediction

time prediction

LDG

[22] continuous ✓ ✓ ✓ ✓ RNN

self-attention dynamic link prediction

MolDGNN

[18] discrete ✓ ✓ ✓ RNN adjacency matrix prediction

static GNN hardware optimizations to DGNNs and

designing hardware-specic accelerators to improve

DGNNs hardware eciency.

The rest of the paper is organized as follows. Section 2

claries the current development of hardware accelerators

for GNNs and outlines the motivation of this paper. Sec-

tion 3 provides detailed information about the data ow

and data dependency graph for eight DGNN models for

proling. Section 4 presents proling results with bottleneck

analysis. Section 5 proposes various optimizations in terms

of software, hardware, or a combination of both software

and hardware (software/hardware co-design) in order to

address the bottlenecks associated with the graph topology

of DGNNs. Section 6 summarizes the work presented in

this paper, points out the limitations of this research, and

discusses future directions.

2. Motivations

Aiming to gain orders-of-magnitude performance im-

provement, various GNN accelerators have been proposed

to handle the computation- and communication-intensive

GNN challenges targeting dierent hardware platforms

including CPUs, GPUs, ASICs, FPGAs, and heterogeneous

architecture. To speed up the sparse matrix multiplication

(SpMM) for embedding transformation, which is widely used

in GNNs, AWB-GCN [25], I-GCN [26], and BoostGCN [27]

are proposed. Addressing the irregular memory access in

the message processing, the following methods have been

proposed: GraphACT [28], HyGCN [29], and VersaGNN [30].

These methods all exploit data locality or partition the graph

on the CPU in order to address the issue.

Despite the fact that GNN accelerators have gained state-

of-the-art performance in hardware in recent years, they

are only applicable to static graphs. Some DGNN hardware

accelerators [31, 32] have been proposed recently, however,

there still remain some challenges on DGNN hardware

deployment introduced by dynamic graphs that have not

yet been addressed yet. We believe the hardware analysis

of DGNN models we presented in this paper is meaningful

and signicant for two reasons:

•

Most of the existing DGNNs are designed from the

algorithmic perspective. They only focus on harnessing

the network structures’ temporal encoding methods to

improve the performance of DGNNs on specic tasks.

The lack of attention to hardware eciency induces

huge computational overheads, low hardware parallelism,

and high energy consumption. Identifying the hardware

bottlenecks of existing DGNNs helps researchers avoid

algorithm ineciency and potentially leads to the design

of hardware-compatible DGNNs in the future.

•

Hardware optimizations in current GNN accelerators

primarily focus on matrix multiplications, and achieve

signicant hardware performance improvement when

processing static graphs. However, dynamic graph pro-

cessing needs to take temporal data dependencies into

consideration, which brings new challenges. The reason is

that previous optimizations for static graph processing do

not behave well when facing temporal data dependencies

and we need to design new parallel computing schemes

to overcome them.

3. Background

Graph Neural Networks (GNNs) are neural networks

capable of leveraging the structure and properties of graphs

in graph-based tasks. GNNs take graphs, endowed with node

and edge features, as inputs and performs the computation

GNN

RNN

layer

GNN

Snapshot 1 Snapshot 2 Snapshot n

……

In sequential

weights

updated

GNN

weights

layer 1

layer Next

layer

RNN

layer

Snapshot 1

GNN

Snapshot 2 Snapshot n

……

In sequential

GNN

weights

layer 1

layer

top-k: select

k node

embeddings

EvolveGCN-O EvolveGCN-H

Time

encoder

Input

minibatch

with target

node

Sampled

subgraph

Sampling Attention

Layer

Node

embedding

Time

embedding

For every node

in sampled

subgraph

Graph with

target node

embedding

updated

layers

TGAT

Traverse all

nodes in a

minibatch TGAT

TGAT

……

weights

updated

RNN

L1 RNN

GNN

L1 GNN

RNN

L1 RNN

GNN

......

TGAT

(a) EvolveGCN

(b) TGAT

RNN

L2 RNN

Figure 2: The computation ow of (a) EvolveGCN and (b) TGAT.

based on their features and the graph structures. Prevail-

ing GNN models follow the message passing mechanism,

including two major stages: message passing and node

transformation. In each layer of the GNN, these two steps

will be applied to the input graph in turn. Fig. 1 (b)

demonstrates a message passing procedure for a single

node 0 at one of the layers. In the message passing stage,

node 0 aggregates features from adjacent nodes by using a

self-dened or learned aggregator, such as mean, average, or

max. In the node transformation stage, the feature of node

0 is updated by a linear layer or a multi-layer perceptron

(MLP). There are many variations of GNNs; a more detailed

introduction can be found in recent surveys [33, 34].

Dynamic Graph Neural Networks (DGNNs) are

GNNs used to process the temporal information in dynamic

graph-structured data, as shown in Fig. 1 (c) and (d).

What separates DGNNs from GNNs is the innovative time

encoder, such as Recurrent Neural Networks (RNNs) and

Time2Vec [35], which encode time-series features of graphs

varying with time.

DGNNs are specialized with various features, such as

dierent graph structures and methods of time encoding,

to get the best performance in individual real-world ap-

plications. In this section, we present a brief description

of the 8 representative DGNNs with graphs of dataow

dependencies and a summarizing table 1 of their key

features. We include the type of DGNN, which part is

changing with time, the time encoding method and tasks

that can be applied to in this table.

3.1. EvolveGCN

EvolveGCN [16] is a discrete time dynamic GNN. It can

be applied to link prediction and node/edge classication

tasks in social networks, where the graph topology and

embedding are changing over time.

EvolveGCN takes a snapshot of the whole graph at

each time step to process temporal subgraphs using a

graph convolution network (GCN), and meanwhile adopts a

recurrent neural network (RNN), e.g., gated recurrent unit

(GRU) [36] and long short-term memory (LSTM) [37], to

regulate the network parameters. There are two versions

of EvolveGCN: EvolveGCN-H and EvolveGCN-O, and their

architectures are shown in Fig. 2 (a). In each layer, the RNN

and GNN are executed sequentially because GNN needs

the weights renewed by RNN.

The dierence between -H version and -O version is that,

in -O version, the input of RNN is GNN weights while in -H

version, both the GNN weights and node embedding are the

input of RNN. Since the dimension of the node embedding

matrix is inconsistent with that of the GCN weight matrix,

an additional module (top-k) for node embedding sampling

is needed to ensure the dimension consistency.

3.2. TGAT

Temporal Graph Attention Network (TGAT) is relevant

for applications that predict the future behavior of nodes in

a temporal graph and detect changes in node behavior over

time. Examples of such applications include social networks,

transportation networks, and biological networks.

TGAT is inspired by, rst, the self-attention mechanism

to aggregate neighborhood nodes’ information and, sec-

ond, classical Bochner’s theorem of harmonic analysis to

encode the temporal information. TGAT generates the time

embedding based on two assumptions: critical temporal

information is revealed in the relative time value and this

value will be concatenated with the node embedding as the

input of the attention layer to renew the target nodes’

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BottleneckAnalysisofDynamicGraphNeuralNetworkInferenceonCPUandGPUHanqiuChen1,YahyaAlhinai1§,YihanJiang1§,EunjeeNa2§,CongHao1{hchen799,yhinai3,yjiang400}@gatech.edu,jasmin7907@kaist.ac.kr,callie.hao@gatech.edu1SchoolofElectricalandComputerEngineering,GeorgiaInstituteofTechnology2KoreaAdvancedInstitut...

展开>> 收起<<

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU Hanqiu Chen1 Yahya Alhinai1 Yihan Jiang1 Eunjee Na2 Cong Hao1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU Hanqiu Chen1 Yahya Alhinai1 Yihan Jiang1 Eunjee Na2 Cong Hao1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: