Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU Hanqiu Chen1 Yahya Alhinai1 Yihan Jiang1 Eunjee Na2 Cong Hao1

2025-04-30 0 0 1.49MB 16 页 10玖币
侵权投诉
Bottleneck Analysis of Dynamic Graph
Neural Network Inference on CPU and GPU
Hanqiu Chen1, Yahya Alhinai1§, Yihan Jiang1§, Eunjee Na2§, Cong Hao1
{hchen799,yhinai3,yjiang400}@gatech.edu, jasmin7907@kaist.ac.kr, callie.hao@gatech.edu
1School of Electrical and Computer Engineering, Georgia Institute of Technology
2Korea Advanced Institute of Science & Technology
Abstract
Dynamic graph neural network (DGNN) is becoming
increasingly popular because of its widespread use in cap-
turing the dynamic features in the real world. A variety
of dynamic graph neural networks designed from algorith-
mic perspectives have succeeded in incorporating temporal
information into graph processing. Despite the promising
algorithmic performance, deploying DGNNs on hardware
presents additional challenges due to the model complexity,
diversity, and the nature of the time-dependency. Meanwhile,
the dierences between DGNNs and static graph neural
networks make hardware-related optimizations for static
graph neural networks unsuitable for DGNNs. In this paper,
we select eight prevailing DGNNs with dierent characteristics
and prole them on both CPU and GPU. The proling
results are summarized and analyzed, providing in-depth
insights into the bottlenecks of DGNNs on hardware and
identifying potential optimization opportunities for future
DGNN acceleration. Followed by a comprehensive survey, we
provide a detailed analysis of DGNN performance bottlenecks
on hardware, including temporal data dependency, workload
imbalance, data movement, and GPU warm-up. We suggest
several optimizations from both software and hardware
perspectives. This paper is the rst to provide an in-depth
analysis of the hardware performance of DGNN
. Code is
available at https:// github.com/sharc-lab/ DGNN_analysis.
1. Introduction
Deep neural networks (DNNs) have made tremendous
breakthroughs in various domains such as speech [1–
3], image [4–7], and natural language processing [8–10].
While DNNs can eectively capture the hidden pattern in
Euclidean data, they do not perform well in processing
non-Euclidean data presented in graph format, such as
social networks, recommendation systems, and knowledge
graphs. Facing the challenges of DNNs, researchers have
an increasing interest in the graph data processing. Graph
Neural Network (GNN) is a powerful tool for processing
graphs. GNNs have shown the ability to capture the
Hanqiu Chen proled EvolveGCN and TGAT models and led paper writing.
Yahya Alhinai proled JODIE, TGN, and MolDGNN models. Yihan Jiang proled
DyRep and LDG models. Eunjee proled ASTGNN model and prepared the codebase;
work done during her intern at Georgia Tech. §Equal contribution.
expressive power of graphs in many research areas. These
areas include social networks [11], physical systems [12]
and new drug discovery [13]. Fig. 1(a-b) demonstrates how
a social network can be represented as a graph, and how
GNNs can be applied to the graph. This is called static
GNN, in which both the graph and the model parameters
do not change.
Although GNNs have a strong representation power
on static graphs, in real-life, most of the graphs are
changing with time. For instance, as shown in Fig. 1 (c),
social networks are constantly growing as more people
are joining, and the edge weights are also changing as
the relationships between people evolve. In this case, the
graph topology changes over time, with nodes and edges
appearing and disappearing as time progresses. To better
represent complex graph structures changing over time,
dynamic graph neural network (DGNNs) is a promising
solution. Fig. 1(c) illustrates an example of a dynamically
changing graph, and Fig. 1(d) is an example of a dynamic
GNN, which uses a graph structure encoder to capture
the spatial information and a time encoder to encode the
temporal information.
According to a recent survey [14], DGNNs can be
divided into two categories according to the graph represen-
tation temporal granularity: (i) discrete time dynamic graph
neural network (DTDG) and (ii) continuous time dynamic
graph neural network (CTDG). DTDG captures the status
of a changing graph at dierent time steps and uses the
information from multiple snapshots of the dynamic graph
to gain a deeper understanding of the graph features. The
algorithm takes into account the temporal information to
better understand the features of the graph. CTDG is an
event-based neural network that can update node and edge
embeddings at any time when an edge between two nodes
appears. This update method is more realistic because it is
closer to real-world scenarios.
Although DGNNs have achieved great success from an
algorithmic perspective, their hardware performance is ex-
tremely poor due to the dire lack of hardware optimization.
Because of hardware bottlenecks, the lack of optimization
results in high latency and a low degree of parallelism.
Motivated by the software-hardware performance gap of
DGNNs, this paper seeks to provide a quantitative analysis
of eight representative models with dierent characteristics.
Graph structure encoder (GNN)
……
Time encoder (RNN, time2vec…)
t
embeddings
final results
1
6
4
2
3
0
50
1
3
5
aggregator NN
Original graph
Embedding
after MP Embedding
after NT
Message Passing (MP) Node Transformation (NT)
NN: Neural Network
(b) Static GNN
0
0
social network
people are connected graph abstraction
edge
deleted node
added
edge
added
graph topology
changing with time
node
deleted
person
disappeared person
appeared
(a) Static graph (b) Static GNN
(c) Dynamic graph (d) Dynamic GNN
Figure 1: Introduction to graph neural networks (GNNs) and dynamic graph neural networks (DGNNs). (a) Static graph
modelling of a social network. (b) An example structure of a static GNN with message passing (MP) and node transformation
(NT). (c) Dynamic graph modelling of a social network varying with time. (d) A high-level architecture of a DGNN with a
graph structure encoder to capture spatial information and a time encoder to capture temporal information.
By proling these models on both CPU and GPU, we aim to
provide root causes behind the hardware bottlenecks during
inference, as well as further acceleration optimizations.
The models analyzed are: JODIE [15], EvolveGCN [16],
TGAT [17], MolDGNN [18], ASTGNN [19], DyRep [20],
TGN [21], and LDG [22].
To date, little research has been conducted on the
hardware performance of DGNNs. Thus, this paper provides
the rst comprehensive DGNN performance analysis on
hardware with detailed bottleneck analysis. Our contribu-
tions are summarized as follows:
1)
Comprehensive review. We rst provide an overview
of the background of GNNs and DGNNs. We then
introduce eight representative DGNNs with their high-
level dataow and data dependency graphs to explain
their computational behavior, which largely impact the
hardware performance.
2)
Hardware proling. We use PyTorch Proler [23] and
NVIDIA Nsight Systems [24] to prole eight DGNN
models on both CPU (6226R) and GPU (A6000). PyTorch
Proler is used to investigate memory usage and exe-
cution time of dierent modules in dierent DGNNs;
Nvidia Nsight Systems is used for detailed CUDA and
kernel operations breakdown and analysis.
3)
DGNN hardware bottleneck analysis. We showcase
and analyze the proling results of DGNN inference on
both CPU and GPU. We identify four common hardware
bottlenecks that will degrade DGNN hardware perfor-
mance:
1
Temporal data dependencies between dierent
snapshots, node embedding, and edge embedding lead
to limited parallelism on hardware, which causes low
GPU utilization and degrades GPU acceleration rate.
2
The time-evolving graph topology of DGNNs causes
irregular memory accesses and a high graph sampling
overhead. This leads to a workload imbalance between
CPU and GPU, which makes CPU or GPU computation
kernels idle and limits high hardware utilization.
3
The
time-evolving graph topology, node and edge embedding
cause frequent data exchange between CPU and GPU.
4
Before inference starts on GPU, GPU needs to warm up
because of model initialization and memory allocation
on GPU.
4)
Proposal of promising optimizations. We propose
several optimizations that can help overcome the DGNN
hardware bottlenecks uncovered in this work. We classify
them into three categories according to their application
scope:
1
software optimizations including reduc-
ing graph preprocessing overhead and implementing
smarter framework;
2
hardware optimizations in-
cluding improving parallelism on GPU and reducing
data transferring overhead between CPU-GPU;
3
hard-
ware/software co-design including extending current
TABLE 1: Summary of the DGNNs profiled in this work. Column 2 is the type of a DGNN; column 3
6 show which
feature is evolving with time; column 7 is the time encoding method used in DGNN; column 8 gives examples of
the tasks the DGNN can be applied to.
DGNN type node feature edge feature graph topology weights time encoding tasks
JODIE
[15] continuous RNN future interaction prediction
state change prediction
TGN
[21] continuous time embedding future edge prediction
EvolveGCN
[16] discrete RNN
link prediction
node classication
edge classication
TGAT
[17] continuous ✓ ✓ time embedding link prediction
link classication
ASTGNN
[19] discrete self-attention trac ow prediction
DyRep
[20] continuous ✓ ✓ RNN dynamic link prediction
time prediction
LDG
[22] continuous RNN
self-attention dynamic link prediction
MolDGNN
[18] discrete RNN adjacency matrix prediction
static GNN hardware optimizations to DGNNs and
designing hardware-specic accelerators to improve
DGNNs hardware eciency.
The rest of the paper is organized as follows. Section 2
claries the current development of hardware accelerators
for GNNs and outlines the motivation of this paper. Sec-
tion 3 provides detailed information about the data ow
and data dependency graph for eight DGNN models for
proling. Section 4 presents proling results with bottleneck
analysis. Section 5 proposes various optimizations in terms
of software, hardware, or a combination of both software
and hardware (software/hardware co-design) in order to
address the bottlenecks associated with the graph topology
of DGNNs. Section 6 summarizes the work presented in
this paper, points out the limitations of this research, and
discusses future directions.
2. Motivations
Aiming to gain orders-of-magnitude performance im-
provement, various GNN accelerators have been proposed
to handle the computation- and communication-intensive
GNN challenges targeting dierent hardware platforms
including CPUs, GPUs, ASICs, FPGAs, and heterogeneous
architecture. To speed up the sparse matrix multiplication
(SpMM) for embedding transformation, which is widely used
in GNNs, AWB-GCN [25], I-GCN [26], and BoostGCN [27]
are proposed. Addressing the irregular memory access in
the message processing, the following methods have been
proposed: GraphACT [28], HyGCN [29], and VersaGNN [30].
These methods all exploit data locality or partition the graph
on the CPU in order to address the issue.
Despite the fact that GNN accelerators have gained state-
of-the-art performance in hardware in recent years, they
are only applicable to static graphs. Some DGNN hardware
accelerators [31, 32] have been proposed recently, however,
there still remain some challenges on DGNN hardware
deployment introduced by dynamic graphs that have not
yet been addressed yet. We believe the hardware analysis
of DGNN models we presented in this paper is meaningful
and signicant for two reasons:
Most of the existing DGNNs are designed from the
algorithmic perspective. They only focus on harnessing
the network structures’ temporal encoding methods to
improve the performance of DGNNs on specic tasks.
The lack of attention to hardware eciency induces
huge computational overheads, low hardware parallelism,
and high energy consumption. Identifying the hardware
bottlenecks of existing DGNNs helps researchers avoid
algorithm ineciency and potentially leads to the design
of hardware-compatible DGNNs in the future.
Hardware optimizations in current GNN accelerators
primarily focus on matrix multiplications, and achieve
signicant hardware performance improvement when
processing static graphs. However, dynamic graph pro-
cessing needs to take temporal data dependencies into
consideration, which brings new challenges. The reason is
that previous optimizations for static graph processing do
not behave well when facing temporal data dependencies
and we need to design new parallel computing schemes
to overcome them.
3. Background
Graph Neural Networks (GNNs) are neural networks
capable of leveraging the structure and properties of graphs
in graph-based tasks. GNNs take graphs, endowed with node
and edge features, as inputs and performs the computation
GNN
L1
RNN
L1
Next
layer
GNN
L1
Snapshot 1 Snapshot 2 Snapshot n
……
In sequential
weights
updated
GNN
weights
layer 1
Next
layer Next
layer
RNN
L1
Next
layer
Snapshot 1
GNN
L1
Snapshot 2 Snapshot n
……
In sequential
GNN
weights
layer 1
Next
layer
Next
layer
top-k: select
k node
embeddings
EvolveGCN-O EvolveGCN-H
Time
encoder
Input
minibatch
with target
node
Sampled
subgraph
Sampling Attention
Layer
Node
embedding
Time
embedding
For every node
in sampled
subgraph
Graph with
target node
embedding
updated
Next
layers
TGAT
L2
Traverse all
nodes in a
minibatch TGAT
L1
TGAT
L1
……
weights
updated
RNN
L1 RNN
L1
GNN
L1 GNN
L1
RNN
L1 RNN
L1
GNN
L1
......
......
TGAT
L2
(a) EvolveGCN
(b) TGAT
RNN
L2 RNN
L2 RNN
L2
Figure 2: The computation ow of (a) EvolveGCN and (b) TGAT.
based on their features and the graph structures. Prevail-
ing GNN models follow the message passing mechanism,
including two major stages: message passing and node
transformation. In each layer of the GNN, these two steps
will be applied to the input graph in turn. Fig. 1 (b)
demonstrates a message passing procedure for a single
node 0 at one of the layers. In the message passing stage,
node 0 aggregates features from adjacent nodes by using a
self-dened or learned aggregator, such as mean, average, or
max. In the node transformation stage, the feature of node
0 is updated by a linear layer or a multi-layer perceptron
(MLP). There are many variations of GNNs; a more detailed
introduction can be found in recent surveys [33, 34].
Dynamic Graph Neural Networks (DGNNs) are
GNNs used to process the temporal information in dynamic
graph-structured data, as shown in Fig. 1 (c) and (d).
What separates DGNNs from GNNs is the innovative time
encoder, such as Recurrent Neural Networks (RNNs) and
Time2Vec [35], which encode time-series features of graphs
varying with time.
DGNNs are specialized with various features, such as
dierent graph structures and methods of time encoding,
to get the best performance in individual real-world ap-
plications. In this section, we present a brief description
of the 8 representative DGNNs with graphs of dataow
dependencies and a summarizing table 1 of their key
features. We include the type of DGNN, which part is
changing with time, the time encoding method and tasks
that can be applied to in this table.
3.1. EvolveGCN
EvolveGCN [16] is a discrete time dynamic GNN. It can
be applied to link prediction and node/edge classication
tasks in social networks, where the graph topology and
embedding are changing over time.
EvolveGCN takes a snapshot of the whole graph at
each time step to process temporal subgraphs using a
graph convolution network (GCN), and meanwhile adopts a
recurrent neural network (RNN), e.g., gated recurrent unit
(GRU) [36] and long short-term memory (LSTM) [37], to
regulate the network parameters. There are two versions
of EvolveGCN: EvolveGCN-H and EvolveGCN-O, and their
architectures are shown in Fig. 2 (a). In each layer, the RNN
and GNN are executed sequentially because GNN needs
the weights renewed by RNN.
The dierence between -H version and -O version is that,
in -O version, the input of RNN is GNN weights while in -H
version, both the GNN weights and node embedding are the
input of RNN. Since the dimension of the node embedding
matrix is inconsistent with that of the GCN weight matrix,
an additional module (top-k) for node embedding sampling
is needed to ensure the dimension consistency.
3.2. TGAT
Temporal Graph Attention Network (TGAT) is relevant
for applications that predict the future behavior of nodes in
a temporal graph and detect changes in node behavior over
time. Examples of such applications include social networks,
transportation networks, and biological networks.
TGAT is inspired by, rst, the self-attention mechanism
to aggregate neighborhood nodes’ information and, sec-
ond, classical Bochner’s theorem of harmonic analysis to
encode the temporal information. TGAT generates the time
embedding based on two assumptions: critical temporal
information is revealed in the relative time value and this
value will be concatenated with the node embedding as the
input of the attention layer to renew the target nodes’
摘要:

BottleneckAnalysisofDynamicGraphNeuralNetworkInferenceonCPUandGPUHanqiuChen1,YahyaAlhinai1§,YihanJiang1§,EunjeeNa2§,CongHao1{hchen799,yhinai3,yjiang400}@gatech.edu,jasmin7907@kaist.ac.kr,callie.hao@gatech.edu1SchoolofElectricalandComputerEngineering,GeorgiaInstituteofTechnology2KoreaAdvancedInstitut...

展开>> 收起<<
Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU Hanqiu Chen1 Yahya Alhinai1 Yihan Jiang1 Eunjee Na2 Cong Hao1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.49MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注