Eicient Graph Neural Network Inference at Large Scale
Xinyi Gao
The University of Queensland
xinyi.gao@uq.edu.au
Wentao Zhang
Mila - Québec AI Institute
HEC Montréal
wentao.zhang@mila.quebec
Yingxia Shao
Beijing University of Posts and
Telecommunications
shaoyx@bupt.edu.cn
Quoc Viet Hung Nguyen
Grith University
henry.nguyen@grith.edu.au
Bin Cui
Peking University
bin.cui@pku.edu.cn
Hongzhi Yin
The University of Queensland
h.yin1@uq.edu.au
ABSTRACT
Graph neural networks (GNNs) have demonstrated excellent perfor-
mance in a wide range of applications. However, the enormous size
of large-scale graphs hinders their applications under real-time in-
ference scenarios. Although existing scalable GNNs leverage linear
propagation to preprocess the features and accelerate the training
and inference procedure, these methods still suer from scalability
issues when making inferences on unseen nodes, as the feature
preprocessing requires the graph is known and xed. To speed up
the inference in the inductive setting, we propose a novel adaptive
propagation order approach that generates the personalized prop-
agation order for each node based on its topological information.
This could successfully avoid the redundant computation of feature
propagation. Moreover, the trade-o between accuracy and infer-
ence latency can be exibly controlled by simple hyper-parameters
to match dierent latency constraints of application scenarios. To
compensate for the potential inference accuracy loss, we further
propose Inception Distillation to exploit the multi-scale reception
information and improve the inference performance. Extensive
experiments are conducted on four public datasets with dierent
scales and characteristics, and the experimental results show that
our proposed inference acceleration framework outperforms the
SOTA graph inference acceleration baselines in terms of both ac-
curacy and eciency. In particular, the advantage of our proposed
method is more signicant on larger-scale datasets, and our frame-
work achieves 75
×
inference speedup on the largest Ogbn-products
dataset.
1 INTRODUCTION
Developing a graph neural network (GNN) for very large graphs
has drawn increasing attention due to the powerful expressiveness
of GNNs and their enormous success in many industrial applica-
tions [
11
,
29
,
33
]. Although, GNNs provide a universal framework
to tackle various down-streaming tasks, performing the model
on large-scale industrial graphs suers from heavy computation
and high latency. This severely limits the application to latency-
sensitive scenarios. For example, recommender systems designed
for streaming sessions must completely perform real-time infer-
ence on user-item interaction graphs [
1
,
24
,
31
,
34
]. The fraud and
spam detection tasks require millisecond-level inference on the
million-scale graph to identify the malicious users and avoid the
property loss of the victim users [
18
,
21
,
30
]. In some computer
vision applications, GNNs are designed for 3D point clouds data
and deployed on edge devices such as self-driving cars to perform
object detection or semantic segmentation tasks [
17
,
23
,
25
]. In such
scenarios, real-time inference response is essential.
The root cause for the heavy computation and high latency of
GNNs is the neighbor explosion problem. Generally, most GNNs
adopt the message-passing pipeline and leverage the feature prop-
agation and transformation procedures to construct the model.
Through executing
𝑘
times of feature propagation, the
𝑘
-th or-
der propagated features can capture the node information from
𝑘
-hop neighborhoods. Especially in large-scale and sparsely labeled
graphs, multiple layers of propagation are needed to aggregate
enough label information from distant neighbors according to the
message-passing pipeline [
12
,
28
,
40
,
43
,
47
]. However, as the order
of propagation layers increases, the number of supporting nodes
grows exponentially. This directly incurs the high computation cost
of feature propagation.
To mitigate the expensive computation resulted from feature
propagation, several linear propagation-based GNNs [
4
,
6
,
32
,
45
,
46
,
49
,
52
], e.g., SGC, were proposed to remove the non-linearity
among feature propagation and aggregate node features during
the preprocessing procedure. Instead of performing feature prop-
agation during each training epoch, this time-consuming process
is only executed once in linear propagation-based GNNs. As a re-
sult, the time complexity of model training is signicantly reduced,
and the training of these models scales well with graph size. How-
ever, linear propagation-based GNNs still struggle with ecient
inference at scale because the preprocessing of feature propaga-
tion is based on the premise that the graph is known and xed.
This strong premise severely limits real-world applications, and
more practical scenarios require inference on unseen nodes, where
feature propagation has to be executed online. In addition, these
existing methods adopt a xed propagation order for all nodes. Due
to the complex topological structures, the xed propagation order
restricts the exibility of exploiting the multi-scale reception elds
and also tends to over-smooth the high-degree nodes, leading to
wasted computation and performance degradation.
To this end, we propose to reduce the redundant computation of
feature propagation to further accelerate the inference of scalable
GNNs. Specically, we design a plug-and-play technique: Node-
Adaptive Inference (NAI), which introduces node-wise adaptive
propagation order (or propagation depth) to customize the propa-
gation order for each node. By measuring the distance between the
current feature and the stationary state, the smoothing status of the
propagated feature is evaluated. Then we introduce simple global
hyper-parameters to adaptively determine the propagation order
for each node and eciently trade o between inference latency
arXiv:2211.00495v3 [cs.LG] 28 Dec 2022