SA-MLP Distilling Graph Knowledge from GNNs into Structure-Aware MLP

2025-05-03 0 0 1.31MB 10 页 10玖币
侵权投诉
SA-MLP: Distilling Graph Knowledge from GNNs into
Structure-Aware MLP
Jie Chen1, Shouzhen Chen1, Mingyuan Bai1,2, Junbin Gao2, Junping Zhang1, Jian Pu3
1Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, China
2Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, Australia
3Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, China
1,3
{chenj19, chensz19, jpzhang,jianpu}@fudan.edu.cn,
1,2
yvonne.mingyuanbai@gmail.com,
2
junbin.gao@sydney.edu.au
ABSTRACT
The message-passing mechanism helps Graph Neural Networks
(GNNs) achieve remarkable results on various node classification
tasks. Nevertheless, the recursive nodes fetching and aggregation in
message-passing cause inference latency when deploying GNNs to
large-scale graphs. One promising inference acceleration direction
is to distill the GNNs into message-passing-free student multi-layer
perceptrons (MLPs). However, the MLP student cannot fully learn
the structure knowledge due to the lack of structure inputs, which
causes inferior performance in the heterophily and inductive scenar-
ios. To address this, we intend to inject structure information into
MLP-like students in low-latency and interpretable ways. Specif-
ically, we first design a Structure-Aware MLP (SA-MLP) student
that encodes both features and structures without message-passing.
Then, we introduce a novel structure-mixing knowledge distillation
strategy to enhance the learning ability of MLPs for structure in-
formation. Furthermore, we design a latent structure embedding
approximation technique with two-stage distillation for inductive
scenarios. Extensive experiments on eight benchmark datasets under
both transductive and inductive settings show that our SA-MLP can
consistently outperform the teacher GNNs, while maintaining faster
inference as MLPs. The source code of our work can be found in
https://github.com/JC-202/SA-MLP.
CCS CONCEPTS
Computing methodologies Neural networks
;
Information
systems Deep web.
KEYWORDS
Graph Neural Networks, Knowledge Distillation, Node Classifica-
tion,
ACM Reference Format:
Jie Chen
1
, Shouzhen Chen
1
, Mingyuan Bai
1,2
, Junbin Gao
2
, Junping Zhang
1
,
Jian Pu
3
. 2018. SA-MLP: Distilling Graph Knowledge from GNNs into
Structure-Aware MLP. In Proceedings of Make sure to enter the correct
conference title from your rights confirmation emai (Conference acronym
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/XXXXXXX.XXXXXXX
1
234
15 6
...
1 3 6 7
... ... ... ... ... ...
GNN
1
2
3
4
65
7
8
9
10
1 2 10
Mix Targets
Student SA-MLP: Inference without message-passing
Mix
Structure-Mixing Knowledge Distillaiton
Node Featres + Local Structure SA-MLP
Soft
Targets
Output
Teacher GNNs: Inference with recursive neighbor expansion and aggregation
Mix Structure
Distill
Figure 1: An overview of our distillation framework. A
structure-awareness MLP-like student learns from GNNs via a
structure-mixing knowledge distillation strategy to achieve sub-
stantially faster inference without message-passing.
’XX). ACM, New York, NY, USA, 10 pages. https://doi.org/XXXXXXX.
XXXXXXX
1 INTRODUCTION
Graph Neural Networks (GNNs) [
10
,
20
] have recently emerged
as a powerful class of deep learning architectures to analyze graph
datasets in diverse domains such as social networks [
29
], traffic net-
works [
35
] and recommendation systems [
34
]. Most GNNs follow a
message-passing mechanism [
9
] that extracts graph knowledge by
aggregating neighborhood information iteratively to build node rep-
resentation. However, the number of neighbors for each node would
exponentially increase as the number of layers increases [
37
,
41
].
Hence, as the yellow node is shown in the upside of Figure 1, this re-
cursive neighbor fetching induced by message-passing leads to infer-
ence latency, making GNNs hard to deploy for latency-constrained
applications that require fast inference, especially for large-scale
graphs.
Common inference acceleration methods, such as pruning [
44
]
and quantization [
32
,
42
], can speed up GNNs to some extent by re-
ducing the Multiplication-and-ACcumulation (MAC) operations.
However, they are still limited by recursive aggregation due to
message-passing. Knowledge distillation (KD) is another generic
neural network learning paradigm for deployment that transfers
arXiv:2210.09609v1 [cs.LG] 18 Oct 2022
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato and Tobin, et al.
knowledge from high-performance but resource-intensive teacher
models to resource-efficient students [
13
]. Motivated by the promis-
ing results of MLP-like models in computer vision [
23
,
24
,
26
],
graph-less neural networks (GLNN) [41] were proposed to transfer
the graph knowledge from GNNs to standard MLP students via KD.
This idea improves the performance of MLPs on node classifica-
tion while being easy to deploy in production systems because of
discarding the message-passing.
However, traditional MLPs do not fully understand graph knowl-
edge due to the lack of structure inputs. As in GLNN [
41
], distillation
may fail when node labels are highly correlated with structure infor-
mation, e.g., heterophily datasets. Hence the improvement of distilla-
tion is mainly attributed to the strong memory ability of MLPs [
31
].
To better understand the limitations of GLNN, we consider the two
scenarios, i.e., transductive (trans) and inductive (ind), according
to information from graphs obtained during the training phase. In
trans settings [
20
], all node features and graph structures are given
in the training time, and student MLPs can overfit (memorize) teach-
ers’ outputs on all nodes, which leads to superior performance [
41
].
Nevertheless, in the more challenging ind scenario [
10
] where the
test node information is unavailable in the training stage, an MLP
without graph dependency has limited generalizability on these test
nodes. In this scenario, the structure information is a key clue that
binds the training and test nodes to improve generalization. Further-
more, the standard logit-based KD [
13
], which merely considers
label information on existing nodes, cannot fully transfer the graph
knowledge due to the sparsity of the graph structure [21].
To address the above problems, we intend to inject the structure
information into MLPs in a low-latency and interpretable way. To
this end, as shown in Figure 1, we first design a simple yet effective
and interpretable Structure-Aware MLP (SA-MLP) student model
(Section 4.1) to encode both node features and structure informa-
tion. Specifically, the SA-MLP decouples the features and structure
information of each node with two encoders, and utilizes an adap-
tive fusion decoder to generate the final prediction. All the modules
of SA-MLP are implemented by MLPs, and the structure inputs
are the batch of the sparse adjacency matrix rows. Hence, it can
benefit from both mini-batch training and faster inference. Second,
we propose a novel structure-mixing knowledge distillation (Sec-
tion 4.2) via the mixup [
40
] technique to improve the learning ability
of SA-MLP for structure information from GNNs. It generates the
virtual mixing structure and teachers’ output samples to enhance the
distillation density for structure knowledge. Compared with standard
logit-based distillation, our strategy is more appropriate for SA-MLP
due to the reduction of structure sparsity. Third, for the ind scenario
without connection, e.g., some newest users on Twitter who do not
interact with any others, we propose an implicit structure embedding
approximation technique with a two-stage distillation procedure to
enhance the generalization ability (Section 4.3).
We conduct extensive experiments on eight public benchmark
datasets under trans and ind scenarios. The results show that the
learned SA-MLP can achieve similar or even better performance than
teacher GNNs in all scenarios, with substantially faster inference.
Furthermore, we also conduct an in-depth analysis to investigate the
compatibility, interpretability, and efficiency of the learned SA-MLP.
To summarize, this work has the following main contributions:
We propose a message-passing free model SA-MLP, which
has a low latency inference and a more interpretable pre-
diction process, and naturally preserves the feature/structure
information after distillation from GNN.
We design a mixup-based structure-mixing knowledge distil-
lation strategy that improves the performance and structure
awareness of SA-MLP via KD.
For the missing structure scenario, we propose a latent struc-
ture embedding approximation technique and a two-stage
distillation to enhance the generalization ability.
2 PRELIMINARY
2.1 Notation and Problem Setting
Consider a graph
G=(V,E)
, with
𝑁
nodes and
𝐸
edges. Let
AR𝑁×𝑁be the adjacency matrix, with A𝑖,𝑗 =1if edge(𝑖, 𝑗) ∈ E,
and 0 otherwise. Let
DR𝑁×𝑁
be the diagonal degree matrix.
Each node
𝑣𝑖
is given a
𝑑
-dimensional feature representation
x𝑖
and
a
𝑐
-dimensional one-hot class label
y𝑖
. The feature inputs are then
formed by
XR𝑁×𝑑
, and the labels are represented by
YR𝑁×𝑐
.
The labeled and unlabeled node sets are denoted as
V
𝐿
and
V
𝑈
, and
we have V=V
𝐿∪ V
𝑈.
The task of node classification is to predict the labels
Y
by exploit-
ing the nodes’ features
X
and the graph structure
A
. The goal of our
paper is to learn an MLP-like student, such that the learned MLP can
achieve similar or even better performance compared with a GNN
trained by the same training set, with a much lower computational
cost in the inference time.
2.2 Graph Neural Networks
Most existing GNNs follow the message-passing paradigm which
contains node feature transformation and information aggregation
from connected neighbors on the graph structure [
9
]. The general
𝑘-th layer graph convolution for a node 𝑣𝑖can be formulated as
h(𝑘)
𝑖=𝑓h(𝑘1)
𝑖,nh(𝑘1)
𝑗:𝑗 N (𝑣𝑖)o,(1)
where representation
h𝑖
is updated iteratively in each layer by col-
lecting messages from its neighbors denoted as
N (𝑣𝑖)
. The graph
convolution operator
𝑓
is usually implemented as a weighted sum
of nodes’ representation according to the adjacent matrix
A
as in
GCN [
20
] and GraphSAGE [
10
] or the attention mechanism in
GAT [
33
]. However, this recursive expansion and aggregation of
neighbors cause inference latency, because the number of neighbors
fetching will exponentially increase with increasing layers [37, 41].
The objective function for training GNNs is the cross-entropy of
the ground truth labels Yand the output of the network ˆ
YR𝑁×𝑐:
L𝐶𝐸 (ˆ
Y𝐿,Y𝐿)=
𝑖V
𝐿
𝑐
𝑗=1
Y𝑖 𝑗 ln ˆ
Y𝑖 𝑗 .(2)
2.3 Transductive and Inductive Scenarios
Although the idealistic transductive is the commonly studied setting
for node classification, it is incongruous with unseen nodes in real ap-
plications. We then consider node classification under three settings
to give a broad evaluation of models: transductive (trans), inductive
with connection (ind w/c) and inductive without connection (ind
w/o c), as shown in Figure 2. For trans, models can utilize all node
摘要:

SA-MLP:DistillingGraphKnowledgefromGNNsintoStructure-AwareMLPJieChen1,ShouzhenChen1,MingyuanBai1,2,JunbinGao2,JunpingZhang1,JianPu31ShanghaiKeyLabofIntelligentInformationProcessing,SchoolofComputerScience,FudanUniversity,China2DisciplineofBusinessAnalytics,TheUniversityofSydneyBusinessSchool,TheUniv...

展开>> 收起<<
SA-MLP Distilling Graph Knowledge from GNNs into Structure-Aware MLP.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.31MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注