SA-MLP Distilling Graph Knowledge from GNNs into Structure-Aware MLP

2025-05-03 0 0 1.31MB 10 页 10玖币

侵权投诉

SA-MLP: Distilling Graph Knowledge from GNNs into

Structure-Aware MLP

Jie Chen1, Shouzhen Chen1, Mingyuan Bai1,2, Junbin Gao2, Junping Zhang1, Jian Pu3

1Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University, China

2Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, Australia

3Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, China

1,3

{chenj19, chensz19, jpzhang,jianpu}@fudan.edu.cn,

1,2

yvonne.mingyuanbai@gmail.com,

junbin.gao@sydney.edu.au

ABSTRACT

The message-passing mechanism helps Graph Neural Networks

(GNNs) achieve remarkable results on various node classiﬁcation

tasks. Nevertheless, the recursive nodes fetching and aggregation in

message-passing cause inference latency when deploying GNNs to

large-scale graphs. One promising inference acceleration direction

is to distill the GNNs into message-passing-free student multi-layer

perceptrons (MLPs). However, the MLP student cannot fully learn

the structure knowledge due to the lack of structure inputs, which

causes inferior performance in the heterophily and inductive scenar-

ios. To address this, we intend to inject structure information into

MLP-like students in low-latency and interpretable ways. Specif-

ically, we ﬁrst design a Structure-Aware MLP (SA-MLP) student

that encodes both features and structures without message-passing.

Then, we introduce a novel structure-mixing knowledge distillation

strategy to enhance the learning ability of MLPs for structure in-

formation. Furthermore, we design a latent structure embedding

approximation technique with two-stage distillation for inductive

scenarios. Extensive experiments on eight benchmark datasets under

both transductive and inductive settings show that our SA-MLP can

consistently outperform the teacher GNNs, while maintaining faster

inference as MLPs. The source code of our work can be found in

https://github.com/JC-202/SA-MLP.

CCS CONCEPTS

•Computing methodologies →Neural networks

;

•Information

systems →Deep web.

KEYWORDS

Graph Neural Networks, Knowledge Distillation, Node Classiﬁca-

tion,

ACM Reference Format:

Jie Chen

, Shouzhen Chen

, Mingyuan Bai

1,2

, Junbin Gao

, Junping Zhang

Jian Pu

. 2018. SA-MLP: Distilling Graph Knowledge from GNNs into

Structure-Aware MLP. In Proceedings of Make sure to enter the correct

conference title from your rights conﬁrmation emai (Conference acronym

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full citation

on the ﬁrst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a

fee. Request permissions from permissions@acm.org.

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00

https://doi.org/XXXXXXX.XXXXXXX

234

15 6

...

1 3 6 7

... ... ... ... ... ...

GNN

1 2 10

Mix Targets

Student SA-MLP: Inference without message-passing

Mix

Structure-Mixing Knowledge Distillaiton

Node Featres + Local Structure SA-MLP

Soft

Targets

Output

Teacher GNNs: Inference with recursive neighbor expansion and aggregation

Mix Structure

Distill

Figure 1: An overview of our distillation framework. A

structure-awareness MLP-like student learns from GNNs via a

structure-mixing knowledge distillation strategy to achieve sub-

stantially faster inference without message-passing.

’XX). ACM, New York, NY, USA, 10 pages. https://doi.org/XXXXXXX.

XXXXXXX

1 INTRODUCTION

Graph Neural Networks (GNNs) [

] have recently emerged

as a powerful class of deep learning architectures to analyze graph

datasets in diverse domains such as social networks [

], trafﬁc net-

works [

] and recommendation systems [

]. Most GNNs follow a

message-passing mechanism [

] that extracts graph knowledge by

aggregating neighborhood information iteratively to build node rep-

resentation. However, the number of neighbors for each node would

exponentially increase as the number of layers increases [

Hence, as the yellow node is shown in the upside of Figure 1, this re-

cursive neighbor fetching induced by message-passing leads to infer-

ence latency, making GNNs hard to deploy for latency-constrained

applications that require fast inference, especially for large-scale

graphs.

Common inference acceleration methods, such as pruning [

]

and quantization [

], can speed up GNNs to some extent by re-

ducing the Multiplication-and-ACcumulation (MAC) operations.

However, they are still limited by recursive aggregation due to

message-passing. Knowledge distillation (KD) is another generic

neural network learning paradigm for deployment that transfers

arXiv:2210.09609v1 [cs.LG] 18 Oct 2022

Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato and Tobin, et al.

knowledge from high-performance but resource-intensive teacher

models to resource-efﬁcient students [

]. Motivated by the promis-

ing results of MLP-like models in computer vision [

graph-less neural networks (GLNN) [41] were proposed to transfer

the graph knowledge from GNNs to standard MLP students via KD.

This idea improves the performance of MLPs on node classiﬁca-

tion while being easy to deploy in production systems because of

discarding the message-passing.

However, traditional MLPs do not fully understand graph knowl-

edge due to the lack of structure inputs. As in GLNN [

], distillation

may fail when node labels are highly correlated with structure infor-

mation, e.g., heterophily datasets. Hence the improvement of distilla-

tion is mainly attributed to the strong memory ability of MLPs [

To better understand the limitations of GLNN, we consider the two

scenarios, i.e., transductive (trans) and inductive (ind), according

to information from graphs obtained during the training phase. In

trans settings [

], all node features and graph structures are given

in the training time, and student MLPs can overﬁt (memorize) teach-

ers’ outputs on all nodes, which leads to superior performance [

Nevertheless, in the more challenging ind scenario [

] where the

test node information is unavailable in the training stage, an MLP

without graph dependency has limited generalizability on these test

nodes. In this scenario, the structure information is a key clue that

binds the training and test nodes to improve generalization. Further-

more, the standard logit-based KD [

], which merely considers

label information on existing nodes, cannot fully transfer the graph

knowledge due to the sparsity of the graph structure [21].

To address the above problems, we intend to inject the structure

information into MLPs in a low-latency and interpretable way. To

this end, as shown in Figure 1, we ﬁrst design a simple yet effective

and interpretable Structure-Aware MLP (SA-MLP) student model

(Section 4.1) to encode both node features and structure informa-

tion. Speciﬁcally, the SA-MLP decouples the features and structure

information of each node with two encoders, and utilizes an adap-

tive fusion decoder to generate the ﬁnal prediction. All the modules

of SA-MLP are implemented by MLPs, and the structure inputs

are the batch of the sparse adjacency matrix rows. Hence, it can

beneﬁt from both mini-batch training and faster inference. Second,

we propose a novel structure-mixing knowledge distillation (Sec-

tion 4.2) via the mixup [

] technique to improve the learning ability

of SA-MLP for structure information from GNNs. It generates the

virtual mixing structure and teachers’ output samples to enhance the

distillation density for structure knowledge. Compared with standard

logit-based distillation, our strategy is more appropriate for SA-MLP

due to the reduction of structure sparsity. Third, for the ind scenario

without connection, e.g., some newest users on Twitter who do not

interact with any others, we propose an implicit structure embedding

approximation technique with a two-stage distillation procedure to

enhance the generalization ability (Section 4.3).

We conduct extensive experiments on eight public benchmark

datasets under trans and ind scenarios. The results show that the

learned SA-MLP can achieve similar or even better performance than

teacher GNNs in all scenarios, with substantially faster inference.

Furthermore, we also conduct an in-depth analysis to investigate the

compatibility, interpretability, and efﬁciency of the learned SA-MLP.

To summarize, this work has the following main contributions:

•

We propose a message-passing free model SA-MLP, which

has a low latency inference and a more interpretable pre-

diction process, and naturally preserves the feature/structure

information after distillation from GNN.

•

We design a mixup-based structure-mixing knowledge distil-

lation strategy that improves the performance and structure

awareness of SA-MLP via KD.

•

For the missing structure scenario, we propose a latent struc-

ture embedding approximation technique and a two-stage

distillation to enhance the generalization ability.

2 PRELIMINARY

2.1 Notation and Problem Setting

Consider a graph

G=(V,E)

, with

𝑁

nodes and

𝐸

edges. Let

A∈R𝑁×𝑁be the adjacency matrix, with A𝑖,𝑗 =1if edge(𝑖, 𝑗) ∈ E,

and 0 otherwise. Let

D∈R𝑁×𝑁

be the diagonal degree matrix.

Each node

𝑣𝑖

is given a

𝑑

-dimensional feature representation

x𝑖

and

𝑐

-dimensional one-hot class label

y𝑖

. The feature inputs are then

formed by

X∈R𝑁×𝑑

, and the labels are represented by

Y∈R𝑁×𝑐

The labeled and unlabeled node sets are denoted as

𝐿

and

𝑈

, and

we have V=V

𝐿∪ V

𝑈.

The task of node classiﬁcation is to predict the labels

by exploit-

ing the nodes’ features

and the graph structure

. The goal of our

paper is to learn an MLP-like student, such that the learned MLP can

achieve similar or even better performance compared with a GNN

trained by the same training set, with a much lower computational

cost in the inference time.

2.2 Graph Neural Networks

Most existing GNNs follow the message-passing paradigm which

contains node feature transformation and information aggregation

from connected neighbors on the graph structure [

]. The general

𝑘-th layer graph convolution for a node 𝑣𝑖can be formulated as

h(𝑘)

𝑖=𝑓h(𝑘−1)

𝑖,nh(𝑘−1)

𝑗:𝑗∈ N (𝑣𝑖)o,(1)

where representation

h𝑖

is updated iteratively in each layer by col-

lecting messages from its neighbors denoted as

N (𝑣𝑖)

. The graph

convolution operator

𝑓

is usually implemented as a weighted sum

of nodes’ representation according to the adjacent matrix

as in

GCN [

] and GraphSAGE [

] or the attention mechanism in

GAT [

]. However, this recursive expansion and aggregation of

neighbors cause inference latency, because the number of neighbors

fetching will exponentially increase with increasing layers [37, 41].

The objective function for training GNNs is the cross-entropy of

the ground truth labels Yand the output of the network ˆ

Y∈R𝑁×𝑐:

L𝐶𝐸 (ˆ

Y𝐿,Y𝐿)=−

𝑖∈V

𝐿

𝑐



𝑗=1

Y𝑖 𝑗 ln ˆ

Y𝑖 𝑗 .(2)

2.3 Transductive and Inductive Scenarios

Although the idealistic transductive is the commonly studied setting

for node classiﬁcation, it is incongruous with unseen nodes in real ap-

plications. We then consider node classiﬁcation under three settings

to give a broad evaluation of models: transductive (trans), inductive

with connection (ind w/c) and inductive without connection (ind

w/o c), as shown in Figure 2. For trans, models can utilize all node

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SA-MLP:DistillingGraphKnowledgefromGNNsintoStructure-AwareMLPJieChen1,ShouzhenChen1,MingyuanBai1,2,JunbinGao2,JunpingZhang1,JianPu31ShanghaiKeyLabofIntelligentInformationProcessing,SchoolofComputerScience,FudanUniversity,China2DisciplineofBusinessAnalytics,TheUniversityofSydneyBusinessSchool,TheUniv...

展开>> 收起<<

SA-MLP Distilling Graph Knowledge from GNNs into Structure-Aware MLP.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SA-MLP Distilling Graph Knowledge from GNNs into Structure-Aware MLP

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: