Boosting Graph Neural Networks via Adaptive Knowledge Distillation Zhichun Guo1 Chunhui Zhang2 Yujie Fan3 Yijun Tian1 Chuxu Zhang2 Nitesh V . Chawla1 1University of Notre Dame Notre Dame IN 46556

2025-04-27 0 0 832.19KB 11 页 10玖币

侵权投诉

Boosting Graph Neural Networks via Adaptive Knowledge Distillation

Zhichun Guo1, Chunhui Zhang2, Yujie Fan3, Yijun Tian1, Chuxu Zhang2, Nitesh V. Chawla1

1University of Notre Dame, Notre Dame, IN 46556

2Brandeis University, Waltham, MA 02453

3Case Western Reserve University, Cleveland, OH 44106

{zguo5,yijun.tian,nchawla}@nd.edu

{chunhuizhang,chuxuzhang}@brandeis.edu, yxf370@case.edu

Abstract

Graph neural networks (GNNs) have shown remarkable perfor-

mance on diverse graph mining tasks. While sharing the same

message passing framework, our study shows that different

GNNs learn distinct knowledge from the same graph. This

implies potential performance improvement by distilling the

complementary knowledge from multiple models. However,

knowledge distillation (KD) transfers knowledge from high-

capacity teachers to a lightweight student, which deviates from

our scenario: GNNs are often shallow. To transfer knowledge

effectively, we need to tackle two challenges: how to transfer

knowledge from compact teachers to a student with the same

capacity; and, how to exploit student GNN’s own learning abil-

ity. In this paper, we propose a novel adaptive KD framework,

called BGNN, which sequentially transfers knowledge from

multiple GNNs into a student GNN. We also introduce an

adaptive temperature module and a weight boosting module.

These modules guide the student to the appropriate knowledge

for effective learning. Extensive experiments have demon-

strated the effectiveness of BGNN. In particular, we achieve

up to 3.05% improvement for node classiﬁcation and 6.35%

improvement for graph classiﬁcation over vanilla GNNs.

Introduction

Recent years have witnessed the signiﬁcant development of

graph neural networks (GNNs). Various GNNs have been de-

veloped and applied to different graph mining tasks (Kipf and

Welling 2017; Hamilton, Ying, and Leskovec 2017; Velick-

ovic et al. 2018; Klicpera, Bojchevski, and Günnemann 2019;

Xu et al. 2019; Wu et al. 2019; Jin et al. 2021; Guo et al.

2022a). Although most GNNs can be uniﬁed into the Mes-

sage Passing Neural Networks (Gilmer et al. 2017), their

learning abilities diverge (Xu et al. 2019; Balcilar et al. 2020).

In our preliminary study, we observe that the graph represen-

tations learned by different GNNs are not similar, especially

in deeper layers. It suggests that different GNNs may encode

complementary knowledge due to their different aggregation

schemes. Based on this observation, it is natural to ask: can

we boost vanilla GNNs by effectively utilizing complementary

knowledge learned by different GNNs from the same dataset?

An intuitive solution is to compose multiple models into

an ensemble (Hansen and Salamon 1990; Breiman 2001) that

2023, Association for the Advancement of Artiﬁcial

would achieve better performance than each of its constituent

models. However, ensemble is not always effective especially

when the base classiﬁers are strong learners (Zhang et al.

2020). Thus, we seek a different approach to take advantage

of knowledge from different GNNs: knowledge distillation

(KD) (Hinton et al. 2015; Romero et al. 2014; Touvron et al.

2021), which distills information from one (teacher) model

to another (student) model. However, KD is always accom-

panied by model compression (Yim et al. 2017; Heo et al.

2019; Yuan et al. 2019), where the teacher network is a high-

capacity neural network, and the student network is a compact

and fast-to-execute model. Standing by this situation, there

could be a signiﬁcant performance gap between students and

teachers. But this kind of performance gap may not exist

in our scenario: GNNs are all very shallow due to the over-

smoothing issue (Zhao and Akoglu 2019; Li, Han, and Wu

2018; Alon and Yahav 2020). Hence, it is more difﬁcult to

distill extra knowledge from teacher GNNs to boost the stu-

dent GNN. To achieve this goal, two major challenges arise:

the ﬁrst one is how to transfer knowledge from a teacher

GNN into a student GNN with the same capacity that can

produce the same even better performance (teaching effec-

tiveness); the second one is how to push the student model

to play the best role in learning by itself, which is ignored in

the traditional KD where the student’s performance heavily

relies on the teacher (learning ability).

In this work, we propose a novel framework, namely

BGNN, which combines the knowledge from different GNNs

in a “boosting” way to strengthen a vanilla GNN through

knowledge distillation. To improve the teaching effectiveness,

we propose two strategies to increase the useful knowledge

transferred from the teachers to the student. One is the se-

quential training strategy, where the student is encouraged to

focus on learning from one teacher at a time. This allows the

student to learn diverse knowledge from individual GNNs.

The other one is an adaptive temperature module. Unlike

existing KD methods that use a uniform temperature for all

samples, the temperature in BGNN is adjustable based on

the teacher’s conﬁdence in a speciﬁc sample. To enhance

the learning ability, we develop a weight boosting module.

This module redistributes the weight of samples, making the

student GNN pay more attention to the misclassiﬁed samples.

Our proposed BGNN is a general model which can be applied

to both graph classiﬁcation and node classiﬁcation tasks. We

arXiv:2210.05920v2 [cs.LG] 5 Apr 2023

Figure 1: CKA similarity between graph representation at different layers of GNNs on Enzymes.

conduct extensive experimental studies on both tasks, and

the results demonstrate the superior performance of BGNN

compared with a set of baseline methods.

To summarize, our contributions are listed as follows:

•

Through empirical study, we show that the representations

learned by different GNNs are not similar, indicating that

they encode different knowledge from the same input.

•

Motivated by our observation, we propose a novel frame-

work BGNN that transfers knowledge from different GNNs

in a “boosting” way to elevate a vanilla GNN.

•

Rather than using a uniform temperature for all samples,

we design an adaptive temperature for each sample, which

beneﬁts the knowledge transfer from teacher to student.

•

Empirical results have demonstrated the effectiveness of

BGNN. Particularly, we achieve up to 3.05% and 6.35%

improvement over vanilla GNNs for node classiﬁcation

and graph classiﬁcation, respectively.

Related Work and Background

Graph Neural Networks.

Most GNNs follow a message-

passing scheme, which consists of message, update, and

readout functions to learn node embeddings by iteratively

aggregating the information of its neighbors (Xu et al. 2019;

Wu et al. 2019; Klicpera, Bojchevski, and Günnemann 2019).

For example, GCN (Kipf and Welling 2017) simpliﬁes graph

convolutions, which takes averaged aggregation method to

aggregate the neighbors’ information; GraphSage (Hamilton,

Ying, and Leskovec 2017) ﬁxes the number of sampled neigh-

bors to perform aggregation; GAT (Velickovic et al. 2018)

proposes an attention mechanism (Vaswani et al. 2017) to

treat neighbors differently in aggregation. The aim of this

work is not to design a new GNN architecture, but to propose

a new framework to boost existing GNNs by leveraging the

diverse learning abilities of different GNNs.

GNN Knowledge Distillation. There have been many mod-

els that apply KD framework on GNNs for better efﬁciency in

different settings (Yang, Liu, and Shi 2021; Zheng et al. 2022;

Zhang et al. 2020; Deng and Zhang 2021; Feng et al. 2022).

For example, Yan et al. (Yan et al. 2020) proposed TinyGNN

to distillate a large GNN to a small GNN. GLNN (Zhang

et al. 2022) was proposed to distillate GNNs to MLP. All of

these work distillate knowledge by penalizing the softened

logit differences between a teacher and a student follow-

ing (Hinton et al. 2015). Besides this vanilla KD, Yang et

al. (Yang et al. 2020) proposed LSP, a local structure preserv-

ing based KD method in computer vision area, to transfer

the knowledge effectively between different GCN models.

Wang et al. (Wang et al. 2021) propose a novel multi-teacher

KD method, MulDE, for link prediction based on knowledge

graph embeddings. LLP (Guo et al. 2022b) is another KD

framework speciﬁcally for link prediction tasks. In this work,

we also take logits-based KD method to distillate knowl-

edge but design two modules to increase useful knowledge

transferred from the teachers to the student and play the best

of the student model. Different from combining teachers’

knowledge in a parallel way in MulDE, we utilize sequential

training strategy to combine different teacher models.

Background and Preliminary Study

Background

Notations.

Let

G= (V,E)

denote a graph, where

stands

for all nodes and

stands for all edges. Each node

vi∈ V

the graph has a corresponding

-dimensional feature vector

xi∈RD

. There are

nodes in the graph. The entire node

features matrix is X∈RN×D.

Graph Neural Networks.

A GNN iteratively updates node

embeddings by aggregating information of its neighboring

nodes. We initialize the embedding of node

h(0)

v=xv

Its embedding in the

-th layer is updated to

h(l)

by ag-

gregating its neighbors’ embedding, which is formulated

as:

h(l)

v=UPDATElh(l−1)

v,AGGl{h(l−1)

u:∀u∈

N(v)},

where

AGG

and

UPDATE

are aggregation func-

tion and update function, respectively,

N(v)

denotes the

neighbors of node

. Furthermore, the whole graph represen-

tation can be computed based on all nodes’ representations

as:

hG=READOUT({h(l)

v|v∈ V}),

where

READOUT

a graph-level pooling function.

Node Classiﬁcation.

Node classiﬁcation is a typical super-

vised learning task for GNNs. The target is to predict the

label of unlabeled node

in the graph. Let

Y∈RN×C

the set of node labels. The ground truth of node

will be

aC-dimension one-hot vector.

Graph Classiﬁcation.

Graph classiﬁcation is commonly

used in chemistry tasks like molecular property predic-

tion (Hu et al. 2019; Guo et al. 2021). Graph classiﬁcation is

to predict the graph properties. Here, the ground truth matrix

Y∈RM×C

is the set of graph labels, where

and

are

the number of graphs and graph categories, respectively.

Step 1：

GNN1Labels

Step p：

Labels

GNNp-1

Soft

Labels

Weighted

Labels

GNNp

Adaptive

Temperature

Soft

Labels

LKD

GNNp-1

Weight

Boosting LLabel

Weighted

Labels

…

logits

Repeat

Training

Detailed Training Process in Step p

* GNNpis different from any GNNq(q<p)

(a) Model Overview

(b) Adaptive Temperature

𝜏!

𝜏"

⋮

𝜏#

GNNp

Figure 2: The overall framework of BGNN. (a) shows our sequential training process, where we train a teacher GNN

with

labels. Then we repeatedly take the generated GNN model from previous step as the teacher to generate soft labels and update

the training nodes’ weight to train a new initialized student. (b) presents the adaptive temperature module, where the temperature

is adjusted based on the teacher’s logits distribution for each node. (c) shows the weight boosting module, where the weight of

the nodes misclassiﬁed by the teacher GNN are boosted (nodes with large size in the above ﬁgure).

Preliminary Study on GNNs’ Representation

Next, we perform a preliminary study to answer the following

question: do different GNNs encode different knowledge from

the same input graphs? We train 4-layer GCN, GAT, and

GraphSage on Enzymes in a supervised way. After the train-

ing, we utilize Centered Kernel Alignment (CKA) (Kornblith

et al. 2019) as a similarity index to evaluate the relationship

among different representations. The higher CKA means the

compared representations are more similar. We take the aver-

age of all the embeddings at each layer as the representations

of that layer. Figure 1 illustrates the CKA between representa-

tions of each layer learned from GCN, GAT, and GraphSage

on Enzymes. We observe that the similarities between the

learned representations at different layers in GCN, GAT, and

GraphSage are diverse. For example, the CKA value between

the representation from layer 1/2/3/4 of GCN and that from

GAT is around 0.7/0.35/0.4/0.3. It indicates that different

GNNs may encode different knowledge.

We posit that different aggregation schemes in these GNNs

cause difference in the learned representations. In particular,

GCN aggregates neighborhoods with predeﬁned weights;

GAT aggregates neighborhoods using learnable weights and

GraphSage randomly samples neighbors during aggregation.

Given such differences, it is promising to boost one GNN

by incorporating the knowledge from other GNNs and it

motivates us to design a framework that can take advantage

of the diverse knowledge from different GNNs.

The Proposed Framework

In this section, we introduce the framework BGNN to boost

GNNs by utilizing complementary knowledge from other

GNNs. An illustration of the model framework is shown in

Figure 2. Our framework adopts a sequential training strategy

to encourage the student to focus on learning from one single

teacher at a time. To adjust the information distilled from the

teacher, we propose an adaptive temperature module to adjust

the soft labels from teachers. Further, we propose a weight

boosting mechanism to enhance the student model training.

Model Overview

To boost a GNN, we take it as the student and we aim to

transfer diverse knowledge from other teacher GNNs into it.

In this work, we utilize the KD method proposed by (Hinton

et al. 2015), where a teacher’s knowledge is transferred to

a student by encouraging the student model to imitate the

teacher’s behavior. In our framework, we pre-train a teacher

GNN (GNN

) with ground-truth labels and keep its parame-

ters ﬁxed during KD. Then, we transfer the knowledge from

GNN

by letting the student GNN (GNN

) optimize the soft

cross-entropy loss between the student network’s logits

and the teachers’ logits

. Let

τv

be the temperature for node

vto soften the logits distribution of teacher GNN.

Then we incorporate the target of knowledge distillation

into the training process of the student model by minimizing

the following loss:

L=Llabel +λLKD =Llabel −λXv∈V ˆ

vlog(ˆ

v),

with ˆ

v= softmax(t

tv/τv),ˆ

v= softmax(z

zv/τv),

(1)

where

Llabel

is the supervised training loss w.r.t. ground-truth

labels and

is a trade-off factor to balance their importance.

Following the above objective, we adopt a sequential pro-

cess of distillation, where we can freely boost the student

GNN with one teacher GNN or multiple teachers. Such a

sequential manner encourages the student model to focus on

the knowledge from one single teacher. In contrast, when us-

ing multiple teachers simultaneously, the student may receive

mixed noisy signals which can harm the distillation process.

As illustrated in Figure 2(a), we ﬁrst train a teacher GNN

using true labels and train a student GNN

with dual targets of

predicting the true labels and matching the logits distribution

of GNN

. The logits distribution has been softened by our

proposed adaptive temperature (Figure 2(b)) for each node

and the weight of the nodes misclassiﬁed by the teacher

GNN are boosted when predicting true labels (Figure 2(c)).

The parameters of teacher GNN

are not updated when we

train GNN

. Such process is repeated for

steps and the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoostingGraphNeuralNetworksviaAdaptiveKnowledgeDistillationZhichunGuo1,ChunhuiZhang2,YujieFan3,YijunTian1,ChuxuZhang2,NiteshV.Chawla11UniversityofNotreDame,NotreDame,IN465562BrandeisUniversity,Waltham,MA024533CaseWesternReserveUniversity,Cleveland,OH44106{zguo5,yijun.tian,nchawla}@nd.edu{chunhuizhan...

展开>> 收起<<

Boosting Graph Neural Networks via Adaptive Knowledge Distillation Zhichun Guo1 Chunhui Zhang2 Yujie Fan3 Yijun Tian1 Chuxu Zhang2 Nitesh V . Chawla1 1University of Notre Dame Notre Dame IN 46556.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Boosting Graph Neural Networks via Adaptive Knowledge Distillation Zhichun Guo1 Chunhui Zhang2 Yujie Fan3 Yijun Tian1 Chuxu Zhang2 Nitesh V . Chawla1 1University of Notre Dame Notre Dame IN 46556

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: