Boosting Graph Neural Networks via Adaptive Knowledge Distillation Zhichun Guo1 Chunhui Zhang2 Yujie Fan3 Yijun Tian1 Chuxu Zhang2 Nitesh V . Chawla1 1University of Notre Dame Notre Dame IN 46556

2025-04-27 0 0 832.19KB 11 页 10玖币
侵权投诉
Boosting Graph Neural Networks via Adaptive Knowledge Distillation
Zhichun Guo1, Chunhui Zhang2, Yujie Fan3, Yijun Tian1, Chuxu Zhang2, Nitesh V. Chawla1
1University of Notre Dame, Notre Dame, IN 46556
2Brandeis University, Waltham, MA 02453
3Case Western Reserve University, Cleveland, OH 44106
{zguo5,yijun.tian,nchawla}@nd.edu
{chunhuizhang,chuxuzhang}@brandeis.edu, yxf370@case.edu
Abstract
Graph neural networks (GNNs) have shown remarkable perfor-
mance on diverse graph mining tasks. While sharing the same
message passing framework, our study shows that different
GNNs learn distinct knowledge from the same graph. This
implies potential performance improvement by distilling the
complementary knowledge from multiple models. However,
knowledge distillation (KD) transfers knowledge from high-
capacity teachers to a lightweight student, which deviates from
our scenario: GNNs are often shallow. To transfer knowledge
effectively, we need to tackle two challenges: how to transfer
knowledge from compact teachers to a student with the same
capacity; and, how to exploit student GNN’s own learning abil-
ity. In this paper, we propose a novel adaptive KD framework,
called BGNN, which sequentially transfers knowledge from
multiple GNNs into a student GNN. We also introduce an
adaptive temperature module and a weight boosting module.
These modules guide the student to the appropriate knowledge
for effective learning. Extensive experiments have demon-
strated the effectiveness of BGNN. In particular, we achieve
up to 3.05% improvement for node classification and 6.35%
improvement for graph classification over vanilla GNNs.
Introduction
Recent years have witnessed the significant development of
graph neural networks (GNNs). Various GNNs have been de-
veloped and applied to different graph mining tasks (Kipf and
Welling 2017; Hamilton, Ying, and Leskovec 2017; Velick-
ovic et al. 2018; Klicpera, Bojchevski, and Günnemann 2019;
Xu et al. 2019; Wu et al. 2019; Jin et al. 2021; Guo et al.
2022a). Although most GNNs can be unified into the Mes-
sage Passing Neural Networks (Gilmer et al. 2017), their
learning abilities diverge (Xu et al. 2019; Balcilar et al. 2020).
In our preliminary study, we observe that the graph represen-
tations learned by different GNNs are not similar, especially
in deeper layers. It suggests that different GNNs may encode
complementary knowledge due to their different aggregation
schemes. Based on this observation, it is natural to ask: can
we boost vanilla GNNs by effectively utilizing complementary
knowledge learned by different GNNs from the same dataset?
An intuitive solution is to compose multiple models into
an ensemble (Hansen and Salamon 1990; Breiman 2001) that
Copyright
©
2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
would achieve better performance than each of its constituent
models. However, ensemble is not always effective especially
when the base classifiers are strong learners (Zhang et al.
2020). Thus, we seek a different approach to take advantage
of knowledge from different GNNs: knowledge distillation
(KD) (Hinton et al. 2015; Romero et al. 2014; Touvron et al.
2021), which distills information from one (teacher) model
to another (student) model. However, KD is always accom-
panied by model compression (Yim et al. 2017; Heo et al.
2019; Yuan et al. 2019), where the teacher network is a high-
capacity neural network, and the student network is a compact
and fast-to-execute model. Standing by this situation, there
could be a significant performance gap between students and
teachers. But this kind of performance gap may not exist
in our scenario: GNNs are all very shallow due to the over-
smoothing issue (Zhao and Akoglu 2019; Li, Han, and Wu
2018; Alon and Yahav 2020). Hence, it is more difficult to
distill extra knowledge from teacher GNNs to boost the stu-
dent GNN. To achieve this goal, two major challenges arise:
the first one is how to transfer knowledge from a teacher
GNN into a student GNN with the same capacity that can
produce the same even better performance (teaching effec-
tiveness); the second one is how to push the student model
to play the best role in learning by itself, which is ignored in
the traditional KD where the student’s performance heavily
relies on the teacher (learning ability).
In this work, we propose a novel framework, namely
BGNN, which combines the knowledge from different GNNs
in a “boosting” way to strengthen a vanilla GNN through
knowledge distillation. To improve the teaching effectiveness,
we propose two strategies to increase the useful knowledge
transferred from the teachers to the student. One is the se-
quential training strategy, where the student is encouraged to
focus on learning from one teacher at a time. This allows the
student to learn diverse knowledge from individual GNNs.
The other one is an adaptive temperature module. Unlike
existing KD methods that use a uniform temperature for all
samples, the temperature in BGNN is adjustable based on
the teacher’s confidence in a specific sample. To enhance
the learning ability, we develop a weight boosting module.
This module redistributes the weight of samples, making the
student GNN pay more attention to the misclassified samples.
Our proposed BGNN is a general model which can be applied
to both graph classification and node classification tasks. We
arXiv:2210.05920v2 [cs.LG] 5 Apr 2023
Figure 1: CKA similarity between graph representation at different layers of GNNs on Enzymes.
conduct extensive experimental studies on both tasks, and
the results demonstrate the superior performance of BGNN
compared with a set of baseline methods.
To summarize, our contributions are listed as follows:
Through empirical study, we show that the representations
learned by different GNNs are not similar, indicating that
they encode different knowledge from the same input.
Motivated by our observation, we propose a novel frame-
work BGNN that transfers knowledge from different GNNs
in a “boosting” way to elevate a vanilla GNN.
Rather than using a uniform temperature for all samples,
we design an adaptive temperature for each sample, which
benefits the knowledge transfer from teacher to student.
Empirical results have demonstrated the effectiveness of
BGNN. Particularly, we achieve up to 3.05% and 6.35%
improvement over vanilla GNNs for node classification
and graph classification, respectively.
Related Work and Background
Graph Neural Networks.
Most GNNs follow a message-
passing scheme, which consists of message, update, and
readout functions to learn node embeddings by iteratively
aggregating the information of its neighbors (Xu et al. 2019;
Wu et al. 2019; Klicpera, Bojchevski, and Günnemann 2019).
For example, GCN (Kipf and Welling 2017) simplifies graph
convolutions, which takes averaged aggregation method to
aggregate the neighbors’ information; GraphSage (Hamilton,
Ying, and Leskovec 2017) fixes the number of sampled neigh-
bors to perform aggregation; GAT (Velickovic et al. 2018)
proposes an attention mechanism (Vaswani et al. 2017) to
treat neighbors differently in aggregation. The aim of this
work is not to design a new GNN architecture, but to propose
a new framework to boost existing GNNs by leveraging the
diverse learning abilities of different GNNs.
GNN Knowledge Distillation. There have been many mod-
els that apply KD framework on GNNs for better efficiency in
different settings (Yang, Liu, and Shi 2021; Zheng et al. 2022;
Zhang et al. 2020; Deng and Zhang 2021; Feng et al. 2022).
For example, Yan et al. (Yan et al. 2020) proposed TinyGNN
to distillate a large GNN to a small GNN. GLNN (Zhang
et al. 2022) was proposed to distillate GNNs to MLP. All of
these work distillate knowledge by penalizing the softened
logit differences between a teacher and a student follow-
ing (Hinton et al. 2015). Besides this vanilla KD, Yang et
al. (Yang et al. 2020) proposed LSP, a local structure preserv-
ing based KD method in computer vision area, to transfer
the knowledge effectively between different GCN models.
Wang et al. (Wang et al. 2021) propose a novel multi-teacher
KD method, MulDE, for link prediction based on knowledge
graph embeddings. LLP (Guo et al. 2022b) is another KD
framework specifically for link prediction tasks. In this work,
we also take logits-based KD method to distillate knowl-
edge but design two modules to increase useful knowledge
transferred from the teachers to the student and play the best
of the student model. Different from combining teachers’
knowledge in a parallel way in MulDE, we utilize sequential
training strategy to combine different teacher models.
Background and Preliminary Study
Background
Notations.
Let
G= (V,E)
denote a graph, where
V
stands
for all nodes and
E
stands for all edges. Each node
vi∈ V
in
the graph has a corresponding
D
-dimensional feature vector
xiRD
. There are
N
nodes in the graph. The entire node
features matrix is XRN×D.
Graph Neural Networks.
A GNN iteratively updates node
embeddings by aggregating information of its neighboring
nodes. We initialize the embedding of node
v
as
h(0)
v=xv
.
Its embedding in the
l
-th layer is updated to
h(l)
v
by ag-
gregating its neighbors’ embedding, which is formulated
as:
h(l)
v=UPDATElh(l1)
v,AGGl{h(l1)
u:u
N(v)},
where
AGG
and
UPDATE
are aggregation func-
tion and update function, respectively,
N(v)
denotes the
neighbors of node
v
. Furthermore, the whole graph represen-
tation can be computed based on all nodes’ representations
as:
hG=READOUT({h(l)
v|v∈ V}),
where
READOUT
is
a graph-level pooling function.
Node Classification.
Node classification is a typical super-
vised learning task for GNNs. The target is to predict the
label of unlabeled node
v
in the graph. Let
YRN×C
be
the set of node labels. The ground truth of node
v
will be
yv
,
aC-dimension one-hot vector.
Graph Classification.
Graph classification is commonly
used in chemistry tasks like molecular property predic-
tion (Hu et al. 2019; Guo et al. 2021). Graph classification is
to predict the graph properties. Here, the ground truth matrix
YRM×C
is the set of graph labels, where
M
and
C
are
the number of graphs and graph categories, respectively.
Step 1
GNN1Labels
Step p
Labels
GNNp-1
Soft
Labels
Weighted
Labels
GNNp
Adaptive
Temperature
Soft
Labels
LKD
GNNp-1
Weight
Boosting LLabel
Weighted
Labels
logits
Repeat
Training
Detailed Training Process in Step p
* GNNpis different from any GNNq(q<p)
(a) Model Overview
(b) Adaptive Temperature
(c) Weight Boosting
𝜏!
𝜏"
𝜏#
GNNp
Figure 2: The overall framework of BGNN. (a) shows our sequential training process, where we train a teacher GNN
1
with
labels. Then we repeatedly take the generated GNN model from previous step as the teacher to generate soft labels and update
the training nodes’ weight to train a new initialized student. (b) presents the adaptive temperature module, where the temperature
is adjusted based on the teacher’s logits distribution for each node. (c) shows the weight boosting module, where the weight of
the nodes misclassified by the teacher GNN are boosted (nodes with large size in the above figure).
Preliminary Study on GNNs’ Representation
Next, we perform a preliminary study to answer the following
question: do different GNNs encode different knowledge from
the same input graphs? We train 4-layer GCN, GAT, and
GraphSage on Enzymes in a supervised way. After the train-
ing, we utilize Centered Kernel Alignment (CKA) (Kornblith
et al. 2019) as a similarity index to evaluate the relationship
among different representations. The higher CKA means the
compared representations are more similar. We take the aver-
age of all the embeddings at each layer as the representations
of that layer. Figure 1 illustrates the CKA between representa-
tions of each layer learned from GCN, GAT, and GraphSage
on Enzymes. We observe that the similarities between the
learned representations at different layers in GCN, GAT, and
GraphSage are diverse. For example, the CKA value between
the representation from layer 1/2/3/4 of GCN and that from
GAT is around 0.7/0.35/0.4/0.3. It indicates that different
GNNs may encode different knowledge.
We posit that different aggregation schemes in these GNNs
cause difference in the learned representations. In particular,
GCN aggregates neighborhoods with predefined weights;
GAT aggregates neighborhoods using learnable weights and
GraphSage randomly samples neighbors during aggregation.
Given such differences, it is promising to boost one GNN
by incorporating the knowledge from other GNNs and it
motivates us to design a framework that can take advantage
of the diverse knowledge from different GNNs.
The Proposed Framework
In this section, we introduce the framework BGNN to boost
GNNs by utilizing complementary knowledge from other
GNNs. An illustration of the model framework is shown in
Figure 2. Our framework adopts a sequential training strategy
to encourage the student to focus on learning from one single
teacher at a time. To adjust the information distilled from the
teacher, we propose an adaptive temperature module to adjust
the soft labels from teachers. Further, we propose a weight
boosting mechanism to enhance the student model training.
Model Overview
To boost a GNN, we take it as the student and we aim to
transfer diverse knowledge from other teacher GNNs into it.
In this work, we utilize the KD method proposed by (Hinton
et al. 2015), where a teacher’s knowledge is transferred to
a student by encouraging the student model to imitate the
teacher’s behavior. In our framework, we pre-train a teacher
GNN (GNN
T
) with ground-truth labels and keep its parame-
ters fixed during KD. Then, we transfer the knowledge from
GNN
T
by letting the student GNN (GNN
S
) optimize the soft
cross-entropy loss between the student network’s logits
z
z
zv
and the teachers’ logits
t
t
tv
. Let
τv
be the temperature for node
vto soften the logits distribution of teacher GNN.
Then we incorporate the target of knowledge distillation
into the training process of the student model by minimizing
the following loss:
L=Llabel +λLKD =Llabel λXv∈V ˆ
y
y
yT
vlog(ˆ
y
y
yS
v),
with ˆ
y
y
yT
v= softmax(t
t
tvv),ˆ
y
y
yS
v= softmax(z
z
zvv),
(1)
where
Llabel
is the supervised training loss w.r.t. ground-truth
labels and
λ
is a trade-off factor to balance their importance.
Following the above objective, we adopt a sequential pro-
cess of distillation, where we can freely boost the student
GNN with one teacher GNN or multiple teachers. Such a
sequential manner encourages the student model to focus on
the knowledge from one single teacher. In contrast, when us-
ing multiple teachers simultaneously, the student may receive
mixed noisy signals which can harm the distillation process.
As illustrated in Figure 2(a), we first train a teacher GNN
1
using true labels and train a student GNN
2
with dual targets of
predicting the true labels and matching the logits distribution
of GNN
1
. The logits distribution has been softened by our
proposed adaptive temperature (Figure 2(b)) for each node
and the weight of the nodes misclassified by the teacher
GNN are boosted when predicting true labels (Figure 2(c)).
The parameters of teacher GNN
1
are not updated when we
train GNN
2
. Such process is repeated for
p
steps and the
摘要:

BoostingGraphNeuralNetworksviaAdaptiveKnowledgeDistillationZhichunGuo1,ChunhuiZhang2,YujieFan3,YijunTian1,ChuxuZhang2,NiteshV.Chawla11UniversityofNotreDame,NotreDame,IN465562BrandeisUniversity,Waltham,MA024533CaseWesternReserveUniversity,Cleveland,OH44106{zguo5,yijun.tian,nchawla}@nd.edu{chunhuizhan...

展开>> 收起<<
Boosting Graph Neural Networks via Adaptive Knowledge Distillation Zhichun Guo1 Chunhui Zhang2 Yujie Fan3 Yijun Tian1 Chuxu Zhang2 Nitesh V . Chawla1 1University of Notre Dame Notre Dame IN 46556.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:832.19KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注