
Boosting Graph Neural Networks via Adaptive Knowledge Distillation
Zhichun Guo1, Chunhui Zhang2, Yujie Fan3, Yijun Tian1, Chuxu Zhang2, Nitesh V. Chawla1
1University of Notre Dame, Notre Dame, IN 46556
2Brandeis University, Waltham, MA 02453
3Case Western Reserve University, Cleveland, OH 44106
{zguo5,yijun.tian,nchawla}@nd.edu
{chunhuizhang,chuxuzhang}@brandeis.edu, yxf370@case.edu
Abstract
Graph neural networks (GNNs) have shown remarkable perfor-
mance on diverse graph mining tasks. While sharing the same
message passing framework, our study shows that different
GNNs learn distinct knowledge from the same graph. This
implies potential performance improvement by distilling the
complementary knowledge from multiple models. However,
knowledge distillation (KD) transfers knowledge from high-
capacity teachers to a lightweight student, which deviates from
our scenario: GNNs are often shallow. To transfer knowledge
effectively, we need to tackle two challenges: how to transfer
knowledge from compact teachers to a student with the same
capacity; and, how to exploit student GNN’s own learning abil-
ity. In this paper, we propose a novel adaptive KD framework,
called BGNN, which sequentially transfers knowledge from
multiple GNNs into a student GNN. We also introduce an
adaptive temperature module and a weight boosting module.
These modules guide the student to the appropriate knowledge
for effective learning. Extensive experiments have demon-
strated the effectiveness of BGNN. In particular, we achieve
up to 3.05% improvement for node classification and 6.35%
improvement for graph classification over vanilla GNNs.
Introduction
Recent years have witnessed the significant development of
graph neural networks (GNNs). Various GNNs have been de-
veloped and applied to different graph mining tasks (Kipf and
Welling 2017; Hamilton, Ying, and Leskovec 2017; Velick-
ovic et al. 2018; Klicpera, Bojchevski, and Günnemann 2019;
Xu et al. 2019; Wu et al. 2019; Jin et al. 2021; Guo et al.
2022a). Although most GNNs can be unified into the Mes-
sage Passing Neural Networks (Gilmer et al. 2017), their
learning abilities diverge (Xu et al. 2019; Balcilar et al. 2020).
In our preliminary study, we observe that the graph represen-
tations learned by different GNNs are not similar, especially
in deeper layers. It suggests that different GNNs may encode
complementary knowledge due to their different aggregation
schemes. Based on this observation, it is natural to ask: can
we boost vanilla GNNs by effectively utilizing complementary
knowledge learned by different GNNs from the same dataset?
An intuitive solution is to compose multiple models into
an ensemble (Hansen and Salamon 1990; Breiman 2001) that
Copyright
©
2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
would achieve better performance than each of its constituent
models. However, ensemble is not always effective especially
when the base classifiers are strong learners (Zhang et al.
2020). Thus, we seek a different approach to take advantage
of knowledge from different GNNs: knowledge distillation
(KD) (Hinton et al. 2015; Romero et al. 2014; Touvron et al.
2021), which distills information from one (teacher) model
to another (student) model. However, KD is always accom-
panied by model compression (Yim et al. 2017; Heo et al.
2019; Yuan et al. 2019), where the teacher network is a high-
capacity neural network, and the student network is a compact
and fast-to-execute model. Standing by this situation, there
could be a significant performance gap between students and
teachers. But this kind of performance gap may not exist
in our scenario: GNNs are all very shallow due to the over-
smoothing issue (Zhao and Akoglu 2019; Li, Han, and Wu
2018; Alon and Yahav 2020). Hence, it is more difficult to
distill extra knowledge from teacher GNNs to boost the stu-
dent GNN. To achieve this goal, two major challenges arise:
the first one is how to transfer knowledge from a teacher
GNN into a student GNN with the same capacity that can
produce the same even better performance (teaching effec-
tiveness); the second one is how to push the student model
to play the best role in learning by itself, which is ignored in
the traditional KD where the student’s performance heavily
relies on the teacher (learning ability).
In this work, we propose a novel framework, namely
BGNN, which combines the knowledge from different GNNs
in a “boosting” way to strengthen a vanilla GNN through
knowledge distillation. To improve the teaching effectiveness,
we propose two strategies to increase the useful knowledge
transferred from the teachers to the student. One is the se-
quential training strategy, where the student is encouraged to
focus on learning from one teacher at a time. This allows the
student to learn diverse knowledge from individual GNNs.
The other one is an adaptive temperature module. Unlike
existing KD methods that use a uniform temperature for all
samples, the temperature in BGNN is adjustable based on
the teacher’s confidence in a specific sample. To enhance
the learning ability, we develop a weight boosting module.
This module redistributes the weight of samples, making the
student GNN pay more attention to the misclassified samples.
Our proposed BGNN is a general model which can be applied
to both graph classification and node classification tasks. We
arXiv:2210.05920v2 [cs.LG] 5 Apr 2023