Geometric Knowledge Distillation Topology Compression for Graph Neural Networks Chenxiao Yang Qitian Wu Junchi Yan

2025-05-06 0 0 861.01KB 21 页 10玖币
侵权投诉
Geometric Knowledge Distillation:
Topology Compression for Graph Neural Networks
Chenxiao Yang, Qitian Wu, Junchi Yan
Department of Computer Science and Engineering
MoE Key Lab of Artificial Intelligence
Shanghai Jiao Tong University
{chr26195,echo740,yanjunchi}@sjtu.edu.cn
Abstract
We study a new paradigm of knowledge transfer that aims at encoding graph
topological information into graph neural networks (GNNs) by distilling knowl-
edge from a teacher GNN model trained on a complete graph to a student GNN
model operating on a smaller or sparser graph. To this end, we revisit the con-
nection between thermodynamics and the behavior of GNN, based on which we
propose Neural Heat Kernel (NHK) to encapsulate the geometric property of the
underlying manifold concerning the architecture of GNNs. A fundamental and
principled solution is derived by aligning NHKs on teacher and student models,
dubbed as Geometric Knowledge Distillation. We develop non-parametric and
parametric instantiations and demonstrate their efficacy in various experimental set-
tings for knowledge distillation regarding different types of privileged topological
information and teacher-student schemes.
1 Introduction
Modern graph neural networks (GNNs) [
28
;
49
;
53
] have shown remarkable performance in learning
representations for structured instances. From the perspective of geometric deep learning [
5
;
4
;
38
],
part of the achievement of GNNs can be attributed to their implementation of the permutation
invariance property as geometric priors
2
into the architecture design. Nevertheless, in practice, GNNs
highly rely on graph topology, as essential input information, to explore the relational knowledge
implicit in interactions of instance pairs throughout the entire message passing process, termed
as geometric knowledge in this paper. As advances in generalized distillation [
33
;
47
] reveal the
possibility of encoding input features into model construction, natural questions arise as to:
Is it possible, and if so, how can we encode graph topology as a special type of ‘geometric prior’ into
a GNN model, such that the model could precisely capture the underlying geometric knowledge even
without full graph topology as input?
In specific, we are interested in the following geometric knowledge transfer problem: a GNN model
(with node-specific outputs for node-level prediction [
23
]) is exposed with a partial graph, which is a
subset of the complete graph. Formally speaking, we have notations:
G={V,E} (partial graph),˜
G={˜
V,˜
E} (complete graph),where V ⊆ ˜
V,E {V × V} ∩ ˜
E.(1)
Our goal is to transfer or encode geometric knowledge extracted from
˜
G
to the target GNN model
that is only aware of
G
. Studying this problem is also of much practical value. As a non-exhaustive
Junchi Yan is the correspondence author who is also with Shanghai AI Laboratory.
2
Geometric priors originally refer to the geometric principles naturally encoded in deep learning architectures,
e.g., translational symmetry for CNNs, permutation invariance for GNNs.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.13014v2 [cs.LG] 15 Jan 2023
A
B
H
P
O
M
R
V
W
N
CD
E
L
G
U
S
F
Y
X
T
Q
I
J
K
A
O
M
R
V
W
N
D
E
L
G
U
S
F
A
B
H
P
O
M
R
V
W
N
CD
E
L
G
U
SF
T
Q
I
J
K
a) Teacher c) Student
(after GKD)
b) Student
(before GKD)
Figure 1: Feature propagation on the un-
derlying manifold
M
. (a) Teacher: aware
of the complete graph topology, and faith-
fully explore geometric knowledge about
the underlying manifold. (b) Student be-
fore GKD: only aware of partial graph
topology, and estimate biased geometry
property. (c) Student after GKD: able to
propagate features on the same space as
teacher by alignment of NHKs.
list of applications: improving efficiency without compromising on effectiveness for coarsened
graphs [
14
;
24
;
69
], privacy constrained scenarios in social recommenders or federated learning
where the complete graph is unavailable [
34
;
50
;
68
], promoting concentration on targeted community
to bring up economic benefits [57].
Achieving this target is non-trivial for that we need to first find a principled and fundamental way to
encapsulate the geometric knowledge extracted by GNN model, which requires in-depth investigation
on the role of graph topology throughout the progressive process of message passing. Therefore,
we take a thermodynamic view borrowed from physics and propose a new methodology built upon
recent advances revealing the connection between heat diffusion and architectures of GNNs [
8
;
51
;
7
].
Specifically, we interpret feature propagation as heat flows on the underlying Riemmanian manifold,
whose characteristics (that are dependent on graph topology and the GNN model) pave the way for a
principled representation of the latent geometric knowledge.
1.1 Our Contributions
New theoretical perspective for analyzing latent graph geometry.
On top of the connection
between heat equation and GNNs, we step further to inspect the implication of heat kernel for
GNNs, and propose a novel notion of Neural Heat Kernel (NHK) with rigorous proof of its existence.
Heat kernel intrinsically defines the unique solution to the heat equation and can be a fundamental
characterization for the geometric property of the underlying manifold [
18
;
19
]. Likewise, NHK
uncovers geometric property of the latent graph manifold for GNNs, and governs how information
flows between pairs of instances, which lends us a mathematical tool to encapsulate geometric
knowledge extracted from GNN model and enables geometric knowledge transfer. This result alone
might also be useful in broader contexts for understanding GNNs.
Flexible distillation framework with versatile instantiations.
Based on the above insights, we
treat NHK matrices as representation of the latent geometric knowledge, upon which we build a
flexible and principled distillation framework dubbed as Geometric Knowledge Distillation (GKD),
which aims at encoding and transferring geometric knowledge by aligning latent manifolds behind
GNN models as illustrated in Fig. 1. We also develop non-parametirc and parametric versions of
GKD, in terms of different ways to approximate NHK computation. Specifically, the former derives
explicit NHKs via assumptions on latent space, and the later learns NHK in a data-driven manner.
Applications for geometric knowledge transfer and conventional KD purposes.
We verify the
efficacy of GKD in terms of different geometric knowledge types (i.e., edge-aware and node-aware
ones), and further show its effectiveness for conventional KD purposes (e.g., model compression, self
distillation, online distillation) for broader applicability. We highlight that our methods consistently
exceed teacher model and rival with the oracle model that gives the performance upper bound in
principle.
1.2 Links to Related Works
Geometric Deep Learning.
The study of geometric deep learning [
5
;
38
] provides fundamental
principles and methodology to generalize deep learning methods to non-Euclidean domains (e.g.,
graphs and manifolds). From this perspective, architectures for off-the-shelf GNNs [
28
;
49
;
53
;
20
]
and graph Transformers [
67
;
56
] have naturally incorporated the geometric prior knowledge for
graphs such as permutation invariance. Despite their remarkable success, they highly rely on the
graph topology. This work extends the idea of geometric deep learning by treating the global graph
2
topology as a special type of prior knowledge, and attempts to encode it into GNNs themselves,
such that the trained model would leverage information from the global graph topology even without
explicitly taking it as input.
Graph-Based Knowledge Distillation.
Knowledge distillation (KD) [
22
;
1
] uses the outputs of a
teacher model as alternative supervised signals to teach a student model, with various new paradigms
including feature-based [
42
;
66
] and relation-based [
64
;
65
] ones. While some prior arts [
9
;
61
;
50
;
69
;
63
;
60
] attempted to combine KD and GNNs, i.e., graph-based KD, they are nearly straight-
forward adaptations of KD without in-depth investigation on the role of graph topology, also restricted
by a specific choice of GNN architecture or application scenarios. In contrast, we first formalize the
problem of geometric knowledge transfer, theoretically answer the question “how to represent graph
geometric knowledge and encode it into GNN models", and propose geometric distillation approach
based on the theoretical results. More discussions on related works are deferred to Appendix G.
2 Preliminaries
We commence with a brief detour to heat equation on Riemannian manifolds, and its connection with
modern GNN architectures. Moreover, we bring forth the notion of heat kernel to motivate this work.
Heat Equation on Manifolds.
We are interested in heat equation defined on a smooth
k
-
dimensional Riemannian manifold
M
. Suppose the manifold is associated with a scalar- or vector-
valued function
x(u, t) : M × [0,)Rd
, quantifying a specific type of signals such as heat at a
point
u∈ M
and time
t
. Fourier’s law of heat conductivity describes the flow of heat with respect to
time and space, via a partial differential equation (PDE) called heat equation [6], i.e.,
x(u, t)
t =cx(u, t),(2)
where
c > 0
is the thermal conductivity coefficient, and
is the natural Laplace–Beltrami operator
associated with
M
. Rewriting
as the functional composition of the divergence operator
and
gradient operator
, i.e.,
∆ = ◦ ∇
, we can interpret the heat equation as: the variation of
temperature within an infinitesimal time interval at a point is equivalent to the divergence between its
own temperature and the average temperature on an infinitesimal sphere around it.
Implications on Graphs.
A spatial discretisation of a continuous manifold yields a graph
G=
{V,E}
, whose nodes can be thought of as embedded on the base manifold. In fact, the heat equation
along with variants thereof (e.g., Schrödinger equation) have found widespread use in modeling graph
dynamics [11; 25; 36]. More importantly, it has been recently revealed to be intimately related with
the architectures of modern GNNs [
51
;
8
;
7
]: suppose
X(0) = {x(u, 0)}u∈V Rn×d
denotes the
initial condition for Eqn.
(2)
determined by input node features, then solving the heat equation under
certain definitions of
and
(i.e., definition of
) amounts to different architectures of GNNs.
For instance:
Example 1.
[
51
] Define the discretised counterpart of
as the graph Laplacian matrix
L=
e
D1
2(e
De
A)e
D1
2
. Numerically solving Eqn.
(2)
using the forward Euler method with step size
τ= 1
yields the formulation of Simple Graph Convolution (SGC) [
53
], where
Θ
denote learnable
transformation matrix
ˆ
X(t) = ˜
D1
2˜
A˜
D1
2t
X(0),ˆ
Y= softmax ˆ
X(t)Θ.(3)
Example 2.
[
8
] Define the gradient operator
ij
as the difference of source and target node features,
the divergence operator
i
as the sum of features of all edges for the node. Numerically solving
Eqn.
(2)
using the explicit Euler scheme with step size
τ
yields the following recursive formulation
ˆ
X(t+τ) = τ(GI)ˆ
X(t) + ˆ
X(t)(4)
where Gis a diffusivity coefficient matrix in place of c.
Moreover, stacking a non-linear transformation layer after each step yields the formulation of Graph
Convolution Networks (GCN) [
28
] for Eqn.
(3)
, Graph Attention Networks (GAT) [
49
] with residual
connection for Eqn.
(4)
, and even more GNN architectures by virtue of the flexibility of interpretion
for heat equation on graphs.
3
Heat Kernels.
Intriguingly, it turns out that the initial value problem of heat equation on any
manifold
M
has a smallest positive fundamental solution depending on the Laplace operator
,
known as the heat kernel [2]. It is denoted as a kernel function κ(x, y, t), such that
x(ui, t) = etx(ui,0) = ZM
κ(ui, uj, t)x(uj,0)dµ(uj),(5)
where
µ
is a non-negative measure associated with
M
. In physics, the heat kernel
κ(x, y, t)
can
be interpreted as a transition density that describes the asymptotic behavior of a natural Brownian
motion on the manifold. Its formulation thus can be treated as a unique reflection or representation of
the geometry of the underlying manifold. For example, if the manifold is a
k
-dimensional Euclidean
Space Rkor a Hyperbolic Space Hk, the explicit formula of heat kernel is respectively given by,
κ(ui, uj, t) = 1
(4πt)k/2exp
ρ2
4tand κ(ui, uj, t) = (1)m
2mπm
1
(4πt)1
21
sinh ρ
ρ m
em2tρ2
4t,(6)
where
ρ=d(ui, uj)
denote geodesic distance. Heat kernel has also been adopted for graph-related
applications such as community detection [31], graph clustering [58].
3 Extending Heat Kernel to GNNs
The starting point of this work is the development of neural heat kernel, built upon the previously-
mentioned connection of GNNs and heat equation. As will be discussed later, this novel notion lends
us a thermodynamic perspective to the intrinsic geometric property of the latent graph manifold
embodied in GNNs, and hence paves the way for distilling geometric knowledge.
3.1 Neural Heat Kernel
Consider the graph signal
X(t)
at time
t
and node features
H(l)
at layer
l
as interchangeable notions.
Consequently, feature propagation using one layer of GNN amounts to heat diffusion on the base
manifold Mwithin a certain time interval τ, leading to the equivalences of X(t+τ)and H(l+1):
H(l+1) =fθ(H(l),G),X(t+τ) = eτ∆(fθ,G)X(t),(7)
where fθdenotes an arbitrary GNN model with parameter θ, and ∆(fθ,G)denotes a generalization
of Laplace-Beltrami operator defined over the base manifold
M
associated with graph
G
and the
arbitrary backbone GNN model fθ.
Remark.
The equivalence of two equations in Eqn. 7 is based on the recently established connection
between heat equation and GNNs [
8
;
7
;
51
;
13
;
12
], which reveal that the formulation of a GNN layer
could be thought of as discretisations (that correspond to the left equation in Eqn. 7) of the continuous
diffusion process (that correspond to the right equation in Eqn. 7) described by the heat equation.
Furthermore, different definitions of Laplace-Beltrami operator
and schemes for solving Eqn. 2
could yield different GNNs (e.g., SGC [
53
], GAT [
49
], GRAND [
8
]). While it is unclear whether
there exists such a definition of
for every GNN architecture, we write the operator as
∆(fθ,G)
to
associate it with model
fθ
, and then use the analogy between GNN and heat equation as an analytical
tool, in a similar manner with [
3
;
46
;
51
;
45
;
8
;
7
], for studying the geometry property of GNNs. See
more detailed justifications in Appendix E.
In light of this connection, we consider a natural generalization of heat kernel for GNNs, termed as
neural heat kernel (NHK) to highlight its difference with heat kernel in the thermodynamic context.
In particular, a single-layer NHK is defined as a positive definite symmetric kernel function denoted
as
κ(l)
θ(vi, vj)
, where the sub-script
θ
implies that it is associated with the architecture and parameters
of the backbone GNN, and the super-script
(l)
implies that it is specific to each layer, analogous to
the role of continuous time tin Eqn. (5).
Theorem 1. (Existence of Single-Layer NHK)
Suppose two expressions in Eqn.
(7)
are equivalent
(see Appendix E for more discussions), then for any graph
G
and GNN model
fθ
, there exist a unique
single-layer NHK function κ(l)
θ(·)such that for any node vi∈ V and l > 0,
h(l)
i=X
vj∈V
κ(l)
θ(vi, vj)·h(l1)
jµ(vj)(8)
4
where
h(l)
iRd
denotes the feature of node
vi
at
l
-th layer, and
µ
is a measure over vertices that
could be specified as the inverse of node degree 1/di.
To push further, we can generalize NHK across multiple layers of GNN, termed as a cross-layer NHK
κθ(vi, vj, l 7→ l+k)
(e.g., from
l
-th layer to
(l+k)
-th layer of GNN). Its existence could be induced
recursively by the semi-group identity property of NHK concerning consecutive GNN layers.
Theorem 2. (Semigroup Identity Property of NHK)
The NHK satisfies the semigroup identity
property: vi, vj∈ V and l > 0, there exists a cross-layer NHK across two consecutive layers
κθ(vi, vj, l 7→ l+ 2) = X
vk∈V
κ(l+1)
θ(vi, vk)κ(l+2)
θ(vk, vj)(vk)(9)
This theorem indicates that stacks of multiple GNN layers also constitute a valid kernel, i.e.,
h(l+k)
i=X
vj∈V
κθ(vi, vj, l 7→ l+k)·h(l)
jµ(vj).(10)
Analogous to heat kernel as an unique characterization of the underlying space, NHK characterizes the
geometric property of the latent graph manifold for GNNs. Additionally, NHK is dependent on GNN
models through the definition of the associated Laplace-Beltrami operator
∆(fθ,G)
, inheriting the
expressiveness of neural networks and varying through the course of training. Intuitively, NHK can
be thought of as a model-driven encoding for topological information, encapsulating the geometric
knowledge learned by GNNs into a tractable functional form.
3.2 Application in Geometric Distillation
Consider the problem of distilling geometric knowledge, which involves an intelligent teacher model
fθ
, which is exposed to and pre-trained over the (relatively) complete graph
˜
G= ( ˜
V,˜
E)
, and a
student model
fθ
that is exposed to the partial graph
G= (V,E)
, where
V ⊆ ˜
V
and
E {V × V} ∩ ˜
E
.
Our target is to train a student model (with the help of teacher model) that operates on
G
to be as
competitive as models operating on
˜
G
during inference. Since
G
is a sub-graph of
˜
G
, they should lie in
the same space (i.e., latent manifold) governed by the underlying mechanism of data generation, and
hence we expect student and teacher models to capture the same geometric property of this shared
space. This leads to the principle of Geometric Knowledge Distillation (GKD): transfer the geometric
knowledge of the intelligent teacher to the student such that the student can propagate features as if it
is aware of the complete graph topology (see the example in Fig. 1).
To this end, we resort to NHK matrices on the teacher (resp. student) model over the complete (resp.
partial) graph as instantiations of their geometric knowledge, denoted as
(Teacher) Kθ(˜
G, l 7→ l+k) = {κθ(vi, vj, l 7→ l+k)}|˜
V|×| ˜
V|,
(Student) Kθ(G, l 7→ l+k) = {κθ(vi, vj, l 7→ l+k)}|V|×|V|,
written compactly as
K(l+1)(G)
when
k= 1
. The NHK matrix is a positive semi-definite symmetric
matrix, and alike
κ
, is dependent on the GNN model
fθ
and graph
G
. Denote
K(l)
θ,V(˜
G)R|V|×|V|
as the sub-matrix of K(l)
θ(˜
G)with row and column indices in V. The distillation loss for GKD is
Ldis(Kθ,V,Kθ, l 7→ l+k) = d(Kθ,V(˜
G, l 7→ l+k),Kθ(G, l 7→ l+k)),(11)
where
d(·,·)
is a similarity measure, for which we choose Frobenius distance as implementation, i.e.,
d(Kθ,V,Kθ) = k(Kθ,VKθ)Wk2
F,Wvi,vj=1if (vi, vj)∈ E
δif (vi, vj)/∈ E.(12)
where
WR|V|×|V|
is a weighting matrix to trade-off distillation loss with respect to different node
pairs depending on their connectivity. For
k= 1
, the loss can be re-written as
L(l+1)
dis (Kθ,V,Kθ)
.
Note that one can also specify different
k
for teacher and student models in Eqn.
(11)
in case when
the teacher model is deeper.
5
摘要:

GeometricKnowledgeDistillation:TopologyCompressionforGraphNeuralNetworksChenxiaoYang,QitianWu,JunchiYanDepartmentofComputerScienceandEngineeringMoEKeyLabofArticialIntelligenceShanghaiJiaoTongUniversity{chr26195,echo740,yanjunchi}@sjtu.edu.cnAbstractWestudyanewparadigmofknowledgetransferthataimsate...

展开>> 收起<<
Geometric Knowledge Distillation Topology Compression for Graph Neural Networks Chenxiao Yang Qitian Wu Junchi Yan.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:861.01KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注