Geometric Knowledge Distillation Topology Compression for Graph Neural Networks Chenxiao Yang Qitian Wu Junchi Yan

2025-05-06 0 0 861.01KB 21 页 10玖币

侵权投诉

Geometric Knowledge Distillation:

Topology Compression for Graph Neural Networks

Chenxiao Yang, Qitian Wu, Junchi Yan∗

Department of Computer Science and Engineering

MoE Key Lab of Artiﬁcial Intelligence

Shanghai Jiao Tong University

{chr26195,echo740,yanjunchi}@sjtu.edu.cn

Abstract

We study a new paradigm of knowledge transfer that aims at encoding graph

topological information into graph neural networks (GNNs) by distilling knowl-

edge from a teacher GNN model trained on a complete graph to a student GNN

model operating on a smaller or sparser graph. To this end, we revisit the con-

nection between thermodynamics and the behavior of GNN, based on which we

propose Neural Heat Kernel (NHK) to encapsulate the geometric property of the

underlying manifold concerning the architecture of GNNs. A fundamental and

principled solution is derived by aligning NHKs on teacher and student models,

dubbed as Geometric Knowledge Distillation. We develop non-parametric and

parametric instantiations and demonstrate their efﬁcacy in various experimental set-

tings for knowledge distillation regarding different types of privileged topological

information and teacher-student schemes.

1 Introduction

Modern graph neural networks (GNNs) [

;

] have shown remarkable performance in learning

representations for structured instances. From the perspective of geometric deep learning [

;

part of the achievement of GNNs can be attributed to their implementation of the permutation

invariance property as geometric priors

into the architecture design. Nevertheless, in practice, GNNs

highly rely on graph topology, as essential input information, to explore the relational knowledge

implicit in interactions of instance pairs throughout the entire message passing process, termed

as geometric knowledge in this paper. As advances in generalized distillation [

;

] reveal the

possibility of encoding input features into model construction, natural questions arise as to:

Is it possible, and if so, how can we encode graph topology as a special type of ‘geometric prior’ into

a GNN model, such that the model could precisely capture the underlying geometric knowledge even

without full graph topology as input?

In speciﬁc, we are interested in the following geometric knowledge transfer problem: a GNN model

(with node-speciﬁc outputs for node-level prediction [

]) is exposed with a partial graph, which is a

subset of the complete graph. Formally speaking, we have notations:

G={V,E} (partial graph),˜

G={˜

V,˜

E} (complete graph),where V ⊆ ˜

V,E ⊆ {V × V} ∩ ˜

E.(1)

Our goal is to transfer or encode geometric knowledge extracted from

to the target GNN model

that is only aware of

. Studying this problem is also of much practical value. As a non-exhaustive

∗Junchi Yan is the correspondence author who is also with Shanghai AI Laboratory.

Geometric priors originally refer to the geometric principles naturally encoded in deep learning architectures,

e.g., translational symmetry for CNNs, permutation invariance for GNNs.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.13014v2 [cs.LG] 15 Jan 2023

a) Teacher c) Student

(after GKD)

b) Student

(before GKD)

Figure 1: Feature propagation on the un-

derlying manifold

. (a) Teacher: aware

of the complete graph topology, and faith-

fully explore geometric knowledge about

the underlying manifold. (b) Student be-

fore GKD: only aware of partial graph

topology, and estimate biased geometry

property. (c) Student after GKD: able to

propagate features on the same space as

teacher by alignment of NHKs.

list of applications: improving efﬁciency without compromising on effectiveness for coarsened

graphs [

;

], privacy constrained scenarios in social recommenders or federated learning

where the complete graph is unavailable [

;

], promoting concentration on targeted community

to bring up economic beneﬁts [57].

Achieving this target is non-trivial for that we need to ﬁrst ﬁnd a principled and fundamental way to

encapsulate the geometric knowledge extracted by GNN model, which requires in-depth investigation

on the role of graph topology throughout the progressive process of message passing. Therefore,

we take a thermodynamic view borrowed from physics and propose a new methodology built upon

recent advances revealing the connection between heat diffusion and architectures of GNNs [

;

Speciﬁcally, we interpret feature propagation as heat ﬂows on the underlying Riemmanian manifold,

whose characteristics (that are dependent on graph topology and the GNN model) pave the way for a

principled representation of the latent geometric knowledge.

1.1 Our Contributions

New theoretical perspective for analyzing latent graph geometry.

On top of the connection

between heat equation and GNNs, we step further to inspect the implication of heat kernel for

GNNs, and propose a novel notion of Neural Heat Kernel (NHK) with rigorous proof of its existence.

Heat kernel intrinsically deﬁnes the unique solution to the heat equation and can be a fundamental

characterization for the geometric property of the underlying manifold [

;

]. Likewise, NHK

uncovers geometric property of the latent graph manifold for GNNs, and governs how information

ﬂows between pairs of instances, which lends us a mathematical tool to encapsulate geometric

knowledge extracted from GNN model and enables geometric knowledge transfer. This result alone

might also be useful in broader contexts for understanding GNNs.

Flexible distillation framework with versatile instantiations.

Based on the above insights, we

treat NHK matrices as representation of the latent geometric knowledge, upon which we build a

ﬂexible and principled distillation framework dubbed as Geometric Knowledge Distillation (GKD),

which aims at encoding and transferring geometric knowledge by aligning latent manifolds behind

GNN models as illustrated in Fig. 1. We also develop non-parametirc and parametric versions of

GKD, in terms of different ways to approximate NHK computation. Speciﬁcally, the former derives

explicit NHKs via assumptions on latent space, and the later learns NHK in a data-driven manner.

Applications for geometric knowledge transfer and conventional KD purposes.

We verify the

efﬁcacy of GKD in terms of different geometric knowledge types (i.e., edge-aware and node-aware

ones), and further show its effectiveness for conventional KD purposes (e.g., model compression, self

distillation, online distillation) for broader applicability. We highlight that our methods consistently

exceed teacher model and rival with the oracle model that gives the performance upper bound in

principle.

1.2 Links to Related Works

Geometric Deep Learning.

The study of geometric deep learning [

;

] provides fundamental

principles and methodology to generalize deep learning methods to non-Euclidean domains (e.g.,

graphs and manifolds). From this perspective, architectures for off-the-shelf GNNs [

;

]

and graph Transformers [

;

] have naturally incorporated the geometric prior knowledge for

graphs such as permutation invariance. Despite their remarkable success, they highly rely on the

graph topology. This work extends the idea of geometric deep learning by treating the global graph

topology as a special type of prior knowledge, and attempts to encode it into GNNs themselves,

such that the trained model would leverage information from the global graph topology even without

explicitly taking it as input.

Graph-Based Knowledge Distillation.

Knowledge distillation (KD) [

;

] uses the outputs of a

teacher model as alternative supervised signals to teach a student model, with various new paradigms

including feature-based [

;

] and relation-based [

;

] ones. While some prior arts [

;

] attempted to combine KD and GNNs, i.e., graph-based KD, they are nearly straight-

forward adaptations of KD without in-depth investigation on the role of graph topology, also restricted

by a speciﬁc choice of GNN architecture or application scenarios. In contrast, we ﬁrst formalize the

problem of geometric knowledge transfer, theoretically answer the question “how to represent graph

geometric knowledge and encode it into GNN models", and propose geometric distillation approach

based on the theoretical results. More discussions on related works are deferred to Appendix G.

2 Preliminaries

We commence with a brief detour to heat equation on Riemannian manifolds, and its connection with

modern GNN architectures. Moreover, we bring forth the notion of heat kernel to motivate this work.

Heat Equation on Manifolds.

We are interested in heat equation deﬁned on a smooth

dimensional Riemannian manifold

. Suppose the manifold is associated with a scalar- or vector-

valued function

x(u, t) : M × [0,∞)→Rd

, quantifying a speciﬁc type of signals such as heat at a

point

u∈ M

and time

. Fourier’s law of heat conductivity describes the ﬂow of heat with respect to

time and space, via a partial differential equation (PDE) called heat equation [6], i.e.,

∂x(u, t)

∂t =−c∆x(u, t),(2)

where

c > 0

is the thermal conductivity coefﬁcient, and

∆

is the natural Laplace–Beltrami operator

associated with

. Rewriting

∆

as the functional composition of the divergence operator

∇∗

and

gradient operator

∇

, i.e.,

∆ = ∇∗◦ ∇

, we can interpret the heat equation as: the variation of

temperature within an inﬁnitesimal time interval at a point is equivalent to the divergence between its

own temperature and the average temperature on an inﬁnitesimal sphere around it.

Implications on Graphs.

A spatial discretisation of a continuous manifold yields a graph

{V,E}

, whose nodes can be thought of as embedded on the base manifold. In fact, the heat equation

along with variants thereof (e.g., Schrödinger equation) have found widespread use in modeling graph

dynamics [11; 25; 36]. More importantly, it has been recently revealed to be intimately related with

the architectures of modern GNNs [

;

]: suppose

X(0) = {x(u, 0)}u∈V ∈Rn×d

denotes the

initial condition for Eqn.

(2)

determined by input node features, then solving the heat equation under

certain deﬁnitions of

∇∗

and

∇

(i.e., deﬁnition of

∆

) amounts to different architectures of GNNs.

For instance:

Example 1.

[

] Deﬁne the discretised counterpart of

∆

as the graph Laplacian matrix

D−1

2(e

D−e

A)e

D−1

. Numerically solving Eqn.

(2)

using the forward Euler method with step size

τ= 1

yields the formulation of Simple Graph Convolution (SGC) [

], where

denote learnable

transformation matrix

X(t) = ˜

D−1

2˜

A˜

D−1

2t

X(0),ˆ

Y= softmax ˆ

X(t)Θ.(3)

Example 2.

[

] Deﬁne the gradient operator

∇ij

as the difference of source and target node features,

the divergence operator

∇∗

as the sum of features of all edges for the node. Numerically solving

Eqn.

(2)

using the explicit Euler scheme with step size

yields the following recursive formulation

X(t+τ) = τ(G−I)ˆ

X(t) + ˆ

X(t)(4)

where Gis a diffusivity coefﬁcient matrix in place of c.

Moreover, stacking a non-linear transformation layer after each step yields the formulation of Graph

Convolution Networks (GCN) [

] for Eqn.

(3)

, Graph Attention Networks (GAT) [

] with residual

connection for Eqn.

(4)

, and even more GNN architectures by virtue of the ﬂexibility of interpretion

for heat equation on graphs.

Heat Kernels.

Intriguingly, it turns out that the initial value problem of heat equation on any

manifold

has a smallest positive fundamental solution depending on the Laplace operator

∆

known as the heat kernel [2]. It is denoted as a kernel function κ(x, y, t), such that

x(ui, t) = e−t∆x(ui,0) = ZM

κ(ui, uj, t)x(uj,0)dµ(uj),(5)

where

is a non-negative measure associated with

. In physics, the heat kernel

κ(x, y, t)

can

be interpreted as a transition density that describes the asymptotic behavior of a natural Brownian

motion on the manifold. Its formulation thus can be treated as a unique reﬂection or representation of

the geometry of the underlying manifold. For example, if the manifold is a

-dimensional Euclidean

Space Rkor a Hyperbolic Space Hk, the explicit formula of heat kernel is respectively given by,

κ(ui, uj, t) = 1

(4πt)k/2exp −

ρ2

4tand κ(ui, uj, t) = (−1)m

2mπm

(4πt)1

21

sinh ρ

∂

∂ρ m

e−m2t−ρ2

4t,(6)

where

ρ=d(ui, uj)

denote geodesic distance. Heat kernel has also been adopted for graph-related

applications such as community detection [31], graph clustering [58].

3 Extending Heat Kernel to GNNs

The starting point of this work is the development of neural heat kernel, built upon the previously-

mentioned connection of GNNs and heat equation. As will be discussed later, this novel notion lends

us a thermodynamic perspective to the intrinsic geometric property of the latent graph manifold

embodied in GNNs, and hence paves the way for distilling geometric knowledge.

3.1 Neural Heat Kernel

Consider the graph signal

X(t)

at time

and node features

H(l)

at layer

as interchangeable notions.

Consequently, feature propagation using one layer of GNN amounts to heat diffusion on the base

manifold Mwithin a certain time interval τ, leading to the equivalences of X(t+τ)and H(l+1):

H(l+1) =fθ(H(l),G),X(t+τ) = e−τ∆(fθ,G)X(t),(7)

where fθdenotes an arbitrary GNN model with parameter θ, and ∆(fθ,G)denotes a generalization

of Laplace-Beltrami operator deﬁned over the base manifold

associated with graph

and the

arbitrary backbone GNN model fθ.

Remark.

The equivalence of two equations in Eqn. 7 is based on the recently established connection

between heat equation and GNNs [

;

], which reveal that the formulation of a GNN layer

could be thought of as discretisations (that correspond to the left equation in Eqn. 7) of the continuous

diffusion process (that correspond to the right equation in Eqn. 7) described by the heat equation.

Furthermore, different deﬁnitions of Laplace-Beltrami operator

∆

and schemes for solving Eqn. 2

could yield different GNNs (e.g., SGC [

], GAT [

], GRAND [

]). While it is unclear whether

there exists such a deﬁnition of

∆

for every GNN architecture, we write the operator as

∆(fθ,G)

associate it with model

fθ

, and then use the analogy between GNN and heat equation as an analytical

tool, in a similar manner with [

;

], for studying the geometry property of GNNs. See

more detailed justiﬁcations in Appendix E.

In light of this connection, we consider a natural generalization of heat kernel for GNNs, termed as

neural heat kernel (NHK) to highlight its difference with heat kernel in the thermodynamic context.

In particular, a single-layer NHK is deﬁned as a positive deﬁnite symmetric kernel function denoted

κ(l)

θ(vi, vj)

, where the sub-script

implies that it is associated with the architecture and parameters

of the backbone GNN, and the super-script

(l)

implies that it is speciﬁc to each layer, analogous to

the role of continuous time tin Eqn. (5).

Theorem 1. (Existence of Single-Layer NHK)

Suppose two expressions in Eqn.

(7)

are equivalent

(see Appendix E for more discussions), then for any graph

and GNN model

fθ

, there exist a unique

single-layer NHK function κ(l)

θ(·)such that for any node vi∈ V and l > 0,

h(l)

i=X

vj∈V

κ(l)

θ(vi, vj)·h(l−1)

jµ(vj)(8)

where

h(l)

i∈Rd

denotes the feature of node

-th layer, and

is a measure over vertices that

could be speciﬁed as the inverse of node degree 1/di.

To push further, we can generalize NHK across multiple layers of GNN, termed as a cross-layer NHK

κθ(vi, vj, l 7→ l+k)

(e.g., from

-th layer to

(l+k)

-th layer of GNN). Its existence could be induced

recursively by the semi-group identity property of NHK concerning consecutive GNN layers.

Theorem 2. (Semigroup Identity Property of NHK)

The NHK satisﬁes the semigroup identity

property: ∀vi, vj∈ V and l > 0, there exists a cross-layer NHK across two consecutive layers

κθ(vi, vj, l 7→ l+ 2) = X

vk∈V

κ(l+1)

θ(vi, vk)κ(l+2)

θ(vk, vj)dµ(vk)(9)

This theorem indicates that stacks of multiple GNN layers also constitute a valid kernel, i.e.,

h(l+k)

i=X

vj∈V

κθ(vi, vj, l 7→ l+k)·h(l)

jµ(vj).(10)

Analogous to heat kernel as an unique characterization of the underlying space, NHK characterizes the

geometric property of the latent graph manifold for GNNs. Additionally, NHK is dependent on GNN

models through the deﬁnition of the associated Laplace-Beltrami operator

∆(fθ,G)

, inheriting the

expressiveness of neural networks and varying through the course of training. Intuitively, NHK can

be thought of as a model-driven encoding for topological information, encapsulating the geometric

knowledge learned by GNNs into a tractable functional form.

3.2 Application in Geometric Distillation

Consider the problem of distilling geometric knowledge, which involves an intelligent teacher model

fθ∗

, which is exposed to and pre-trained over the (relatively) complete graph

G= ( ˜

V,˜

, and a

student model

fθ

that is exposed to the partial graph

G= (V,E)

, where

V ⊆ ˜

and

E ⊆ {V × V} ∩ ˜

Our target is to train a student model (with the help of teacher model) that operates on

to be as

competitive as models operating on

during inference. Since

is a sub-graph of

, they should lie in

the same space (i.e., latent manifold) governed by the underlying mechanism of data generation, and

hence we expect student and teacher models to capture the same geometric property of this shared

space. This leads to the principle of Geometric Knowledge Distillation (GKD): transfer the geometric

knowledge of the intelligent teacher to the student such that the student can propagate features as if it

is aware of the complete graph topology (see the example in Fig. 1).

To this end, we resort to NHK matrices on the teacher (resp. student) model over the complete (resp.

partial) graph as instantiations of their geometric knowledge, denoted as

(Teacher) Kθ∗(˜

G, l 7→ l+k) = {κθ∗(vi, vj, l 7→ l+k)}|˜

V|×| ˜

V|,

(Student) Kθ(G, l 7→ l+k) = {κθ(vi, vj, l 7→ l+k)}|V|×|V|,

written compactly as

K(l+1)(G)

when

k= 1

. The NHK matrix is a positive semi-deﬁnite symmetric

matrix, and alike

, is dependent on the GNN model

fθ

and graph

. Denote

K(l)

θ∗,V(˜

G)∈R|V|×|V|

as the sub-matrix of K(l)

θ∗(˜

G)with row and column indices in V. The distillation loss for GKD is

Ldis(Kθ∗,V,Kθ, l 7→ l+k) = d(Kθ∗,V(˜

G, l 7→ l+k),Kθ(G, l 7→ l+k)),(11)

where

d(·,·)

is a similarity measure, for which we choose Frobenius distance as implementation, i.e.,

d(Kθ∗,V,Kθ) = k(Kθ∗,V−Kθ)Wk2

F,Wvi,vj=1if (vi, vj)∈ E

δif (vi, vj)/∈ E.(12)

where

W∈R|V|×|V|

is a weighting matrix to trade-off distillation loss with respect to different node

pairs depending on their connectivity. For

k= 1

, the loss can be re-written as

L(l+1)

dis (Kθ∗,V,Kθ)

Note that one can also specify different

for teacher and student models in Eqn.

(11)

in case when

the teacher model is deeper.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeometricKnowledgeDistillation:TopologyCompressionforGraphNeuralNetworksChenxiaoYang,QitianWu,JunchiYanDepartmentofComputerScienceandEngineeringMoEKeyLabofArticialIntelligenceShanghaiJiaoTongUniversity{chr26195,echo740,yanjunchi}@sjtu.edu.cnAbstractWestudyanewparadigmofknowledgetransferthataimsate...

展开>> 收起<<

Geometric Knowledge Distillation Topology Compression for Graph Neural Networks Chenxiao Yang Qitian Wu Junchi Yan.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Geometric Knowledge Distillation Topology Compression for Graph Neural Networks Chenxiao Yang Qitian Wu Junchi Yan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: