Linkless Link Prediction via Relational Distillation

2025-04-27 0 0 927.26KB 22 页 10玖币

侵权投诉

Zhichun Guo * 1 William Shiao 2Shichang Zhang 3Yozen Liu 4Nitesh V. Chawla 1Neil Shah 4Tong Zhao 4

Abstract

Graph Neural Networks (GNNs) have shown ex-

ceptional performance in the task of link predic-

tion. Despite their effectiveness, the high latency

brought by non-trivial neighborhood data depen-

dency limits GNNs in practical deployments. Con-

versely, the known efﬁcient MLPs are much less

effective than GNNs due to the lack of relational

knowledge. In this work, to combine the ad-

vantages of GNNs and MLPs, we start with ex-

ploring direct knowledge distillation (KD) meth-

ods for link prediction, i.e., predicted logit-based

matching and node representation-based match-

ing. Upon observing direct KD analogs do not

perform well for link prediction, we propose a

relational KD framework, Linkless Link Predic-

tion (LLP), to distill knowledge for link predic-

tion with MLPs. Unlike simple KD methods that

match independent link logits or node represen-

tations, LLP distills relational knowledge that is

centered around each (anchor) node to the student

MLP. Speciﬁcally, we propose rank-based match-

ing and distribution-based matching strategies that

complement each other. Extensive experiments

demonstrate that LLP boosts the link prediction

performance of MLPs with signiﬁcant margins,

and even outperforms the teacher GNNs on 7 out

of 8 benchmarks. LLP also achieves a 70.68

speedup in link prediction inference compared to

GNNs on the large-scale OGB dataset.

1. Introduction

Graph neural networks (GNNs) have been widely used for

machine learning on graph-structured data (Kipf & Welling,

This work was done during ﬁrst author’s internship at Snap

Inc.

Department of Computer Science and Engineering, Univer-

sity of Notre Dame, IN, USA

Department of Computer Science

and Engineering, University of California, Riverside, CA, USA

Department of Computer Science, University of California, Los

Angeles, CA, USA

Snap Inc., CA, USA. Correspondence to:

Zhichun Guo <zguo5@nd.edu>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

2016a;Hamilton et al.,2017). They have shown signiﬁcant

performance in various applications, such as node classi-

ﬁcation (Veli

ckovi

c et al.,2017;Chen et al.,2020), graph

classiﬁcation (Zhang et al.,2018), graph generation (You

et al.,2018;Shiao & Papalexakis,2021), and link predic-

tion (Zhang & Chen,2018;Shiao et al.,2023).

Of these, link prediction is a notably critical problem in

the graph machine learning community, which aims to pre-

dict the likelihood of any two nodes forming a link. It

has broad practical applications such as knowledge graph

completion (Schlichtkrull et al.,2018;Nathani et al.,2019;

Vashishth et al.,2020), friend recommendation on social

platforms (Sankar et al.,2021;Tang et al.,2022;Fan et al.,

2022) and item recommendation for users on service and

commerce platforms (Koren et al.,2009;Ying et al.,2018a;

He et al.,2020). With the rising popularity of GNNs, state-

of-the-art link prediction methods adopt encoder-decoder

style models, where encoders are GNNs, and decoders are

applied directly on pairs of node representations learned by

the GNNs (Kipf & Welling,2016b;Zhang & Chen,2018;

Cai & Ji,2020;Zhao et al.,2022b).

The success of GNNs is typically attributed to the explicit

use of contextual information from nodes’ surrounding

neighborhoods (Zhang et al.,2020e). However, this induces

a heavy reliance on neighborhood fetching and aggregation

schemes, which can lead to high time cost in training and

inference compared to tabular models, such as multi-layer

perceptrons (MLPs), especially owing to neighbor explo-

sion (Zhang et al.,2020b;Jia et al.,2020;Zhang et al.,

2021b;Zeng et al.,2019). Compared to GNNs, MLPs do

not require any graph topology information, making them

more suitable for new or isolated nodes (e.g., for cold-start

settings), but usually resulting in worse general task perfor-

mance as encoders, which we also empirically validate in

Section 4. Nonetheless, having no graph dependency makes

the training and inference time for MLPs negligible when

comparing with those of GNNs. Thus, in large-scale appli-

cations where fast real-time inference is required, MLPs are

still a leading option (Zhang et al.,2021b;Covington et al.,

2016;Gholami et al.,2021).

Given these speed-performance tradeoffs, several recent

works propose to transfer the learned knowledge from

GNNs to MLP using knowledge distillation (KD) tech-

niques (Hinton et al.,2015;Zhang et al.,2021b;Zheng et al.,

arXiv:2210.05801v3 [cs.LG] 5 Jun 2023

Linkless Link Prediction via Relational Distillation

2021;Hu et al.,2021), to take advantage of both GNN’s

performance beneﬁts and MLP’s speed beneﬁts. Speciﬁ-

cally, in this way, the student MLP can potentially obtain

the graph-context knowledge transferred from the GNN

teacher via KD to not only perform better in practice, but

also enjoy model latency beneﬁts compared to GNNs, e.g.

in production inference settings. However, these works fo-

cus on node- or graph-level tasks. Given that KD on link

prediction tasks have not been explored, and the massive

scope of recommendation systems contexts that are posed as

link prediction problems, our work aims to bridge a critical

gap. Speciﬁcally, we ask:

Can we effectively distill link prediction-relevant

knowledge from GNNs to MLPs?

In this work, we focus on exploring, building upon, and

proposing cross-model (GNN to MLP) distillation tech-

niques for link prediction settings. We start with explor-

ing two direct KD methods of aligning student and teacher:

(i) logit-based matching of predicted link existence proba-

bilities (Hinton et al.,2015), and (ii) representation-based

matching of the generated latent node representations (Gou

et al.,2021). However, empirically we observe that nei-

ther the logit-based matching nor the representation-based

matching are powerful enough to distill sufﬁcient knowl-

edge for the student model to perform well on link prediction

tasks. We hypothesize that the reason of these two KD ap-

proaches not performing well is that link prediction, unlike

node classiﬁcation, heavily relies on relational graph topo-

logical information (Mart

ınez et al.,2016;Zhang & Chen,

2018;Yun et al.,2021;Zhao et al.,2022b), which is not

well-captured by direct methods.

To address this issue, we propose a relational KD framework,

namely LLP: our key intuition is that instead of focusing

on matching individual node pairs or node representations,

we focus on matching the relationships between each (an-

chor) node with respect to other (context) nodes in the graph.

Given the relational knowledge centered at the anchor node,

i.e., the teacher model’s predicted link existence probabili-

ties between the anchor node and each context node, LLP

distills it to the student model via two matching methods: (i)

rank-based matching, and (ii) distribution-based matching.

More speciﬁcally, rank-based matching equips the student

model with a ranking loss over the relative ranks of all con-

text nodes w.r.t the anchor node, preserving crucial ranking

information that are directly relevant to downstream link

prediction use-cases, e.g. user-contextual friend recommen-

dation (Sankar et al.,2021;Tang et al.,2022) or item recom-

mendation (Ying et al.,2018a;He et al.,2020). On the other

hand, distribution-based matching equips the student model

with the link probability distribution over context nodes,

conditioned on the anchor node. Importantly, distribution-

based matching is complementary to rank-based matching,

as it provides auxiliary information about the relative val-

ues of the probabilities and magnitudes of differences. To

comprehensively evaluate the effectiveness of our proposed

LLP, we conduct experiments on 8 public benchmarks. In

addition to the standard transductive setting for graph tasks,

we also design a more realistic setting that mimics realistic

(on-line) use-cases for link prediction, which we call the pro-

duction setting. LLP consistently outperforms stand-alone

MLPs by 18.18 points on average under the transductive

setting and 12.01 points under the production setting on all

the datasets, and matches or outperforms teacher GNNs on

7/8 datasets under the transductive setting. Promisingly, for

cold-start nodes, LLP outperforms teacher GNNs and stand-

alone MLPs by 25.29 and 9.42 Hits@20 on average, respec-

tively. Finally, LLP infers drastically faster than GNNs, e.g.

70.68×faster on the large-scale Collab dataset.

2. Related Work and Preliminaries

We brieﬂy discuss related work and preliminaries relevant

to contextualizing our methods and contributions. Due to

space limit, we defer more related work to Appendix A.

Notation. Let

G= (V,E)

denote an undirected graph,

where

denotes the set of

nodes and

E ⊆ V × V

denotes

the set of observed links.

A∈ {0,1}N×N

denotes the

adjacency matrix, where

Ai,j = 1

if exists an edge

ei,j

and

otherwise. Let the matrix of node features be

denoted by

X∈RN×F

, where each row

is the

-dim

raw feature vector of node

. Given both

and

have

the validation and test links masked off for link prediction,

we use

ai,j

(different from

Ai,j

) to denote the true label of

link existence of nodes

and

, which may or may not be

visible during training depending on the setting. We use

E−= (V × V)\ E

to denote the no-edge node pairs that are

used as negative samples during model training. We denote

node representations by

H∈RN×D

, where

is the hidden

dimension. In KD context with multiple models, we use

and

to denote node

’s representations learned by the

teacher and student models, respectively. Similarly, we use

yi,j

and

ˆyi,j

to denote the predictions for

ai,j

by the teacher

and the student models, respectively.

Link Prediction with GNNs. In this work, we take the

commonly used encoder-decoder framework for the link

prediction task (Kipf & Welling,2016b;Berg et al.,2017;

Schlichtkrull et al.,2018;Ying et al.,2018a;Davidson et al.,

2018;Zhu et al.,2021;Yun et al.,2021;Zhao et al.,2022b),

where the GNN-based encoder learns node representations

and the decoder predicts link existence probabilities. Most

GNNs operate under the message-passing framework, where

the model iteratively updates each node

’s representation

by fetching its neighbors’ information. That is, the node’s

representation in the

-th layer is learned with an aggregation

Linkless Link Prediction via Relational Distillation

6Teacher GNN

Val. Accuracy

…

Student MLP Dec.

Dec. 1 2

3 6

…

1 2

3 6

…

ℒ!" ℒ#"

Direct

Sample

For

ℒ##$_! ℒ##$_&

Relational

KD:

LLP

Rank Distribution

Graph Structure

Node Attr. !

3 6

Figure 1: We explore KD methods for link prediction, which distill knowledge from a Teacher GNN toaStudent MLP

encoder, each with their own decoder. We start by exploring direct KD methods: representation-matching and logit-matching

(Section 3.1). Upon observing their drawbacks of not being able to distill relational information, we further propose a

relational KD framework: LLP (Section 3.3), which equips the student model with knowledge of each (anchor) node’s

relationships with other (context) nodes, via our proposed rank-based matching and distribution-based matching objectives.

operation and an update operation:

i=UPDATElhl−1

i,AGGREGATEl({hl−1

j|ei,j ∈ E}),(1)

where

AGGREGATE

combines or pools local neighbor fea-

tures,

UPDATE

is a learnable transformation, and

i=xi

The link prediction decoder takes the node representations

from the last layer, i.e.,

for

i∈ V

, to predict the probabil-

ity of a link between a node pair iand j:

yi,j =σ(DECODER(hi,hj)),(2)

where

denotes a Sigmoid operation. In this work, follow-

ing most state-of-the-art link prediction literature (Zhang

et al.,2021a;Tsitsulin et al.,2018;Zhao et al.,2022b;Wang

et al.,2021), we take the Hadamard product followed by a

MLP as the link prediction DECODER for all methods.

Knowledge Distillation for GNNs. Knowledge distillation

(KD) (Hinton et al.,2015) aims to transfer the knowledge

from a high-capacity and highly accurate teacher model

to a light-weight student model, and is typically employed

in resource-constrained settings. KD methods have shown

great promise in signiﬁcantly reducing model complexity,

while sometimes barely (or not) sacriﬁcing task performance

(Furlanello et al.,2018;Park et al.,2019). As GNNs are

known to be slow due to neighbor aggregation induced by

data dependency, graph-based KD methods (Zhang et al.,

2021b;Zheng et al.,2021) usually distill GNNs onto MLPs,

which are commonly used in large-scale industrial appli-

cations due to their signiﬁcantly improved efﬁciency and

scalability. For example, Zheng et al. (2021) proposed a KD-

based framework to rediscover the missing graph structure

information for MLPs, which improves the models’ gen-

eralization of node classiﬁcation task on cold-start nodes.

Existing KD methods on graph data mainly focus on node-

level (Zheng et al.,2021;Zhang et al.,2021b;Tian et al.,

2022) and graph-level tasks (Ma & Mei,2019;Zhang et al.,

2020c;Deng & Zhang,2021;Joshi et al.,2021), leaving KD

for link prediction yet unexplored. Our work focuses on dis-

tilling link prediction-relevant information from the GNN

teacher to an MLP student, and investigates various KD

strategies to align student and teacher predictions. Specif-

ically, denoting the node representations for nodes

and

learned by the student MLP as

and

, the link exis-

tence prediction by the student model can then be written as

ˆyi,j =σ(DECODER(ˆ

hi,ˆ

hj)).

3. Cross-model Knowledge Distillation for

Link Prediction

In this section, we propose and discuss several approaches to

distill knowledge from a teacher GNN to a student MLP in

a cross-model fashion, for the purpose of link prediction. In

all cases, we aim to supervise the student MLP with artifacts

produced by the GNN teacher, in addition to any available

training labels (

ai,j

w.l.o.g.) about link existence. We start

by adapting two direct knowledge distillation (KD) meth-

ods: logit-matching and representation-matching, on link

prediction tasks; we call these methods direct because they

involve directly matching sample-wise predictions between

teacher and student. Next, we motivate and introduce our

proposed relational KD framework, LLP, with two match-

ing strategies to distill additional topology-related structural

information to the student. We call these methods relational

because they call for the preservation of relationships across

samples between teacher and student (Park et al.,2019).

Figure 1 summarizes our proposals.

3.1. Direct Methods

Logit-matching is one straightforward strategy to distill

knowledge from the teacher to the student, where it di-

rectly aims to teach the student to generalize as the teacher

Linkless Link Prediction via Relational Distillation

does on the downstream task. It was proposed by Hinton

et al. (2015), and it is still one of the most widely used

KD methods in various tasks (Furlanello et al.,2018;Yang

et al.,2020b;Yan et al.,2020). Several works (Phuong

& Lampert,2019;Ji & Zhu,2020) theoretically analyzed

its effectiveness. Moreover, it had also been proved to be

effective for knowledge transfer on graph data (Yan et al.,

2020;Yang et al.,2021;Zhang et al.,2021b) in recent years.

For example, Zhang et al. (2021b) the soft logits generated

by the teacher GNNs to help supervise the student MLP and

achieved strong performance on node classiﬁcation tasks.

In a similar vein, we generate the soft score

yi,j

for the

candidate edge (

i, j

) with the teacher GNN model, and train

the student to match its prediction ˆyi,j on this target:

LLM =X

(i,j)∈E∪E−λLsup(ˆyi,j , ai,j )

+ (1 −λ)Lmatch(ˆyi,j , yi,j ),(3)

where

Lsup

is the supervised link prediction loss (e.g., bi-

nary cross entropy) that directly trains the student model,

Lmatch

is the loss for matching the student’s prediction

with the teacher’s prediction, and

is a hyper-parameter

that mediates the importance of the ground-truth labels and

logit-matching signals. Note that multiple implementation

choices exist for

Lmatch

. For example, mean-squared error

(MSE), Kullback-Leibler (KL) divergence, or cosine sim-

ilarity. In the experiments, we opt for the empirical best

choice for fair comparison across methods.

Representation-matching is another direct distillation

method in which we aim to align the student’s learned latent

node embedding space with the teacher’s. As this KD train-

ing signal only optimizes the encoder part of the student

model, it must be used with

Lsup

so that the student decoder

receives a gradient and can also be optimized:

LRM =X

(i,j)∈E∪E−

λLsup(ˆyi,j , ai,j )

+ (1 −λ)X

i∈V

Lmatch(ˆ

hi,hi).(4)

Unlike logit-matching, representation-matching involves

directly aligning node-level artifacts, which is similar to

object representation matching in computer vision (Romero

et al.,2014;Kim et al.,2018;Wang et al.,2020b;Chen

et al.,2021a).

3.2. Link Prediction with Relational Distillation

Motivation. The above direct methods ask the student

model to directly match node-level or link-level artifacts.

However, one might ask: are matching these sufﬁcient for

link prediction tasks? This is especially relevant considering

that most link prediction applications involve tasks where

ranking target nodes with respect to a source, or anchor

node, is the task of interest, i.e. ranking relevant candidate

users or items with respect to a seed user (Huang et al.,2005;

Trouillon et al.,2016). These contexts involve reasoning

over multiple relations or link-level samples simultaneously,

suggesting that matching across these relations could be

more aligned with the target link prediction task, compared

to the direct node-level or link-level methods.

Furthermore, several works (Zhang & Chen,2018;Yun et al.,

2021) suggest that graph structure information is critically

important for link prediction tasks. For example, heuristic

link prediction methods commonly show competitive per-

formance compared to GNNs (Zhang & Chen,2018) and

have long-served as a cornerstone for accurate link predic-

tion even prior to neural graph methods (Mart

ınez et al.,

2016). Most heuristic methods measure the score of the

target node pairs only based on the graph structure informa-

tion (Barab

asi & Albert,1999;Brin & Page,2012), such as

common neighbors and shortest path. In addition, several

recent works (Zhang & Chen,2017;2018;Li et al.,2020;

Zhao et al.,2022b) also show that enclosing topology infor-

mation such as local subgraph, distances with anchor nodes,

or augmented links can largely improve GNNs’ performance

on link-level tasks. Observing that most successful methods

in link prediction involve using relational information other

than just the two nodes in question, we also adopt this in-

tuition in the distillation context and propose our relational

KD for link prediction. We elaborate next.

3.3. Proposed Framework: Linkless Link Prediction

In accordance with our intuition regarding preservation of

relational knowledge, we propose a novel relational distil-

lation framework, called

inkless

ink

rediction, or LLP.

Instead of focusing on matching individual node pair scores

or node representations, LLP focuses on distilling knowl-

edge about the relationships of each node to other nodes in

the graph; we call the former node an anchor node, and the

latter nodes context nodes. For each node in the graph, when

it serves as the anchor node, we aim to equip the student

MLP model with knowledge of its relationships with a set

of context nodes. Each node can serve as both an anchor

node, as well as a context node (for other anchor nodes).

Let

denote the anchor node and

denote the corre-

sponding set of context nodes of

. We denote the teacher

model’s predicted probabilities of

and each node in

Yv={yv,i|i∈ Cv}

. Similarly, we denote the student

model’s predictions on those as

Yv={ˆyv,i|i∈ Cv}

. To

effectively distill the relational knowledge from

we proposed two relational matching objectives to train

LLP:rank-based matching and distribution-based matching,

which we introduce next.

Linkless Link Prediction via Relational Distillation

Rank-based Matching. As aforementioned in Section 3.2,

link prediction is often considered a ranking task, requiring

the model to rank relevant candidates w.r.t. a seed node, e.g.

in a user-item graph setting, the predictor must rank over a

set of candidate items from the perspective of a user. Thus,

we reason that unlike matching individual and independent

logits, matching the ranking induced by the teacher can more

straightforwardly teach the student relational knowledge

about context nodes w.r.t. the anchor node, e.g. for a speciﬁc

user, item

should be ranked higher than item

, which

should be ranked higher than item

. To adopt this rank-

based intuition into a training objective, we adopt a modiﬁed

margin-based ranking loss that trains the student with the

rank of the logits from the teacher GNN. Speciﬁcally, we

enumerate all pairs of predicted probabilities in

and

supervise it with the corresponding pairs in Yv. That is,

LLLP R=X

v∈V X

{ˆyv,i ,ˆyv,j }∈ ˆ

max(0,−r·(ˆyv,i −ˆyv,j ) + δ),

(5)

where r=









1,if yv,i −yv,j > δ;

−1,if yv,i −yv,j <−δ;

0,otherwise,

where

is the margin hyper-parameter, which is usually a

very small value (e.g. 0.05). Note that the above loss differs

from the conventional margin-based ranking loss, because

it has a condition for

r= 0

(inducing constant loss) on the

logits pairs that the teacher GNN gives similar probabilities,

i.e.,

|yv,i −yv,j |< δ

. This design effectively prevents

the student model from trying to differentiate minuscule

differences in probabilities which the teacher may produce

owing to noise; without this condition, the loss would pass

binary information regardless of how small the difference

is. We also empirically show the necessity of this design

in Table 16 in Appendix D.

Distribution-based Matching. While the rank-based

matching can effectively teach the student model relational

rank information, we observe that it does not fully make use

of the value information from

, e.g. for a speciﬁc user,

item

should be ranked much higher than item

, which

should only be ranked marginally higher than item

. Al-

though the logit-matching introduced in Section 3.1 might

seem suitable here, we observe that its link-level matching

strategy only facilitates matching information on scattered

node pairs, rather than focusing on the relationships condi-

tioned on an anchor node – empirically, we also ﬁnd that

it has limited effectiveness. Therefore, to enable relational

value-based matching centered on the anchor nodes, we fur-

ther propose a distribution-based matching scheme which

utilizes the KL divergence between the teacher predictions

and student predictions

, centered on each anchor

node v. Speciﬁcally, we deﬁne it as

LLLP D=X

v∈V X

i∈Cv

exp(yv,i/τ )

Pj∈Cvexp(yv,j /τ )

log exp(ˆyv,i/τ)

Pj∈Cvexp(ˆyv,j /τ),(6)

where

is a temperature hyper-parameter which controls

the softness of the softmaxed distribution. By also asking

the student to match relative values within the probability

distribution over context nodes conditioned on each anchor

node, the distribution-based matching scheme complements

rank-based matching by providing auxiliary information

about the magnitudes of differences.

Practical Implementation of LLP.In practical implemen-

tation, given the large number of nodes in the graph, it is

infeasible for LLP to use all other nodes as the set of con-

text nodes, especially for the rank-based matching which

enumerates pairs of probabilities in

. Hence, we opt for

simplicity and adopt two straightforward sampling strate-

gies for the constructing

for each anchor node

to limit

its size. First, to keep the local structure around the an-

chor node, we follow previous works (Perozzi et al.,2014;

Hamilton et al.,2017) to sample

nearby nodes by repeat-

ing ﬁxed-length random walks several times, denoted as

. Secondly, we randomly sample

nodes from the whole

graph

(which are likely to be far-away from

) to form

, which additionally preserves the global structure w.r.t.

in the graph. The context nodes for each anchor node are

the union of the nearby samples and random samples.

and

are hyper-parameters. Finally, we make

as the union of

the nearby samples and random samples, i.e.,

Cv=CN

v∪CR

We conduct experiments to show the impact of the selection

strategy of context nodes

for each anchor node, which

are presented in Section 4.6 and Appendix D.5.

While training LLP, we jointly optimize both the rank-based

and distribution-based matching losses in addition to the

ground-truth label loss. Therefore, the overall training loss

which LLP adopts for the student is

L=α· Lsup +β· LLLP R+γ· LLLP D(7)

where

, and

are hyper-parameters which mediate the

strengths of each loss term.

4. Experiments

4.1. Experimental Setup

Datasets. We conduct the experiments using 8 com-

monly used benchmark datasets for link prediction:

Cora

Citeseer

Pubmed

Computers

Photos

Physics

and

Collab

. The statistics of the datasets are shown in

Table 1 with further details provided in Appendix B.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LinklessLinkPredictionviaRelationalDistillationZhichunGuo*1WilliamShiao2ShichangZhang3YozenLiu4NiteshV.Chawla1NeilShah4TongZhao4AbstractGraphNeuralNetworks(GNNs)haveshownex-ceptionalperformanceinthetaskoflinkpredic-tion.Despitetheireffectiveness,thehighlatencybroughtbynon-trivialneighborhooddatadepe...

展开>> 收起<<

Linkless Link Prediction via Relational Distillation.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Linkless Link Prediction via Relational Distillation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: