
Linkless Link Prediction via Relational Distillation
2021;Hu et al.,2021), to take advantage of both GNN’s
performance benefits and MLP’s speed benefits. Specifi-
cally, in this way, the student MLP can potentially obtain
the graph-context knowledge transferred from the GNN
teacher via KD to not only perform better in practice, but
also enjoy model latency benefits compared to GNNs, e.g.
in production inference settings. However, these works fo-
cus on node- or graph-level tasks. Given that KD on link
prediction tasks have not been explored, and the massive
scope of recommendation systems contexts that are posed as
link prediction problems, our work aims to bridge a critical
gap. Specifically, we ask:
Can we effectively distill link prediction-relevant
knowledge from GNNs to MLPs?
In this work, we focus on exploring, building upon, and
proposing cross-model (GNN to MLP) distillation tech-
niques for link prediction settings. We start with explor-
ing two direct KD methods of aligning student and teacher:
(i) logit-based matching of predicted link existence proba-
bilities (Hinton et al.,2015), and (ii) representation-based
matching of the generated latent node representations (Gou
et al.,2021). However, empirically we observe that nei-
ther the logit-based matching nor the representation-based
matching are powerful enough to distill sufficient knowl-
edge for the student model to perform well on link prediction
tasks. We hypothesize that the reason of these two KD ap-
proaches not performing well is that link prediction, unlike
node classification, heavily relies on relational graph topo-
logical information (Mart
´
ınez et al.,2016;Zhang & Chen,
2018;Yun et al.,2021;Zhao et al.,2022b), which is not
well-captured by direct methods.
To address this issue, we propose a relational KD framework,
namely LLP: our key intuition is that instead of focusing
on matching individual node pairs or node representations,
we focus on matching the relationships between each (an-
chor) node with respect to other (context) nodes in the graph.
Given the relational knowledge centered at the anchor node,
i.e., the teacher model’s predicted link existence probabili-
ties between the anchor node and each context node, LLP
distills it to the student model via two matching methods: (i)
rank-based matching, and (ii) distribution-based matching.
More specifically, rank-based matching equips the student
model with a ranking loss over the relative ranks of all con-
text nodes w.r.t the anchor node, preserving crucial ranking
information that are directly relevant to downstream link
prediction use-cases, e.g. user-contextual friend recommen-
dation (Sankar et al.,2021;Tang et al.,2022) or item recom-
mendation (Ying et al.,2018a;He et al.,2020). On the other
hand, distribution-based matching equips the student model
with the link probability distribution over context nodes,
conditioned on the anchor node. Importantly, distribution-
based matching is complementary to rank-based matching,
as it provides auxiliary information about the relative val-
ues of the probabilities and magnitudes of differences. To
comprehensively evaluate the effectiveness of our proposed
LLP, we conduct experiments on 8 public benchmarks. In
addition to the standard transductive setting for graph tasks,
we also design a more realistic setting that mimics realistic
(on-line) use-cases for link prediction, which we call the pro-
duction setting. LLP consistently outperforms stand-alone
MLPs by 18.18 points on average under the transductive
setting and 12.01 points under the production setting on all
the datasets, and matches or outperforms teacher GNNs on
7/8 datasets under the transductive setting. Promisingly, for
cold-start nodes, LLP outperforms teacher GNNs and stand-
alone MLPs by 25.29 and 9.42 Hits@20 on average, respec-
tively. Finally, LLP infers drastically faster than GNNs, e.g.
70.68×faster on the large-scale Collab dataset.
2. Related Work and Preliminaries
We briefly discuss related work and preliminaries relevant
to contextualizing our methods and contributions. Due to
space limit, we defer more related work to Appendix A.
Notation. Let
G= (V,E)
denote an undirected graph,
where
V
denotes the set of
N
nodes and
E ⊆ V × V
denotes
the set of observed links.
A∈ {0,1}N×N
denotes the
adjacency matrix, where
Ai,j = 1
if exists an edge
ei,j
in
E
and
0
otherwise. Let the matrix of node features be
denoted by
X∈RN×F
, where each row
xi
is the
F
-dim
raw feature vector of node
i
. Given both
E
and
A
have
the validation and test links masked off for link prediction,
we use
ai,j
(different from
Ai,j
) to denote the true label of
link existence of nodes
i
and
j
, which may or may not be
visible during training depending on the setting. We use
E−= (V × V)\ E
to denote the no-edge node pairs that are
used as negative samples during model training. We denote
node representations by
H∈RN×D
, where
D
is the hidden
dimension. In KD context with multiple models, we use
hi
and
ˆ
hi
to denote node
i
’s representations learned by the
teacher and student models, respectively. Similarly, we use
yi,j
and
ˆyi,j
to denote the predictions for
ai,j
by the teacher
and the student models, respectively.
Link Prediction with GNNs. In this work, we take the
commonly used encoder-decoder framework for the link
prediction task (Kipf & Welling,2016b;Berg et al.,2017;
Schlichtkrull et al.,2018;Ying et al.,2018a;Davidson et al.,
2018;Zhu et al.,2021;Yun et al.,2021;Zhao et al.,2022b),
where the GNN-based encoder learns node representations
and the decoder predicts link existence probabilities. Most
GNNs operate under the message-passing framework, where
the model iteratively updates each node
i
’s representation
hi
by fetching its neighbors’ information. That is, the node’s
representation in the
l
-th layer is learned with an aggregation
2