Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Trovato and Tobin, et al.
knowledge from high-performance but resource-intensive teacher
models to resource-efficient students [
13
]. Motivated by the promis-
ing results of MLP-like models in computer vision [
23
,
24
,
26
],
graph-less neural networks (GLNN) [41] were proposed to transfer
the graph knowledge from GNNs to standard MLP students via KD.
This idea improves the performance of MLPs on node classifica-
tion while being easy to deploy in production systems because of
discarding the message-passing.
However, traditional MLPs do not fully understand graph knowl-
edge due to the lack of structure inputs. As in GLNN [
41
], distillation
may fail when node labels are highly correlated with structure infor-
mation, e.g., heterophily datasets. Hence the improvement of distilla-
tion is mainly attributed to the strong memory ability of MLPs [
31
].
To better understand the limitations of GLNN, we consider the two
scenarios, i.e., transductive (trans) and inductive (ind), according
to information from graphs obtained during the training phase. In
trans settings [
20
], all node features and graph structures are given
in the training time, and student MLPs can overfit (memorize) teach-
ers’ outputs on all nodes, which leads to superior performance [
41
].
Nevertheless, in the more challenging ind scenario [
10
] where the
test node information is unavailable in the training stage, an MLP
without graph dependency has limited generalizability on these test
nodes. In this scenario, the structure information is a key clue that
binds the training and test nodes to improve generalization. Further-
more, the standard logit-based KD [
13
], which merely considers
label information on existing nodes, cannot fully transfer the graph
knowledge due to the sparsity of the graph structure [21].
To address the above problems, we intend to inject the structure
information into MLPs in a low-latency and interpretable way. To
this end, as shown in Figure 1, we first design a simple yet effective
and interpretable Structure-Aware MLP (SA-MLP) student model
(Section 4.1) to encode both node features and structure informa-
tion. Specifically, the SA-MLP decouples the features and structure
information of each node with two encoders, and utilizes an adap-
tive fusion decoder to generate the final prediction. All the modules
of SA-MLP are implemented by MLPs, and the structure inputs
are the batch of the sparse adjacency matrix rows. Hence, it can
benefit from both mini-batch training and faster inference. Second,
we propose a novel structure-mixing knowledge distillation (Sec-
tion 4.2) via the mixup [
40
] technique to improve the learning ability
of SA-MLP for structure information from GNNs. It generates the
virtual mixing structure and teachers’ output samples to enhance the
distillation density for structure knowledge. Compared with standard
logit-based distillation, our strategy is more appropriate for SA-MLP
due to the reduction of structure sparsity. Third, for the ind scenario
without connection, e.g., some newest users on Twitter who do not
interact with any others, we propose an implicit structure embedding
approximation technique with a two-stage distillation procedure to
enhance the generalization ability (Section 4.3).
We conduct extensive experiments on eight public benchmark
datasets under trans and ind scenarios. The results show that the
learned SA-MLP can achieve similar or even better performance than
teacher GNNs in all scenarios, with substantially faster inference.
Furthermore, we also conduct an in-depth analysis to investigate the
compatibility, interpretability, and efficiency of the learned SA-MLP.
To summarize, this work has the following main contributions:
•
We propose a message-passing free model SA-MLP, which
has a low latency inference and a more interpretable pre-
diction process, and naturally preserves the feature/structure
information after distillation from GNN.
•
We design a mixup-based structure-mixing knowledge distil-
lation strategy that improves the performance and structure
awareness of SA-MLP via KD.
•
For the missing structure scenario, we propose a latent struc-
ture embedding approximation technique and a two-stage
distillation to enhance the generalization ability.
2 PRELIMINARY
2.1 Notation and Problem Setting
Consider a graph
G=(V,E)
, with
𝑁
nodes and
𝐸
edges. Let
A∈R𝑁×𝑁be the adjacency matrix, with A𝑖,𝑗 =1if edge(𝑖, 𝑗) ∈ E,
and 0 otherwise. Let
D∈R𝑁×𝑁
be the diagonal degree matrix.
Each node
𝑣𝑖
is given a
𝑑
-dimensional feature representation
x𝑖
and
a
𝑐
-dimensional one-hot class label
y𝑖
. The feature inputs are then
formed by
X∈R𝑁×𝑑
, and the labels are represented by
Y∈R𝑁×𝑐
.
The labeled and unlabeled node sets are denoted as
V
𝐿
and
V
𝑈
, and
we have V=V
𝐿∪ V
𝑈.
The task of node classification is to predict the labels
Y
by exploit-
ing the nodes’ features
X
and the graph structure
A
. The goal of our
paper is to learn an MLP-like student, such that the learned MLP can
achieve similar or even better performance compared with a GNN
trained by the same training set, with a much lower computational
cost in the inference time.
2.2 Graph Neural Networks
Most existing GNNs follow the message-passing paradigm which
contains node feature transformation and information aggregation
from connected neighbors on the graph structure [
9
]. The general
𝑘-th layer graph convolution for a node 𝑣𝑖can be formulated as
h(𝑘)
𝑖=𝑓h(𝑘−1)
𝑖,nh(𝑘−1)
𝑗:𝑗∈ N (𝑣𝑖)o,(1)
where representation
h𝑖
is updated iteratively in each layer by col-
lecting messages from its neighbors denoted as
N (𝑣𝑖)
. The graph
convolution operator
𝑓
is usually implemented as a weighted sum
of nodes’ representation according to the adjacent matrix
A
as in
GCN [
20
] and GraphSAGE [
10
] or the attention mechanism in
GAT [
33
]. However, this recursive expansion and aggregation of
neighbors cause inference latency, because the number of neighbors
fetching will exponentially increase with increasing layers [37, 41].
The objective function for training GNNs is the cross-entropy of
the ground truth labels Yand the output of the network ˆ
Y∈R𝑁×𝑐:
L𝐶𝐸 (ˆ
Y𝐿,Y𝐿)=−
𝑖∈V
𝐿
𝑐
𝑗=1
Y𝑖 𝑗 ln ˆ
Y𝑖 𝑗 .(2)
2.3 Transductive and Inductive Scenarios
Although the idealistic transductive is the commonly studied setting
for node classification, it is incongruous with unseen nodes in real ap-
plications. We then consider node classification under three settings
to give a broad evaluation of models: transductive (trans), inductive
with connection (ind w/c) and inductive without connection (ind
w/o c), as shown in Figure 2. For trans, models can utilize all node