
Zou et al., 2019, Huang et al., 2018, Dong et al., 2021], where the sampling for the whole layer
is done collectively. On the other hand subgraph sampling methods [Chiang et al., 2019, Zeng
et al., 2020, Hu et al., 2020b, Zeng et al., 2021, Fey et al., 2021, Shi et al., 2023] do not use the
recursive layer by layer sampling scheme used in the node- and layer-based sampling methods and
instead tend to use the same subgraph for all of the layers. Some of these sampling methods take the
magnitudes of embeddings into account [Liu et al., 2020, Zhang et al., 2021, Huang et al., 2018],
while others, such as Chen et al. [2018a], Cong et al. [2021], Fey et al. [2021], Shi et al. [2023], cache
the historical embeddings to reduce the variance of the computed approximate embeddings. There are
methods sampling from a vertex cache Dong et al. [2021] filled with popular vertices. Most of these
approaches are orthogonal to each other and they can be incorporated into other sampling algorithms.
Node-based sampling methods suffer the most from the NEP but they guarantee a good approximation
for each embedding by ensuring each vertex gets
k
neighbors which is the only hyperparameter
of the sampling algorithm. Layer-based sampling methods do not suffer as much from the NEP
because number of vertices sampled is a hyperparameter but they can not guarantee that each vertex
approximation is good enough and also their hyperparameters are hard to reason with, number of
nodes to sample at each layer depends highly on the graph structure (as the numbers in Table 2
show). Subgraph sampling methods usually sample sparser subgraphs compared to their node- and
layer-based counterparts. Hence, in this paper, we focus on the node- and layer-based sampling
methods and combine their advantages. The major contributions of this work can be listed as follows:
•
We propose the use of Poisson Sampling for GNNs, taking advantage of its lower variance
and computational efficiency against sampling without replacement. Applying it to the exist-
ing layer sampling method LADIES, we get the superior PLADIES method outperforming
the former by up to 2% in terms of F1-score.
•
We propose a new sampling algorithm called LABOR, combining advantages of neighbor
and layer sampling approaches using Poisson Sampling. LABOR correlates the sampling
procedures of the given set of seed nodes so that the sampled vertices from different seeds
have a lot of overlap, resulting into a
7×
and
4×
reduction in the number of vertices and
edges sampled compared to NS, respectively. Furthermore, LABOR can sample up to
13×
fewer edges compared to LADIES.
•
We experimentally verify our findings, show that our proposed sampling algorithm LABOR
outperforms both neighbor sampling and layer sampling approaches. LABOR can enjoy a
batch-size of up to 112×larger than NS while sampling the same number of vertices.
2 Background
Graph Neural Networks:
Given a directed graph
G= (V, E)
, where
V
and
E⊂V×V
are vertex
and edge sets respectively,
(t→s)∈E
denotes an edge from a source vertex
t∈V
to a destination
vertex
s∈V
, and
Ats
denotes the corresponding edge weight if provided. If we have a batch of seed
vertices S⊂V, let us define l-hop neighborhood Nl(S)for the incoming edges as follows:
N(s) = {t|(t→s)∈E}, N1(S) = N(S) = ∪s∈SN(s), Nl(S) = N(N(l−1)(S)) (1)
Let us also define the degree
ds
of vertex
s
as
ds=|N(s)|
. To simplify, let’s assume uniform edge
weights,
Ats = 1,∀(t→s)∈E
. Then, our goal is to estimate the following for each vertex
s∈S
,
where
H(l−1)
t
is defined as the embedding of the vertex
t
at layer
l−1
, and
W(l−1)
is the trainable
weight matrix at layer l−1, and σis the nonlinear activation function [Hamilton et al., 2017]:
Zl
s=1
dsX
t→s
H(l−1)
tW(l−1), Hl
s=σ(Zl
s)(2)
Exact Stochastic Gradient Descent:
If we have a node prediction task and
Vt⊆V
is the set
of training vertices,
ys, s ∈Vt
are the labels of the prediction task, and
`
is the loss function for
the prediction task, then our goal is to minimize the following loss function:
1
|Vt|Ps∈Vt`(ys, Zl
s)
.
Replacing
Vt
in the loss function with
S⊂Vt
for each iteration of gradient descent, we get stochastic
gradient descent for GNNs. However with
l
layers, the computation dependency is on
Nl(S)
, which
reaches large portion of the real world graphs, i.e.
|Nl(S)| ≈ |V|
, making each iteration costly both
in terms of computation and memory.
2