Layer-Neighbor Sampling Defusing Neighborhood Explosion in GNNs Muhammed Fatih Balıny

2025-05-03 0 0 1.37MB 17 页 10玖币
侵权投诉
Layer-Neighbor Sampling — Defusing Neighborhood
Explosion in GNNs
Muhammed Fatih Balın
balin@gatech.edu
Ümit V. Çatalyürek
umit@gatech.edu
Abstract
Graph Neural Networks (GNNs) have received significant attention recently, but
training them at a large scale remains a challenge. Mini-batch training coupled
with sampling is used to alleviate this challenge. However, existing approaches
either suffer from the neighborhood explosion phenomenon or have poor perfor-
mance. To address these issues, we propose a new sampling algorithm called
LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for
Neighbor Sampling (NS) with the same fanout hyperparameter while sampling up
to 7 times fewer vertices, without sacrificing quality. By design, the variance of
the estimator of each vertex matches NS from the point of view of a single vertex.
Moreover, under the same vertex sampling budget constraints, LABOR converges
faster than existing layer sampling approaches and can use up to 112 times larger
batch sizes compared to NS.
1 Introduction
Graph Neural Networks (GNN) Hamilton et al. [2017], Kipf and Welling [2017] have become de
facto models for representation learning on graph structured data. Hence they have started being
deployed in production systems Ying et al. [2018], Niu et al. [2020]. These models iteratively update
the node embeddings by passing messages along the direction of the edges in the given graph with
nonlinearities in between different layers. With
l
layers, the computed node embeddings contain
information from the l-hop neighborhood of the seed vertex.
In the production setting, the GNN models need to be trained on billion-scale graphs [Ching et al.,
2015, Ying et al., 2018]. The training of these models takes hours to days even on distributed
systems Zheng et al. [2022b,a]. As in general Deep Neural Networks (DNN), it is more efficient to
use mini-batch training [Bertsekas, 1994] on GNNs, even though it is a bit trickier in this case. The
node embeddings in GNNs depend recursively on their set of neighbors’ embeddings, so when there
are
l
layers, this dependency spans the
l
-hop neighborhood of the node. Real world graphs usually
have a very small diameter and if
l
is large, the
l
-hop neighborhood may very well span the entire
graph, also known as the Neighborhood Explosion Phenomenon (NEP) [Zeng et al., 2020].
To solve these issues, researchers proposed sampling a subgraph of the
l
-hop neighborhood of the
nodes in the batch. There are mainly three different approaches: Node-based, Layer-based and
Subgraph-based methods. Node-based sampling methods [Hamilton et al., 2017, Chen et al., 2018a,
Liu et al., 2020, Zhang et al., 2021] sample independently and recursively for each node. It was
noticed that node-based methods sample subgraphs that are too shallow, i.e., with a low ratio of
number of edges to nodes. Thus layer-based sampling methods were proposed [Chen et al., 2018b,
Part of this work was done during an internship at NVIDIA
School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Amazon Web Services. This publication describes work performed at the Georgia Institute of Technology
and is not associated with AWS.
Preprint. Under review.
arXiv:2210.13339v2 [cs.LG] 19 May 2023
Zou et al., 2019, Huang et al., 2018, Dong et al., 2021], where the sampling for the whole layer
is done collectively. On the other hand subgraph sampling methods [Chiang et al., 2019, Zeng
et al., 2020, Hu et al., 2020b, Zeng et al., 2021, Fey et al., 2021, Shi et al., 2023] do not use the
recursive layer by layer sampling scheme used in the node- and layer-based sampling methods and
instead tend to use the same subgraph for all of the layers. Some of these sampling methods take the
magnitudes of embeddings into account [Liu et al., 2020, Zhang et al., 2021, Huang et al., 2018],
while others, such as Chen et al. [2018a], Cong et al. [2021], Fey et al. [2021], Shi et al. [2023], cache
the historical embeddings to reduce the variance of the computed approximate embeddings. There are
methods sampling from a vertex cache Dong et al. [2021] filled with popular vertices. Most of these
approaches are orthogonal to each other and they can be incorporated into other sampling algorithms.
Node-based sampling methods suffer the most from the NEP but they guarantee a good approximation
for each embedding by ensuring each vertex gets
k
neighbors which is the only hyperparameter
of the sampling algorithm. Layer-based sampling methods do not suffer as much from the NEP
because number of vertices sampled is a hyperparameter but they can not guarantee that each vertex
approximation is good enough and also their hyperparameters are hard to reason with, number of
nodes to sample at each layer depends highly on the graph structure (as the numbers in Table 2
show). Subgraph sampling methods usually sample sparser subgraphs compared to their node- and
layer-based counterparts. Hence, in this paper, we focus on the node- and layer-based sampling
methods and combine their advantages. The major contributions of this work can be listed as follows:
We propose the use of Poisson Sampling for GNNs, taking advantage of its lower variance
and computational efficiency against sampling without replacement. Applying it to the exist-
ing layer sampling method LADIES, we get the superior PLADIES method outperforming
the former by up to 2% in terms of F1-score.
We propose a new sampling algorithm called LABOR, combining advantages of neighbor
and layer sampling approaches using Poisson Sampling. LABOR correlates the sampling
procedures of the given set of seed nodes so that the sampled vertices from different seeds
have a lot of overlap, resulting into a
7×
and
4×
reduction in the number of vertices and
edges sampled compared to NS, respectively. Furthermore, LABOR can sample up to
13×
fewer edges compared to LADIES.
We experimentally verify our findings, show that our proposed sampling algorithm LABOR
outperforms both neighbor sampling and layer sampling approaches. LABOR can enjoy a
batch-size of up to 112×larger than NS while sampling the same number of vertices.
2 Background
Graph Neural Networks:
Given a directed graph
G= (V, E)
, where
V
and
EV×V
are vertex
and edge sets respectively,
(ts)E
denotes an edge from a source vertex
tV
to a destination
vertex
sV
, and
Ats
denotes the corresponding edge weight if provided. If we have a batch of seed
vertices SV, let us define l-hop neighborhood Nl(S)for the incoming edges as follows:
N(s) = {t|(ts)E}, N1(S) = N(S) = sSN(s), Nl(S) = N(N(l1)(S)) (1)
Let us also define the degree
ds
of vertex
s
as
ds=|N(s)|
. To simplify, let’s assume uniform edge
weights,
Ats = 1,(ts)E
. Then, our goal is to estimate the following for each vertex
sS
,
where
H(l1)
t
is defined as the embedding of the vertex
t
at layer
l1
, and
W(l1)
is the trainable
weight matrix at layer l1, and σis the nonlinear activation function [Hamilton et al., 2017]:
Zl
s=1
dsX
ts
H(l1)
tW(l1), Hl
s=σ(Zl
s)(2)
Exact Stochastic Gradient Descent:
If we have a node prediction task and
VtV
is the set
of training vertices,
ys, s Vt
are the labels of the prediction task, and
`
is the loss function for
the prediction task, then our goal is to minimize the following loss function:
1
|Vt|PsVt`(ys, Zl
s)
.
Replacing
Vt
in the loss function with
SVt
for each iteration of gradient descent, we get stochastic
gradient descent for GNNs. However with
l
layers, the computation dependency is on
Nl(S)
, which
reaches large portion of the real world graphs, i.e.
|Nl(S)| ≈ |V|
, making each iteration costly both
in terms of computation and memory.
2
Neighbor Sampling:
Neighbor sampling approach was proposed by Hamilton et al. [2017] to
approximate
Z(l)
s
for each
sS
with a subset of
Nl(S)
. Given a fanout hyperparameter
k
, this
subset is computed recursively by randomly picking
k
neighbors for each
sS
from
N(s)
to form
the next layer
S1
, that is a subset of
N1(S)
. If
dsk
, then the exact neighborhood
N(s)
is used. For
the next layer, S1is treated as the new set of seed vertices and this procedure is applied recursively.
Revisiting LADIES, Dependent Layer-based Sampling:
From now on, we will drop the layer
notation and focus on a single layer and also ignore the nonlinearities. Let us define
Mt=HtW
as a
shorthand notation. Then our goal is to approximate:
Hs=1
dsX
ts
Mt(3)
If we assign probabilities
πt>0,tN(S)
and normalize it so that
PtN(S)πt= 1
, then
use sampling with replacement to sample
TN(S)
with
|T|=n
, where
n
is the number of
vertices to sample given as input to the LADIES algorithm and
T
is a multi-set possibly with
multiple copies of the same vertices, and let
˜
ds=|TN(s)|
which is the number of sampled
vertices for a given vertex
s
, we get the following two possible estimators for each vertex
sS
:
H0
s=1
ndsX
tTN(s)
Mt
πt
(4a) H00
s=PtTN(s)
Mt
πt
PtTN(s)1
πt
(4b)
Note that
H0
s
in (4a) is the Thompson-Horvitz estimator and the
H00
s
in (4b) is the Hajek estimator. For
a comparison between the two and how to get an even better estimator by combining them, see Khan
and Ugander [2021]. The formulation in the LADIES paper uses
H0
s
, but it proposes to row-normalize
the sampled adjacency matrix, meaning they use
H00
s
in their implementation. However, analyzing
the variance of the Thompson-Horvitz estimator is simpler and its variance serves as an upper bound
for the variance of the Hajek estimator when
|Mt|
and
πt
are uncorrelated Khan and Ugander [2021],
Dorfman [1997], which we assume to be true in our case. Note that the variance analysis is simplified
to be element-wise for all vectors involved.
Var(H00
s)Var(H0
s) = 1
˜
dsd2
sX
ts
πtX
t0s
Var(Mt0)
πt0
(5)
Since we do not have access to the computed embeddings and to simplify the analysis, we assume
that
Var(Mt)=1
from now on. One can see that
Var(H0
s)
is minimized when
πt=p, ts
under
the constraint
Ptsπtpds
for some constant
p[0,1]
, hence any deviation from uniformity
increases the variance. The variance is also smaller the larger
˜
ds
is. However, in theory and in
practice, there is no guarantee that each vertex
sS
will get any neighbors in
T
, not to mention
equal numbers of neighbors. Some vertices will have pretty good estimators with thousands of
samples and very low variances, while others might not even get a single neighbor sampled. For this
reason, we designed LABOR so that every vertex in
S
will sample enough neighbors in expectation.
While LADIES is optimal from an approximate matrix multiplication perspective Chen et al. [2022],
it is far from optimal in the case of nonlinearities and multiple layers. Even if there is a single layer,
then the used loss functions are nonlinear. Moreover, the existence of nonlinearities in-between layers
and the fact that there are multiple layers exacerbates this issue and necessitates that each vertex
gets a good enough estimator with low enough variance. Also, LADIES gives a formulation using
sampling with replacement instead of without replacement and that is sub-optimal from the variance
perspective while its implementation uses sampling without replacement without taking care of the
bias created thereby. In the next section, we will show how all of these problems are addressed by
our newly proposed Poisson sampling framework and LABOR sampling.
3 Proposed Layer Sampling Methods
Node-based sampling methods suffer from sampling too shallow subgraphs leading to NEP in just a
few hops (e.g., see Table 2). Layer sampling methods Zou et al. [2019] attempt to fix this by sampling
3
a fixed number of vertices in each layer, however they can not ensure that the estimators for the
vertices are of high quality, and it is hard to reason how to choose the number of vertices to sample in
each layer. LADIES Zou et al. [2019] proposes using the same number for each layer while papers
evaluating it found it is better to sample an increasing number of vertices in each layer Liu et al.
[2020], Chen et al. [2022]. There is no systematic way to choose how many vertices to sample in each
layer for the LADIES method, and since each graph has different density and connectivity structure,
this choice highly depends on the graph in question. Therefore, due to its simplicity and high quality
results, Neighbor Sampling currently seems to be the most popular sampling approach and there
exists high quality implementations on both CPUs and GPUs in the popular GNN frameworks Wang
et al. [2019], Fey and Lenssen [2019].
We propose a new approach that combines the advantages of layer and neighbor sampling approaches
using a vertex-centric variance based framework, reducing the number of sampled vertices drastically
while ensuring the training quality does not suffer and matches the quality of neighbor sampling.
Another advantage of our method is that the user only needs to choose the batch size and the
fanout hyperparameters as in the Neighbor Sampling approach, the algorithm itself then samples the
minimum number of vertices in the later layers in an unbiased way while ensuring each vertex gets
enough neighbors and a good approximation.
We achieve all the previously mentioned good properties with the help of Poisson Sampling. So, next
section will demonstrate applying Poisson Sampling to Layer Sampling, then we will show how the
advantages of Layer and Neighbor Sampling methods can be combined into LABOR while getting
rid of their cons altogether.
3.1 Poisson Layer Sampling (PLADIES)
In layer sampling, the main idea can be summarized as individual vertices making correlated decisions
while sampling their neighbors, because in the end if a vertex
t
is sampled, all edges into the seed
vertices
S
, i.e.,
ts
,
sS
, are added to the sampled subgraph. This can be interpreted as vertices
in Smaking a collective decision on whether to sample t, or not.
The other thing to keep in mind is that, the existing layer sampling methods use sampling with
replacement when doing importance sampling with unequal probabilities, because it is nontrivial
to compute the inclusion probabilities in the without replacement case. The Hajek estimator in the
without replacement case with equal probabilities becomes:
H00
s=PtTN(s)
Mt
¯πt
PtTN(s)1
¯πt
=PtTN(s)Mt|N(S)|
PtTN(s)|N(S)|=1
˜
dsX
tTN(s)
Mt(6)
and it has the variance:
Var(H00
s) = ds˜
ds
ds1
1
˜
ds
(7)
Let us show how one can do layer sampling using Poisson sampling (PLADIES). Given probabilities
πt[0,1],tN(S)
so that
PtN(S)πt=n
, we include
tN(S)
in our sample
T
with
probability
πt
by flipping a coin for it, i.e., we sample
rtU(0,1)
and include
tT
if
rtπt
. In
the end,
E[|T|] = n
and we can still use the Hajek estimator
H00
s
or the Horvitz Thomson estimator
H0
s
to estimate
Hs
. Doing layer sampling this way is unbiased by construction and achieves the
same goal in linear time in contrast to the quadratic time debiasing approach explained in Chen et al.
[2022]. The variance then approximately becomes [Williams et al., 1998], see Appendix A.1 for a
derivation:
Var(H00
s)Var(H0
s) = 1
d2
sX
ts
1
πt
1
ds
(8)
One can notice that the minus term
1
ds
enables the variance to converge to
0
, if all
πt= 1
and we get
the exact result. However, in the sampling with replacement case, the variance goes to 0only as the
sample size goes to infinity.
3.2 LABOR: Layer Neighbor Sampling
The design philosophy of LABOR Sampling is to create a direct alternative to Neighbor Sampling
while incorporating the advantages of layer sampling. Mimicking Layer Sampling with Poisson
4
摘要:

Layer-NeighborSampling—DefusingNeighborhoodExplosioninGNNsMuhammedFatihBalnybalin@gatech.eduÜmitV.Çatalyürekzyumit@gatech.eduAbstractGraphNeuralNetworks(GNNs)havereceivedsignicantattentionrecently,buttrainingthematalargescaleremainsachallenge.Mini-batchtrainingcoupledwithsamplingisusedtoalleviate...

展开>> 收起<<
Layer-Neighbor Sampling Defusing Neighborhood Explosion in GNNs Muhammed Fatih Balıny.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.37MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注