Layer-Neighbor Sampling Defusing Neighborhood Explosion in GNNs Muhammed Fatih Balıny

2025-05-03 0 0 1.37MB 17 页 10玖币

侵权投诉

Layer-Neighbor Sampling — Defusing Neighborhood

Explosion in GNNs

Muhammed Fatih Balın∗†

balin@gatech.edu

Ümit V. Çatalyürek‡†

umit@gatech.edu

Abstract

Graph Neural Networks (GNNs) have received signiﬁcant attention recently, but

training them at a large scale remains a challenge. Mini-batch training coupled

with sampling is used to alleviate this challenge. However, existing approaches

either suffer from the neighborhood explosion phenomenon or have poor perfor-

mance. To address these issues, we propose a new sampling algorithm called

LAyer-neighBOR sampling (LABOR). It is designed to be a direct replacement for

Neighbor Sampling (NS) with the same fanout hyperparameter while sampling up

to 7 times fewer vertices, without sacriﬁcing quality. By design, the variance of

the estimator of each vertex matches NS from the point of view of a single vertex.

Moreover, under the same vertex sampling budget constraints, LABOR converges

faster than existing layer sampling approaches and can use up to 112 times larger

batch sizes compared to NS.

1 Introduction

Graph Neural Networks (GNN) Hamilton et al. [2017], Kipf and Welling [2017] have become de

facto models for representation learning on graph structured data. Hence they have started being

deployed in production systems Ying et al. [2018], Niu et al. [2020]. These models iteratively update

the node embeddings by passing messages along the direction of the edges in the given graph with

nonlinearities in between different layers. With

layers, the computed node embeddings contain

information from the l-hop neighborhood of the seed vertex.

In the production setting, the GNN models need to be trained on billion-scale graphs [Ching et al.,

2015, Ying et al., 2018]. The training of these models takes hours to days even on distributed

systems Zheng et al. [2022b,a]. As in general Deep Neural Networks (DNN), it is more efﬁcient to

use mini-batch training [Bertsekas, 1994] on GNNs, even though it is a bit trickier in this case. The

node embeddings in GNNs depend recursively on their set of neighbors’ embeddings, so when there

are

layers, this dependency spans the

-hop neighborhood of the node. Real world graphs usually

have a very small diameter and if

is large, the

-hop neighborhood may very well span the entire

graph, also known as the Neighborhood Explosion Phenomenon (NEP) [Zeng et al., 2020].

To solve these issues, researchers proposed sampling a subgraph of the

-hop neighborhood of the

nodes in the batch. There are mainly three different approaches: Node-based, Layer-based and

Subgraph-based methods. Node-based sampling methods [Hamilton et al., 2017, Chen et al., 2018a,

Liu et al., 2020, Zhang et al., 2021] sample independently and recursively for each node. It was

noticed that node-based methods sample subgraphs that are too shallow, i.e., with a low ratio of

number of edges to nodes. Thus layer-based sampling methods were proposed [Chen et al., 2018b,

∗Part of this work was done during an internship at NVIDIA

†School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA

‡

Amazon Web Services. This publication describes work performed at the Georgia Institute of Technology

and is not associated with AWS.

Preprint. Under review.

arXiv:2210.13339v2 [cs.LG] 19 May 2023

Zou et al., 2019, Huang et al., 2018, Dong et al., 2021], where the sampling for the whole layer

is done collectively. On the other hand subgraph sampling methods [Chiang et al., 2019, Zeng

et al., 2020, Hu et al., 2020b, Zeng et al., 2021, Fey et al., 2021, Shi et al., 2023] do not use the

recursive layer by layer sampling scheme used in the node- and layer-based sampling methods and

instead tend to use the same subgraph for all of the layers. Some of these sampling methods take the

magnitudes of embeddings into account [Liu et al., 2020, Zhang et al., 2021, Huang et al., 2018],

while others, such as Chen et al. [2018a], Cong et al. [2021], Fey et al. [2021], Shi et al. [2023], cache

the historical embeddings to reduce the variance of the computed approximate embeddings. There are

methods sampling from a vertex cache Dong et al. [2021] ﬁlled with popular vertices. Most of these

approaches are orthogonal to each other and they can be incorporated into other sampling algorithms.

Node-based sampling methods suffer the most from the NEP but they guarantee a good approximation

for each embedding by ensuring each vertex gets

neighbors which is the only hyperparameter

of the sampling algorithm. Layer-based sampling methods do not suffer as much from the NEP

because number of vertices sampled is a hyperparameter but they can not guarantee that each vertex

approximation is good enough and also their hyperparameters are hard to reason with, number of

nodes to sample at each layer depends highly on the graph structure (as the numbers in Table 2

show). Subgraph sampling methods usually sample sparser subgraphs compared to their node- and

layer-based counterparts. Hence, in this paper, we focus on the node- and layer-based sampling

methods and combine their advantages. The major contributions of this work can be listed as follows:

•

We propose the use of Poisson Sampling for GNNs, taking advantage of its lower variance

and computational efﬁciency against sampling without replacement. Applying it to the exist-

ing layer sampling method LADIES, we get the superior PLADIES method outperforming

the former by up to 2% in terms of F1-score.

•

We propose a new sampling algorithm called LABOR, combining advantages of neighbor

and layer sampling approaches using Poisson Sampling. LABOR correlates the sampling

procedures of the given set of seed nodes so that the sampled vertices from different seeds

have a lot of overlap, resulting into a

7×

and

4×

reduction in the number of vertices and

edges sampled compared to NS, respectively. Furthermore, LABOR can sample up to

13×

fewer edges compared to LADIES.

•

We experimentally verify our ﬁndings, show that our proposed sampling algorithm LABOR

outperforms both neighbor sampling and layer sampling approaches. LABOR can enjoy a

batch-size of up to 112×larger than NS while sampling the same number of vertices.

2 Background

Graph Neural Networks:

Given a directed graph

G= (V, E)

, where

and

E⊂V×V

are vertex

and edge sets respectively,

(t→s)∈E

denotes an edge from a source vertex

t∈V

to a destination

vertex

s∈V

, and

Ats

denotes the corresponding edge weight if provided. If we have a batch of seed

vertices S⊂V, let us deﬁne l-hop neighborhood Nl(S)for the incoming edges as follows:

N(s) = {t|(t→s)∈E}, N1(S) = N(S) = ∪s∈SN(s), Nl(S) = N(N(l−1)(S)) (1)

Let us also deﬁne the degree

of vertex

ds=|N(s)|

. To simplify, let’s assume uniform edge

weights,

Ats = 1,∀(t→s)∈E

. Then, our goal is to estimate the following for each vertex

s∈S

where

H(l−1)

is deﬁned as the embedding of the vertex

at layer

l−1

, and

W(l−1)

is the trainable

weight matrix at layer l−1, and σis the nonlinear activation function [Hamilton et al., 2017]:

s=1

dsX

t→s

H(l−1)

tW(l−1), Hl

s=σ(Zl

s)(2)

Exact Stochastic Gradient Descent:

If we have a node prediction task and

Vt⊆V

is the set

of training vertices,

ys, s ∈Vt

are the labels of the prediction task, and

is the loss function for

the prediction task, then our goal is to minimize the following loss function:

|Vt|Ps∈Vt`(ys, Zl

Replacing

in the loss function with

S⊂Vt

for each iteration of gradient descent, we get stochastic

gradient descent for GNNs. However with

layers, the computation dependency is on

Nl(S)

, which

reaches large portion of the real world graphs, i.e.

|Nl(S)| ≈ |V|

, making each iteration costly both

in terms of computation and memory.

Neighbor Sampling:

Neighbor sampling approach was proposed by Hamilton et al. [2017] to

approximate

Z(l)

for each

s∈S

with a subset of

Nl(S)

. Given a fanout hyperparameter

, this

subset is computed recursively by randomly picking

neighbors for each

s∈S

from

N(s)

to form

the next layer

, that is a subset of

N1(S)

. If

ds≤k

, then the exact neighborhood

N(s)

is used. For

the next layer, S1is treated as the new set of seed vertices and this procedure is applied recursively.

Revisiting LADIES, Dependent Layer-based Sampling:

From now on, we will drop the layer

notation and focus on a single layer and also ignore the nonlinearities. Let us deﬁne

Mt=HtW

as a

shorthand notation. Then our goal is to approximate:

Hs=1

dsX

t→s

Mt(3)

If we assign probabilities

πt>0,∀t∈N(S)

and normalize it so that

Pt∈N(S)πt= 1

, then

use sampling with replacement to sample

T⊂N(S)

with

|T|=n

, where

is the number of

vertices to sample given as input to the LADIES algorithm and

is a multi-set possibly with

multiple copies of the same vertices, and let

ds=|T∩N(s)|

which is the number of sampled

vertices for a given vertex

, we get the following two possible estimators for each vertex

s∈S

s=1

ndsX

t∈T∩N(s)

πt

(4a) H00

s=Pt∈T∩N(s)

πt

Pt∈T∩N(s)1

πt

(4b)

Note that

in (4a) is the Thompson-Horvitz estimator and the

H00

in (4b) is the Hajek estimator. For

a comparison between the two and how to get an even better estimator by combining them, see Khan

and Ugander [2021]. The formulation in the LADIES paper uses

, but it proposes to row-normalize

the sampled adjacency matrix, meaning they use

H00

in their implementation. However, analyzing

the variance of the Thompson-Horvitz estimator is simpler and its variance serves as an upper bound

for the variance of the Hajek estimator when

|Mt|

and

πt

are uncorrelated Khan and Ugander [2021],

Dorfman [1997], which we assume to be true in our case. Note that the variance analysis is simpliﬁed

to be element-wise for all vectors involved.

Var(H00

s)≤Var(H0

s) = 1

dsd2

t→s

πtX

t0→s

Var(Mt0)

πt0

(5)

Since we do not have access to the computed embeddings and to simplify the analysis, we assume

that

Var(Mt)=1

from now on. One can see that

Var(H0

is minimized when

πt=p, ∀t→s

under

the constraint

Pt→sπt≤pds

for some constant

p∈[0,1]

, hence any deviation from uniformity

increases the variance. The variance is also smaller the larger

is. However, in theory and in

practice, there is no guarantee that each vertex

s∈S

will get any neighbors in

, not to mention

equal numbers of neighbors. Some vertices will have pretty good estimators with thousands of

samples and very low variances, while others might not even get a single neighbor sampled. For this

reason, we designed LABOR so that every vertex in

will sample enough neighbors in expectation.

While LADIES is optimal from an approximate matrix multiplication perspective Chen et al. [2022],

it is far from optimal in the case of nonlinearities and multiple layers. Even if there is a single layer,

then the used loss functions are nonlinear. Moreover, the existence of nonlinearities in-between layers

and the fact that there are multiple layers exacerbates this issue and necessitates that each vertex

gets a good enough estimator with low enough variance. Also, LADIES gives a formulation using

sampling with replacement instead of without replacement and that is sub-optimal from the variance

perspective while its implementation uses sampling without replacement without taking care of the

bias created thereby. In the next section, we will show how all of these problems are addressed by

our newly proposed Poisson sampling framework and LABOR sampling.

3 Proposed Layer Sampling Methods

Node-based sampling methods suffer from sampling too shallow subgraphs leading to NEP in just a

few hops (e.g., see Table 2). Layer sampling methods Zou et al. [2019] attempt to ﬁx this by sampling

a ﬁxed number of vertices in each layer, however they can not ensure that the estimators for the

vertices are of high quality, and it is hard to reason how to choose the number of vertices to sample in

each layer. LADIES Zou et al. [2019] proposes using the same number for each layer while papers

evaluating it found it is better to sample an increasing number of vertices in each layer Liu et al.

[2020], Chen et al. [2022]. There is no systematic way to choose how many vertices to sample in each

layer for the LADIES method, and since each graph has different density and connectivity structure,

this choice highly depends on the graph in question. Therefore, due to its simplicity and high quality

results, Neighbor Sampling currently seems to be the most popular sampling approach and there

exists high quality implementations on both CPUs and GPUs in the popular GNN frameworks Wang

et al. [2019], Fey and Lenssen [2019].

We propose a new approach that combines the advantages of layer and neighbor sampling approaches

using a vertex-centric variance based framework, reducing the number of sampled vertices drastically

while ensuring the training quality does not suffer and matches the quality of neighbor sampling.

Another advantage of our method is that the user only needs to choose the batch size and the

fanout hyperparameters as in the Neighbor Sampling approach, the algorithm itself then samples the

minimum number of vertices in the later layers in an unbiased way while ensuring each vertex gets

enough neighbors and a good approximation.

We achieve all the previously mentioned good properties with the help of Poisson Sampling. So, next

section will demonstrate applying Poisson Sampling to Layer Sampling, then we will show how the

advantages of Layer and Neighbor Sampling methods can be combined into LABOR while getting

rid of their cons altogether.

3.1 Poisson Layer Sampling (PLADIES)

In layer sampling, the main idea can be summarized as individual vertices making correlated decisions

while sampling their neighbors, because in the end if a vertex

is sampled, all edges into the seed

vertices

, i.e.,

t→s

s∈S

, are added to the sampled subgraph. This can be interpreted as vertices

in Smaking a collective decision on whether to sample t, or not.

The other thing to keep in mind is that, the existing layer sampling methods use sampling with

replacement when doing importance sampling with unequal probabilities, because it is nontrivial

to compute the inclusion probabilities in the without replacement case. The Hajek estimator in the

without replacement case with equal probabilities becomes:

H00

s=Pt∈T∩N(s)

¯πt

Pt∈T∩N(s)1

¯πt

=Pt∈T∩N(s)Mt|N(S)|

Pt∈T∩N(s)|N(S)|=1

dsX

t∈T∩N(s)

Mt(6)

and it has the variance:

Var(H00

s) = ds−˜

ds−1

(7)

Let us show how one can do layer sampling using Poisson sampling (PLADIES). Given probabilities

πt∈[0,1],∀t∈N(S)

so that

Pt∈N(S)πt=n

, we include

t∈N(S)

in our sample

with

probability

πt

by ﬂipping a coin for it, i.e., we sample

rt∼U(0,1)

and include

t∈T

rt≤πt

. In

the end,

E[|T|] = n

and we can still use the Hajek estimator

H00

or the Horvitz Thomson estimator

to estimate

. Doing layer sampling this way is unbiased by construction and achieves the

same goal in linear time in contrast to the quadratic time debiasing approach explained in Chen et al.

[2022]. The variance then approximately becomes [Williams et al., 1998], see Appendix A.1 for a

derivation:

Var(H00

s)≤Var(H0

s) = 1

t→s

πt

−1

(8)

One can notice that the minus term

enables the variance to converge to

, if all

πt= 1

and we get

the exact result. However, in the sampling with replacement case, the variance goes to 0only as the

sample size goes to inﬁnity.

3.2 LABOR: Layer Neighbor Sampling

The design philosophy of LABOR Sampling is to create a direct alternative to Neighbor Sampling

while incorporating the advantages of layer sampling. Mimicking Layer Sampling with Poisson

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Layer-NeighborSamplingDefusingNeighborhoodExplosioninGNNsMuhammedFatihBalnybalin@gatech.eduÜmitV.Çatalyürekzyumit@gatech.eduAbstractGraphNeuralNetworks(GNNs)havereceivedsignicantattentionrecently,buttrainingthematalargescaleremainsachallenge.Mini-batchtrainingcoupledwithsamplingisusedtoalleviate...

展开>> 收起<<

Layer-Neighbor Sampling Defusing Neighborhood Explosion in GNNs Muhammed Fatih Balıny.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Layer-Neighbor Sampling Defusing Neighborhood Explosion in GNNs Muhammed Fatih Balıny

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: