Hierarchical Graph Transformer with Adaptive Node Sampling Zaixi Zhang12Qi Liu12 Qingyong Hu3 Chee-Kong Lee4

2025-05-06 0 0 1.34MB 17 页 10玖币

侵权投诉

Hierarchical Graph Transformer with Adaptive

Node Sampling

Zaixi Zhang1,2Qi Liu1,2∗

, Qingyong Hu3, Chee-Kong Lee4

1: Anhui Province Key Lab of Big Data Analysis and Application,

School of Computer Science and Technology, University of Science and Technology of China

2:State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China

3:Hong Kong University of Science and Technology, 4: Tencent America

zaixi@mail.ustc.edu.cn, qiliuql@ustc.edu.cn

qhuag@cse.ust.hk, cheekonglee@tencent.com

Abstract

The Transformer architecture has achieved remarkable success in a number of

domains including natural language processing and computer vision. However,

when it comes to graph-structured data, transformers have not achieved competitive

performance, especially on large graphs. In this paper, we identify the main

deﬁciencies of current graph transformers: (1) Existing node sampling strategies

in Graph Transformers are agnostic to the graph characteristics and the training

process. (2) Most sampling strategies only focus on local neighbors and neglect

the long-range dependencies in the graph. We conduct experimental investigations

on synthetic datasets to show that existing sampling strategies are sub-optimal. To

tackle the aforementioned problems, we formulate the optimization strategies of

node sampling in Graph Transformer as an adversary bandit problem, where the

rewards are related to the attention weights and can vary in the training procedure.

Meanwhile, we propose a hierarchical attention scheme with graph coarsening

to capture the long-range interactions while reducing computational complexity.

Finally, we conduct extensive experiments on real-world datasets to demonstrate

the superiority of our method over existing graph transformers and popular GNNs.

Our code is open-sourced at https://github.com/zaixizhang/ANS-GT.

1 Introduction

In recent years, the Transformer architecture [

] and its variants (e.g., Bert [

] and ViT [

]) have

achieved unprecedented successes in natural language processing (NLP) and computer vision (CV).

In light of the superior performance of Transformer, some recent works [

] attempt to generalize

Transformer for graph data by treating each node as a token and designing dedicated positional

encoding. However, most of these works only focus on small graphs such as molecular graphs

with tens of atoms [

]. For instance, Graphormer [

] achieves state-of-the-art performance on

molecular property prediction tasks. When it comes to large graphs, the quadratic computational

and storage complexity of the vanilla Transformer with the number of nodes inhibits the practical

application. Although some Sparse Transformer methods [

] can improve the efﬁciency of

the vanilla Transformer, they have not exploited the unique characteristics of graph data and require

a quadratic or at least sub-quadratic space complexity, which is still unaffordable in most practical

cases. Moreover, the full-attention mechanism potentially introduces noise from numerous irrelevant

nodes in the full graph.

∗Qi Liu is the corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.03930v1 [cs.LG] 8 Oct 2022

To generalize Transformer to large graphs, existing Transformer-based methods [

] on graphs

explicitly or implicitly restrict each node’s receptive ﬁeld to reduce the computational and storage

complexity. For example, Graph-Bert [

] restricts the receptive ﬁeld of each node to the nodes

with top-

intimacy scores such as Personalized PageRank (PPR). GT-Sparse [

] only considers

1-hop neighboring nodes. We argue that existing Graph Transformers have the following deﬁciencies:

(1) The ﬁxed node sampling strategies in existing Graph Transformers are ignorant of the graph

properties, which may sample uninformative nodes for attention. Therefore, an adaptive node

sampling strategy aware of the graph properties is needed. We conduct case studies in Section 4 to

support our arguments. (2) Though the sampling method enables scalability, most node sampling

strategies focus on local neighbors and neglect the long-range dependencies and global contexts of

graphs. Hence, incorporating complementary global information is necessary for Graph Transformer.

To solve the challenge (1), we propose

daptive

ode

ampling for

raph

ransformer (ANS-GT)

and formulate the optimization strategy of node sampling in Graph Transformer as an adversary

bandit problem. Speciﬁcally in ANS-GT, we modify Exp4.P method [

] to adaptively assign weights

to several chosen sampling heuristics (e.g., 1-hop neighbors, 2-hop neighbors, PPR) and combine

these sampling strategies to sample informative nodes. The reward is proportional to the attention

weights and the sampling probabilities of nodes, i.e. the reward to a certain sampling heuristic is

higher if the sampling probability distribution and the node attention weights distribution are more

similar. Then in the training process of Graph Transformer, the node sampling strategy is updated

simultaneously to sample more informative nodes. With more informative nodes input into the

self-attention module, ANS-GT can achieve better performance.

To tackle the challenge (2), we propose a hierarchical attention scheme for Graph Transformer to

encode both local and global information for each node. The hierarchical attention scheme consists

of ﬁne-grained local attention and coarse-grained global attention. In the local attention, we use the

aforementioned adaptive node sampling strategy to select informative local nodes for attention. As

for global attention, we ﬁrst use graph coarsening algorithms [

] to pre-process the input graph and

generate a coarse graph. Such algorithms mimic a down-sampling of the original graph via grouping

the nodes into super-nodes while preserving global graph information as much as possible. The

center nodes then interact with the sampled super-nodes. Such coarse-grained global attention helps

each node capture long-distance dependencies while reducing the computational complexity of the

vanilla Graph Transformers.

We conduct extensive experiments on real-world datasets to show the effectiveness of ANS-GT. Our

method outperforms all the existing Graph Transformer architectures and obtains state-of-the-art

results on 6 benchmark datasets. Detailed analysis and ablation studies further show the superiority

of the adaptive node sampling module and the hierarchical attention scheme.

In summary, we make the following contributions:

•

We propose

daptive

ode

ampling for

raph

ransformer (ANS-GT), which modiﬁes a

multi-armed bandit algorithm to adaptively sample nodes for attention.

•

In the hierarchical attention scheme, we introduce coarse-grained global attention with

graph coarsening, which helps graph transformer capture long-range dependencies while

increasing efﬁciency.

•

We empirically evaluate our method on six benchmark datasets to show the advantage over

existing Graph Transformers and popular GNNs.

2 Related Work

2.1 Transformers for Graph

Recently, Transformer [

] has shown its superiority in an increasing number of domains [

e.g. Bert [

] in NLP and ViT [

] in CV. Existing works attempting to generalize Transformer to graph

data mainly focus on two problems: (1) How to design dedicated positional encoding for the nodes;

(2) How to alleviate the quadratic computational complexity of the vanilla Transformer and scale

the Graph Transformer to large graphs. As for the positional encoding, GT [

] ﬁrstly uses Laplacian

eigenvectors to enhance node features. Graph-Bert [

] studies employing Weisfeiler-Lehman code to

encode structural information. Graphormer [

] utilizes centrality encoding to enhance node features

while incorporating edge information with spatial (SPD-indexed attention bias) and edge encoding.

SAN [

] further replaces the static Laplacian eigenvectors with learnable positional encodings and

designs an attention mechanism that distinguishes local connectivity. For the scalability issue, one

immediate idea is to restrict the number of attending nodes. For example, GAT [

] and GT-Sparse

[

] only consider the 1-hop neighboring nodes; Gophormer [

] uses GraphSAGE [

] sampling

to uniformly sample ego-graphs with pre-deﬁned maximum depth; Graph-Bert [

] restricts the

receptive ﬁeld of each node to the nodes with top-k intimacy scores (e.g., Katz and PPR). However,

these ﬁxed node sampling strategies sacriﬁce the advantage of the Transformer architecture. SAC

[

] tries to use an LSTM edge predictor to predict edges for self-attention operations. However, the

fact that LSTM can hardly be parallelized reduces the computational efﬁciency of the Transformer.

2.2 Sparse Transformers

In parallel, many efforts have been devoted to reducing the computational complexity of the Trans-

former in the ﬁeld of NLP [

] and CV [

]. In the domain of NLP, Longformer [

] applies block-wise

or strode patterns while only ﬁxing on ﬁxed neighbors. Reformer [

] replaces dot-product attention

by using approximate attention computation based on locality-sensitive hashing. Routing Trans-

former [30] employs online k-means clustering on the tokens. Linformer [35] demonstrates that the

self-attention mechanism can be approximated by a low-rank matrix and reduces the complexity from

O(n2)

O(n)

. As for vision transformers, Swin Transformer [

] proposes the shifted windowing

scheme which brings greater efﬁciency by limiting self-attention computation to non-overlapping

local windows while also allowing for cross-window connection. Focal Transformer [

] presents a

new mechanism incorporating both ﬁne-grained local and coarse-grained global attention to capture

short- and long-range visual dependencies efﬁciently. However, these sparse transformers do not take

the unique graph properties into consideration.

2.3 Graph Neural Networks and Node Sampling

Graph neural networks (GNNs) [

] follow a message-passing schema

that iteratively updates the representation of a node by aggregating representations from neighboring

nodes. When generalizing to large graphs, Graph Neural Networks face a similar scalability issue.

This is mainly due to the uncontrollable neighborhood expansion in the aggregation stage of GNN.

Several node sampling algorithms have been proposed to limit the neighborhood expansion, which

mainly falls into node-wise sampling methods and layer-wise sampling methods. In node-wise

sampling, each node samples

neighbors from its sampling distribution, then the total number of

nodes in the

-th layer becomes

O(kl)

. GraphSage [

] is one of the most well-known node-wise

sampling methods with the uniform sampling distribution. GCN-BS [

] introduces a variance

reduced sampler based on multi-armed bandits. To alleviate the exponential neighbor expansion

O(kl)

of the node-wise samplers, layer-wise samplers deﬁne the sampling distribution as a probability

of sampling nodes given a set of nodes in the upper layer [

]. From another perspective, these

sampling methods can also be categorized into ﬁxed sampling strategies [

] and adaptive

strategies [

]. However, none of the above sampling methods in GNNs can be directly applied

in Graph Transformer as Graph Transformer does not follow the message passing schema.

3 Preliminaries

3.1 Problem Deﬁnition

Let

G= (A, X)

denote the unweighted graph where

A∈Rn×n

represents the symmetric adjacency

matrix with

nodes, and

X∈Rn×p

is the attribute matrix of

attributes per node. The element

Aij

in the adjacency matrix equals to 1 if there exists an edge between node

and node

, otherwise

Aij = 0

. The label of node

. In the node classiﬁcation problem, the classiﬁer has the

knowledge of the labels of a subset of nodes

. The goal of semi-supervised node classiﬁcation is

to infer the labels of nodes in V\VLby learning a classiﬁcation function.

3.2 Transformer Architecture

The Transformer architecture consists of a series of Transformer layers [

]. Each Transformer layer

has two parts: a multi-head self-attention (MHA) module and a position-wise feed-forward network

(FFN). Let

H= [h1,··· ,hm]>∈Rm×d

denote the input to the self-attention module where

the hidden dimension,

hi∈Rd×1

is the hidden representation at position

, and

is the number of

positions. The MHA module ﬁrstly projects the input

to query-, key-, value-spaces, denoted as

Q,K,V, using three matrices WQ∈Rd×dK,WK∈Rd×dKand WV∈Rd×dV:

Q=HWQ,K=HWK,V=HWV.(1)

Then, in each head

i∈ {1,2, . . . , B}

(

is the total number of heads), the scaled dot-product

attention mechanism is applied to the corresponding {Qi,Ki,Vi}:

headi= Softmax QiKT

√dKVi.(2)

Finally, the outputs from different heads are further concatenated and transformed to obtain the ﬁnal

output of MHA:

MHA(H) = Concat ( head1,...,headB)WO,(3)

where WO∈Rd×d. In this work, we employ dK=dV=d/B.

3.3 Graph Coarsening

The goal of Graph Coarsening [

] is to reduce the number of nodes in a graph by clustering

them into super-nodes while preserving the global information of the graph as much as possible.

Given a graph

G= (V, E)

(

is the node set and

is the edge set), the coarse graph is a smaller

weighted graph

G0= (V0, E0)

is obtained from the original graph by ﬁrst computing a partition

{C1, C2,··· , C|V0|}

, i.e., the clusters

C1···C|V0|

are disjoint and cover all the nodes in

Each cluster

corresponds to a super-node in

The partition can also be characterized by a matrix

P∈ {0,1}|V|×|V0|

, with

Pij = 1

if and only if node

belongs to cluster

. Its normalized

version can be deﬁned by

P,ˆ

P D−1

, where

is a

|V0|×|V0|

diagonal matrix with

|Ci|

as its

-th

diagonal entry. The feature matrix and weighted adjacency matrix of

are deﬁned by

X0,PTX

and

A0,PTAP

. After Graph Coarsening, the number of nodes/edges in

is signiﬁcantly smaller

than that of G. The coarsening rate can be deﬁned as c=|V0|

|V|.

4 Motivating Observations

To generalize Transformer to large graphs, existing Graph Transformer models typically choose to

sample a batch of nodes for attention. However, real-world graph datasets exhibit different properties,

which makes a ﬁxed node sampling strategy unsuitable for all kinds of graphs. Here, we present a

simple yet intuitive case study to illustrate how the performance of Graph Transformer changes with

different node sampling strategies. The main idea is to use four popular node sampling strategies

for node classiﬁcation: 1-hop neighbors, 2-hop neighbors, PPR, and KNN. Then, we will check the

performance of Graph Transformer on graphs with different properties. If the performance drops

sharply when the property varies, it will indicate that graphs with different properties may require

different node sampling strategies.

Figure 1: Performance of Graph Trans-

former using different node sampling mech-

anisms: 1-hop, 2-hop, PPR and kNN re-

spectively on Newman networks.

We conduct experiments on the Newman artiﬁcial net-

works [

] since it enable us to obtain networks with

different properties easily. More detailed settings can

be found in Appendix A. Here, we consider the prop-

erty of homophily/heterophily as one example. Fol-

lowing previous works [

], the degree of homophily

can be deﬁned as the fraction of edges in a net-

work connecting nodes with the same class label, i.e.

α,|(vi,vj)∈E∧yi=yj|

|E|

. Graphs with

closer to 1 tend

to have more edges connecting nodes within the same

class, i.e. strong homophily; whereas networks with

closer to 0 have more edges connecting nodes in

different classes, i.e. strong heterophily.

As shown in Figure 1, there is no sampling strategy

that dominates other strategies across the spectrum of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

HierarchicalGraphTransformerwithAdaptiveNodeSamplingZaixiZhang1;2QiLiu1;2,QingyongHu3,Chee-KongLee41:AnhuiProvinceKeyLabofBigDataAnalysisandApplication,SchoolofComputerScienceandTechnology,UniversityofScienceandTechnologyofChina2:StateKeyLaboratoryofCognitiveIntelligence,Hefei,Anhui,China3:HongKong...

展开>> 收起<<

Hierarchical Graph Transformer with Adaptive Node Sampling Zaixi Zhang12Qi Liu12 Qingyong Hu3 Chee-Kong Lee4.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical Graph Transformer with Adaptive Node Sampling Zaixi Zhang12Qi Liu12 Qingyong Hu3 Chee-Kong Lee4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: