Hierarchical Graph Transformer with Adaptive Node Sampling Zaixi Zhang12Qi Liu12 Qingyong Hu3 Chee-Kong Lee4

2025-05-06 0 0 1.34MB 17 页 10玖币
侵权投诉
Hierarchical Graph Transformer with Adaptive
Node Sampling
Zaixi Zhang1,2Qi Liu1,2
, Qingyong Hu3, Chee-Kong Lee4
1: Anhui Province Key Lab of Big Data Analysis and Application,
School of Computer Science and Technology, University of Science and Technology of China
2:State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China
3:Hong Kong University of Science and Technology, 4: Tencent America
zaixi@mail.ustc.edu.cn, qiliuql@ustc.edu.cn
qhuag@cse.ust.hk, cheekonglee@tencent.com
Abstract
The Transformer architecture has achieved remarkable success in a number of
domains including natural language processing and computer vision. However,
when it comes to graph-structured data, transformers have not achieved competitive
performance, especially on large graphs. In this paper, we identify the main
deficiencies of current graph transformers: (1) Existing node sampling strategies
in Graph Transformers are agnostic to the graph characteristics and the training
process. (2) Most sampling strategies only focus on local neighbors and neglect
the long-range dependencies in the graph. We conduct experimental investigations
on synthetic datasets to show that existing sampling strategies are sub-optimal. To
tackle the aforementioned problems, we formulate the optimization strategies of
node sampling in Graph Transformer as an adversary bandit problem, where the
rewards are related to the attention weights and can vary in the training procedure.
Meanwhile, we propose a hierarchical attention scheme with graph coarsening
to capture the long-range interactions while reducing computational complexity.
Finally, we conduct extensive experiments on real-world datasets to demonstrate
the superiority of our method over existing graph transformers and popular GNNs.
Our code is open-sourced at https://github.com/zaixizhang/ANS-GT.
1 Introduction
In recent years, the Transformer architecture [
33
] and its variants (e.g., Bert [
7
] and ViT [
8
]) have
achieved unprecedented successes in natural language processing (NLP) and computer vision (CV).
In light of the superior performance of Transformer, some recent works [
21
,
38
] attempt to generalize
Transformer for graph data by treating each node as a token and designing dedicated positional
encoding. However, most of these works only focus on small graphs such as molecular graphs
with tens of atoms [
38
]. For instance, Graphormer [
38
] achieves state-of-the-art performance on
molecular property prediction tasks. When it comes to large graphs, the quadratic computational
and storage complexity of the vanilla Transformer with the number of nodes inhibits the practical
application. Although some Sparse Transformer methods [
30
,
2
,
19
] can improve the efficiency of
the vanilla Transformer, they have not exploited the unique characteristics of graph data and require
a quadratic or at least sub-quadratic space complexity, which is still unaffordable in most practical
cases. Moreover, the full-attention mechanism potentially introduces noise from numerous irrelevant
nodes in the full graph.
Qi Liu is the corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.03930v1 [cs.LG] 8 Oct 2022
To generalize Transformer to large graphs, existing Transformer-based methods [
9
,
46
,
7
] on graphs
explicitly or implicitly restrict each node’s receptive field to reduce the computational and storage
complexity. For example, Graph-Bert [
41
] restricts the receptive field of each node to the nodes
with top-
k
intimacy scores such as Personalized PageRank (PPR). GT-Sparse [
9
] only considers
1-hop neighboring nodes. We argue that existing Graph Transformers have the following deficiencies:
(1) The fixed node sampling strategies in existing Graph Transformers are ignorant of the graph
properties, which may sample uninformative nodes for attention. Therefore, an adaptive node
sampling strategy aware of the graph properties is needed. We conduct case studies in Section 4 to
support our arguments. (2) Though the sampling method enables scalability, most node sampling
strategies focus on local neighbors and neglect the long-range dependencies and global contexts of
graphs. Hence, incorporating complementary global information is necessary for Graph Transformer.
To solve the challenge (1), we propose
A
daptive
N
ode
S
ampling for
G
raph
T
ransformer (ANS-GT)
and formulate the optimization strategy of node sampling in Graph Transformer as an adversary
bandit problem. Specifically in ANS-GT, we modify Exp4.P method [
3
] to adaptively assign weights
to several chosen sampling heuristics (e.g., 1-hop neighbors, 2-hop neighbors, PPR) and combine
these sampling strategies to sample informative nodes. The reward is proportional to the attention
weights and the sampling probabilities of nodes, i.e. the reward to a certain sampling heuristic is
higher if the sampling probability distribution and the node attention weights distribution are more
similar. Then in the training process of Graph Transformer, the node sampling strategy is updated
simultaneously to sample more informative nodes. With more informative nodes input into the
self-attention module, ANS-GT can achieve better performance.
To tackle the challenge (2), we propose a hierarchical attention scheme for Graph Transformer to
encode both local and global information for each node. The hierarchical attention scheme consists
of fine-grained local attention and coarse-grained global attention. In the local attention, we use the
aforementioned adaptive node sampling strategy to select informative local nodes for attention. As
for global attention, we first use graph coarsening algorithms [
26
] to pre-process the input graph and
generate a coarse graph. Such algorithms mimic a down-sampling of the original graph via grouping
the nodes into super-nodes while preserving global graph information as much as possible. The
center nodes then interact with the sampled super-nodes. Such coarse-grained global attention helps
each node capture long-distance dependencies while reducing the computational complexity of the
vanilla Graph Transformers.
We conduct extensive experiments on real-world datasets to show the effectiveness of ANS-GT. Our
method outperforms all the existing Graph Transformer architectures and obtains state-of-the-art
results on 6 benchmark datasets. Detailed analysis and ablation studies further show the superiority
of the adaptive node sampling module and the hierarchical attention scheme.
In summary, we make the following contributions:
We propose
A
daptive
N
ode
S
ampling for
G
raph
T
ransformer (ANS-GT), which modifies a
multi-armed bandit algorithm to adaptively sample nodes for attention.
In the hierarchical attention scheme, we introduce coarse-grained global attention with
graph coarsening, which helps graph transformer capture long-range dependencies while
increasing efficiency.
We empirically evaluate our method on six benchmark datasets to show the advantage over
existing Graph Transformers and popular GNNs.
2 Related Work
2.1 Transformers for Graph
Recently, Transformer [
33
] has shown its superiority in an increasing number of domains [
7
,
8
,
40
],
e.g. Bert [
7
] in NLP and ViT [
8
] in CV. Existing works attempting to generalize Transformer to graph
data mainly focus on two problems: (1) How to design dedicated positional encoding for the nodes;
(2) How to alleviate the quadratic computational complexity of the vanilla Transformer and scale
the Graph Transformer to large graphs. As for the positional encoding, GT [
9
] firstly uses Laplacian
eigenvectors to enhance node features. Graph-Bert [
41
] studies employing Weisfeiler-Lehman code to
encode structural information. Graphormer [
38
] utilizes centrality encoding to enhance node features
2
while incorporating edge information with spatial (SPD-indexed attention bias) and edge encoding.
SAN [
21
] further replaces the static Laplacian eigenvectors with learnable positional encodings and
designs an attention mechanism that distinguishes local connectivity. For the scalability issue, one
immediate idea is to restrict the number of attending nodes. For example, GAT [
34
] and GT-Sparse
[
9
] only consider the 1-hop neighboring nodes; Gophormer [
46
] uses GraphSAGE [
11
] sampling
to uniformly sample ego-graphs with pre-defined maximum depth; Graph-Bert [
41
] restricts the
receptive field of each node to the nodes with top-k intimacy scores (e.g., Katz and PPR). However,
these fixed node sampling strategies sacrifice the advantage of the Transformer architecture. SAC
[
22
] tries to use an LSTM edge predictor to predict edges for self-attention operations. However, the
fact that LSTM can hardly be parallelized reduces the computational efficiency of the Transformer.
2.2 Sparse Transformers
In parallel, many efforts have been devoted to reducing the computational complexity of the Trans-
former in the field of NLP [
23
] and CV [
32
]. In the domain of NLP, Longformer [
2
] applies block-wise
or strode patterns while only fixing on fixed neighbors. Reformer [
19
] replaces dot-product attention
by using approximate attention computation based on locality-sensitive hashing. Routing Trans-
former [30] employs online k-means clustering on the tokens. Linformer [35] demonstrates that the
self-attention mechanism can be approximated by a low-rank matrix and reduces the complexity from
O(n2)
to
O(n)
. As for vision transformers, Swin Transformer [
24
] proposes the shifted windowing
scheme which brings greater efficiency by limiting self-attention computation to non-overlapping
local windows while also allowing for cross-window connection. Focal Transformer [
37
] presents a
new mechanism incorporating both fine-grained local and coarse-grained global attention to capture
short- and long-range visual dependencies efficiently. However, these sparse transformers do not take
the unique graph properties into consideration.
2.3 Graph Neural Networks and Node Sampling
Graph neural networks (GNNs) [
18
,
11
,
12
,
44
,
43
,
31
,
45
,
42
] follow a message-passing schema
that iteratively updates the representation of a node by aggregating representations from neighboring
nodes. When generalizing to large graphs, Graph Neural Networks face a similar scalability issue.
This is mainly due to the uncontrollable neighborhood expansion in the aggregation stage of GNN.
Several node sampling algorithms have been proposed to limit the neighborhood expansion, which
mainly falls into node-wise sampling methods and layer-wise sampling methods. In node-wise
sampling, each node samples
k
neighbors from its sampling distribution, then the total number of
nodes in the
l
-th layer becomes
O(kl)
. GraphSage [
11
] is one of the most well-known node-wise
sampling methods with the uniform sampling distribution. GCN-BS [
25
] introduces a variance
reduced sampler based on multi-armed bandits. To alleviate the exponential neighbor expansion
O(kl)
of the node-wise samplers, layer-wise samplers define the sampling distribution as a probability
of sampling nodes given a set of nodes in the upper layer [
4
,
16
,
49
]. From another perspective, these
sampling methods can also be categorized into fixed sampling strategies [
11
,
4
,
49
] and adaptive
strategies [
25
,
39
]. However, none of the above sampling methods in GNNs can be directly applied
in Graph Transformer as Graph Transformer does not follow the message passing schema.
3 Preliminaries
3.1 Problem Definition
Let
G= (A, X)
denote the unweighted graph where
ARn×n
represents the symmetric adjacency
matrix with
n
nodes, and
XRn×p
is the attribute matrix of
p
attributes per node. The element
Aij
in the adjacency matrix equals to 1 if there exists an edge between node
vi
and node
vj
, otherwise
Aij = 0
. The label of node
vi
is
yi
. In the node classification problem, the classifier has the
knowledge of the labels of a subset of nodes
VL
. The goal of semi-supervised node classification is
to infer the labels of nodes in V\VLby learning a classification function.
3.2 Transformer Architecture
The Transformer architecture consists of a series of Transformer layers [
33
]. Each Transformer layer
has two parts: a multi-head self-attention (MHA) module and a position-wise feed-forward network
3
(FFN). Let
H= [h1,··· ,hm]>Rm×d
denote the input to the self-attention module where
d
is
the hidden dimension,
hiRd×1
is the hidden representation at position
i
, and
m
is the number of
positions. The MHA module firstly projects the input
H
to query-, key-, value-spaces, denoted as
Q,K,V, using three matrices WQRd×dK,WKRd×dKand WVRd×dV:
Q=HWQ,K=HWK,V=HWV.(1)
Then, in each head
i∈ {1,2, . . . , B}
(
B
is the total number of heads), the scaled dot-product
attention mechanism is applied to the corresponding {Qi,Ki,Vi}:
headi= Softmax QiKT
i
dKVi.(2)
Finally, the outputs from different heads are further concatenated and transformed to obtain the final
output of MHA:
MHA(H) = Concat ( head1,...,headB)WO,(3)
where WORd×d. In this work, we employ dK=dV=d/B.
3.3 Graph Coarsening
The goal of Graph Coarsening [
29
,
26
,
17
] is to reduce the number of nodes in a graph by clustering
them into super-nodes while preserving the global information of the graph as much as possible.
Given a graph
G= (V, E)
(
V
is the node set and
E
is the edge set), the coarse graph is a smaller
weighted graph
G0= (V0, E0)
.
G0
is obtained from the original graph by first computing a partition
{C1, C2,··· , C|V0|}
of
V
, i.e., the clusters
C1···C|V0|
are disjoint and cover all the nodes in
V
.
Each cluster
Ci
corresponds to a super-node in
G0
The partition can also be characterized by a matrix
ˆ
P∈ {0,1}|V|×|V0|
, with
ˆ
Pij = 1
if and only if node
vi
in
G
belongs to cluster
Cj
. Its normalized
version can be defined by
P,ˆ
P D1
2
, where
D
is a
|V0|×|V0|
diagonal matrix with
|Ci|
as its
i
-th
diagonal entry. The feature matrix and weighted adjacency matrix of
G0
are defined by
X0,PTX
and
A0,PTAP
. After Graph Coarsening, the number of nodes/edges in
G0
is significantly smaller
than that of G. The coarsening rate can be defined as c=|V0|
|V|.
4 Motivating Observations
To generalize Transformer to large graphs, existing Graph Transformer models typically choose to
sample a batch of nodes for attention. However, real-world graph datasets exhibit different properties,
which makes a fixed node sampling strategy unsuitable for all kinds of graphs. Here, we present a
simple yet intuitive case study to illustrate how the performance of Graph Transformer changes with
different node sampling strategies. The main idea is to use four popular node sampling strategies
for node classification: 1-hop neighbors, 2-hop neighbors, PPR, and KNN. Then, we will check the
performance of Graph Transformer on graphs with different properties. If the performance drops
sharply when the property varies, it will indicate that graphs with different properties may require
different node sampling strategies.
Figure 1: Performance of Graph Trans-
former using different node sampling mech-
anisms: 1-hop, 2-hop, PPR and kNN re-
spectively on Newman networks.
We conduct experiments on the Newman artificial net-
works [
10
] since it enable us to obtain networks with
different properties easily. More detailed settings can
be found in Appendix A. Here, we consider the prop-
erty of homophily/heterophily as one example. Fol-
lowing previous works [
47
], the degree of homophily
α
can be defined as the fraction of edges in a net-
work connecting nodes with the same class label, i.e.
α,|(vi,vj)Eyi=yj|
|E|
. Graphs with
α
closer to 1 tend
to have more edges connecting nodes within the same
class, i.e. strong homophily; whereas networks with
α
closer to 0 have more edges connecting nodes in
different classes, i.e. strong heterophily.
As shown in Figure 1, there is no sampling strategy
that dominates other strategies across the spectrum of
4
摘要:

HierarchicalGraphTransformerwithAdaptiveNodeSamplingZaixiZhang1;2QiLiu1;2,QingyongHu3,Chee-KongLee41:AnhuiProvinceKeyLabofBigDataAnalysisandApplication,SchoolofComputerScienceandTechnology,UniversityofScienceandTechnologyofChina2:StateKeyLaboratoryofCognitiveIntelligence,Hefei,Anhui,China3:HongKong...

展开>> 收起<<
Hierarchical Graph Transformer with Adaptive Node Sampling Zaixi Zhang12Qi Liu12 Qingyong Hu3 Chee-Kong Lee4.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.34MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注