To generalize Transformer to large graphs, existing Transformer-based methods [
9
,
46
,
7
] on graphs
explicitly or implicitly restrict each node’s receptive field to reduce the computational and storage
complexity. For example, Graph-Bert [
41
] restricts the receptive field of each node to the nodes
with top-
k
intimacy scores such as Personalized PageRank (PPR). GT-Sparse [
9
] only considers
1-hop neighboring nodes. We argue that existing Graph Transformers have the following deficiencies:
(1) The fixed node sampling strategies in existing Graph Transformers are ignorant of the graph
properties, which may sample uninformative nodes for attention. Therefore, an adaptive node
sampling strategy aware of the graph properties is needed. We conduct case studies in Section 4 to
support our arguments. (2) Though the sampling method enables scalability, most node sampling
strategies focus on local neighbors and neglect the long-range dependencies and global contexts of
graphs. Hence, incorporating complementary global information is necessary for Graph Transformer.
To solve the challenge (1), we propose
A
daptive
N
ode
S
ampling for
G
raph
T
ransformer (ANS-GT)
and formulate the optimization strategy of node sampling in Graph Transformer as an adversary
bandit problem. Specifically in ANS-GT, we modify Exp4.P method [
3
] to adaptively assign weights
to several chosen sampling heuristics (e.g., 1-hop neighbors, 2-hop neighbors, PPR) and combine
these sampling strategies to sample informative nodes. The reward is proportional to the attention
weights and the sampling probabilities of nodes, i.e. the reward to a certain sampling heuristic is
higher if the sampling probability distribution and the node attention weights distribution are more
similar. Then in the training process of Graph Transformer, the node sampling strategy is updated
simultaneously to sample more informative nodes. With more informative nodes input into the
self-attention module, ANS-GT can achieve better performance.
To tackle the challenge (2), we propose a hierarchical attention scheme for Graph Transformer to
encode both local and global information for each node. The hierarchical attention scheme consists
of fine-grained local attention and coarse-grained global attention. In the local attention, we use the
aforementioned adaptive node sampling strategy to select informative local nodes for attention. As
for global attention, we first use graph coarsening algorithms [
26
] to pre-process the input graph and
generate a coarse graph. Such algorithms mimic a down-sampling of the original graph via grouping
the nodes into super-nodes while preserving global graph information as much as possible. The
center nodes then interact with the sampled super-nodes. Such coarse-grained global attention helps
each node capture long-distance dependencies while reducing the computational complexity of the
vanilla Graph Transformers.
We conduct extensive experiments on real-world datasets to show the effectiveness of ANS-GT. Our
method outperforms all the existing Graph Transformer architectures and obtains state-of-the-art
results on 6 benchmark datasets. Detailed analysis and ablation studies further show the superiority
of the adaptive node sampling module and the hierarchical attention scheme.
In summary, we make the following contributions:
•
We propose
A
daptive
N
ode
S
ampling for
G
raph
T
ransformer (ANS-GT), which modifies a
multi-armed bandit algorithm to adaptively sample nodes for attention.
•
In the hierarchical attention scheme, we introduce coarse-grained global attention with
graph coarsening, which helps graph transformer capture long-range dependencies while
increasing efficiency.
•
We empirically evaluate our method on six benchmark datasets to show the advantage over
existing Graph Transformers and popular GNNs.
2 Related Work
2.1 Transformers for Graph
Recently, Transformer [
33
] has shown its superiority in an increasing number of domains [
7
,
8
,
40
],
e.g. Bert [
7
] in NLP and ViT [
8
] in CV. Existing works attempting to generalize Transformer to graph
data mainly focus on two problems: (1) How to design dedicated positional encoding for the nodes;
(2) How to alleviate the quadratic computational complexity of the vanilla Transformer and scale
the Graph Transformer to large graphs. As for the positional encoding, GT [
9
] firstly uses Laplacian
eigenvectors to enhance node features. Graph-Bert [
41
] studies employing Weisfeiler-Lehman code to
encode structural information. Graphormer [
38
] utilizes centrality encoding to enhance node features
2