Transformers over Directed Acyclic Graphs Yuankai Luo Beihang University

2025-05-06 0 0 1.7MB 19 页 10玖币
侵权投诉
Transformers over Directed Acyclic Graphs
Yuankai Luo
Beihang University
luoyk@buaa.edu.cn
Veronika Thost
MIT-IBM Watson AI Lab, IBM Research
veronika.thost@ibm.com
Lei Shi
Beihang University
leishi@buaa.edu.cn
Abstract
Transformer models have recently gained popularity in graph representation learn-
ing as they have the potential to learn complex relationships beyond the ones
captured by regular graph neural networks. The main research question is how to
inject the structural bias of graphs into the transformer architecture, and several
proposals have been made for undirected molecular graphs and, recently, also for
larger network graphs. In this paper, we study transformers over directed acyclic
graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An
attention mechanism that is considerably more efficient than the regular quadratic
complexity of transformers and at the same time faithfully captures the DAG struc-
ture, and (2) a positional encoding of the DAG’s partial order, complementing the
former. We rigorously evaluate our approach over various types of tasks, ranging
from classifying source code graphs to nodes in citation networks, and show that
it is effective in two important aspects: in making graph transformers generally
outperform graph neural networks tailored to DAGs and in improving SOTA graph
transformer performance in terms of both quality and efficiency.
1 Introduction
Graph-structured data is ubiquitous in various disciplines [Gilmer et al., 2017, Zitnik et al., 2018,
Sanchez-Gonzalez et al., 2020] and hence graph representation learning has the potential to provide
huge impact. There are various types of graph neural networks (GNNs), the majority of which is
based on a so-called message-passing scheme where node representations are computed iteratively
by aggregating the embeddings of neighbor nodes [Gilmer et al., 2017]. Yet, this mechanism in its
basic form has limited expressivity [Xu et al., 2018] and research is focusing on extensions.
Transformer models have recently gained popularity in graph learning as they have the potential to
learn complex relationships beyond the ones captured by regular GNNs, in a different way [Dwivedi
and Bresson, 2020, Kreuzer et al., 2021]. Technically, they can be considered as graph neural
networks operating on fully-connected computational graphs, decoupled from the input graphs. The
main research question in this context is how to inject the structural bias of the given input graphs
(i.e., which nodes are actually connected) into the transformer architecture by adapting their attention
and positional encoding appropriately. Several promising proposals have been made to encode
undirected molecular graphs [Ying et al., 2021, Ross et al., 2022] and recent works take into account
the scalability challenge [Dwivedi et al., 2022], also over larger network graphs [Rampášek et al.,
2022, Chen et al., 2022b].
Corresponding author.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.13148v6 [cs.LG] 30 Oct 2023
Not to calculate attention
To calculate attention Node Feature
···
!!𝑥!
!!𝑥"
!!𝑥!
!!𝑥"
!!𝑥##
!!𝑥#$
!!𝑥#%
!!𝑥#
!!𝑥$
!!𝑥%
!!𝑥&
!!𝑥'
&&𝑞𝑢𝑒𝑟𝑦!
𝑞𝑢𝑒𝑟𝑦
𝑘𝑒𝑦
𝑥!&&&𝑥"&&𝑥#&&&𝑥$&&𝑥%&&&𝑥&&&𝑥'&&&𝑥(&&𝑥)&&𝑥!*&𝑥!! &𝑥!"
𝑥!
𝑥"
𝑥#
𝑥$
𝑥%
𝑥&
𝑥'
𝑥(
𝑥)
𝑥!*
𝑥!!
𝑥!"
Linear
MatMul
Scale
SoftMax
MatMul
𝑄 𝐾 𝑉
Linear Linear
DAG Positional Encoding
···
Figure 1: Overview of our DAG attention. A mask matrix restricts the receptive field of the node under
consideration to nodes that are directly reachable in the DAG. Our positional encoding (right), additionally
captures its position as its depth in the DAG.
We focus on directed acyclic graphs (DAGs), which are of special interest across various domains;
examples include parsing results of source code [Allamanis et al., 2018], logical formulas [Crouse
et al., 2019], and conversational emotion recognition [Shen et al., 2021], as well as probabilistic
graphical models and neural architectures [Zhang et al., 2018, 2019, Knyazev et al., 2021]. In a DAG,
the edges define a partial order over the nodes. This partial order represents an additional strong
inductive bias, which offers itself to be integrated into the models.
In fact, various kinds of neural networks tailored to DAG structures have been proposed over the
years; from early descriptions of recursive neural networks over DAGs [Sperduti and Starita, 1997,
Frasconi et al., 1998], which have pointed out a similarity to sequence learning, to more recent
works, such as DAG-RNN [Shuai et al., 2016], DAG-LSTM [Crouse et al., 2019], D-VAE [Zhang
et al., 2019], and DAGNN [Thost and Chen, 2021]. The latter models focus on encoding the entire
DAG; in a nutshell, they compute the embedding in a message-passing fashion by iterating over the
DAG nodes in an asynchronous way, given by the partial order, and thereafter aggregating the final
node representations into one for the DAG. This procedure yields state-of-the-art performance in
terms of quality, but its asynchronous nature leads to comparatively (to regular GNNs) very slow
runtime performance. The more parallel computation of transformers would seem to make them
more suitable models and [Dong et al., 2022] have recently proposed one such model, PACE, which
shows convincing performance in terms of both quality and efficiency.
In this paper, we focus on transformers over DAGs more generally, motivated by the highly flexible,
parallel, and expressive nature of their computation. In particular, their success in sequence learning
opens up the question how they can be tailored to DAGs, which are essentially sequential graphs.
Based on the above-mentioned insights in DAG learning, we propose a straightforward and efficient
DAG attention framework, which effectively biases any transformer towards DAGs.
As outlined in Figure 1, we (1) adapt the attention mechanism by restricting the receptive field of
each node to its predecessor and successor nodes so that it faithfully captures the DAG structure -
in terms of its reachability relation - and at the same time gets considerably more efficient than the
regular quadratic complexity; and (2) employ the positional encoding in a complementary way, to
further bias the computation towards the DAG topology, by explicitly encoding the node depth.
We show that the attention is not restricted effectively, even if it seems so technically. Moreover,
we draw a connection to the random walk theory.
We rigorously evaluate our proposal in ablation studies and show that it successfully improves
different kinds of baseline transformers, from vanilla transformers [Vaswani et al., 2017] to state-
of-the-art graph transformers [Wu et al., 2021, Chen et al., 2022a, Rampášek et al., 2022, Wu
et al., 2022], over various types of DAG data. Our experiments range from classifying source
code graphs to nodes in citation networks, and even go far beyond related works’ problem scope.
Most importantly, our proposal is proven effective in two important aspects: in making graph
2
transformers generally outperform graph neural networks tailored to DAGs and in improving SOTA
graph transformer performance in terms of both quality and efficiency.
Finally, our DAG attention framework can be implemented in two possible and rather straightfor-
ward ways, either on top of existing transformer models, or based on message-passing GNNs. Our
implementation is available at https://github.com/LUOyk1999/DAGformer.
2 Background and Related Works
Directed Acyclic Graphs. We refer to a graph as tuple
G= (V, E, X)
, with node set
V
, edge set
EV×V
, and node features
XRn×d
, with each row representing the feature vector of one node,
with number of nodes
n
and feature dimension
d
, the features of node
v
are denoted by
xvRd
.
Adirected acyclic graph (DAG) is a directed graph without directed cycles. For a DAG
G
, we can
define a unique strong partial order
on the node set
V
, such that, for all pairs of nodes
u, v V
,
uv
if and only if there is a directed path from
u
to
v
. We define the reachability relation
over a
DAG based on that partial order. That is,
uv
if and only if
v
is reachable from
u
; further, if
uv
,
then
u
is called a predecessor of
v
, and
v
is a successor of
u
. All nodes without predecessors are
source nodes, and all nodes without successors are sink nodes. We further define a restricted form of
reachability
K
with
ukv
if and only if there is a directed path of length at most
k
from
u
to
v
.
The depth of a node
vV
is defined below. The depth of a DAG is the maximum depth over its
nodes.
depth(v) = (0if vis a source node
1 + max
(u,v)E
depth(u)otherwise
Message-Passing Graph Neural Networks. Most modern graph neural networks (GNNs) are or
can be formulated in terms of the message-passing architecture, a framework proposed in [Gilmer
et al., 2017]. In that architecture, node representations are computed iteratively by aggregating the
embeddings of their neighbor nodes (i.e., the messages), and a final graph representation can be
obtained by aggregating the node embeddings. Yet, in its basic form, this mechanism has limited
expressivity [Xu et al., 2018] and graph neural networks remain an active research area. Notable
works we compare to include the graph convolutional network (GCN) [Kipf and Welling, 2017],
a very early, basic architecture; the graph attention network (GAT) [Veliˇ
ckovi´
c et al., 2018],
aggregating the neighbor node embeddings using attention; the graph isomorphism network (GIN)
[Xu et al., 2018], which integrates extra MLP layers into message passing for increased expressivity;
and the principal neighbourhood aggregation (PNA) model [Corso et al., 2020], a recent proposal
focusing on adapting the aggregation mechanism to the node under consideration (e.g., scaling based
on its degree). For a broader overview of the field, we refer the reader to [Wu et al., 2020].
Transformers on Graphs. Transformer models [Vaswani et al., 2017] have gained popularity in
graph learning as they have the potential to learn complex relationships beyond the ones captured
by regular GNNs and in a different way. The architecture is composed of two main parts: a self-
attention module and a feed-forward neural network, each followed by a residual connection with
a normalization layer. Formally, the self-attention projects the input node features
X
using three
matrices
WQRd×dK
,
WKRd×dK
and
WVRd×dK
to the corresponding representations for
query (Q), key (K), and value (V), and is described as follows :
Q=XWQ,K=XWK,V=XWV,
Attention (X) = softmax QKT
dK!V.(1)
Over graphs, we focus on computing node instead of token embeddings (recall Figure 1). Technically,
transformers can be considered as message-passing GNNs operating on fully-connected computational
graphs, decoupled from the input graphs. The main research question in the context of graphs is
how to inject the structural bias of the given input graphs by adapting their attention and by adding
extensions to the original features, via so-called positional encodings (PEs). Graph Transformer
[Dwivedi and Bresson, 2020] represents an early work using Laplacian eigenvectors as positional
encodings, and various extensions and other models have been proposed since then [Min et al.,
2022]. For instance, [Mialon et al., 2021] proposed a relative PE [Shaw et al., 2018] by means of
kernels on graphs to bias the self-attention calculation. Notably, GraphTrans [Wu et al., 2021]
3
was the first hybrid architecture, using a stack of message-passing GNN layers before the regular
transformer layers. Finally, [Chen et al., 2022a] have reformulated the self-attention mechanism as a
kernel smoother as shown below and incorporated structure information into their structure-aware
transformer (SAT) architecture by extracting a subgraph representation rooted at each node before
attention computation:
Attention (xv) = X
uV
κ(xv, xu)
PwVκ(xv, xw)f(xu),vV, (2)
with f(x) = WVx, and non-symmetric exp. kernel κ:
κ(x, x) = exp φ(x)WQ, φ(x)WK
dK, φ(x) = xin vanilla transformer
GNNG(x)in SAT
where
⟨·,·⟩
is the dot product on
Rd
and
GNNG(x)
is an arbitrary GNN model. Most of the
aforementioned works focus on the classification of smaller graphs, such as molecules; yet, recently,
GraphGPS [Rampášek et al., 2022] is also considering larger graphs and the area is focusing on the
development of scalable models; for instance, Nodeformer [Wu et al., 2022] is designed to address
the issue of scalability and expressivity in node classification. Altogether, the transformer architecture
opens new and promising avenues for graph representation learning, beyond message-passing GNNs.
Neural Networks over DAGs. The additional strong inductive bias present in DAGs has motivated
researchers to formulate special neural architectures tailored to this kind of graphs [Sperduti and
Starita, 1997, Frasconi et al., 1998]. Various types of models have been proposed over the years in
many different application areas. There are works in the context of syntactic graphs, [Tai et al., 2015,
Shuai et al., 2016] logical formulas [Crouse et al., 2019], source code representations [Allamanis
et al., 2018] and neural architectures [Zhang et al., 2018, 2019, Knyazev et al., 2021]. We particularly
compare to the most recent proposals S-VAE and D-VAE [Zhang et al., 2019]; and to DAGNN from
[Thost and Chen, 2021], who also showed that using attention for neighbor node aggregation is
beneficial in several DAG tasks.
While the models can be formulated in terms of the message-passing framework [Thost and Chen,
2021], their processing is special for GNNs: they compute a graph embedding by iterating over the
DAG nodes in an asynchronous way, given by the partial order, and starting from the source nodes
(or the other way around, from the sink nodes). That is, the messages are passed along the partial
order of the DAG and, at any point during computation, capture the information of the subgraph
consisting of all the predecessors of the node currently under consideration. Thereafter, a final graph
representations can be obtained by aggregating the embeddings of all sink nodes. In contrast to
regular message-passing GNNs, these customized works for DAGs usually focus on encoding graphs
(i.e., instead of nodes or edges, which are other common GNN targets) and pass messages over
the entire graph, while regular GNNs consider a fixed radius - usually rather small - around each
node and essentially obtain a neighborhood embedding. Furthermore, the nature of the proposed
architectures shows that recurrent techniques have proven especially useful, analogy to learning over
sequences. In summary, these DAG models are elegant and effective, yielding state-of-the-art results,
but their asynchronous nature leads to very slow and practically inhibitive performance compared to
regular GNNs.
Transformers over DAGs. We are aware of only few proposals for transformers tailored to DAGs.
[Huang et al., 2022] developed a Directed Acyclic Transfomer in the context of machine translation
in order to simultaneously capture multiple translations within the decoded DAG, but the model’s
encoder is a vanilla transformer. [Kotnis et al., 2021] proposed BIQE, which leverages the depth of
nodes in DAG paths as positional encodings, with a specific focus on answering complex queries in
knowledge graphs. [Gagrani et al., 2022] proposed Topoformer, which is also an encoder-decoder
model but has been developed for finding optimal topological orders in DAGs. Since the study entirely
focuses on the latter goal (e.g., in terms of training objective and evaluation) it is very specific and
different from our more general study. The DAG encoder itself is also rather different in that it uses a
Laplacian PE, as it is used with regular graph transformers; and a more complex attention mechanism,
which does not only model the reachability but also several other relations over the graph. Closest to
our method is PACE [Dong et al., 2022], which similarly focuses on modeling the sequential nature
of DAGs inside a transformer model and independently of a specific application in mind. However,
(1) it applies a rather complex, node-specific positional encoding, whereas our PE only distinguishes
4
摘要:

TransformersoverDirectedAcyclicGraphsYuankaiLuoBeihangUniversityluoyk@buaa.edu.cnVeronikaThost∗MIT-IBMWatsonAILab,IBMResearchveronika.thost@ibm.comLeiShi∗BeihangUniversityleishi@buaa.edu.cnAbstractTransformermodelshaverecentlygainedpopularityingraphrepresentationlearn-ingastheyhavethepotentialtolear...

展开>> 收起<<
Transformers over Directed Acyclic Graphs Yuankai Luo Beihang University.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.7MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注