Transformers over Directed Acyclic Graphs Yuankai Luo Beihang University

2025-05-06 0 0 1.7MB 19 页 10玖币

侵权投诉

Transformers over Directed Acyclic Graphs

Yuankai Luo

Beihang University

luoyk@buaa.edu.cn

Veronika Thost∗

MIT-IBM Watson AI Lab, IBM Research

veronika.thost@ibm.com

Lei Shi∗

Beihang University

leishi@buaa.edu.cn

Abstract

Transformer models have recently gained popularity in graph representation learn-

ing as they have the potential to learn complex relationships beyond the ones

captured by regular graph neural networks. The main research question is how to

inject the structural bias of graphs into the transformer architecture, and several

proposals have been made for undirected molecular graphs and, recently, also for

larger network graphs. In this paper, we study transformers over directed acyclic

graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An

attention mechanism that is considerably more efﬁcient than the regular quadratic

complexity of transformers and at the same time faithfully captures the DAG struc-

ture, and (2) a positional encoding of the DAG’s partial order, complementing the

former. We rigorously evaluate our approach over various types of tasks, ranging

from classifying source code graphs to nodes in citation networks, and show that

it is effective in two important aspects: in making graph transformers generally

outperform graph neural networks tailored to DAGs and in improving SOTA graph

transformer performance in terms of both quality and efﬁciency.

1 Introduction

Graph-structured data is ubiquitous in various disciplines [Gilmer et al., 2017, Zitnik et al., 2018,

Sanchez-Gonzalez et al., 2020] and hence graph representation learning has the potential to provide

huge impact. There are various types of graph neural networks (GNNs), the majority of which is

based on a so-called message-passing scheme where node representations are computed iteratively

by aggregating the embeddings of neighbor nodes [Gilmer et al., 2017]. Yet, this mechanism in its

basic form has limited expressivity [Xu et al., 2018] and research is focusing on extensions.

Transformer models have recently gained popularity in graph learning as they have the potential to

learn complex relationships beyond the ones captured by regular GNNs, in a different way [Dwivedi

and Bresson, 2020, Kreuzer et al., 2021]. Technically, they can be considered as graph neural

networks operating on fully-connected computational graphs, decoupled from the input graphs. The

main research question in this context is how to inject the structural bias of the given input graphs

(i.e., which nodes are actually connected) into the transformer architecture by adapting their attention

and positional encoding appropriately. Several promising proposals have been made to encode

undirected molecular graphs [Ying et al., 2021, Ross et al., 2022] and recent works take into account

the scalability challenge [Dwivedi et al., 2022], also over larger network graphs [Rampášek et al.,

2022, Chen et al., 2022b].

∗Corresponding author.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.13148v6 [cs.LG] 30 Oct 2023

Not to calculate attention

To calculate attention Node Feature

···

!!𝑥!

!!𝑥"

!!𝑥!

!!𝑥"

!!𝑥##

!!𝑥#$

!!𝑥#%

!!𝑥#

!!𝑥$

!!𝑥%

!!𝑥&

!!𝑥'

&&𝑞𝑢𝑒𝑟𝑦!

𝑞𝑢𝑒𝑟𝑦

𝑘𝑒𝑦

𝑥!&&&𝑥"&&𝑥#&&&𝑥$&&𝑥%&&&𝑥&&&𝑥'&&&𝑥(&&𝑥)&&𝑥!*&𝑥!! &𝑥!"

𝑥!

𝑥"

𝑥#

𝑥$

𝑥%

𝑥&

𝑥'

𝑥(

𝑥)

𝑥!*

𝑥!!

𝑥!"

Linear

MatMul

Scale

SoftMax

MatMul

𝑄 𝐾 𝑉

Linear Linear

DAG Positional Encoding

···

Figure 1: Overview of our DAG attention. A mask matrix restricts the receptive ﬁeld of the node under

consideration to nodes that are directly reachable in the DAG. Our positional encoding (right), additionally

captures its position as its depth in the DAG.

We focus on directed acyclic graphs (DAGs), which are of special interest across various domains;

examples include parsing results of source code [Allamanis et al., 2018], logical formulas [Crouse

et al., 2019], and conversational emotion recognition [Shen et al., 2021], as well as probabilistic

graphical models and neural architectures [Zhang et al., 2018, 2019, Knyazev et al., 2021]. In a DAG,

the edges deﬁne a partial order over the nodes. This partial order represents an additional strong

inductive bias, which offers itself to be integrated into the models.

In fact, various kinds of neural networks tailored to DAG structures have been proposed over the

years; from early descriptions of recursive neural networks over DAGs [Sperduti and Starita, 1997,

Frasconi et al., 1998], which have pointed out a similarity to sequence learning, to more recent

works, such as DAG-RNN [Shuai et al., 2016], DAG-LSTM [Crouse et al., 2019], D-VAE [Zhang

et al., 2019], and DAGNN [Thost and Chen, 2021]. The latter models focus on encoding the entire

DAG; in a nutshell, they compute the embedding in a message-passing fashion by iterating over the

DAG nodes in an asynchronous way, given by the partial order, and thereafter aggregating the ﬁnal

node representations into one for the DAG. This procedure yields state-of-the-art performance in

terms of quality, but its asynchronous nature leads to comparatively (to regular GNNs) very slow

runtime performance. The more parallel computation of transformers would seem to make them

more suitable models and [Dong et al., 2022] have recently proposed one such model, PACE, which

shows convincing performance in terms of both quality and efﬁciency.

In this paper, we focus on transformers over DAGs more generally, motivated by the highly ﬂexible,

parallel, and expressive nature of their computation. In particular, their success in sequence learning

opens up the question how they can be tailored to DAGs, which are essentially sequential graphs.

Based on the above-mentioned insights in DAG learning, we propose a straightforward and efﬁcient

DAG attention framework, which effectively biases any transformer towards DAGs.

•

As outlined in Figure 1, we (1) adapt the attention mechanism by restricting the receptive ﬁeld of

each node to its predecessor and successor nodes so that it faithfully captures the DAG structure -

in terms of its reachability relation - and at the same time gets considerably more efﬁcient than the

regular quadratic complexity; and (2) employ the positional encoding in a complementary way, to

further bias the computation towards the DAG topology, by explicitly encoding the node depth.

•

We show that the attention is not restricted effectively, even if it seems so technically. Moreover,

we draw a connection to the random walk theory.

•

We rigorously evaluate our proposal in ablation studies and show that it successfully improves

different kinds of baseline transformers, from vanilla transformers [Vaswani et al., 2017] to state-

of-the-art graph transformers [Wu et al., 2021, Chen et al., 2022a, Rampášek et al., 2022, Wu

et al., 2022], over various types of DAG data. Our experiments range from classifying source

code graphs to nodes in citation networks, and even go far beyond related works’ problem scope.

Most importantly, our proposal is proven effective in two important aspects: in making graph

transformers generally outperform graph neural networks tailored to DAGs and in improving SOTA

graph transformer performance in terms of both quality and efﬁciency.

• Finally, our DAG attention framework can be implemented in two possible and rather straightfor-

ward ways, either on top of existing transformer models, or based on message-passing GNNs. Our

implementation is available at https://github.com/LUOyk1999/DAGformer.

2 Background and Related Works

Directed Acyclic Graphs. We refer to a graph as tuple

G= (V, E, X)

, with node set

, edge set

E⊆V×V

, and node features

X∈Rn×d

, with each row representing the feature vector of one node,

with number of nodes

and feature dimension

, the features of node

are denoted by

xv∈Rd

Adirected acyclic graph (DAG) is a directed graph without directed cycles. For a DAG

, we can

deﬁne a unique strong partial order

⩽

on the node set

, such that, for all pairs of nodes

u, v ∈V

u⩽v

if and only if there is a directed path from

. We deﬁne the reachability relation

⩽

over a

DAG based on that partial order. That is,

u⩽v

if and only if

is reachable from

; further, if

u⩽v

then

is called a predecessor of

, and

is a successor of

. All nodes without predecessors are

source nodes, and all nodes without successors are sink nodes. We further deﬁne a restricted form of

reachability

⩽K

with

u⩽kv

if and only if there is a directed path of length at most

from

The depth of a node

v∈V

is deﬁned below. The depth of a DAG is the maximum depth over its

nodes.

depth(v) = (0if vis a source node

1 + max

(u,v)∈E

depth(u)otherwise

Message-Passing Graph Neural Networks. Most modern graph neural networks (GNNs) are or

can be formulated in terms of the message-passing architecture, a framework proposed in [Gilmer

et al., 2017]. In that architecture, node representations are computed iteratively by aggregating the

embeddings of their neighbor nodes (i.e., the messages), and a ﬁnal graph representation can be

obtained by aggregating the node embeddings. Yet, in its basic form, this mechanism has limited

expressivity [Xu et al., 2018] and graph neural networks remain an active research area. Notable

works we compare to include the graph convolutional network (GCN) [Kipf and Welling, 2017],

a very early, basic architecture; the graph attention network (GAT) [Veliˇ

ckovi´

c et al., 2018],

aggregating the neighbor node embeddings using attention; the graph isomorphism network (GIN)

[Xu et al., 2018], which integrates extra MLP layers into message passing for increased expressivity;

and the principal neighbourhood aggregation (PNA) model [Corso et al., 2020], a recent proposal

focusing on adapting the aggregation mechanism to the node under consideration (e.g., scaling based

on its degree). For a broader overview of the ﬁeld, we refer the reader to [Wu et al., 2020].

Transformers on Graphs. Transformer models [Vaswani et al., 2017] have gained popularity in

graph learning as they have the potential to learn complex relationships beyond the ones captured

by regular GNNs and in a different way. The architecture is composed of two main parts: a self-

attention module and a feed-forward neural network, each followed by a residual connection with

a normalization layer. Formally, the self-attention projects the input node features

using three

matrices

WQ∈Rd×dK

WK∈Rd×dK

and

WV∈Rd×dK

to the corresponding representations for

query (Q), key (K), and value (V), and is described as follows :

Q=XWQ,K=XWK,V=XWV,

Attention (X) = softmax QKT

√dK!V.(1)

Over graphs, we focus on computing node instead of token embeddings (recall Figure 1). Technically,

transformers can be considered as message-passing GNNs operating on fully-connected computational

graphs, decoupled from the input graphs. The main research question in the context of graphs is

how to inject the structural bias of the given input graphs by adapting their attention and by adding

extensions to the original features, via so-called positional encodings (PEs). Graph Transformer

[Dwivedi and Bresson, 2020] represents an early work using Laplacian eigenvectors as positional

encodings, and various extensions and other models have been proposed since then [Min et al.,

2022]. For instance, [Mialon et al., 2021] proposed a relative PE [Shaw et al., 2018] by means of

kernels on graphs to bias the self-attention calculation. Notably, GraphTrans [Wu et al., 2021]

was the ﬁrst hybrid architecture, using a stack of message-passing GNN layers before the regular

transformer layers. Finally, [Chen et al., 2022a] have reformulated the self-attention mechanism as a

kernel smoother as shown below and incorporated structure information into their structure-aware

transformer (SAT) architecture by extracting a subgraph representation rooted at each node before

attention computation:

Attention (xv) = X

u∈V

κ(xv, xu)

Pw∈Vκ(xv, xw)f(xu),∀v∈V, (2)

with f(x) = WVx, and non-symmetric exp. kernel κ:

κ(x, x′) = exp ⟨φ(x)WQ, φ(x′)WK⟩

√dK, φ(x) = xin vanilla transformer

GNNG(x)in SAT

where

⟨·,·⟩

is the dot product on

and

GNNG(x)

is an arbitrary GNN model. Most of the

aforementioned works focus on the classiﬁcation of smaller graphs, such as molecules; yet, recently,

GraphGPS [Rampášek et al., 2022] is also considering larger graphs and the area is focusing on the

development of scalable models; for instance, Nodeformer [Wu et al., 2022] is designed to address

the issue of scalability and expressivity in node classiﬁcation. Altogether, the transformer architecture

opens new and promising avenues for graph representation learning, beyond message-passing GNNs.

Neural Networks over DAGs. The additional strong inductive bias present in DAGs has motivated

researchers to formulate special neural architectures tailored to this kind of graphs [Sperduti and

Starita, 1997, Frasconi et al., 1998]. Various types of models have been proposed over the years in

many different application areas. There are works in the context of syntactic graphs, [Tai et al., 2015,

Shuai et al., 2016] logical formulas [Crouse et al., 2019], source code representations [Allamanis

et al., 2018] and neural architectures [Zhang et al., 2018, 2019, Knyazev et al., 2021]. We particularly

compare to the most recent proposals S-VAE and D-VAE [Zhang et al., 2019]; and to DAGNN from

[Thost and Chen, 2021], who also showed that using attention for neighbor node aggregation is

beneﬁcial in several DAG tasks.

While the models can be formulated in terms of the message-passing framework [Thost and Chen,

2021], their processing is special for GNNs: they compute a graph embedding by iterating over the

DAG nodes in an asynchronous way, given by the partial order, and starting from the source nodes

(or the other way around, from the sink nodes). That is, the messages are passed along the partial

order of the DAG and, at any point during computation, capture the information of the subgraph

consisting of all the predecessors of the node currently under consideration. Thereafter, a ﬁnal graph

representations can be obtained by aggregating the embeddings of all sink nodes. In contrast to

regular message-passing GNNs, these customized works for DAGs usually focus on encoding graphs

(i.e., instead of nodes or edges, which are other common GNN targets) and pass messages over

the entire graph, while regular GNNs consider a ﬁxed radius - usually rather small - around each

node and essentially obtain a neighborhood embedding. Furthermore, the nature of the proposed

architectures shows that recurrent techniques have proven especially useful, analogy to learning over

sequences. In summary, these DAG models are elegant and effective, yielding state-of-the-art results,

but their asynchronous nature leads to very slow and practically inhibitive performance compared to

regular GNNs.

Transformers over DAGs. We are aware of only few proposals for transformers tailored to DAGs.

[Huang et al., 2022] developed a Directed Acyclic Transfomer in the context of machine translation

in order to simultaneously capture multiple translations within the decoded DAG, but the model’s

encoder is a vanilla transformer. [Kotnis et al., 2021] proposed BIQE, which leverages the depth of

nodes in DAG paths as positional encodings, with a speciﬁc focus on answering complex queries in

knowledge graphs. [Gagrani et al., 2022] proposed Topoformer, which is also an encoder-decoder

model but has been developed for ﬁnding optimal topological orders in DAGs. Since the study entirely

focuses on the latter goal (e.g., in terms of training objective and evaluation) it is very speciﬁc and

different from our more general study. The DAG encoder itself is also rather different in that it uses a

Laplacian PE, as it is used with regular graph transformers; and a more complex attention mechanism,

which does not only model the reachability but also several other relations over the graph. Closest to

our method is PACE [Dong et al., 2022], which similarly focuses on modeling the sequential nature

of DAGs inside a transformer model and independently of a speciﬁc application in mind. However,

(1) it applies a rather complex, node-speciﬁc positional encoding, whereas our PE only distinguishes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransformersoverDirectedAcyclicGraphsYuankaiLuoBeihangUniversityluoyk@buaa.edu.cnVeronikaThost∗MIT-IBMWatsonAILab,IBMResearchveronika.thost@ibm.comLeiShi∗BeihangUniversityleishi@buaa.edu.cnAbstractTransformermodelshaverecentlygainedpopularityingraphrepresentationlearn-ingastheyhavethepotentialtolear...

展开>> 收起<<

Transformers over Directed Acyclic Graphs Yuankai Luo Beihang University.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Transformers over Directed Acyclic Graphs Yuankai Luo Beihang University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: