
was the first hybrid architecture, using a stack of message-passing GNN layers before the regular
transformer layers. Finally, [Chen et al., 2022a] have reformulated the self-attention mechanism as a
kernel smoother as shown below and incorporated structure information into their structure-aware
transformer (SAT) architecture by extracting a subgraph representation rooted at each node before
attention computation:
Attention (xv) = X
u∈V
κ(xv, xu)
Pw∈Vκ(xv, xw)f(xu),∀v∈V, (2)
with f(x) = WVx, and non-symmetric exp. kernel κ:
κ(x, x′) = exp ⟨φ(x)WQ, φ(x′)WK⟩
√dK, φ(x) = xin vanilla transformer
GNNG(x)in SAT
where
⟨·,·⟩
is the dot product on
Rd
and
GNNG(x)
is an arbitrary GNN model. Most of the
aforementioned works focus on the classification of smaller graphs, such as molecules; yet, recently,
GraphGPS [Rampášek et al., 2022] is also considering larger graphs and the area is focusing on the
development of scalable models; for instance, Nodeformer [Wu et al., 2022] is designed to address
the issue of scalability and expressivity in node classification. Altogether, the transformer architecture
opens new and promising avenues for graph representation learning, beyond message-passing GNNs.
Neural Networks over DAGs. The additional strong inductive bias present in DAGs has motivated
researchers to formulate special neural architectures tailored to this kind of graphs [Sperduti and
Starita, 1997, Frasconi et al., 1998]. Various types of models have been proposed over the years in
many different application areas. There are works in the context of syntactic graphs, [Tai et al., 2015,
Shuai et al., 2016] logical formulas [Crouse et al., 2019], source code representations [Allamanis
et al., 2018] and neural architectures [Zhang et al., 2018, 2019, Knyazev et al., 2021]. We particularly
compare to the most recent proposals S-VAE and D-VAE [Zhang et al., 2019]; and to DAGNN from
[Thost and Chen, 2021], who also showed that using attention for neighbor node aggregation is
beneficial in several DAG tasks.
While the models can be formulated in terms of the message-passing framework [Thost and Chen,
2021], their processing is special for GNNs: they compute a graph embedding by iterating over the
DAG nodes in an asynchronous way, given by the partial order, and starting from the source nodes
(or the other way around, from the sink nodes). That is, the messages are passed along the partial
order of the DAG and, at any point during computation, capture the information of the subgraph
consisting of all the predecessors of the node currently under consideration. Thereafter, a final graph
representations can be obtained by aggregating the embeddings of all sink nodes. In contrast to
regular message-passing GNNs, these customized works for DAGs usually focus on encoding graphs
(i.e., instead of nodes or edges, which are other common GNN targets) and pass messages over
the entire graph, while regular GNNs consider a fixed radius - usually rather small - around each
node and essentially obtain a neighborhood embedding. Furthermore, the nature of the proposed
architectures shows that recurrent techniques have proven especially useful, analogy to learning over
sequences. In summary, these DAG models are elegant and effective, yielding state-of-the-art results,
but their asynchronous nature leads to very slow and practically inhibitive performance compared to
regular GNNs.
Transformers over DAGs. We are aware of only few proposals for transformers tailored to DAGs.
[Huang et al., 2022] developed a Directed Acyclic Transfomer in the context of machine translation
in order to simultaneously capture multiple translations within the decoded DAG, but the model’s
encoder is a vanilla transformer. [Kotnis et al., 2021] proposed BIQE, which leverages the depth of
nodes in DAG paths as positional encodings, with a specific focus on answering complex queries in
knowledge graphs. [Gagrani et al., 2022] proposed Topoformer, which is also an encoder-decoder
model but has been developed for finding optimal topological orders in DAGs. Since the study entirely
focuses on the latter goal (e.g., in terms of training objective and evaluation) it is very specific and
different from our more general study. The DAG encoder itself is also rather different in that it uses a
Laplacian PE, as it is used with regular graph transformers; and a more complex attention mechanism,
which does not only model the reachability but also several other relations over the graph. Closest to
our method is PACE [Dong et al., 2022], which similarly focuses on modeling the sequential nature
of DAGs inside a transformer model and independently of a specific application in mind. However,
(1) it applies a rather complex, node-specific positional encoding, whereas our PE only distinguishes
4