
data. Previous work has focused on L0regularization of attention coefficients to enforce sparsity (Ye
& Ji, 2021) or has co-optimized a link prediction task using attention (Kim & Oh, 2021). Since these
regularization strategies are formulated independently of the primary prediction task, they align the
attention mechanism with some intrinsic property of the input graph without regard for the training
objective.
We take a different approach and consider the question: “What is the importance of a specific edge
to the prediction task?” Our answer comes from the perspective of regularization: we introduce
CAR, a causal attention regularization framework that is broadly suitable for graph attention net-
work architectures (Figure 1). Intuitively, an edge in the input graph is important to a prediction
task if removing it leads to substantial degradation in the prediction performance of the GNN. The
key conceptual advance of this work is to scalably leverage active interventions on node neighbor-
hoods (i.e., deletion of specific edges) to align graph attention training with the causal impact of
these interventions on task performance. Theoretically, our approach is motivated by the invariant
prediction framework for causal inference (Peters et al., 2016; Wu et al., 2022). While some efforts
have previously been made to infuse notions of causality into GNNs, these causal approaches have
been largely limited to using causal effects from pre-trained models as features for a separate model
(Feng et al., 2021; Knyazev et al., 2019) or decoupling causal from non-causal effects (Sui et al.,
2021).
We apply CAR on three graph attention architectures across eight node classification tasks, finding
that it consistently improves test loss and accuracy. CAR is able to fine-tune graph attention by
improving its alignment with task-specific homophily. Correspondingly, we found that as graph het-
erophily increases, the margin of CAR’s outperformance widens. In contrast, a non-causal approach
that directly regularizes with respect to label similarity generalizes less well. On the ogbn-arxiv
network, we investigate the citations up/down-weighted by CAR and found them to broadly group
into three intuitive themes. Our causal approach can thus enhance the interpretability of attention
coefficients, and we provide a qualitative analysis of this improved interpretability. We also present
preliminary results demonstrating the applicability of CAR to graph pruning tasks. Due to the
size of industrially relevant graphs, it is common to use GCNs or sampling-based approaches on
them. There, using attention coefficients learned by CAR on sampled subnetworks may guide graph
rewiring of the full network to improve the results obtained with convolutional techniques.
2 METHODS
2.1 GRAPH ATTENTION NETWORKS
Attention mechanisms have been effectively used in many domains by enabling models to dynam-
ically attend to the specific parts of an input that are relevant to a prediction task (Chaudhari et al.,
2021). In graph settings, attention mechanisms compute the relevance of edges in the graph for a
prediction task. A neighbor aggregation operator then uses this information to weight the contribu-
tion of each edge (Lee et al., 2019a; Li et al., 2016; Lee et al., 2019b).
The approach for computing attention is similar in many graph attention mechanisms. A graph
attention layer takes as input a set of node features = {h1, ..., hN},hi∈RF, where Nis the number
of nodes. The graph attention layer uses these node features to compute attention coefficients for
each edge: αij =a(hi,hj),
where a:RF0×RF0→(0,1) is the attention mechanism function, and the attention coeffi-
cient αij for an edge indicates the importance of node i’s input features to node j. For a node j,
these attention coefficients are then used to compute a linear combination of its neighbors’ features:
h0
j=X
i∈N(j)
αij W hi,s.t.X
i∈N(j)
αij = 1. For multi-headed attention, each of the Kheads first
independently calculates its own attention coefficients α(k)
i,j with its head-specific attention mecha-
nism a(k)(·,·), after which the head-specific outputs are averaged.
In this paper, we focus on three widely used graph attention architectures: the original graph atten-
tion network (GAT) (Velickovic et al., 2018), a modified version of this original network (GATv2)
(Brody et al., 2022), and the Graph Transformer network (Shi et al., 2021). The three architectures
and their equations for computing attention are presented in Appendix A.1.
2