
standard deviation and Iis the identity matrix. Let Ebe the set of edges which consists of pairs
(i, j) such that aij = 1. Consider E∈R|E|×hto be the edge feature matrix such that such that
each row E(i,j)is an independent h-dimensional Gaussian random vector with E(i,j)∼N(ν, ζI) if
(i, j) is an intra-edge, i.e., i, j ∈C0or i, j ∈C1, and E(i,j)∼N(−ν, ζI) if (i, j) is an inter-edge,
i.e., i∈C0, j ∈C1or i∈C1, j ∈C0. Here ν∈Rdis the mean, ζ≥0 is the standard deviation.
Denote by CSBM(n, p, q, µ, ν, σ, ζ) the coupling of a stochastic block model with the Gaus-
sian mixture models for the nodes and the edges with means µ, ν and standard deviation σ, ζ,
respectively, as described above. We denote a sample by (A, X, E)∼CSBM(n, p, q, µ, ν, σ, ζ).
3.2 Assumptions
We now we state two standard assumptions on the CSBM that we will use in our analysis. The
first assumption is p, q ≥Ω(log2n/n), and it guarantees that the expected degrees of the graph are
Ω(log2n), they also guarantee degree concentration. The second assumption is that the standard
deviation ζof the edge features is constant. This assumption is without loss of generality since all
that really matters is the ratio of the distance between the means of the edges features over the
standard deviation. As long as we allow the distance between the means to grow while ζis fixed
then the results are not restricted, while the analysis is simplified.
3.3 Graph attention
The graph attention convolution is defined as ˆxi=Pj∈[n]˜
Aijγij xj∀i∈[n], where γij is the atten-
tion coefficient of the edge (i, j). We focus on a single layer graph attention since this architecture
is enough for the simple CSBM that we consider.
There are many ways to set the attention coefficients γ. We discuss the setting in our paper
and how it is related to the original definition in [32] and newer ones [8]. We define the attention
function Ψ(E(i,j)) which takes as input the features of the edge (i, j)E(i,j)and outputs a scalar
value. The function Ψ is often parameterized by learnable variables, and it is used to define the
attention coefficients
γij := exp(Ψ(E(i,j)))
P`∈Niexp(Ψ(E(i,`))),
where Niis the set of neighbors of node i.
In the original paper [32] the function Ψ is a linear function of the two dimensional vector
[wTxi, wTxj] passed through LeakyRelu, where the coefficients of the linear function are learnable
parameters, ware learnable parameters as well and are shared with the parameters outside atten-
tion. In this paper we consider independent edge features as input to the attention mechanism.
Although in the original paper [32] edge features are mentioned as an input to the model this seems
an important departure from what was extensively studied in [32]. However, using edge features
captures the effect of dominating noise in graph attention, which is what we are interested in this
paper for understanding performance of graph attention. Finally, we consider Ψ functions that are
a composition of a Lipschitz and a linear function. This is enough to prove that graph attention is
able to distinguish intra- from inter-edges and consequently leads to better performance than graph
convolution when the edge features are clean. Given that the edge features in our data model are
independent from node features, this setting avoids the issues discussed in [8].
5