On Classication Thresholds for Graph Attention with Edge Features Kimon FountoulakisDake HeSilvio LattanziBryan Perozzi

2025-05-02 0 0 677.75KB 37 页 10玖币
侵权投诉
On Classification Thresholds for Graph Attention with Edge
Features
Kimon FountoulakisDake HeSilvio LattanziBryan Perozzi§
Anton TsitsulinShenghao Yang
October 19, 2022
Abstract
The recent years we have seen the rise of graph neural networks for prediction tasks on graphs.
One of the dominant architectures is graph attention due to its ability to make predictions using
weighted edge features and not only node features. In this paper we analyze, theoretically and
empirically, graph attention networks and their ability of correctly labelling nodes in a classic
classification task. More specifically, we study the performance of graph attention on the classic
contextual stochastic block model (CSBM). In CSBM the nodes and edge features are obtained
from a mixture of Gaussians and the edges from a stochastic block model. We consider a general
graph attention mechanism that takes random edge features as input to determine the attention
coefficients. We study two cases, in the first one, when the edge features are noisy, we prove
that the majority of the attention coefficients are up to a constant uniform. This allows us to
prove that graph attention with edge features is not better than simple graph convolution for
achieving perfect node classification. Second, we prove that when the edge features are clean
graph attention can distinguish intra- from inter-edges and this makes graph attention better
than classic graph convolution.
1 Introduction
Learning from multi-modal datasets is currently one of the most prominent topics in artificial
intelligence. The reason behind this trend is that many applications, such as recommendation
systems, fraud detection and vision, require some combination of different types of data. In this
paper we are interested in multi-modal data which combine a graph, i.e., a set of nodes and edges,
with attributes for each node and edge. The attributes of the nodes/edges capture information
about the nodes/edges themselves, while the edges among the nodes capture relations among the
nodes. Capturing relations is particularly helpful for applications where we are trying to make
predictions for nodes given neighborhood data.
One of the most prominent ways of handling multi-modal data for downstream tasks such as
node classification are graph neural networks [19,31,9,14,22,4,12,21,26]. Graph neural network
School of Computer Science, University of Waterloo, and Google Waterloo, Waterloo, ON, Canada. E-mail:
kfountou@uwaterloo.ca.
Google Waterloo, Waterloo, ON, Canada. E-mail: dkhe@google.com.
Google Research, Zurich, Switzerland. E-mail: silviol@google.com.
§Google Research, New York, USA. E-mail: bperozzi@acm.org.
Google Research, New York, USA. E-mail: tsitsulin@google.com.
School of Computer Science, University of Waterloo, Waterloo, ON, Canada. E-mail: s286yang@uwaterloo.ca.
1
arXiv:2210.10014v1 [cs.LG] 18 Oct 2022
models can mix hand-crafted or automatically learned attributes about the nodes while taking into
account relational information among the nodes. Their output vector representation contains both
global and local information for the nodes. This contrasts with neural networks that only learn
from the attributes of entities.
1.1 Motivation and goals
Graph neural networks have found a plethora of uses in chemistry [18,31], biology, and in various
industrial applications. Some representative examples include fighting spam and abusive behaviors,
providing personalization for the users [37], and predicting states of physical objects [6]. Given wide
applicability and exploding popularity of GNNs, theoretically understanding in which regimes they
work best is of paramount importance.
One of the most popular graph neural network architectures is the Graph Attention Network
(GAT). Graph attention [32] is usually defined as averaging the features of a node with the features
of its neighbors by appropriately weighting the edges of a graph before spatially convolving the node
features. It is generally expected by practitioners that GAT is able to downweight unimportant
edges and set a large weight for important edges, depending on the downstream task. In this paper
we analyze the graph attention mechanism.
We focus on node classification, which is one of the most popular tasks for graph learning. We
perform our analysis using the contextual stochastic block model (CSBM) [7,13]. The CSBM is a
coupling of the stochastic block model (SBM) with a Gaussian mixture model. We focus on two
classes where the answer to the above question is sufficiently precise to understand the performance
of graph attention and build useful intuition about it.
We study perfect classification as it is one of the three questions that has been asked for
the community detection for SBM without node features [1]. We leave results on other types of
classification guarantees for future work. Our goal is study the performance of graph attention on
a well-studied synthetic data model. We see our paper as a small step in the direction of building
theoretically justified intuition about graph attention and better attention mechanisms.
1.2 Contributions
We study the performance of graph attention with edge and node features for the CSBM. The
edge features follow a Gaussian mixture model with two means, one for intra-edges and one for
inter-edges. We call the edge features clean when the distance between the means is larger than the
standard deviation. We call the edge features noisy when the distance between the means is smaller
than the standard deviation. We split our results into two parts. In the first part we consider the
case where the edge features that are passed to the attention mechanism are clean. In the second
part we consider the case where the edge features are noisy. We describe our contributions below.
1. Separation of intra and inter attention coefficients for clean edge features. There exists an
attention architecture which can distinguish intra- from inter-edges. This attention archi-
tecture allows us to prove that the means of the convolved data do not move closer, while
achieving large variance reduction. It also allows us to prove that the threshold of perfect
node classification for graph attention is better than that of graph convolution.
2. Perfect node classification for clean edge features. Let σbe the standard deviation of the
node features, nthe number of nodes and p, q the intra- and inter-edge probabilities. If
the distance between the means of the node features is ωσqlog n
nmax(p,q). Then with high
probability graph attention classifies the data perfectly.
2
3. Failure of perfect node classification for clean edge features. If the distance between the
means of the node features is small, that is smaller than Kσqlog n
nmax(p,q)(1 max(p, q)) for
some constant K, then graph attention can’t classify the nodes perfectly with probability at
least 1 1/n1max(p,q).
4. Uniform intra and inter attention coefficients for noisy edge features. We prove that for
no(n) nodes at least 90% of their attention coefficients are up to a constant uniform. This
means that a lot of attention coefficients are up to a constant the same as those of graph
convolution. This property allows us to show that in this regime graph attention is not better
than graph convolution.
5. Perfect node classification for noisy edge features. If the distance is ωσp+q
|pq|qlog n
nmax(p,q),
then with high probability graph attention classifies all nodes correctly.
6. Failure of perfect node classification for noisy edge features. If the distance is less than
Kσ p+q
|pq|qlog n
nmax(p,q)(1 max(p, q)) for some constant K, then graph attention can’t classify
the nodes perfectly with probability at least 1 1/n1max(p,q).
Finally we complement our theoretical results with an empirical analysis confirming our main
findings.
2 Relevant Work
There have been numerous papers proposing new graph neural network architectures that is im-
possible to acknowledge all works in one paper. We leave this work for relevant books and survey
papers on graph neural networks, examples include [20,35]. From a theoretical perspective, a few
authors have analyzed graph neural networks using traditional machine learning frameworks or
from a signal processing perspective [10,11,38,36,17,28,29]. For a recent survey in this direction
see the recent survey paper [24] that focuses on three main categories, representation, generaliza-
tion and extrapolation. In our paper we analyze graph attention from a statistical perspective that
allows us to formally understand claims about graph attention.
In the past, researchers have put significant effort in understanding community detection for
the SBM [1]. Usually the results for community detection are divided in three parameters regimes
for the SBM. The first type of guarantee that was investigated was that of exact recovery or
perfect classification. We are also interested in perfect node classification, but our work is on
graph attention for the CSBM. The analysis of exact recovery in SBM and perfect classification in
CSBM for graph attention are significantly different. In fact, our focus is not on designing the best
algorithm for the exact classification task but it is on understanding the advantages and limitation
of Graph Attention over other standard architectures. As a consequence, the model we analyze is
a non-linear function of the input data since we have to deal with the coupling of highly nonlinear
attention coefficients, the node features and the graph structure.
A closely related work is [5], which studies the performance of graph convolution [26] on CSBM
as a semi-supervised learning problem. In our paper we work with graph attention and we compare
it to graph convolution. Another relevant work is [16]. In this paper the authors also study the
performance of graph attention for CSBM. However, in [16] edge features are not used and there
is no result provided about when graph attention fails to achieve perfect node classification, only
a conjecture is provided. In this paper we provide a complete treatment regarding the question of
3
perfect classification when edge features are given. Another paper that studies performance of graph
attention on CSBM is [3]. In this paper the attention architecture is constructed using ground-truth
labels and it is fixed. The authors also consider an attention architecture which is constructed using
an eigenvector of the adjacency matrix when the community structure in the graph can be exactly
recovered. Thus in [3] only a rather optimistic scenario is studied, that is, when we are given a good
attention architecture. In our paper, we consider the case where additional edge features are given
that follow a Gaussian mixture model and we analyze the performance of graph attention when
these features are clean or noisy. We provide complete analysis about the attention coefficients,
instead of assuming them, and we show how they affect perfect node classification when the edge
features are clean or noisy.
Within the context of random graphs another relevant work is [25,30]. In the former paper,
the authors study universality of graph neural networks on random graphs. In the latter paper the
authors go a step further and prove that the generalization error of graph neural networks between
the training set and the true distribution is small, and the error decreases with respect to the
number of training samples and the average number of nodes in the graphs. In our paper we are
interested in understanding the parameters regimes of CSBM such that graph attention classifies
or fails to classify the data perfectly. This allows us to compare the performance of graph attention
to other basic approaches such as a graph convolution.
Other papers that have studied the performance of graph attention are [8,27,23]. In [8] the
authors show that graph attention fails due to a global ranking of nodes that is generated by
the attention mechanism in [32]. They propose a deeper attention mechanism as a solution. Our
analysis a deeper attention mechanism is not required since we consider independently distributed
edge features and the issue mentioned in [8] is avoided.
The work in [27] is an empirical study of the ability of graph attention to generalize on larger,
complex, and noisy graphs. Finally, in [23] the authors propose a different metric to generate
the attention coefficients and show empirically that it has an advantage over the original GAT
architecture. In our paper we consider the original and most popular attention mechanism [32] and
its deeper variation as well.
3 Preliminaries
In this section we describe the data model that we use and the graph attention architecture.
3.1 The contextual stochastic block model with random edge features
In this section we describe the CSBM [13], which is a simple coupling of a stochastic block model
with a Gaussian mixture model. Let (k)k[n]be i.i.d Bernoulli random variables. These variables
define the class membership of the nodes. In particular, consider a stochastic block model consisting
of two classes C0={i[n] : i= 0}and C1=Cc
0with inter-class edge probability qand intra-class
edge probability pwith no self-loops1. In particular, given (k) the adjacency matrix A= (aij)
follows a Bernoulli distribution where aij Ber(p) if i, j are in the same class and aij Ber(q)
if they are in distinct classes. This completes the distributions for the class membership and the
graph. Let us now describe the distributions for the node and edges features.
Consider the node features xito be independent d-dimensional Gaussian random vectors with
xiN(µ, σI) if iC0and xiN(µ, σI) if iC1. Here µRdis the mean, σ0 is the
1In practice, self-loops are often added to the graph and the following adjacency matrix ˜
Ais used instead: ˜
A= (˜aij )
is the matrix A+Iand Dto be the diagonal degree matrix for ˜
A, so that Dii =Pj[n]˜aij for all i[n]. Our results
can be extended to this case with minor changes.
4
standard deviation and Iis the identity matrix. Let Ebe the set of edges which consists of pairs
(i, j) such that aij = 1. Consider ER|Ehto be the edge feature matrix such that such that
each row E(i,j)is an independent h-dimensional Gaussian random vector with E(i,j)N(ν, ζI) if
(i, j) is an intra-edge, i.e., i, j C0or i, j C1, and E(i,j)N(ν, ζI) if (i, j) is an inter-edge,
i.e., iC0, j C1or iC1, j C0. Here νRdis the mean, ζ0 is the standard deviation.
Denote by CSBM(n, p, q, µ, ν, σ, ζ) the coupling of a stochastic block model with the Gaus-
sian mixture models for the nodes and the edges with means µ, ν and standard deviation σ, ζ,
respectively, as described above. We denote a sample by (A, X, E)CSBM(n, p, q, µ, ν, σ, ζ).
3.2 Assumptions
We now we state two standard assumptions on the CSBM that we will use in our analysis. The
first assumption is p, q Ω(log2n/n), and it guarantees that the expected degrees of the graph are
Ω(log2n), they also guarantee degree concentration. The second assumption is that the standard
deviation ζof the edge features is constant. This assumption is without loss of generality since all
that really matters is the ratio of the distance between the means of the edges features over the
standard deviation. As long as we allow the distance between the means to grow while ζis fixed
then the results are not restricted, while the analysis is simplified.
3.3 Graph attention
The graph attention convolution is defined as ˆxi=Pj[n]˜
Aijγij xji[n], where γij is the atten-
tion coefficient of the edge (i, j). We focus on a single layer graph attention since this architecture
is enough for the simple CSBM that we consider.
There are many ways to set the attention coefficients γ. We discuss the setting in our paper
and how it is related to the original definition in [32] and newer ones [8]. We define the attention
function Ψ(E(i,j)) which takes as input the features of the edge (i, j)E(i,j)and outputs a scalar
value. The function Ψ is often parameterized by learnable variables, and it is used to define the
attention coefficients
γij := exp(Ψ(E(i,j)))
P`Niexp(Ψ(E(i,`))),
where Niis the set of neighbors of node i.
In the original paper [32] the function Ψ is a linear function of the two dimensional vector
[wTxi, wTxj] passed through LeakyRelu, where the coefficients of the linear function are learnable
parameters, ware learnable parameters as well and are shared with the parameters outside atten-
tion. In this paper we consider independent edge features as input to the attention mechanism.
Although in the original paper [32] edge features are mentioned as an input to the model this seems
an important departure from what was extensively studied in [32]. However, using edge features
captures the effect of dominating noise in graph attention, which is what we are interested in this
paper for understanding performance of graph attention. Finally, we consider Ψ functions that are
a composition of a Lipschitz and a linear function. This is enough to prove that graph attention is
able to distinguish intra- from inter-edges and consequently leads to better performance than graph
convolution when the edge features are clean. Given that the edge features in our data model are
independent from node features, this setting avoids the issues discussed in [8].
5
摘要:

OnClassi cationThresholdsforGraphAttentionwithEdgeFeaturesKimonFountoulakis*DakeHe„SilvioLattanzi…BryanPerozzi§AntonTsitsulin¶ShenghaoYang†October19,2022AbstractTherecentyearswehaveseentheriseofgraphneuralnetworksforpredictiontasksongraphs.Oneofthedominantarchitecturesisgraphattentionduetoitsability...

展开>> 收起<<
On Classication Thresholds for Graph Attention with Edge Features Kimon FountoulakisDake HeSilvio LattanziBryan Perozzi.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:677.75KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注