On Classication Thresholds for Graph Attention with Edge Features Kimon FountoulakisDake HeSilvio LattanziBryan Perozzi

2025-05-02 0 0 677.75KB 37 页 10玖币

侵权投诉

On Classiﬁcation Thresholds for Graph Attention with Edge

Features

Kimon Fountoulakis∗Dake He†Silvio Lattanzi‡Bryan Perozzi§

Anton Tsitsulin¶Shenghao Yang‖

October 19, 2022

Abstract

The recent years we have seen the rise of graph neural networks for prediction tasks on graphs.

One of the dominant architectures is graph attention due to its ability to make predictions using

weighted edge features and not only node features. In this paper we analyze, theoretically and

empirically, graph attention networks and their ability of correctly labelling nodes in a classic

classiﬁcation task. More speciﬁcally, we study the performance of graph attention on the classic

contextual stochastic block model (CSBM). In CSBM the nodes and edge features are obtained

from a mixture of Gaussians and the edges from a stochastic block model. We consider a general

graph attention mechanism that takes random edge features as input to determine the attention

coeﬃcients. We study two cases, in the ﬁrst one, when the edge features are noisy, we prove

that the majority of the attention coeﬃcients are up to a constant uniform. This allows us to

prove that graph attention with edge features is not better than simple graph convolution for

achieving perfect node classiﬁcation. Second, we prove that when the edge features are clean

graph attention can distinguish intra- from inter-edges and this makes graph attention better

than classic graph convolution.

1 Introduction

Learning from multi-modal datasets is currently one of the most prominent topics in artiﬁcial

intelligence. The reason behind this trend is that many applications, such as recommendation

systems, fraud detection and vision, require some combination of diﬀerent types of data. In this

paper we are interested in multi-modal data which combine a graph, i.e., a set of nodes and edges,

with attributes for each node and edge. The attributes of the nodes/edges capture information

about the nodes/edges themselves, while the edges among the nodes capture relations among the

nodes. Capturing relations is particularly helpful for applications where we are trying to make

predictions for nodes given neighborhood data.

One of the most prominent ways of handling multi-modal data for downstream tasks such as

node classiﬁcation are graph neural networks [19,31,9,14,22,4,12,21,26]. Graph neural network

∗School of Computer Science, University of Waterloo, and Google Waterloo, Waterloo, ON, Canada. E-mail:

kfountou@uwaterloo.ca.

†Google Waterloo, Waterloo, ON, Canada. E-mail: dkhe@google.com.

‡Google Research, Zurich, Switzerland. E-mail: silviol@google.com.

§Google Research, New York, USA. E-mail: bperozzi@acm.org.

¶Google Research, New York, USA. E-mail: tsitsulin@google.com.

‖School of Computer Science, University of Waterloo, Waterloo, ON, Canada. E-mail: s286yang@uwaterloo.ca.

arXiv:2210.10014v1 [cs.LG] 18 Oct 2022

models can mix hand-crafted or automatically learned attributes about the nodes while taking into

account relational information among the nodes. Their output vector representation contains both

global and local information for the nodes. This contrasts with neural networks that only learn

from the attributes of entities.

1.1 Motivation and goals

Graph neural networks have found a plethora of uses in chemistry [18,31], biology, and in various

industrial applications. Some representative examples include ﬁghting spam and abusive behaviors,

providing personalization for the users [37], and predicting states of physical objects [6]. Given wide

applicability and exploding popularity of GNNs, theoretically understanding in which regimes they

work best is of paramount importance.

One of the most popular graph neural network architectures is the Graph Attention Network

(GAT). Graph attention [32] is usually deﬁned as averaging the features of a node with the features

of its neighbors by appropriately weighting the edges of a graph before spatially convolving the node

features. It is generally expected by practitioners that GAT is able to downweight unimportant

edges and set a large weight for important edges, depending on the downstream task. In this paper

we analyze the graph attention mechanism.

We focus on node classiﬁcation, which is one of the most popular tasks for graph learning. We

perform our analysis using the contextual stochastic block model (CSBM) [7,13]. The CSBM is a

coupling of the stochastic block model (SBM) with a Gaussian mixture model. We focus on two

classes where the answer to the above question is suﬃciently precise to understand the performance

of graph attention and build useful intuition about it.

We study perfect classiﬁcation as it is one of the three questions that has been asked for

the community detection for SBM without node features [1]. We leave results on other types of

classiﬁcation guarantees for future work. Our goal is study the performance of graph attention on

a well-studied synthetic data model. We see our paper as a small step in the direction of building

theoretically justiﬁed intuition about graph attention and better attention mechanisms.

1.2 Contributions

We study the performance of graph attention with edge and node features for the CSBM. The

edge features follow a Gaussian mixture model with two means, one for intra-edges and one for

inter-edges. We call the edge features clean when the distance between the means is larger than the

standard deviation. We call the edge features noisy when the distance between the means is smaller

than the standard deviation. We split our results into two parts. In the ﬁrst part we consider the

case where the edge features that are passed to the attention mechanism are clean. In the second

part we consider the case where the edge features are noisy. We describe our contributions below.

1. Separation of intra and inter attention coeﬃcients for clean edge features. There exists an

attention architecture which can distinguish intra- from inter-edges. This attention archi-

tecture allows us to prove that the means of the convolved data do not move closer, while

achieving large variance reduction. It also allows us to prove that the threshold of perfect

node classiﬁcation for graph attention is better than that of graph convolution.

2. Perfect node classiﬁcation for clean edge features. Let σbe the standard deviation of the

node features, nthe number of nodes and p, q the intra- and inter-edge probabilities. If

the distance between the means of the node features is ωσqlog n

nmax(p,q). Then with high

probability graph attention classiﬁes the data perfectly.

3. Failure of perfect node classiﬁcation for clean edge features. If the distance between the

means of the node features is small, that is smaller than Kσqlog n

nmax(p,q)(1 −max(p, q)) for

some constant K, then graph attention can’t classify the nodes perfectly with probability at

least 1 −1/n1−max(p,q).

4. Uniform intra and inter attention coeﬃcients for noisy edge features. We prove that for

n−o(n) nodes at least 90% of their attention coeﬃcients are up to a constant uniform. This

means that a lot of attention coeﬃcients are up to a constant the same as those of graph

convolution. This property allows us to show that in this regime graph attention is not better

than graph convolution.

5. Perfect node classiﬁcation for noisy edge features. If the distance is ωσp+q

|p−q|qlog n

nmax(p,q),

then with high probability graph attention classiﬁes all nodes correctly.

6. Failure of perfect node classiﬁcation for noisy edge features. If the distance is less than

Kσ p+q

|p−q|qlog n

nmax(p,q)(1 −max(p, q)) for some constant K, then graph attention can’t classify

the nodes perfectly with probability at least 1 −1/n1−max(p,q).

Finally we complement our theoretical results with an empirical analysis conﬁrming our main

ﬁndings.

2 Relevant Work

There have been numerous papers proposing new graph neural network architectures that is im-

possible to acknowledge all works in one paper. We leave this work for relevant books and survey

papers on graph neural networks, examples include [20,35]. From a theoretical perspective, a few

authors have analyzed graph neural networks using traditional machine learning frameworks or

from a signal processing perspective [10,11,38,36,17,28,29]. For a recent survey in this direction

see the recent survey paper [24] that focuses on three main categories, representation, generaliza-

tion and extrapolation. In our paper we analyze graph attention from a statistical perspective that

allows us to formally understand claims about graph attention.

In the past, researchers have put signiﬁcant eﬀort in understanding community detection for

the SBM [1]. Usually the results for community detection are divided in three parameters regimes

for the SBM. The ﬁrst type of guarantee that was investigated was that of exact recovery or

perfect classiﬁcation. We are also interested in perfect node classiﬁcation, but our work is on

graph attention for the CSBM. The analysis of exact recovery in SBM and perfect classiﬁcation in

CSBM for graph attention are signiﬁcantly diﬀerent. In fact, our focus is not on designing the best

algorithm for the exact classiﬁcation task but it is on understanding the advantages and limitation

of Graph Attention over other standard architectures. As a consequence, the model we analyze is

a non-linear function of the input data since we have to deal with the coupling of highly nonlinear

attention coeﬃcients, the node features and the graph structure.

A closely related work is [5], which studies the performance of graph convolution [26] on CSBM

as a semi-supervised learning problem. In our paper we work with graph attention and we compare

it to graph convolution. Another relevant work is [16]. In this paper the authors also study the

performance of graph attention for CSBM. However, in [16] edge features are not used and there

is no result provided about when graph attention fails to achieve perfect node classiﬁcation, only

a conjecture is provided. In this paper we provide a complete treatment regarding the question of

perfect classiﬁcation when edge features are given. Another paper that studies performance of graph

attention on CSBM is [3]. In this paper the attention architecture is constructed using ground-truth

labels and it is ﬁxed. The authors also consider an attention architecture which is constructed using

an eigenvector of the adjacency matrix when the community structure in the graph can be exactly

recovered. Thus in [3] only a rather optimistic scenario is studied, that is, when we are given a good

attention architecture. In our paper, we consider the case where additional edge features are given

that follow a Gaussian mixture model and we analyze the performance of graph attention when

these features are clean or noisy. We provide complete analysis about the attention coeﬃcients,

instead of assuming them, and we show how they aﬀect perfect node classiﬁcation when the edge

features are clean or noisy.

Within the context of random graphs another relevant work is [25,30]. In the former paper,

the authors study universality of graph neural networks on random graphs. In the latter paper the

authors go a step further and prove that the generalization error of graph neural networks between

the training set and the true distribution is small, and the error decreases with respect to the

number of training samples and the average number of nodes in the graphs. In our paper we are

interested in understanding the parameters regimes of CSBM such that graph attention classiﬁes

or fails to classify the data perfectly. This allows us to compare the performance of graph attention

to other basic approaches such as a graph convolution.

Other papers that have studied the performance of graph attention are [8,27,23]. In [8] the

authors show that graph attention fails due to a global ranking of nodes that is generated by

the attention mechanism in [32]. They propose a deeper attention mechanism as a solution. Our

analysis a deeper attention mechanism is not required since we consider independently distributed

edge features and the issue mentioned in [8] is avoided.

The work in [27] is an empirical study of the ability of graph attention to generalize on larger,

complex, and noisy graphs. Finally, in [23] the authors propose a diﬀerent metric to generate

the attention coeﬃcients and show empirically that it has an advantage over the original GAT

architecture. In our paper we consider the original and most popular attention mechanism [32] and

its deeper variation as well.

3 Preliminaries

In this section we describe the data model that we use and the graph attention architecture.

3.1 The contextual stochastic block model with random edge features

In this section we describe the CSBM [13], which is a simple coupling of a stochastic block model

with a Gaussian mixture model. Let (k)k∈[n]be i.i.d Bernoulli random variables. These variables

deﬁne the class membership of the nodes. In particular, consider a stochastic block model consisting

of two classes C0={i∈[n] : i= 0}and C1=Cc

0with inter-class edge probability qand intra-class

edge probability pwith no self-loops1. In particular, given (k) the adjacency matrix A= (aij)

follows a Bernoulli distribution where aij ∼Ber(p) if i, j are in the same class and aij ∼Ber(q)

if they are in distinct classes. This completes the distributions for the class membership and the

graph. Let us now describe the distributions for the node and edges features.

Consider the node features xito be independent d-dimensional Gaussian random vectors with

xi∼N(µ, σI) if i∈C0and xi∼N(−µ, σI) if i∈C1. Here µ∈Rdis the mean, σ≥0 is the

1In practice, self-loops are often added to the graph and the following adjacency matrix ˜

Ais used instead: ˜

A= (˜aij )

is the matrix A+Iand Dto be the diagonal degree matrix for ˜

A, so that Dii =Pj∈[n]˜aij for all i∈[n]. Our results

can be extended to this case with minor changes.

standard deviation and Iis the identity matrix. Let Ebe the set of edges which consists of pairs

(i, j) such that aij = 1. Consider E∈R|E|×hto be the edge feature matrix such that such that

each row E(i,j)is an independent h-dimensional Gaussian random vector with E(i,j)∼N(ν, ζI) if

(i, j) is an intra-edge, i.e., i, j ∈C0or i, j ∈C1, and E(i,j)∼N(−ν, ζI) if (i, j) is an inter-edge,

i.e., i∈C0, j ∈C1or i∈C1, j ∈C0. Here ν∈Rdis the mean, ζ≥0 is the standard deviation.

Denote by CSBM(n, p, q, µ, ν, σ, ζ) the coupling of a stochastic block model with the Gaus-

sian mixture models for the nodes and the edges with means µ, ν and standard deviation σ, ζ,

respectively, as described above. We denote a sample by (A, X, E)∼CSBM(n, p, q, µ, ν, σ, ζ).

3.2 Assumptions

We now we state two standard assumptions on the CSBM that we will use in our analysis. The

ﬁrst assumption is p, q ≥Ω(log2n/n), and it guarantees that the expected degrees of the graph are

Ω(log2n), they also guarantee degree concentration. The second assumption is that the standard

deviation ζof the edge features is constant. This assumption is without loss of generality since all

that really matters is the ratio of the distance between the means of the edges features over the

standard deviation. As long as we allow the distance between the means to grow while ζis ﬁxed

then the results are not restricted, while the analysis is simpliﬁed.

3.3 Graph attention

The graph attention convolution is deﬁned as ˆxi=Pj∈[n]˜

Aijγij xj∀i∈[n], where γij is the atten-

tion coeﬃcient of the edge (i, j). We focus on a single layer graph attention since this architecture

is enough for the simple CSBM that we consider.

There are many ways to set the attention coeﬃcients γ. We discuss the setting in our paper

and how it is related to the original deﬁnition in [32] and newer ones [8]. We deﬁne the attention

function Ψ(E(i,j)) which takes as input the features of the edge (i, j)E(i,j)and outputs a scalar

value. The function Ψ is often parameterized by learnable variables, and it is used to deﬁne the

attention coeﬃcients

γij := exp(Ψ(E(i,j)))

P`∈Niexp(Ψ(E(i,`))),

where Niis the set of neighbors of node i.

In the original paper [32] the function Ψ is a linear function of the two dimensional vector

[wTxi, wTxj] passed through LeakyRelu, where the coeﬃcients of the linear function are learnable

parameters, ware learnable parameters as well and are shared with the parameters outside atten-

tion. In this paper we consider independent edge features as input to the attention mechanism.

Although in the original paper [32] edge features are mentioned as an input to the model this seems

an important departure from what was extensively studied in [32]. However, using edge features

captures the eﬀect of dominating noise in graph attention, which is what we are interested in this

paper for understanding performance of graph attention. Finally, we consider Ψ functions that are

a composition of a Lipschitz and a linear function. This is enough to prove that graph attention is

able to distinguish intra- from inter-edges and consequently leads to better performance than graph

convolution when the edge features are clean. Given that the edge features in our data model are

independent from node features, this setting avoids the issues discussed in [8].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnClassicationThresholdsforGraphAttentionwithEdgeFeaturesKimonFountoulakis*DakeHeSilvioLattanziBryanPerozzi§AntonTsitsulin¶ShenghaoYangOctober19,2022AbstractTherecentyearswehaveseentheriseofgraphneuralnetworksforpredictiontasksongraphs.Oneofthedominantarchitecturesisgraphattentionduetoitsability...

展开>> 收起<<

On Classication Thresholds for Graph Attention with Edge Features Kimon FountoulakisDake HeSilvio LattanziBryan Perozzi.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

On Classication Thresholds for Graph Attention with Edge Features Kimon FountoulakisDake HeSilvio LattanziBryan Perozzi

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: