Explaining the Explainers in Graph Neural Networks a Comparative Study Antonio Longa12 Steve Azzolin1 Gabriele Santin3 Giulia Cencetti4 Pietro Li o5 Bruno Lepri2

2025-05-06 0 0 6.56MB 48 页 10玖币

侵权投诉

Explaining the Explainers in Graph Neural Networks: a Comparative

Study

Antonio Longa1,2, Steve Azzolin1, Gabriele Santin3, Giulia Cencetti4, Pietro Li`o5, Bruno Lepri2,

and Andrea Passerini1

1University of Trento

2Fondazione Bruno Kessler

3University of Venice

4CNRS, Centre de Physique Theorique

5University of Cambridge

July 2, 2024

Abstract

Following a fast initial breakthrough in graph based learning, Graph Neural Networks (GNNs) have reached

a widespread application in many science and engineering ﬁelds, prompting the need for methods to understand

their decision process. GNN explainers have started to emerge in recent years, with a multitude of methods

both novel or adapted from other domains. To sort out this plethora of alternative approaches, several studies

have benchmarked the performance of diﬀerent explainers in terms of various explainability metrics. However,

these earlier works make no attempts at providing insights into why diﬀerent GNN architectures are more or

less explainable, or which explainer should be preferred in a given setting. In this survey we ﬁll these gaps by

devising a systematic experimental study, which tests twelve explainers on eight representative message-passing

architectures trained on six carefully designed graph and node classiﬁcation datasets. With our results we

provide key insights on the choice and applicability of GNN explainers, we isolate key components that make

them usable and successful and provide recommendations on how to avoid common interpretation pitfalls. We

conclude by highlighting open questions and directions of possible future research.

1 Introduction

Graph Neural Networks (GNNs) have emerged as the de-facto standard for graph-based learning tasks. Regardless

of their apparent simplicity, that allows most GNN architectures to be expressed as variants of Message Passing [31],

i.e., exchanging messages between nodes, GNNs have proved extremely eﬀective in preserving the natural symmetries

present in many real-world physical systems [108, 43, 77, 25, 10]. The versatility of GNNs allowed them to be also

applied to emulate classical algorithms [14], addressing tasks like bipartite matching [30], graph coloring [56] or the

Traveling Salesperson Problem [76], and approximate symbolic reasoning tasks like propositional satisﬁability [92,

91, 109] and probabilistic logic reasoning [130]. Despite recent works trying to adapt the Transformer architecture,

made popular by a wide success ﬁrst in language [104, 84, 78] and then in vision applications [80, 3, 61], to the

graph domain [48, 81, 125, 22, 132], the natural inductive bias of GNNs remains at the basis of the current success

of GNNs. A major drawback of GNNs with respect to alternative graph processing approaches [39, 74, 97, 34] is the

opacity of their predictive mechanism, which they share with most deep-learning based architectures. This severely

limits the applicability of these technologies to safety-critical scenarios.

The need to provide insights into the decision process of the network, and the need to provide explanations for

automatic decisions aﬀecting human’s life [18, 47], have stimulated research in techniques for shading light into the

black box nature of deep architectures [83, 93, 114, 126, 103, 100, 102, 75, 35]. The approaches have also been

adapted to generate explanations for GNN models [100, 102, 75, 8]. However, networked data have peculiarities that

pose speciﬁc challenges that explainers developed for tensor data struggle to address. The main challenge comes

from the lack of a regular structure, as nodes have a variable number of edges, which requires ad-hoc strategies

to be properly addressed. Indeed, several approaches have been recently developed that are speciﬁcally tailored to

explain GNN architectures, and Section 3 will summarize the main contributions.

arXiv:2210.15304v3 [cs.LG] 1 Jul 2024

It is often the case, however, that each work proposes a new set of benchmarks or metrics, making the comparison

across works complicated. We thereby stress the need for a comprehensive evaluation that can fairly benchmark

the explainers under a uniﬁed lens. One of the ﬁrst attempts to provide such a comparative analysis is the up

mentioned work by Yuan et al. [122], where a taxonomy of the available explainers was proposed. In addition to

this, the authors reported a detailed overview of the most common datasets used to benchmark explainers, along

with the adopted evaluation metrics.

However, despite the wide coverage of explainers, datasets, and evaluation metrics, only a single GNN archi-

tecture, namely a simple Graph Convolutional Network [49], was evaluated, so that nothing can be said about the

impact of diﬀerent architectures in the resulting explanations. A similar limitation aﬀects the works of Zhao et

al. [133] and Agarwal et al. [1, 2] that, despite presenting interesting insights in terms of consistency of explainers,

desired properties of explanation metrics and even introducing a generator for synthetic graph benchmarks, focus

their analysis to a single GNN architecture. Li et al. [55] conducted the ﬁrst empirical study comparing diﬀerent

GNN architectures. However, their study is limited to node classiﬁcation and the three explainers under analy-

sis [117, 64, 88] are not well representative of the diversity of explanation strategies that have been proposed, as

summarized in the aforementioned taxonomy [122]. The most comprehensive study to date is the recent work by

Rathee at al. [82], that evaluated four GNN architectures over nine explainers for both node and graph classiﬁcation.

However, the main goal of this study is proposing a benchmarking suite to quantitatively evaluate explainers, with

no attempts at providing insights into why diﬀerent GNN architectures behave diﬀerently in terms of explainability,

or which explainer should be preferred in a given setting.

In spite of the aforementioned recent studies benchmarking explainability methods for GNNs, no investigation

has been done in characterizing the typical explanation patterns associated to the topological concepts learned

by the network and how diﬀerent architectures aﬀect the explanation. In our work, we address these issues by

answering the following research questions:

•RQ1: How does the architecture aﬀect the explanations?

•RQ2: How do explainers aﬀect the explanations?

•RQ3: How do diﬀerent types of problems aﬀect the explanations?

Overall, our work aims to go beyond a merely quantitative evaluation of the performance of explainer-GNN pairs

and to make a signiﬁcant step towards explaining explainability. We run an unprecedented number of experiments

involving eight GNN architectures, twelve instance-based explainers, and six datasets (divided into node and graph

classiﬁcation) and enrich the quantitative results we obtain by providing a deep understanding of the reasons behind

the observed behaviors, together with a set of recommendations on how to select and best use the most appropriate

explainer for the task under investigation while avoiding common pitfalls, as well as a number of open problems in

GNN explainability that we believe deserve further investigation. Following previous analyses [123, 55, 58, 1, 82],

our study investigates message-passing GNNs applied to static graphs. Exploring explainability in non-message

passing architectures (e.g., Transformers) or diﬀerent graph types (e.g., temporal graphs) is an interesting direction

for future research, that may however necessitate distinct benchmarking approaches.

The remainder of the paper is structured as follows: Section 2 presents an overview of GNN architectures, with

greater detail on the models that we adopted in our study. On the same vein, Section 3 introduces the explainers

for graph models, while Section 4 describes the benchmark datasets. Section 5 presents the evaluation metrics

employed to asses the explanation’s quality. Section 6 summarizes how we trained the tested architectures, and

Section 7 presents the results expressed with respect to the research questions deﬁned above. In Section 8 we discuss

the results. Finally, in Section 9 we propose future research directions and in Section 10 we draw some conclusions.

2 Graph Neural Networks

In this section we ﬁrst introduce the notation to deal with the GNN formalism, then we review the GNN architectures

explicitly used in our study. To facilitate the reading of the paper, recurring mathematical notation is summarized

in Table 2.

We consider a graph G:= (V, E, XV,XE), with nV∈Nnodes V:= {1, . . . , nV},nE∈Nedges E⊂V×V, a

matrix of d-dimensional node features XV∈RnV×d, where the i-th row of Xis the vector of d∈Nfeatures of the i-

th node, and similarly a matrix of d′-dimensional edge features XE∈RnE×d′. Most graphs considered in this paper

do not have edge features, and we will simply write G:= (V, E, X) in order to denote a graph with node features

only. We use the matrices A,L,I,˜

A,˜

D,˜

L∈RnV×nV, where Aand Lare the adjacency and Laplacian matrices of

Symbol Description

G:= (V, E, X) A graph.

V:= {1, . . . , nV}The set of nV∈Nnodes of a graph.

E⊂V×VThe set of nE∈Nedges of a graph.

X∈RnV×d,Xi∈RdThe matrix of d-dimensional node features, and the feature

vector of the node i∈V.

I∈RnV×nVThe identity matrix.

A,˜

A∈RnV×nVThe adjacency and normalized adjacency matrix of a graph.

D,˜

D∈RnV×nVThe degree and normalized degree matrix of a graph.

L,˜

L∈RnV×nVThe Laplacian and normalized Laplacian matrix of a graph.

N(i)⊂VThe ﬁrst order neighborhood of the node i∈V.

W∈Rd×d′The trainable weights of a layer.

Table 2: List of the mathematical symbols used throughout the paper, and their meaning. A few symbols used

only in speciﬁc settings are omitted and deﬁned in the text.

G,Iis the nV-dimensional identity matrix, ˜

A:= A+I,˜

Dis its diagonal matrix, and ˜

L:= 2

λmax (L)L−Iis the scaled

and normalized Laplacian, where λmax(L) is the largest eigenvalue of L. Furthermore, N(i) := {j∈V: (i, j)∈E}

is the ﬁrst order neighborhood of the node i∈V.

Each GNN layer takes as input the graph G, and maps the node features X∈RnV×dto updated node features

X′∈RnV×d′for a given d′∈N. Some speciﬁc GNN layers, like Hierarchical Pooling layers [135, 118, 52, 13],

instead of reﬁning the embedding for each input node aggregate nodes in order to coarsen the graph in a similar

way as done by pooling methods for vision models [16, 102, 50], thus resulting in a node feature matrix X′∈Rn′×d′

where n′< nV. Overall, this new feature matrix X′is the embedding or representation of the nodes after the

application of one layer of the network. When needed, we denote as Xi,X′

ithe original and transformed feature

vector of the i-node, i.e., the transpose of the i-th row of the matrices X,X′. Specifying the map X→X′is

thus suﬃcient to provide a full deﬁnition of the diﬀerent layers. These transformations are parametric, and they

depend on trainable weights that are learned during the optimization of the network. We represent these weights

as matrices W. Additional terms speciﬁc to single layers are deﬁned in the following.

After an arbitrary number tof GNN layers stacked in sequence, the node embedding matrix X(t)is further

processed in a way that depends on the task to perform. In node classiﬁcation settings [49, 41], where the aim is

predicting one or more node properties, a Multi-Layer Perceptron (MLP) [38] (with shared parameters across nodes)

is applied to each node’s embedding independently in order to output its predicted class. For graph classiﬁcation

settings [41] instead, where the goal is predicting a label for the entire graph, a permutation invariant aggregation

function (like mean, max, or sum) is applied over nodes’ embedding to compress X(t)into a single vector which is

then mapped to the ﬁnal prediction via a standard MLP.

With this notation settled, we can now fully deﬁne the architectures that we are going to consider. In selecting

the architectures to be included in our study, we relied on the comprehensive taxonomy of GNN methods published

by Zhou et al. [135]. Since our goal is to provide an extensive overview of explainability methods for GNNs, we

selected the models to benchmark aiming at covering as much as possible the diﬀerent categories of the taxonomy.

The speciﬁc methods are also selected depending on their popularity, their ease of training, their performance on our

benchmark datasets, their code availability and their compatibility with the explainers being investigated. Overall,

we analyzed the following categories: Convolutional whose computation can be roughly intended as a generalization

of the convolution operation on the image domain. Such convolution can either be Spectral [19, 49], theoretically

grounded in graph signal processing [99], or Spatial [105, 36, 116, 32], where the operations are usually deﬁned in

terms of graph topology; The Pooling category contains all approaches that aggregate node representations in order

to perform graph-level tasks. They can be further diﬀerentiated into Direct [106, 127], where nodes can be aggregated

with diﬀerent aggregation strategies, often called readout functions, and Hierarchical [118, 13, 120, 52, 11], where

nodes are progressively hierarchically aggregated based on their similarity. The latter methods often allow one to

cluster nodes both based on their features and their topological neighborhood [118, 11]. Despite covering the major

aspects of GNN architectures, the aforementioned taxonomy lacks some of the fundamental works that we will

analyze in our study. Particularly, to compensate that, we decided to respectively include the Graph Isomorphism

Network (Gin) [116] and the GraphConv Higher Order Network (GraphConv) [71] as Spatial Convolution and

Higher Order, the latter being a new category added to the taxonomy. A summary of such categorization can be

found in Figure 1. In the following, we broadly describe each GNN model used in this work, reporting for each one

Figure 1: An overview of the adopted GNN architectures structured in a taxonomy as deﬁned by Zhou et al. [135].

In particular, blue boxes represent the categories as deﬁned by the aforementioned work, while the purple box

corresponds to the newly introduced Higher Order category.

the category as identiﬁed by Zhou et al. [135]. Although the following deﬁnitions are often given as global functions

of the graph for notational convenience, we remark that all the layers of a GNN can be eﬃciently implemented as

local operations by means of a message-passing or sparse matrix multiplications.

Chebyshev Spectral Graph Convolution (Cheb, Spectral Convolution) [19]: The Chebyshev spectral graph

convolutional operator [19] was aimed to generalize the convolution operation from the image to the graph do-

main. In doing so, it approximates the convolution of the node features with a trainable ﬁlter where such ap-

proximation is deﬁned in the Fourier domain by means of a Chebyshev polynomial. Since explicitly computing

the convolution in the Fourier domain is very computationally expensive, Cheb adopts the truncated recursive

Chebyshev expansion [37], where the Chebyshev polynomial Tk(x) of order kcan be computed by the recurrence

Tk(x)=2xTk−1(x)−Tk−2(x) with T1= 1, T2=x. Given a degree K∈N, we thus have:

X′=

k=0

Z(k)·W(k),(1)

where Z(k)is computed recursively as:

Z(1) =X,Z(2) =˜

L·X,Z(k)= 2˜

L·Z(k−1) −Z(k−2).

In our analysis we kept K= 5.

Graph Convolutional Network (Gcn, Spectral Convolution) [49]: Gcn [49] represents a ﬁrst-order approxima-

tion of localized spectral ﬁlters on graphs [19]. For a single layer, the node representation is computed as:

X′=σ(˜

D−1/2·˜

A·˜

D−1/2·X·W),(2)

with W∈Rd×d′, and where σ:R→Ris a nonlinear activation function that is applied entry-wise. The

normalization applied to ˜

Ais in place to avoid numerical instabilities after successive applications of the above

propagation rule. It eﬀectively normalizes the node-wise aggregation by a weighted sum where each neighbor is

weighted by the incident edge weight normalized by the degree of the two nodes. The connection with Cheb

introduced above can be seen by keeping the number of convolutions per layer to 1, thus by setting K= 1. After

some further simpliﬁcations detailed in [49], and here omitted for brevity, we can rewrite the spectral convolution

as W(k)·(I+˜

D−1/2·˜

A·˜

D−1/2)·x. After applying the renormalization trick introduced in [49] and by generalizing

the formulation to a node feature matrix X∈RnV×dwe get the formulation of Eq 2.

Graph SAmple and aggreGatE (GraphSage, Spatial Convolution) [36]: This work was proposed as an ex-

tension of Gcn. Contrary to Gcn, which inherently implements a weighted mean neighborhood aggregation,

GraphSAGE generalizes to diﬀerent kinds of aggregations like mean, max or Long Short-Term Memory (LSTM)

aggregations [36, 40].

X′

i=W1Xi+W2·aggregationj∈N(i)Xj,(3)

where W1,W2have both dimension nV×d′and the function aggregation can be any permutation invariant

function. Note that in the case of the LSTM aggregation, the authors adapted LSTMs to operate on sets by simply

applying the LSTMs to a random permutation of the nodes. In our study we used the mean of node features as

aggregator.

Graph Isomorphism Network (Gin, Spatial Convolution) [116]: The work of Xu et al. [116] was among the ﬁrst

ones to study the expressive power of GNNs in relation to the Weisfeiler-Lehman (WL) test of isomorphism [113].

The key insight is that a GNN can have as large discriminative power as the WL test if the GNN’s aggregation

scheme is highly expressive and can model injective functions, like the aggregation of the WL test. They studied the

conditions for which a GNN has the same discriminative power as the WL test, and then they proposed the Graph

Isomorphism Network (Gin) that provably satisﬁes such conditions [116]. The Gin computes node representations

as:

X′=hW′((A+ (1 + ϵ) + I)·X),(4)

where ϵ∈Ris a small number, and hW′is a neural network whose weights W′are the trainable part of the Gin

layer. This model generalizes the WL test and hence achieves maximum discriminative power with respect to the

WL test [116]. In our work, to limit the complexity of the network, we limited hW′to have a single layer.

Graph Attention Network (Gat, Spatial Attentional Convolution) [105]: Following the successes of attention

mechanism in natural language processing [104, 78, 84] and in computer vision [115, 80], Veliˇckovi´c et al. [105]

proposed a GNN layer which computes each node’s representation as a weighted sum of neighborhood features,

where the weights are computed via an attention mechanism. Such attention mechanism helps the layer to focus

on neighbors which are considered to be important for the current node, instead of treating each neighbor equally

importantly [7]. Speciﬁcally, the propagation rule for each node can be deﬁned by:

X′

i=X

j∈N(i)∪{i}

αi,j W·Xj,(5)

where the αi,j are attention coeﬃcients, which are the output of a single-layer feed forward neural network applied

to W·Xi,W·Xj. Namely, if [x||y] denotes the concatenation of the vectors x, y, and if W′∈R2d′×1is a vector

of trainable weights, then αi,j is deﬁned by

αi,j := exp(LeakyReLU((W′)T[W Xi||W Xj]))

Pk∈N(i)∪{i}exp(LeakyReLU((W′)T[W Xi||W Xk])),(6)

where LeakyReLU has usually a slope α:= 0.2 if x≤0. Similarly as previous works on attention [104, 17, 78, 84, 80],

Gat can implement multi-head attention in order to increase the attention’s expressive power in modelling diﬀerent

aspects of node features.

MinCutPooling (MinCutPool, Hierarchical Pooling) [11]: Similarly to other graph pooling mechanisms [118,

13, 120, 52], MinCutPool is able to hierarchically aggregate nodes into coarser representations to summarize

local components and remove redundant information. Graph pooling follows a similar idea as pooling in standard

architectures for vision applications [16, 102, 50], where it helps to reduce the memory and computation footprint

of the model. However, the structured graph domain poses new challenges which required ad-hoc techniques for

achieving good performing pooling methods. MinCutPool [11] achieves so by formulating a relaxation of the

MinCut problem and by training a GNN to compute cluster assignment by jointly optimizing the downstream

supervised loss and the additional unsupervised mincut loss. In particular, a soft cluster assignment matrix S∈

RnV×n′

Vis computed with a MLP with softmax activation over a reﬁned set of node features (e.g. after one or more

layers of message passing). Then, both the adjacency matrix and the node features’ matrix are updated accordingly:

X′=ST·X,˜

A=ST·A·S−In′

Vdiag(ST·A·S),A′=˜

2·˜

A·˜

We refer the interested reader to [11] for the details about the mathematical formulation of the diﬀerentiable MinCut

problem. Since pooling methods modify the graph structure, they are not typically suitable for node classiﬁcation

and link predictions tasks [118, 13, 120, 52]. We will henceforth refer to as MinCutPool for an architecture

implemented with Gcn layers interleaved with the MinCutPool operator.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplainingtheExplainersinGraphNeuralNetworks:aComparativeStudyAntonioLonga1,2,SteveAzzolin1,GabrieleSantin3,GiuliaCencetti4,PietroLi`o5,BrunoLepri2,andAndreaPasserini11UniversityofTrento2FondazioneBrunoKessler3UniversityofVenice4CNRS,CentredePhysiqueTheorique5UniversityofCambridgeJuly2,2024AbstractF...

展开>> 收起<<

Explaining the Explainers in Graph Neural Networks a Comparative Study Antonio Longa12 Steve Azzolin1 Gabriele Santin3 Giulia Cencetti4 Pietro Li o5 Bruno Lepri2.pdf

共48页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Explaining the Explainers in Graph Neural Networks a Comparative Study Antonio Longa12 Steve Azzolin1 Gabriele Santin3 Giulia Cencetti4 Pietro Li o5 Bruno Lepri2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: