Explaining the Explainers in Graph Neural Networks a Comparative Study Antonio Longa12 Steve Azzolin1 Gabriele Santin3 Giulia Cencetti4 Pietro Li o5 Bruno Lepri2

2025-05-06 0 0 6.56MB 48 页 10玖币
侵权投诉
Explaining the Explainers in Graph Neural Networks: a Comparative
Study
Antonio Longa1,2, Steve Azzolin1, Gabriele Santin3, Giulia Cencetti4, Pietro Li`o5, Bruno Lepri2,
and Andrea Passerini1
1University of Trento
2Fondazione Bruno Kessler
3University of Venice
4CNRS, Centre de Physique Theorique
5University of Cambridge
July 2, 2024
Abstract
Following a fast initial breakthrough in graph based learning, Graph Neural Networks (GNNs) have reached
a widespread application in many science and engineering fields, prompting the need for methods to understand
their decision process. GNN explainers have started to emerge in recent years, with a multitude of methods
both novel or adapted from other domains. To sort out this plethora of alternative approaches, several studies
have benchmarked the performance of different explainers in terms of various explainability metrics. However,
these earlier works make no attempts at providing insights into why different GNN architectures are more or
less explainable, or which explainer should be preferred in a given setting. In this survey we fill these gaps by
devising a systematic experimental study, which tests twelve explainers on eight representative message-passing
architectures trained on six carefully designed graph and node classification datasets. With our results we
provide key insights on the choice and applicability of GNN explainers, we isolate key components that make
them usable and successful and provide recommendations on how to avoid common interpretation pitfalls. We
conclude by highlighting open questions and directions of possible future research.
1 Introduction
Graph Neural Networks (GNNs) have emerged as the de-facto standard for graph-based learning tasks. Regardless
of their apparent simplicity, that allows most GNN architectures to be expressed as variants of Message Passing [31],
i.e., exchanging messages between nodes, GNNs have proved extremely effective in preserving the natural symmetries
present in many real-world physical systems [108, 43, 77, 25, 10]. The versatility of GNNs allowed them to be also
applied to emulate classical algorithms [14], addressing tasks like bipartite matching [30], graph coloring [56] or the
Traveling Salesperson Problem [76], and approximate symbolic reasoning tasks like propositional satisfiability [92,
91, 109] and probabilistic logic reasoning [130]. Despite recent works trying to adapt the Transformer architecture,
made popular by a wide success first in language [104, 84, 78] and then in vision applications [80, 3, 61], to the
graph domain [48, 81, 125, 22, 132], the natural inductive bias of GNNs remains at the basis of the current success
of GNNs. A major drawback of GNNs with respect to alternative graph processing approaches [39, 74, 97, 34] is the
opacity of their predictive mechanism, which they share with most deep-learning based architectures. This severely
limits the applicability of these technologies to safety-critical scenarios.
The need to provide insights into the decision process of the network, and the need to provide explanations for
automatic decisions affecting human’s life [18, 47], have stimulated research in techniques for shading light into the
black box nature of deep architectures [83, 93, 114, 126, 103, 100, 102, 75, 35]. The approaches have also been
adapted to generate explanations for GNN models [100, 102, 75, 8]. However, networked data have peculiarities that
pose specific challenges that explainers developed for tensor data struggle to address. The main challenge comes
from the lack of a regular structure, as nodes have a variable number of edges, which requires ad-hoc strategies
to be properly addressed. Indeed, several approaches have been recently developed that are specifically tailored to
explain GNN architectures, and Section 3 will summarize the main contributions.
1
arXiv:2210.15304v3 [cs.LG] 1 Jul 2024
It is often the case, however, that each work proposes a new set of benchmarks or metrics, making the comparison
across works complicated. We thereby stress the need for a comprehensive evaluation that can fairly benchmark
the explainers under a unified lens. One of the first attempts to provide such a comparative analysis is the up
mentioned work by Yuan et al. [122], where a taxonomy of the available explainers was proposed. In addition to
this, the authors reported a detailed overview of the most common datasets used to benchmark explainers, along
with the adopted evaluation metrics.
However, despite the wide coverage of explainers, datasets, and evaluation metrics, only a single GNN archi-
tecture, namely a simple Graph Convolutional Network [49], was evaluated, so that nothing can be said about the
impact of different architectures in the resulting explanations. A similar limitation affects the works of Zhao et
al. [133] and Agarwal et al. [1, 2] that, despite presenting interesting insights in terms of consistency of explainers,
desired properties of explanation metrics and even introducing a generator for synthetic graph benchmarks, focus
their analysis to a single GNN architecture. Li et al. [55] conducted the first empirical study comparing different
GNN architectures. However, their study is limited to node classification and the three explainers under analy-
sis [117, 64, 88] are not well representative of the diversity of explanation strategies that have been proposed, as
summarized in the aforementioned taxonomy [122]. The most comprehensive study to date is the recent work by
Rathee at al. [82], that evaluated four GNN architectures over nine explainers for both node and graph classification.
However, the main goal of this study is proposing a benchmarking suite to quantitatively evaluate explainers, with
no attempts at providing insights into why different GNN architectures behave differently in terms of explainability,
or which explainer should be preferred in a given setting.
In spite of the aforementioned recent studies benchmarking explainability methods for GNNs, no investigation
has been done in characterizing the typical explanation patterns associated to the topological concepts learned
by the network and how different architectures affect the explanation. In our work, we address these issues by
answering the following research questions:
RQ1: How does the architecture affect the explanations?
RQ2: How do explainers affect the explanations?
RQ3: How do different types of problems affect the explanations?
Overall, our work aims to go beyond a merely quantitative evaluation of the performance of explainer-GNN pairs
and to make a significant step towards explaining explainability. We run an unprecedented number of experiments
involving eight GNN architectures, twelve instance-based explainers, and six datasets (divided into node and graph
classification) and enrich the quantitative results we obtain by providing a deep understanding of the reasons behind
the observed behaviors, together with a set of recommendations on how to select and best use the most appropriate
explainer for the task under investigation while avoiding common pitfalls, as well as a number of open problems in
GNN explainability that we believe deserve further investigation. Following previous analyses [123, 55, 58, 1, 82],
our study investigates message-passing GNNs applied to static graphs. Exploring explainability in non-message
passing architectures (e.g., Transformers) or different graph types (e.g., temporal graphs) is an interesting direction
for future research, that may however necessitate distinct benchmarking approaches.
The remainder of the paper is structured as follows: Section 2 presents an overview of GNN architectures, with
greater detail on the models that we adopted in our study. On the same vein, Section 3 introduces the explainers
for graph models, while Section 4 describes the benchmark datasets. Section 5 presents the evaluation metrics
employed to asses the explanation’s quality. Section 6 summarizes how we trained the tested architectures, and
Section 7 presents the results expressed with respect to the research questions defined above. In Section 8 we discuss
the results. Finally, in Section 9 we propose future research directions and in Section 10 we draw some conclusions.
2 Graph Neural Networks
In this section we first introduce the notation to deal with the GNN formalism, then we review the GNN architectures
explicitly used in our study. To facilitate the reading of the paper, recurring mathematical notation is summarized
in Table 2.
We consider a graph G:= (V, E, XV,XE), with nVNnodes V:= {1, . . . , nV},nENedges EV×V, a
matrix of d-dimensional node features XVRnV×d, where the i-th row of Xis the vector of dNfeatures of the i-
th node, and similarly a matrix of d-dimensional edge features XERnE×d. Most graphs considered in this paper
do not have edge features, and we will simply write G:= (V, E, X) in order to denote a graph with node features
only. We use the matrices A,L,I,˜
A,˜
D,˜
LRnV×nV, where Aand Lare the adjacency and Laplacian matrices of
2
Symbol Description
G:= (V, E, X) A graph.
V:= {1, . . . , nV}The set of nVNnodes of a graph.
EV×VThe set of nENedges of a graph.
XRnV×d,XiRdThe matrix of d-dimensional node features, and the feature
vector of the node iV.
IRnV×nVThe identity matrix.
A,˜
ARnV×nVThe adjacency and normalized adjacency matrix of a graph.
D,˜
DRnV×nVThe degree and normalized degree matrix of a graph.
L,˜
LRnV×nVThe Laplacian and normalized Laplacian matrix of a graph.
N(i)VThe first order neighborhood of the node iV.
WRd×dThe trainable weights of a layer.
Table 2: List of the mathematical symbols used throughout the paper, and their meaning. A few symbols used
only in specific settings are omitted and defined in the text.
G,Iis the nV-dimensional identity matrix, ˜
A:= A+I,˜
Dis its diagonal matrix, and ˜
L:= 2
λmax (L)LIis the scaled
and normalized Laplacian, where λmax(L) is the largest eigenvalue of L. Furthermore, N(i) := {jV: (i, j)E}
is the first order neighborhood of the node iV.
Each GNN layer takes as input the graph G, and maps the node features XRnV×dto updated node features
XRnV×dfor a given dN. Some specific GNN layers, like Hierarchical Pooling layers [135, 118, 52, 13],
instead of refining the embedding for each input node aggregate nodes in order to coarsen the graph in a similar
way as done by pooling methods for vision models [16, 102, 50], thus resulting in a node feature matrix XRn×d
where n< nV. Overall, this new feature matrix Xis the embedding or representation of the nodes after the
application of one layer of the network. When needed, we denote as Xi,X
ithe original and transformed feature
vector of the i-node, i.e., the transpose of the i-th row of the matrices X,X. Specifying the map XXis
thus sufficient to provide a full definition of the different layers. These transformations are parametric, and they
depend on trainable weights that are learned during the optimization of the network. We represent these weights
as matrices W. Additional terms specific to single layers are defined in the following.
After an arbitrary number tof GNN layers stacked in sequence, the node embedding matrix X(t)is further
processed in a way that depends on the task to perform. In node classification settings [49, 41], where the aim is
predicting one or more node properties, a Multi-Layer Perceptron (MLP) [38] (with shared parameters across nodes)
is applied to each node’s embedding independently in order to output its predicted class. For graph classification
settings [41] instead, where the goal is predicting a label for the entire graph, a permutation invariant aggregation
function (like mean, max, or sum) is applied over nodes’ embedding to compress X(t)into a single vector which is
then mapped to the final prediction via a standard MLP.
With this notation settled, we can now fully define the architectures that we are going to consider. In selecting
the architectures to be included in our study, we relied on the comprehensive taxonomy of GNN methods published
by Zhou et al. [135]. Since our goal is to provide an extensive overview of explainability methods for GNNs, we
selected the models to benchmark aiming at covering as much as possible the different categories of the taxonomy.
The specific methods are also selected depending on their popularity, their ease of training, their performance on our
benchmark datasets, their code availability and their compatibility with the explainers being investigated. Overall,
we analyzed the following categories: Convolutional whose computation can be roughly intended as a generalization
of the convolution operation on the image domain. Such convolution can either be Spectral [19, 49], theoretically
grounded in graph signal processing [99], or Spatial [105, 36, 116, 32], where the operations are usually defined in
terms of graph topology; The Pooling category contains all approaches that aggregate node representations in order
to perform graph-level tasks. They can be further differentiated into Direct [106, 127], where nodes can be aggregated
with different aggregation strategies, often called readout functions, and Hierarchical [118, 13, 120, 52, 11], where
nodes are progressively hierarchically aggregated based on their similarity. The latter methods often allow one to
cluster nodes both based on their features and their topological neighborhood [118, 11]. Despite covering the major
aspects of GNN architectures, the aforementioned taxonomy lacks some of the fundamental works that we will
analyze in our study. Particularly, to compensate that, we decided to respectively include the Graph Isomorphism
Network (Gin) [116] and the GraphConv Higher Order Network (GraphConv) [71] as Spatial Convolution and
Higher Order, the latter being a new category added to the taxonomy. A summary of such categorization can be
found in Figure 1. In the following, we broadly describe each GNN model used in this work, reporting for each one
3
Figure 1: An overview of the adopted GNN architectures structured in a taxonomy as defined by Zhou et al. [135].
In particular, blue boxes represent the categories as defined by the aforementioned work, while the purple box
corresponds to the newly introduced Higher Order category.
the category as identified by Zhou et al. [135]. Although the following definitions are often given as global functions
of the graph for notational convenience, we remark that all the layers of a GNN can be efficiently implemented as
local operations by means of a message-passing or sparse matrix multiplications.
Chebyshev Spectral Graph Convolution (Cheb, Spectral Convolution) [19]: The Chebyshev spectral graph
convolutional operator [19] was aimed to generalize the convolution operation from the image to the graph do-
main. In doing so, it approximates the convolution of the node features with a trainable filter where such ap-
proximation is defined in the Fourier domain by means of a Chebyshev polynomial. Since explicitly computing
the convolution in the Fourier domain is very computationally expensive, Cheb adopts the truncated recursive
Chebyshev expansion [37], where the Chebyshev polynomial Tk(x) of order kcan be computed by the recurrence
Tk(x)=2xTk1(x)Tk2(x) with T1= 1, T2=x. Given a degree KN, we thus have:
X=
K
X
k=0
Z(k)·W(k),(1)
where Z(k)is computed recursively as:
Z(1) =X,Z(2) =˜
L·X,Z(k)= 2˜
L·Z(k1) Z(k2).
In our analysis we kept K= 5.
Graph Convolutional Network (Gcn, Spectral Convolution) [49]: Gcn [49] represents a first-order approxima-
tion of localized spectral filters on graphs [19]. For a single layer, the node representation is computed as:
X=σ(˜
D1/2·˜
A·˜
D1/2·X·W),(2)
with WRd×d, and where σ:RRis a nonlinear activation function that is applied entry-wise. The
normalization applied to ˜
Ais in place to avoid numerical instabilities after successive applications of the above
propagation rule. It effectively normalizes the node-wise aggregation by a weighted sum where each neighbor is
weighted by the incident edge weight normalized by the degree of the two nodes. The connection with Cheb
introduced above can be seen by keeping the number of convolutions per layer to 1, thus by setting K= 1. After
some further simplifications detailed in [49], and here omitted for brevity, we can rewrite the spectral convolution
as W(k)·(I+˜
D1/2·˜
A·˜
D1/2)·x. After applying the renormalization trick introduced in [49] and by generalizing
the formulation to a node feature matrix XRnV×dwe get the formulation of Eq 2.
Graph SAmple and aggreGatE (GraphSage, Spatial Convolution) [36]: This work was proposed as an ex-
tension of Gcn. Contrary to Gcn, which inherently implements a weighted mean neighborhood aggregation,
GraphSAGE generalizes to different kinds of aggregations like mean, max or Long Short-Term Memory (LSTM)
aggregations [36, 40].
X
i=W1Xi+W2·aggregationjN(i)Xj,(3)
4
where W1,W2have both dimension nV×dand the function aggregation can be any permutation invariant
function. Note that in the case of the LSTM aggregation, the authors adapted LSTMs to operate on sets by simply
applying the LSTMs to a random permutation of the nodes. In our study we used the mean of node features as
aggregator.
Graph Isomorphism Network (Gin, Spatial Convolution) [116]: The work of Xu et al. [116] was among the first
ones to study the expressive power of GNNs in relation to the Weisfeiler-Lehman (WL) test of isomorphism [113].
The key insight is that a GNN can have as large discriminative power as the WL test if the GNN’s aggregation
scheme is highly expressive and can model injective functions, like the aggregation of the WL test. They studied the
conditions for which a GNN has the same discriminative power as the WL test, and then they proposed the Graph
Isomorphism Network (Gin) that provably satisfies such conditions [116]. The Gin computes node representations
as:
X=hW((A+ (1 + ϵ) + I)·X),(4)
where ϵRis a small number, and hWis a neural network whose weights Ware the trainable part of the Gin
layer. This model generalizes the WL test and hence achieves maximum discriminative power with respect to the
WL test [116]. In our work, to limit the complexity of the network, we limited hWto have a single layer.
Graph Attention Network (Gat, Spatial Attentional Convolution) [105]: Following the successes of attention
mechanism in natural language processing [104, 78, 84] and in computer vision [115, 80], Veliˇckovi´c et al. [105]
proposed a GNN layer which computes each node’s representation as a weighted sum of neighborhood features,
where the weights are computed via an attention mechanism. Such attention mechanism helps the layer to focus
on neighbors which are considered to be important for the current node, instead of treating each neighbor equally
importantly [7]. Specifically, the propagation rule for each node can be defined by:
X
i=X
jN(i)∪{i}
αi,j W·Xj,(5)
where the αi,j are attention coefficients, which are the output of a single-layer feed forward neural network applied
to W·Xi,W·Xj. Namely, if [x||y] denotes the concatenation of the vectors x, y, and if WR2d×1is a vector
of trainable weights, then αi,j is defined by
αi,j := exp(LeakyReLU((W)T[W Xi||W Xj]))
PkN(i)∪{i}exp(LeakyReLU((W)T[W Xi||W Xk])),(6)
where LeakyReLU has usually a slope α:= 0.2 if x0. Similarly as previous works on attention [104, 17, 78, 84, 80],
Gat can implement multi-head attention in order to increase the attention’s expressive power in modelling different
aspects of node features.
MinCutPooling (MinCutPool, Hierarchical Pooling) [11]: Similarly to other graph pooling mechanisms [118,
13, 120, 52], MinCutPool is able to hierarchically aggregate nodes into coarser representations to summarize
local components and remove redundant information. Graph pooling follows a similar idea as pooling in standard
architectures for vision applications [16, 102, 50], where it helps to reduce the memory and computation footprint
of the model. However, the structured graph domain poses new challenges which required ad-hoc techniques for
achieving good performing pooling methods. MinCutPool [11] achieves so by formulating a relaxation of the
MinCut problem and by training a GNN to compute cluster assignment by jointly optimizing the downstream
supervised loss and the additional unsupervised mincut loss. In particular, a soft cluster assignment matrix S
RnV×n
Vis computed with a MLP with softmax activation over a refined set of node features (e.g. after one or more
layers of message passing). Then, both the adjacency matrix and the node features’ matrix are updated accordingly:
X=ST·X,˜
A=ST·A·SIn
Vdiag(ST·A·S),A=˜
D1
2·˜
A·˜
D1
2.
We refer the interested reader to [11] for the details about the mathematical formulation of the differentiable MinCut
problem. Since pooling methods modify the graph structure, they are not typically suitable for node classification
and link predictions tasks [118, 13, 120, 52]. We will henceforth refer to as MinCutPool for an architecture
implemented with Gcn layers interleaved with the MinCutPool operator.
5
摘要:

ExplainingtheExplainersinGraphNeuralNetworks:aComparativeStudyAntonioLonga1,2,SteveAzzolin1,GabrieleSantin3,GiuliaCencetti4,PietroLi`o5,BrunoLepri2,andAndreaPasserini11UniversityofTrento2FondazioneBrunoKessler3UniversityofVenice4CNRS,CentredePhysiqueTheorique5UniversityofCambridgeJuly2,2024AbstractF...

展开>> 收起<<
Explaining the Explainers in Graph Neural Networks a Comparative Study Antonio Longa12 Steve Azzolin1 Gabriele Santin3 Giulia Cencetti4 Pietro Li o5 Bruno Lepri2.pdf

共48页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:48 页 大小:6.56MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 48
客服
关注