
Symbol Description
G:= (V, E, X) A graph.
V:= {1, . . . , nV}The set of nV∈Nnodes of a graph.
E⊂V×VThe set of nE∈Nedges of a graph.
X∈RnV×d,Xi∈RdThe matrix of d-dimensional node features, and the feature
vector of the node i∈V.
I∈RnV×nVThe identity matrix.
A,˜
A∈RnV×nVThe adjacency and normalized adjacency matrix of a graph.
D,˜
D∈RnV×nVThe degree and normalized degree matrix of a graph.
L,˜
L∈RnV×nVThe Laplacian and normalized Laplacian matrix of a graph.
N(i)⊂VThe first order neighborhood of the node i∈V.
W∈Rd×d′The trainable weights of a layer.
Table 2: List of the mathematical symbols used throughout the paper, and their meaning. A few symbols used
only in specific settings are omitted and defined in the text.
G,Iis the nV-dimensional identity matrix, ˜
A:= A+I,˜
Dis its diagonal matrix, and ˜
L:= 2
λmax (L)L−Iis the scaled
and normalized Laplacian, where λmax(L) is the largest eigenvalue of L. Furthermore, N(i) := {j∈V: (i, j)∈E}
is the first order neighborhood of the node i∈V.
Each GNN layer takes as input the graph G, and maps the node features X∈RnV×dto updated node features
X′∈RnV×d′for a given d′∈N. Some specific GNN layers, like Hierarchical Pooling layers [135, 118, 52, 13],
instead of refining the embedding for each input node aggregate nodes in order to coarsen the graph in a similar
way as done by pooling methods for vision models [16, 102, 50], thus resulting in a node feature matrix X′∈Rn′×d′
where n′< nV. Overall, this new feature matrix X′is the embedding or representation of the nodes after the
application of one layer of the network. When needed, we denote as Xi,X′
ithe original and transformed feature
vector of the i-node, i.e., the transpose of the i-th row of the matrices X,X′. Specifying the map X→X′is
thus sufficient to provide a full definition of the different layers. These transformations are parametric, and they
depend on trainable weights that are learned during the optimization of the network. We represent these weights
as matrices W. Additional terms specific to single layers are defined in the following.
After an arbitrary number tof GNN layers stacked in sequence, the node embedding matrix X(t)is further
processed in a way that depends on the task to perform. In node classification settings [49, 41], where the aim is
predicting one or more node properties, a Multi-Layer Perceptron (MLP) [38] (with shared parameters across nodes)
is applied to each node’s embedding independently in order to output its predicted class. For graph classification
settings [41] instead, where the goal is predicting a label for the entire graph, a permutation invariant aggregation
function (like mean, max, or sum) is applied over nodes’ embedding to compress X(t)into a single vector which is
then mapped to the final prediction via a standard MLP.
With this notation settled, we can now fully define the architectures that we are going to consider. In selecting
the architectures to be included in our study, we relied on the comprehensive taxonomy of GNN methods published
by Zhou et al. [135]. Since our goal is to provide an extensive overview of explainability methods for GNNs, we
selected the models to benchmark aiming at covering as much as possible the different categories of the taxonomy.
The specific methods are also selected depending on their popularity, their ease of training, their performance on our
benchmark datasets, their code availability and their compatibility with the explainers being investigated. Overall,
we analyzed the following categories: Convolutional whose computation can be roughly intended as a generalization
of the convolution operation on the image domain. Such convolution can either be Spectral [19, 49], theoretically
grounded in graph signal processing [99], or Spatial [105, 36, 116, 32], where the operations are usually defined in
terms of graph topology; The Pooling category contains all approaches that aggregate node representations in order
to perform graph-level tasks. They can be further differentiated into Direct [106, 127], where nodes can be aggregated
with different aggregation strategies, often called readout functions, and Hierarchical [118, 13, 120, 52, 11], where
nodes are progressively hierarchically aggregated based on their similarity. The latter methods often allow one to
cluster nodes both based on their features and their topological neighborhood [118, 11]. Despite covering the major
aspects of GNN architectures, the aforementioned taxonomy lacks some of the fundamental works that we will
analyze in our study. Particularly, to compensate that, we decided to respectively include the Graph Isomorphism
Network (Gin) [116] and the GraphConv Higher Order Network (GraphConv) [71] as Spatial Convolution and
Higher Order, the latter being a new category added to the taxonomy. A summary of such categorization can be
found in Figure 1. In the following, we broadly describe each GNN model used in this work, reporting for each one
3