Hence, we propose the Tree Mover’s Distance (TMD), a pseudometric on attributed graphs that
considers both the tree structure and local distribution of attributes. It achieves this via a hierarchical
optimal transport problem that defines distances between trees. First, we observe that the TMD
captures properties that capture relationships between graphs and common labels: a simple SVM
based on TMD performs competitively with standard GNNs and graph kernels on graph classification
benchmarks. Second, we relate TMD to the performance of GNNs under input perturbations. We
determine a Lipschitz constant of GNNs with respect to TMD, which enables a bound on their target
risk under domain shifts, i.e., distribution shifts between training and test data. This bound uses the
metric in two ways: to measure the distribution shift, and to measure perturbation robustness via the
Lipschitz constant. Empirically, we observe that the TMD correlates well with the performance of
GNNs under distribution shifts, also when compared to other distances. We hence hope that this work
inspires further empirical and theoretical work on tightening the understanding of the performance of
GNNs under distribution shifts.
In short, this work makes the following contributions:
•
We propose a new graph metric via hierarchical optimal transport between computation trees of
graphs, which, in an SVM, leads to a competitive graph learning method;
• We bound the Lipschitz constant of message-passing GNNs with respect to TMD, which allows
to quantify stability and generalization;
•
We develop a generalization bound for GNNs under distribution shifts that correlates well with
empirical behavior.
2 Related Works
Graph Metrics and Kernels
Measuring distances between graphs has been a long-standing goal
in data analysis. However, proper metrics that distinguish non-isomorphic graphs in polynomial time
are not known. For instance, Vayer et al. [49] propose a graph metric by fusing Wassestein distance
[
50
] and Gromov-Wasserstein distance [
32
]. Similar to classic graph metrics [
9
,
40
], the proposed
metric requires approximation. Closely related, graph kernels [
52
,
7
] have gained attention. Most
graph kernels lie in the framework of
R
-convolutional [
21
], and measure similarity by comparing
substructures. Many
R
-convolutional kernels have limited expressive power and sometimes struggle
to handle continuously attributed graphs [
46
]. In comparison, TMD is as powerful as the WL
graph isomorphism test [
53
] while accommodating graphs with high dimensional node attributes.
Importantly, TMD captures the stability and generalization of message-passing GNNs [27,55].
Stability and Generalization of Graph Neural Networks
A number of existing works study sta-
bility of GNNs to input perturbations. Spectral GNNs are known to be stable to certain perturbations,
e.g., size, if the overall structure is preserved [
19
,
25
,
26
,
19
]. For message passing GNNs, Yehudai
et al.
[57]
study perturbations of graph size, and demonstrate the importance of local computation
trees. Xu et al.
[54]
study how the out-of-distribution behavior of aggregation functions may affect
the GNN’s prediction. These studies motivate to include both computation trees and inputs to aggre-
gation functions in the TMD. Finally, a number of works study within-distribution generalization
[15,20,28,42].
3 Background on Optimal Transport
We begin with a brief introduction to Optimal Transport (OT) and earth mover’s distance. The
earth mover’s distance, also known as Wasserstein distance, is a distance function defined via the
transportation cost between two distributions. Let
X={xi}m
i=1
and
Y={yi}m
j=1
be two multisets
of
m
elements each. Let
C∈Rm×m
be the transportation cost for each pair:
Cij =d(xi, yj)
where
dis the distance between xiand yj. The earth mover’s distance solves the following OT problem:
OT∗
d(X, Y ):= min
γ∈Γ(X,Y )hC, γi/m, Γ(X, Y ) = {γ∈Rm×m
+|γ1m=γ>1m=1m},(1)
where
Γ
is the set of transportation plans that satisfies the flow constrain
γ1m=γ>1m=1m
. In
this work, we adopt the unnormalized version of the earth’s mover distance:
OTd(X, Y ):= min
γ∈Γ(X,Y )hC, γi=m·OT∗
d(X, Y ).(2)
Comparing to classic OT, unnormalized OT preserves the size information of multisets Xand Y.
2