Tree Movers Distance Bridging Graph Metrics and Stability of Graph Neural Networks Ching-Yao Chuang

2025-05-06 0 0 6.46MB 27 页 10玖币
侵权投诉
Tree Mover’s Distance: Bridging Graph Metrics and
Stability of Graph Neural Networks
Ching-Yao Chuang
MIT CSAIL
cychuang@mit.edu
Stefanie Jegelka
MIT CSAIL
stefje@mit.edu
Abstract
Understanding generalization and robustness of machine learning models funda-
mentally relies on assuming an appropriate metric on the data space. Identifying
such a metric is particularly challenging for non-Euclidean data such as graphs.
Here, we propose a pseudometric for attributed graphs, the Tree Mover’s Distance
(TMD), and study its relation to generalization. Via a hierarchical optimal transport
problem, TMD reflects the local distribution of node attributes as well as the distri-
bution of local computation trees, which are known to be decisive for the learning
behavior of graph neural networks (GNNs). First, we show that TMD captures
properties relevant to graph classification: a simple TMD-SVM performs competi-
tively with standard GNNs. Second, we relate TMD to generalization of GNNs
under distribution shifts, and show that it correlates well with performance drop
under such shifts. The code is available at
https://github.com/chingyaoc/TMD
.
1 Introduction
Understanding generalization under distribution shifts – theoretically and empirically – relies on
an appropriate measure of divergence between data distributions. This, in turn, typically demands
a metric on the data space that indicates what kinds of data points are “close” to the training data.
While such metrics may be readily available for Euclidean spaces, they are more challenging to
determine for non-Euclidean data spaces such as graphs with attributes in
Rp
, which underlie graph
learning methods. In this work, we study the question of a suitable metric for message passing graph
neural networks (GNNs).
An “ideal” metric for studying input perturbations captures the invariances and inductive biases of
the model we are examining. For instance, since graph isomorphism is a difficult problem [
18
],
this metric is expected to be a pseudometric, i.e., will fail to distinguish certain graphs. These
“failures” should be aligned with the GNNs’ invariances. Moreover, several recent works highlight
the importance of local structures – computation trees resulting from unrolling the message passing
process – for the approximation power of GNNs and their inability to distinguish certain pairs of
graphs [
2
,
20
,
34
,
55
]. Works on out-of-distribution generalization of GNNs mostly focus on specific
types of instances where a trained model may fail miserably [
58
,
54
], without specifying the behavior
for gradual shifts in distributions. Yehudai et al.
[58]
show that a sufficiently large, unrestricted GNN
may predict arbitrarily on previously unseen computation trees. In practice, we may observe a more
gradual change, depending on the magnitude of change of the tree, in terms of both structure and
node attributes, and the capacity of the aggregation function. In summary, we desire a metric that
reflects the structure of computation trees and the distribution of node attributes within trees. Many
distances and kernels between graphs have been proposed [
7
], several based on local structures [
21
],
and some using structure and attributes [
44
]. Closest to our ideal is the Wasserstein Weisfeiler-Leman
pseudometric [
46
], which computes an optimal transport distance between node embeddings of two
graphs. The embeddings are computed via message passing, which aligns with the computation of
GNNs, but loses structural information within trees.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01906v1 [cs.LG] 4 Oct 2022
Hence, we propose the Tree Mover’s Distance (TMD), a pseudometric on attributed graphs that
considers both the tree structure and local distribution of attributes. It achieves this via a hierarchical
optimal transport problem that defines distances between trees. First, we observe that the TMD
captures properties that capture relationships between graphs and common labels: a simple SVM
based on TMD performs competitively with standard GNNs and graph kernels on graph classification
benchmarks. Second, we relate TMD to the performance of GNNs under input perturbations. We
determine a Lipschitz constant of GNNs with respect to TMD, which enables a bound on their target
risk under domain shifts, i.e., distribution shifts between training and test data. This bound uses the
metric in two ways: to measure the distribution shift, and to measure perturbation robustness via the
Lipschitz constant. Empirically, we observe that the TMD correlates well with the performance of
GNNs under distribution shifts, also when compared to other distances. We hence hope that this work
inspires further empirical and theoretical work on tightening the understanding of the performance of
GNNs under distribution shifts.
In short, this work makes the following contributions:
We propose a new graph metric via hierarchical optimal transport between computation trees of
graphs, which, in an SVM, leads to a competitive graph learning method;
We bound the Lipschitz constant of message-passing GNNs with respect to TMD, which allows
to quantify stability and generalization;
We develop a generalization bound for GNNs under distribution shifts that correlates well with
empirical behavior.
2 Related Works
Graph Metrics and Kernels
Measuring distances between graphs has been a long-standing goal
in data analysis. However, proper metrics that distinguish non-isomorphic graphs in polynomial time
are not known. For instance, Vayer et al. [49] propose a graph metric by fusing Wassestein distance
[
50
] and Gromov-Wasserstein distance [
32
]. Similar to classic graph metrics [
9
,
40
], the proposed
metric requires approximation. Closely related, graph kernels [
52
,
7
] have gained attention. Most
graph kernels lie in the framework of
R
-convolutional [
21
], and measure similarity by comparing
substructures. Many
R
-convolutional kernels have limited expressive power and sometimes struggle
to handle continuously attributed graphs [
46
]. In comparison, TMD is as powerful as the WL
graph isomorphism test [
53
] while accommodating graphs with high dimensional node attributes.
Importantly, TMD captures the stability and generalization of message-passing GNNs [27,55].
Stability and Generalization of Graph Neural Networks
A number of existing works study sta-
bility of GNNs to input perturbations. Spectral GNNs are known to be stable to certain perturbations,
e.g., size, if the overall structure is preserved [
19
,
25
,
26
,
19
]. For message passing GNNs, Yehudai
et al.
[57]
study perturbations of graph size, and demonstrate the importance of local computation
trees. Xu et al.
[54]
study how the out-of-distribution behavior of aggregation functions may affect
the GNN’s prediction. These studies motivate to include both computation trees and inputs to aggre-
gation functions in the TMD. Finally, a number of works study within-distribution generalization
[15,20,28,42].
3 Background on Optimal Transport
We begin with a brief introduction to Optimal Transport (OT) and earth mover’s distance. The
earth mover’s distance, also known as Wasserstein distance, is a distance function defined via the
transportation cost between two distributions. Let
X={xi}m
i=1
and
Y={yi}m
j=1
be two multisets
of
m
elements each. Let
CRm×m
be the transportation cost for each pair:
Cij =d(xi, yj)
where
dis the distance between xiand yj. The earth mover’s distance solves the following OT problem:
OT
d(X, Y ):= min
γΓ(X,Y )hC, γi/m, Γ(X, Y ) = {γRm×m
+|γ1m=γ>1m=1m},(1)
where
Γ
is the set of transportation plans that satisfies the flow constrain
γ1m=γ>1m=1m
. In
this work, we adopt the unnormalized version of the earth’s mover distance:
OTd(X, Y ):= min
γΓ(X,Y )hC, γi=m·OT
d(X, Y ).(2)
Comparing to classic OT, unnormalized OT preserves the size information of multisets Xand Y.
2
,
,
Tree Distance Root Difference Subtree Difference
, ,
Blank
Tree
Blank
Tree
Figure 1:
Tree Distance via Hierarchical OT.
The weight
w(·)
is set to 1 to simplify visualization.
The distance between two trees can be decomposed into (a) the distance between roots and (b) the OT
cost between subtrees, where the cost function of OT is again the distance between two trees. This
formulates a hierarchical OT problem, where solving the OT between trees requires solving the OT
between subtrees.
4 Tree Mover’s Distance: Optimal Transport on Graphs
Next, we introduce tree mover’s distance, a new distance for graphs. Let
G= (V, E)
denote a graph
with sets of nodes
V
and edges
E
. The graph may have node features
xvRp
for
vV
. If not, we
simply set the node feature to a scalar xv= 1 for all vV.
Local structures of graphs are characterized by computation trees [
23
]. In particular, computation
trees are constructed by connecting adjacent nodes recursively.
Definition 1
(Computation Trees)
.
Given a graph
G= (V, E)
, let
T1
v=v
, and let
TL
v
be the depth-
L
computation tree of node
v
constructed by connecting the neighbors of the leaf nodes of
TL1
v
to the
tree. The multiset of depth-Lcomputation trees defined by Gis denoted by TL
G:={TL
v}vV.
Figure 2illustrates an example of constructing computation trees. Computation trees, also referred to
as subtree patterns, have been central in graph analysis [
38
,
53
] and graph kernels [
39
,
44
]. Intuitively,
the computation tree of a node encodes local structure by appending the neighbors to the tree in each
level. If depth-
L
computation trees are the same for two nodes, they share similar neighborhoods up
to
L
steps away. Therefore, an intuitive way to compare two graphs is by measuring the difference of
their nodes’ computation trees [
44
,
53
]. In this work, we adopt optimal transport, a natural way to
compute the distance between two sets of objects with, importantly, an underlying geometry. We will
begin by defining the transportation cost between two computation trees. The cost then gives rise to
the tree mover’s distance, an extension of earth mover’s distance to multisets of trees.
4.1 Distance between Trees via Hierarchical OT
Let
T= (V, E, r)
denote a rooted tree. We further let
Tv
be the multiset of computation trees of the
node
v
which consists of trees that root at the descendants of
v
. Determining whether two trees are
similar requires iteratively examining whether the subtrees in each level are similar. For instance,
two trees
Ta
and
Tb
with roots
ra
and
rb
are the same if
xra=xrb
and
Tra=Trb
. This motivates us
to define the distance between two trees by recursively computing the optimal transportation cost
between their subtrees. Nevertheless, the number of subtrees could be different for
ra
and
rb
, i.e.,
|Tra| 6=|Trb|
(see Figure 1left). To compute the OT between sets with different sizes, unbalanced
OT [
11
,
41
] or partial OT [
10
] are usually adopted. Inspired by [
10
], we augment the smaller set with
blank trees.
Definition 2
(Blank Tree)
.
Ablank tree
T0
is a tree (graph) that contains a single node and no edge,
where the node feature is the zero vector 0pRp, and Tn
0denotes a multiset of nblank trees.
Definition 3
(Blank Tree Augmentation)
.
Given two multisets of trees
Tu,Tv
, define
ρ
to be a function
that augments a pair of trees with blank trees as follows:
ρ:(Tv,Tu)7→ Tv[Tmax(|Tu|−|Tv|,0)
0,Tu[Tmax(|Tv|−|Tu|,0)
0.
If
|Tv|<|Tu|
,
ρ
augments
Tv
by
|Tu| − |Tv|
blank trees to make the two multisets contain the same
number of trees, and hence allows to define a transportation cost between two multisets of trees with
different sizes. In particular, the transportation costs of additional trees are simply the distance to the
blank trees. The distance between a tree and a blank tree can be interpreted as the norm of the tree. In
our case, the blank tree can be viewed as the origin, as it is a tree with the simplest structure and zero
feature vector. Equipped with ρ, we define the distance between two rooted trees as follows.
3
Definition 4 (Tree Distance).The distance between two trees Ta, Tbis defined recursively as
TDw(Ta, Tb):=(kxraxrbk+w(L)·OTTDw(ρ(Tra,Trb)) if L > 1
kxraxrbkotherwise,
where
L= max(Depth(Ta),Depth(Tb))
and
w:NR+
is a depth-dependent weighting function.
Here,
OTTDw
is the OT distance defined in (2) with
TDw
as the metric. Figure 1gives an illustration
of computing tree distances. The tree distance
TDw(Ta, Tb)
aims to optimally align two trees
Ta
and
Tbby recursively comparing their roots and subtrees. Calculating TDw(Ta, Tb)requires calculating
the OT between augmented subtrees
ρ(Tra,Trb)
, where the cost function of OT is
TDw
again. This
formulates a hierarchical optimal transport problem: the distance of two trees is defined via the
distances of subtrees, where the importance of each level is determined by the weight
w(·)
. Increasing
w(·)
upweights the effect of nodes in the subtrees. While the weights may be arbitrary, we found
that for many applications, using a single weight for all depths yields good empirical performance,
as section 4.3 shows. The role of weights will be more significant when we use TMD to bound the
stability of GNNs.
4.2 From Tree Distance to Graph Distance
Next, we extend the distance between trees to a distance between graphs. By leveraging the tree
distance
TDw(·,·)
, we introduce the tree mover’s distance (TMD), a distance for graphs, by calculating
the optimal transportation cost between the graphs’ computation trees.
Definition 5
(
Tree Mover’s Distance
)
.
Given two graphs
Ga, Gb
and
w, L 0
, the tree mover’s
distance between Gaand Gbis defined as
TMDL
w(Ga, Gb) = OTTDw(ρ(TL
Ga,TL
Gb)),
where
TL
Ga
and
TL
Gb
are multisets of the depth-
L
computation trees of graphs
Ga
and
Gb
, respectively.
Figure 2illustrates the computation of TMD. Intuitively, TMD is the minimum cost required to
transport node-wise computation trees from one graph to another. The blank tree augmentation
ρ
is
again adopted to handle graphs with different numbers of nodes. The next theorem shows that TMD
is a pseudometric on attributed graphs.
Theorem 6 (Pseudometric).The tree mover’s distance TMDL
wis a pseudometric for finite L > 0.
In particular, the tree mover’s distance satisfies (1)
TMDL
w(Ga, Ga)=0
, (2)
TMDL
w(Ga, Gb) =
TMDL
w(Gb, Ga)
, and (3)
TMDL
w(Ga, Gb)TMDL
w(Ga, Gc) + TMDL
w(Gc, Gb)
for any graphs
Ga, Gb, Gc
. However, in some cases, the distance
TMDL
w(Ga, Gb)
can be zero even if
Ga6=Gb
.
This is reasonable, as computing graph isomorphism is not known to be solvable in polynomial
time. Nevertheless, TMD can provably distinguish graphs that are identifiable by the (1-dimensional)
Weisfeiler-Leman graph isomorphism test [53].
Theorem 7
(Discriminative Power of TMD)
.
If two graphs
Ga
,
Gb
are determined to be non-
isomorphic in WL iteration Land w(l)>0for all 0< l L+ 1, then TMDL+1
w(Ga, Gb)>0.
The unnormalized OT and blank tree augmentation are essential to prove Theorem 16. The tree
mover’s distance can be exactly computed by solving optimal transport. In addition, TMD remains
highly expressive on graphs with high dimensional continuous attributes, where most
R
-convolutional
graph kernels struggle [
46
]. The discriminative power of TMD can be further strengthened by
augmenting node attributes e.g. with positional encodings [16,29].
The OT cost between node representations in different graphs is reminiscent of the recently proposed
Wasserstein WL (WWL) [
46
]. WWL uses the distance of node embeddings as a ground metric, where
the node embeddings are computed via
L
iterations of message passing with average-aggregation;
in contrast, TMD computes a distance that aligns the trees, retaining more structural information.
Even though an aggregation with nonlinearities can retain tree isomorphism information [
55
], similar
to the hashing applied in the WL test (WWL uses only linear aggregations), the hierarchical OT
is a more direct graded distance measure of trees. Vayer et al.
[49]
define a metric that uses a
Gromov-Wasserstein distance [32] between nodes, but need to approximate the GW computation.
4
. . . . . .
Optimal
Transport
Figure 2:
Illustration of Computation Trees and Tree Mover’s Distance.
The computation trees
of nodes are constructed by iteratively connecting the neighbors to the trees, and each graph will
define a multiset of node-wise computation trees. Tree mover’s distance is then defined as the optimal
transport cost between the computation trees of two graphs.
MUTAG PTC PROTEINS NCI1 NCI109 BZR COX2
TMD L=1 89.4±5.5 65.3±5.8 73.9±2.8 68.3±2.0 69.5±1.6 83.8±7.2 77.8±5.0
TMD L=2 90.0±5.7 67.4±7.7 74.8±2.8 80.8±1.8 78.9±2.3 84.5±6.9 79.1±5.2
TMD L=3 91.1±5.4 68.5±6.1 74.6±2.6 83.3±1.1 82.3±2.5 85.5±6.2 78.5±5.9
TMD L=4 92.2±6.0 66.5±7.1 75.2±2.3 84.8±1.2 82.8±2.1 84.5±6.4 76.1±6.1
WWL [46] 87.3±1.5 66.3±1.2 74.3±0.6 86.1±0.3 - 84.4±2.0 78.3±0.5
FGW [49] 88.4±5.6 65.3±7.9 74.5±2.7 86.4±1.6 - 85.1±4.2 77.2±4.9
WL [44] 90.4±5.7 59.9±4.3 75.0±3.1 86.0±1.8 82.46±0.2 N/A N/A
R&G [39] 85.7±0.4 58.5±0.9 70.7±0.4 61.9±0.3 61.7±0.2 N/A N/A
GIN [55] 89.4±5.6 64.6±7.0 76.2±2.8 82.7±1.7 82.2±1.6 83.5±6.0 79.0±5.3
GCN [27] 85.6±5.8 64.2±4.3 76.0±3.2 80.2±2.0 - 84.6±5.9 77.1±4.7
Table 1:
Classification on TU Dataset.
TMD outperforms or matches the state-of-the-art graph
kernels or GNNs. Note that WL [
44
] and R&G [
39
] are not applicable to continuously attributed
graphs such as BZR and COX2.
Numerical Computation with Dynamic Programming
One can numerically compute TMD with
dynamic programming. Starting with pair-wise distances (
TMDL=1
w
) between node features across
the two graphs, we then iteratively compute
TMDL=k
w
between depth-
k
computation trees from
k= 2
to
L
according to Definition 4. Let
D
be the maximum degree of a node in the two graphs and
τ(m)
be the complexity of computing OT between sets of cardinality
m
. In each level, we have to perform
OT of sets contain at most
D
elements for
N
nodes. Including the last OT between nodes of graph,
the overall time complexity of computing
TMDL
w
is
O(τ(N) + LNτ(D))
. The time complexity
for exact OT by solving linear programming is
τ(m) = O(m3log(m))
[
17
]. One can use faster
approximation of OT, e.g., near linear time complexity [
1
], but we use exact OT throughout all the
experiments, implemented with the POT library [17].
4.3 Experiments
We verify whether the tree mover’s distance aligns with the labels of graphs in graph classification
tasks: the TUDatasets [35], which contain graphs with discrete node attributes (MUTAG, PTC-MR,
PROTEINS, NCI1, NCI109) and graphs with continuous node attributes (BZR, COX2). Specifically,
we run a support vector classifier (C
=
1) with indefinite kernel
eγ×TMD(·,·)
, which can be viewed as
a noisy observation of the true positive semidefinite kernel [
31
]. The
γ
is selected via cross-validation
from
{0.01,0.05,0.1}
and the weights
w(·)
are set to
0.5
for all depths. For comparison, we use
graph kernels based on graph subtrees: Ramon & Gärtner kernel [
39
], WL subtree kernel [
44
]; two
widely-adopted GNNs: graph isomorphism network (GIN) [
55
], graph convolutional networks (GCN)
[
27
]; and the recently proposed graph metrics FGW [
49
] and WWL [
46
]. Table 1reports the mean
and standard deviation over 10 independent trials with 90%/10% train-test split. The performances of
the baselines are taken from the original papers. TMD outperforms or matches the performances of
state-of-the-art GNNs, graph kernels, and metrics, implying that it captures meaningful structural
properties of graphs. Appendix Cshows further graph clustering and t-SNE visualization [
48
] results.
5
摘要:

TreeMover'sDistance:BridgingGraphMetricsandStabilityofGraphNeuralNetworksChing-YaoChuangMITCSAILcychuang@mit.eduStefanieJegelkaMITCSAILstefje@mit.eduAbstractUnderstandinggeneralizationandrobustnessofmachinelearningmodelsfunda-mentallyreliesonassuminganappropriatemetriconthedataspace.Identifyingsucha...

展开>> 收起<<
Tree Movers Distance Bridging Graph Metrics and Stability of Graph Neural Networks Ching-Yao Chuang.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:6.46MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注