Tree Movers Distance Bridging Graph Metrics and Stability of Graph Neural Networks Ching-Yao Chuang

2025-05-06 1 0 6.46MB 27 页 10玖币

侵权投诉

Tree Mover’s Distance: Bridging Graph Metrics and

Stability of Graph Neural Networks

Ching-Yao Chuang

MIT CSAIL

cychuang@mit.edu

Stefanie Jegelka

MIT CSAIL

stefje@mit.edu

Abstract

Understanding generalization and robustness of machine learning models funda-

mentally relies on assuming an appropriate metric on the data space. Identifying

such a metric is particularly challenging for non-Euclidean data such as graphs.

Here, we propose a pseudometric for attributed graphs, the Tree Mover’s Distance

(TMD), and study its relation to generalization. Via a hierarchical optimal transport

problem, TMD reﬂects the local distribution of node attributes as well as the distri-

bution of local computation trees, which are known to be decisive for the learning

behavior of graph neural networks (GNNs). First, we show that TMD captures

properties relevant to graph classiﬁcation: a simple TMD-SVM performs competi-

tively with standard GNNs. Second, we relate TMD to generalization of GNNs

under distribution shifts, and show that it correlates well with performance drop

under such shifts. The code is available at

https://github.com/chingyaoc/TMD

1 Introduction

Understanding generalization under distribution shifts – theoretically and empirically – relies on

an appropriate measure of divergence between data distributions. This, in turn, typically demands

a metric on the data space that indicates what kinds of data points are “close” to the training data.

While such metrics may be readily available for Euclidean spaces, they are more challenging to

determine for non-Euclidean data spaces such as graphs with attributes in

, which underlie graph

learning methods. In this work, we study the question of a suitable metric for message passing graph

neural networks (GNNs).

An “ideal” metric for studying input perturbations captures the invariances and inductive biases of

the model we are examining. For instance, since graph isomorphism is a difﬁcult problem [

this metric is expected to be a pseudometric, i.e., will fail to distinguish certain graphs. These

“failures” should be aligned with the GNNs’ invariances. Moreover, several recent works highlight

the importance of local structures – computation trees resulting from unrolling the message passing

process – for the approximation power of GNNs and their inability to distinguish certain pairs of

graphs [

]. Works on out-of-distribution generalization of GNNs mostly focus on speciﬁc

types of instances where a trained model may fail miserably [

], without specifying the behavior

for gradual shifts in distributions. Yehudai et al.

[58]

show that a sufﬁciently large, unrestricted GNN

may predict arbitrarily on previously unseen computation trees. In practice, we may observe a more

gradual change, depending on the magnitude of change of the tree, in terms of both structure and

node attributes, and the capacity of the aggregation function. In summary, we desire a metric that

reﬂects the structure of computation trees and the distribution of node attributes within trees. Many

distances and kernels between graphs have been proposed [

], several based on local structures [

and some using structure and attributes [

]. Closest to our ideal is the Wasserstein Weisfeiler-Leman

pseudometric [

], which computes an optimal transport distance between node embeddings of two

graphs. The embeddings are computed via message passing, which aligns with the computation of

GNNs, but loses structural information within trees.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01906v1 [cs.LG] 4 Oct 2022

Hence, we propose the Tree Mover’s Distance (TMD), a pseudometric on attributed graphs that

considers both the tree structure and local distribution of attributes. It achieves this via a hierarchical

optimal transport problem that deﬁnes distances between trees. First, we observe that the TMD

captures properties that capture relationships between graphs and common labels: a simple SVM

based on TMD performs competitively with standard GNNs and graph kernels on graph classiﬁcation

benchmarks. Second, we relate TMD to the performance of GNNs under input perturbations. We

determine a Lipschitz constant of GNNs with respect to TMD, which enables a bound on their target

risk under domain shifts, i.e., distribution shifts between training and test data. This bound uses the

metric in two ways: to measure the distribution shift, and to measure perturbation robustness via the

Lipschitz constant. Empirically, we observe that the TMD correlates well with the performance of

GNNs under distribution shifts, also when compared to other distances. We hence hope that this work

inspires further empirical and theoretical work on tightening the understanding of the performance of

GNNs under distribution shifts.

In short, this work makes the following contributions:

•

We propose a new graph metric via hierarchical optimal transport between computation trees of

graphs, which, in an SVM, leads to a competitive graph learning method;

• We bound the Lipschitz constant of message-passing GNNs with respect to TMD, which allows

to quantify stability and generalization;

•

We develop a generalization bound for GNNs under distribution shifts that correlates well with

empirical behavior.

2 Related Works

Graph Metrics and Kernels

Measuring distances between graphs has been a long-standing goal

in data analysis. However, proper metrics that distinguish non-isomorphic graphs in polynomial time

are not known. For instance, Vayer et al. [49] propose a graph metric by fusing Wassestein distance

[

] and Gromov-Wasserstein distance [

]. Similar to classic graph metrics [

], the proposed

metric requires approximation. Closely related, graph kernels [

] have gained attention. Most

graph kernels lie in the framework of

-convolutional [

], and measure similarity by comparing

substructures. Many

-convolutional kernels have limited expressive power and sometimes struggle

to handle continuously attributed graphs [

]. In comparison, TMD is as powerful as the WL

graph isomorphism test [

] while accommodating graphs with high dimensional node attributes.

Importantly, TMD captures the stability and generalization of message-passing GNNs [27,55].

Stability and Generalization of Graph Neural Networks

A number of existing works study sta-

bility of GNNs to input perturbations. Spectral GNNs are known to be stable to certain perturbations,

e.g., size, if the overall structure is preserved [

]. For message passing GNNs, Yehudai

et al.

[57]

study perturbations of graph size, and demonstrate the importance of local computation

trees. Xu et al.

[54]

study how the out-of-distribution behavior of aggregation functions may affect

the GNN’s prediction. These studies motivate to include both computation trees and inputs to aggre-

gation functions in the TMD. Finally, a number of works study within-distribution generalization

[15,20,28,42].

3 Background on Optimal Transport

We begin with a brief introduction to Optimal Transport (OT) and earth mover’s distance. The

earth mover’s distance, also known as Wasserstein distance, is a distance function deﬁned via the

transportation cost between two distributions. Let

X={xi}m

i=1

and

Y={yi}m

j=1

be two multisets

elements each. Let

C∈Rm×m

be the transportation cost for each pair:

Cij =d(xi, yj)

where

dis the distance between xiand yj. The earth mover’s distance solves the following OT problem:

OT∗

d(X, Y ):= min

γ∈Γ(X,Y )hC, γi/m, Γ(X, Y ) = {γ∈Rm×m

+|γ1m=γ>1m=1m},(1)

where

is the set of transportation plans that satisﬁes the ﬂow constrain

γ1m=γ>1m=1m

. In

this work, we adopt the unnormalized version of the earth’s mover distance:

OTd(X, Y ):= min

γ∈Γ(X,Y )hC, γi=m·OT∗

d(X, Y ).(2)

Comparing to classic OT, unnormalized OT preserves the size information of multisets Xand Y.

Tree Distance Root Difference Subtree Difference

, ,

Blank

Tree

Blank

Tree

Figure 1:

Tree Distance via Hierarchical OT.

The weight

w(·)

is set to 1 to simplify visualization.

The distance between two trees can be decomposed into (a) the distance between roots and (b) the OT

cost between subtrees, where the cost function of OT is again the distance between two trees. This

formulates a hierarchical OT problem, where solving the OT between trees requires solving the OT

between subtrees.

4 Tree Mover’s Distance: Optimal Transport on Graphs

Next, we introduce tree mover’s distance, a new distance for graphs. Let

G= (V, E)

denote a graph

with sets of nodes

and edges

. The graph may have node features

xv∈Rp

for

v∈V

. If not, we

simply set the node feature to a scalar xv= 1 for all v∈V.

Local structures of graphs are characterized by computation trees [

]. In particular, computation

trees are constructed by connecting adjacent nodes recursively.

Deﬁnition 1

(Computation Trees)

Given a graph

G= (V, E)

, let

v=v

, and let

be the depth-

computation tree of node

constructed by connecting the neighbors of the leaf nodes of

TL−1

to the

tree. The multiset of depth-Lcomputation trees deﬁned by Gis denoted by TL

G:={TL

v}v∈V.

Figure 2illustrates an example of constructing computation trees. Computation trees, also referred to

as subtree patterns, have been central in graph analysis [

] and graph kernels [

]. Intuitively,

the computation tree of a node encodes local structure by appending the neighbors to the tree in each

level. If depth-

computation trees are the same for two nodes, they share similar neighborhoods up

steps away. Therefore, an intuitive way to compare two graphs is by measuring the difference of

their nodes’ computation trees [

]. In this work, we adopt optimal transport, a natural way to

compute the distance between two sets of objects with, importantly, an underlying geometry. We will

begin by deﬁning the transportation cost between two computation trees. The cost then gives rise to

the tree mover’s distance, an extension of earth mover’s distance to multisets of trees.

4.1 Distance between Trees via Hierarchical OT

Let

T= (V, E, r)

denote a rooted tree. We further let

be the multiset of computation trees of the

node

which consists of trees that root at the descendants of

. Determining whether two trees are

similar requires iteratively examining whether the subtrees in each level are similar. For instance,

two trees

and

with roots

and

are the same if

xra=xrb

and

Tra=Trb

. This motivates us

to deﬁne the distance between two trees by recursively computing the optimal transportation cost

between their subtrees. Nevertheless, the number of subtrees could be different for

and

, i.e.,

|Tra| 6=|Trb|

(see Figure 1left). To compute the OT between sets with different sizes, unbalanced

OT [

] or partial OT [

] are usually adopted. Inspired by [

], we augment the smaller set with

blank trees.

Deﬁnition 2

(Blank Tree)

Ablank tree

is a tree (graph) that contains a single node and no edge,

where the node feature is the zero vector 0p∈Rp, and Tn

0denotes a multiset of nblank trees.

Deﬁnition 3

(Blank Tree Augmentation)

Given two multisets of trees

Tu,Tv

, deﬁne

to be a function

that augments a pair of trees with blank trees as follows:

ρ:(Tv,Tu)7→ Tv[Tmax(|Tu|−|Tv|,0)

0,Tu[Tmax(|Tv|−|Tu|,0)

0.

|Tv|<|Tu|

augments

|Tu| − |Tv|

blank trees to make the two multisets contain the same

number of trees, and hence allows to deﬁne a transportation cost between two multisets of trees with

different sizes. In particular, the transportation costs of additional trees are simply the distance to the

blank trees. The distance between a tree and a blank tree can be interpreted as the norm of the tree. In

our case, the blank tree can be viewed as the origin, as it is a tree with the simplest structure and zero

feature vector. Equipped with ρ, we deﬁne the distance between two rooted trees as follows.

Deﬁnition 4 (Tree Distance).The distance between two trees Ta, Tbis deﬁned recursively as

TDw(Ta, Tb):=(kxra−xrbk+w(L)·OTTDw(ρ(Tra,Trb)) if L > 1

kxra−xrbkotherwise,

where

L= max(Depth(Ta),Depth(Tb))

and

w:N→R+

is a depth-dependent weighting function.

Here,

OTTDw

is the OT distance deﬁned in (2) with

TDw

as the metric. Figure 1gives an illustration

of computing tree distances. The tree distance

TDw(Ta, Tb)

aims to optimally align two trees

and

Tbby recursively comparing their roots and subtrees. Calculating TDw(Ta, Tb)requires calculating

the OT between augmented subtrees

ρ(Tra,Trb)

, where the cost function of OT is

TDw

again. This

formulates a hierarchical optimal transport problem: the distance of two trees is deﬁned via the

distances of subtrees, where the importance of each level is determined by the weight

w(·)

. Increasing

w(·)

upweights the effect of nodes in the subtrees. While the weights may be arbitrary, we found

that for many applications, using a single weight for all depths yields good empirical performance,

as section 4.3 shows. The role of weights will be more signiﬁcant when we use TMD to bound the

stability of GNNs.

4.2 From Tree Distance to Graph Distance

Next, we extend the distance between trees to a distance between graphs. By leveraging the tree

distance

TDw(·,·)

, we introduce the tree mover’s distance (TMD), a distance for graphs, by calculating

the optimal transportation cost between the graphs’ computation trees.

Deﬁnition 5

(

Tree Mover’s Distance

)

Given two graphs

Ga, Gb

and

w, L ≥0

, the tree mover’s

distance between Gaand Gbis deﬁned as

TMDL

w(Ga, Gb) = OTTDw(ρ(TL

Ga,TL

Gb)),

where

and

are multisets of the depth-

computation trees of graphs

and

, respectively.

Figure 2illustrates the computation of TMD. Intuitively, TMD is the minimum cost required to

transport node-wise computation trees from one graph to another. The blank tree augmentation

again adopted to handle graphs with different numbers of nodes. The next theorem shows that TMD

is a pseudometric on attributed graphs.

Theorem 6 (Pseudometric).The tree mover’s distance TMDL

wis a pseudometric for ﬁnite L > 0.

In particular, the tree mover’s distance satisﬁes (1)

TMDL

w(Ga, Ga)=0

, (2)

TMDL

w(Ga, Gb) =

TMDL

w(Gb, Ga)

, and (3)

TMDL

w(Ga, Gb)≤TMDL

w(Ga, Gc) + TMDL

w(Gc, Gb)

for any graphs

Ga, Gb, Gc

. However, in some cases, the distance

TMDL

w(Ga, Gb)

can be zero even if

Ga6=Gb

This is reasonable, as computing graph isomorphism is not known to be solvable in polynomial

time. Nevertheless, TMD can provably distinguish graphs that are identiﬁable by the (1-dimensional)

Weisfeiler-Leman graph isomorphism test [53].

Theorem 7

(Discriminative Power of TMD)

If two graphs

are determined to be non-

isomorphic in WL iteration Land w(l)>0for all 0< l ≤L+ 1, then TMDL+1

w(Ga, Gb)>0.

The unnormalized OT and blank tree augmentation are essential to prove Theorem 16. The tree

mover’s distance can be exactly computed by solving optimal transport. In addition, TMD remains

highly expressive on graphs with high dimensional continuous attributes, where most

-convolutional

graph kernels struggle [

]. The discriminative power of TMD can be further strengthened by

augmenting node attributes e.g. with positional encodings [16,29].

The OT cost between node representations in different graphs is reminiscent of the recently proposed

Wasserstein WL (WWL) [

]. WWL uses the distance of node embeddings as a ground metric, where

the node embeddings are computed via

iterations of message passing with average-aggregation;

in contrast, TMD computes a distance that aligns the trees, retaining more structural information.

Even though an aggregation with nonlinearities can retain tree isomorphism information [

], similar

to the hashing applied in the WL test (WWL uses only linear aggregations), the hierarchical OT

is a more direct graded distance measure of trees. Vayer et al.

[49]

deﬁne a metric that uses a

Gromov-Wasserstein distance [32] between nodes, but need to approximate the GW computation.

. . . . . .

Optimal

Transport

Figure 2:

Illustration of Computation Trees and Tree Mover’s Distance.

The computation trees

of nodes are constructed by iteratively connecting the neighbors to the trees, and each graph will

deﬁne a multiset of node-wise computation trees. Tree mover’s distance is then deﬁned as the optimal

transport cost between the computation trees of two graphs.

MUTAG PTC PROTEINS NCI1 NCI109 BZR COX2

TMD L=1 89.4±5.5 65.3±5.8 73.9±2.8 68.3±2.0 69.5±1.6 83.8±7.2 77.8±5.0

TMD L=2 90.0±5.7 67.4±7.7 74.8±2.8 80.8±1.8 78.9±2.3 84.5±6.9 79.1±5.2

TMD L=3 91.1±5.4 68.5±6.1 74.6±2.6 83.3±1.1 82.3±2.5 85.5±6.2 78.5±5.9

TMD L=4 92.2±6.0 66.5±7.1 75.2±2.3 84.8±1.2 82.8±2.1 84.5±6.4 76.1±6.1

WWL [46] 87.3±1.5 66.3±1.2 74.3±0.6 86.1±0.3 - 84.4±2.0 78.3±0.5

FGW [49] 88.4±5.6 65.3±7.9 74.5±2.7 86.4±1.6 - 85.1±4.2 77.2±4.9

WL [44] 90.4±5.7 59.9±4.3 75.0±3.1 86.0±1.8 82.46±0.2 N/A N/A

R&G [39] 85.7±0.4 58.5±0.9 70.7±0.4 61.9±0.3 61.7±0.2 N/A N/A

GIN [55] 89.4±5.6 64.6±7.0 76.2±2.8 82.7±1.7 82.2±1.6 83.5±6.0 79.0±5.3

GCN [27] 85.6±5.8 64.2±4.3 76.0±3.2 80.2±2.0 - 84.6±5.9 77.1±4.7

Table 1:

Classiﬁcation on TU Dataset.

TMD outperforms or matches the state-of-the-art graph

kernels or GNNs. Note that WL [

] and R&G [

] are not applicable to continuously attributed

graphs such as BZR and COX2.

Numerical Computation with Dynamic Programming

One can numerically compute TMD with

dynamic programming. Starting with pair-wise distances (

TMDL=1

) between node features across

the two graphs, we then iteratively compute

TMDL=k

between depth-

computation trees from

k= 2

according to Deﬁnition 4. Let

be the maximum degree of a node in the two graphs and

τ(m)

be the complexity of computing OT between sets of cardinality

. In each level, we have to perform

OT of sets contain at most

elements for

nodes. Including the last OT between nodes of graph,

the overall time complexity of computing

TMDL

O(τ(N) + LNτ(D))

. The time complexity

for exact OT by solving linear programming is

τ(m) = O(m3log(m))

[

]. One can use faster

approximation of OT, e.g., near linear time complexity [

], but we use exact OT throughout all the

experiments, implemented with the POT library [17].

4.3 Experiments

We verify whether the tree mover’s distance aligns with the labels of graphs in graph classiﬁcation

tasks: the TUDatasets [35], which contain graphs with discrete node attributes (MUTAG, PTC-MR,

PROTEINS, NCI1, NCI109) and graphs with continuous node attributes (BZR, COX2). Speciﬁcally,

we run a support vector classiﬁer (C

1) with indeﬁnite kernel

e−γ×TMD(·,·)

, which can be viewed as

a noisy observation of the true positive semideﬁnite kernel [

]. The

is selected via cross-validation

from

{0.01,0.05,0.1}

and the weights

w(·)

are set to

0.5

for all depths. For comparison, we use

graph kernels based on graph subtrees: Ramon & Gärtner kernel [

], WL subtree kernel [

]; two

widely-adopted GNNs: graph isomorphism network (GIN) [

], graph convolutional networks (GCN)

[

]; and the recently proposed graph metrics FGW [

] and WWL [

]. Table 1reports the mean

and standard deviation over 10 independent trials with 90%/10% train-test split. The performances of

the baselines are taken from the original papers. TMD outperforms or matches the performances of

state-of-the-art GNNs, graph kernels, and metrics, implying that it captures meaningful structural

properties of graphs. Appendix Cshows further graph clustering and t-SNE visualization [

] results.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TreeMover'sDistance:BridgingGraphMetricsandStabilityofGraphNeuralNetworksChing-YaoChuangMITCSAILcychuang@mit.eduStefanieJegelkaMITCSAILstefje@mit.eduAbstractUnderstandinggeneralizationandrobustnessofmachinelearningmodelsfunda-mentallyreliesonassuminganappropriatemetriconthedataspace.Identifyingsucha...

展开>> 收起<<

Tree Movers Distance Bridging Graph Metrics and Stability of Graph Neural Networks Ching-Yao Chuang.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Tree Movers Distance Bridging Graph Metrics and Stability of Graph Neural Networks Ching-Yao Chuang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: