What Makes Graph Neural Networks Miscalibrated Hans Hao-Hsun Hsu1Yuesong Shen12Christian Tomani12Daniel Cremers12 1Technical University of Munich Germany

2025-04-26 1 0 4.14MB 25 页 10玖币

侵权投诉

What Makes Graph Neural Networks Miscalibrated?

Hans Hao-Hsun Hsu∗1Yuesong Shen∗1,2Christian Tomani 1,2Daniel Cremers 1,2

1Technical University of Munich, Germany

2Munich Center for Machine Learning, Germany

{hans.hsu, yuesong.shen, christian.tomani, cremers}@tum.de

Abstract

Given the importance of getting calibrated predictions and reliable uncertainty

estimations, various post-hoc calibration methods have been developed for neural

networks on standard multi-class classiﬁcation tasks. However, these methods

are not well suited for calibrating graph neural networks (GNNs), which presents

unique challenges such as accounting for the graph structure and the graph-induced

correlations between the nodes. In this work, we conduct a systematic study on

the calibration qualities of GNN node predictions. In particular, we identify ﬁve

factors which inﬂuence the calibration of GNNs: general under-conﬁdent tendency,

diversity of nodewise predictive distributions, distance to training nodes, relative

conﬁdence level, and neighborhood similarity. Furthermore, based on the insights

from this study, we design a novel calibration method named Graph Attention

Temperature Scaling (GATS), which is tailored for calibrating graph neural net-

works. GATS incorporates designs that address all the identiﬁed inﬂuential factors

and produces nodewise temperature scaling using an attention-based architecture.

GATS is accuracy-preserving, data-efﬁcient, and expressive at the same time. Our

experiments empirically verify the effectiveness of GATS, demonstrating that it can

consistently achieve state-of-the-art calibration results on various graph datasets

for different GNN backbones.2

1 Introduction

Graph-structured data, such as social networks, knowledge graphs and internet of things, have wide-

spread presence and learning on graphs using neural networks has been an active area of research. For

node classiﬁcation on graphs, a wide range of graph neural network (GNN) models, including GCN

[14], GAT [40] and GraphSAGE [9], have been proposed to achieve high classiﬁcation accuracy.

This said, high accuracy is not the only desideratum for a classiﬁer. Especially, reliable uncertainty

estimation is crucial for applications like safety critical tasks and active learning. Neural networks

are known to produce poorly calibrated predictions that are either overconﬁdent or under-conﬁdent

[

]. To mitigate this issue a variety of post-hoc calibration methods [

] have been

introduced over the last few years for calibrating neural networks on standard multi-class classiﬁcation

problems. However, calibration of GNNs, in the context of node classiﬁcation on graphs, is currently

still an underexplored topic. While it is possible to apply existing calibration methods designed for

multi-class classiﬁcation to GNNs in a nodewise manner, this does not address the speciﬁc challenges

of node classiﬁcation on graphs. Especially, node predictions in a graph are not i.i.d. but correlated,

and we are tackling a structured prediction problem [

]. A uniform treatment when calibrating

node predictions would fail to account for the structural information from graphs and the non i.i.d.

behavior of node predictions.

∗Equal contribution

2Source code available at https://github.com/hans66hsu/GATS

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.06391v1 [cs.LG] 12 Oct 2022

Our contribution.

In this work, we focus on calibrating GNNs for the node classiﬁcation task

[

]. First, we aim at understanding the speciﬁc challenges posed by GNNs by conducting

a systematic study on the calibration qualities of GNN node predictions. Our study reveals ﬁve

factors that inﬂuence the calibration performance of GNNs: general under-conﬁdent tendency,

diversity of nodewise predictive distributions, distance to training nodes, relative conﬁdence level,

and neighborhood similarity. Second, we develop Graph Attention Temperature Scaling (GATS)

approach, which is designed in a way that accounts for the aforementioned inﬂuential factors. GATS

generates nodewise temperatures that calibrate GNN predictions based on the graph topology. Third,

we conduct a series of GNN calibration experiments and empirically verify the effectiveness of GATS

in terms of calibration, data-efﬁciency, and expressivity.

2 Related work

For standard multi-class classiﬁcation tasks, a variety of post-hoc calibration methods have been

proposed in order to make neural networks uncertainty aware: temperature scaling (TS) [

], ensemble

temperature scaling (ETS) [

], multi-class isotonic regression (IRM) [

], Dirichlet calibration [

spline calibration [

], etc. Additionally, calibration has been formulated for regression tasks [

More generally, instead of transforming logits after training a classiﬁer, a plethora of methods exists

that modify either the model architecture or the training process itself. This includes methods that are

based on Bayesian paradigm [

], evidential theory [

], adversarial calibration [

] and

model ensembling [

]. One common caveat of these methods is the trade-off between accuracy and

calibration, which oftentimes do not go hand in hand. Post-hoc methods like temperature scaling, on

the other hand, are accuracy preserving. They ensure that the per node logit rankings are unaltered.

Calibration of GNNs is currently a substantially less explored topic. Nodewise post-hoc calibration

on GNNs using methods developed for the multi-class setting has been empirically evaluated by

Teixeira et al.

[36]

. They show that these methods, which perform uniform calibration of nodewise

predictions, are unable to produce calibrated predictions for some harder tasks. Wang et al.

[41]

observe that GNNs tend to be under-conﬁdent in contrast to the majority of multi-class classiﬁers,

which are generally overconﬁdent [

]. Based on their ﬁndings, Wang et al.

[41]

propose the CaGCN

approach, which attaches a GCN on top of the backbone GNN for calibration. Some approaches

improve the uncertainty estimation of GNNs by adjusting model training. This includes Bayesian

learning approaches [45, 10] and methods based on the evidential theory [46, 35].

3 Problem setup for GNN calibration

We consider the problem of calibrating GNNs for node classiﬁcation tasks: given a graph

(V,E)

, the training data consist of nodewise input features

{xi}i∈V ∈ X

and ground-truth labels

{yi}i∈L ∈ Y ={1, . . . , K}

for a subset

L⊂V

of nodes, and the goal is to predict the labels

{yi}i∈U ∈ Y

for the rest of the nodes

U=V \ L

. A graph neural network tackles the problem

by producing nodewise probabilistic forecasts

ˆpi

. These forecasts yield the corresponding label

predictions

ˆyi:= argmaxyˆpi(y)

and conﬁdences

ˆci:= maxyˆpi(y)

. The GNN is calibrated when

its probabilistic forecasts are reliable, e.g., for predictions with conﬁdence

0.8

, they should be correct

80% of the time. Formally, a GNN is perfectly calibrated [41] if

∀c∈[0,1],P(yi= ˆyi|ˆci=c) = c. (1)

In practice, we quantify the calibration quality with the expected calibration error (ECE) [

]. We

follow the commonly used deﬁnition from Guo et al.

[7]

which uses a equal width binning scheme to

estimate calibration error for any node subset

N ⊂ V

: the predictions are regrouped according to

equally spaced conﬁdence intervals, i.e.

(B1, . . . , BM)

with

Bm={j∈ N | m−1

M<ˆcj≤m

, and

the expected calibration error of the GNN forecasts is deﬁned as

ECE =

m=1

|Bm|

|N | 



acc(Bm)−conf(Bm)



,with (2)

acc(Bm) = 1

|Bm|X

i∈Bm

1(yi= ˆyi)and conf(Bm) = 1

|Bm|X

i∈Bm

ˆci.(3)

4 Factors that inﬂuence GNN calibration

To design calibration methods adapted to GNNs, we need to ﬁgure out the particular factors that

inﬂuence the calibration quality of GNN predictions. For this we train a series of GCN [

]

and GAT [

] models on seven graph datasets: Cora [

], Citeseer [

], Pubmed [

], Amazon

Computers [

], Amazon Photo [

], Coauthor CS [

], and Coauthor Physics [

]. We summarize

the dataset statistics in Appendix A.1. Details about model training are provided in Appendix A.2 for

reproducibility. To compare with the standard multi-class classiﬁcation case, we additionally train

ResNet-20 [11] models on the CIFAR-10 image classiﬁcation task [16] as a reference.

Our experiments uncover ﬁve decisive factors that affect the calibration quality of GNNs. In the

following we discuss them in detail.

4.1 General under-conﬁdent tendency

Figure 1: Reliability diagrams of GCN models trained on various graph datasets. We see a general

tendency of under-conﬁdent predictions (plots above the diagonal) except the Physics dataset. This is

in contrast to the overconﬁdent behavior of multi-class image classiﬁcation using CNNs (in gray).

Starting with a global perspective, we notice that GNNs tend to produce under-conﬁdent predictions.

In Figure 1 we plot the reliability diagrams [

] for results on different graph datasets using GCN.

Similar to Wang et al.

[41]

, we see a general trend of under-conﬁdent predictions for GNNs. This is

in contrast to the standard multi-class image classiﬁcation case which has overconﬁdent behavior.

Also, it is interesting to see that this under-conﬁdent trend can be more or less pronounced depending

on the dataset. For Coauthor Physics, the predictions are well calibrated and have no signiﬁcant bias.

Results using GAT models lead to similar conclusions and are provided in Appendix B.1.

4.2 Diversity of nodewise predictive distributions

Figure 2: Entropy distributions of GCN predictions on graph datasets. Compared to the standard

classiﬁcation case, GNN predictions tend to be more dispersed, reﬂecting their disparate behaviors.

Contrary to the standard multi-class case, GNN outputs can have varying roles depending on their

positions in the graph, which means that their output distributions could exhibit dissimilar behaviors.

This is empirically evident in Figure 2, where we visualize the entropy distributions of GCN output

predictions v.s. the standard multi-class results (GAT results are available in Appendix B.2). We see

that the entropies of GNN outputs have more spread-out distributions, which indicates that they have

distinct roles and behaviors in graphs.

In terms of GNN calibration, this observation implies that uniform node-agnostic adjustments like

temperature scaling [

] might be insufﬁcient for GNNs, whereas nodewise adaptive approaches could

be beneﬁcial.

4.3 Distance to training nodes

Figure 3: Nodewise calibration error of GCN results depending on the minimum distance to training

nodes. We observe that training nodes and their neighbors tend to be better calibrated.

A graph provides additional structural information for its nodes. One insightful feature is the minimum

distance to training nodes. We discover that nodes with shorter distances, especially the training

nodes themselves and their direct neighbors, tend to be better calibrated.

To evaluate the calibration quality nodewise, we propose the nodewise calibration error, which is

based on the binning scheme used to compute the global expected calibration error (ECE) [

]: for

each node, we ﬁnd its corresponding bin depending on its predicted conﬁdence, and the calibration

error of this bin is assigned to be its nodewise calibration error.

Using this nodewise metric, in Figure 3 we visualize the inﬂuence of minimum distance to training

nodes on the nodewise calibration quality (c.f. Appendix B.3 for GAT results). We see that nodes

close to training ones typically have lower nodewise calibration error. This suggests that minimum

distance to training nodes can be useful for GNN calibration.

4.4 Relative conﬁdence level

Figure 4: Nodewise calibration error of GCN results depending on the relative conﬁdence level. We

observe that nodes which are less conﬁdent than their neighbors tend to have worse calibration.

Another important structural information is the neighborhood relation. We ﬁnd out that the relative

conﬁdence level

δˆci

of a node

, i.e., the difference between the nodewise conﬁdence

ˆci

and the

average conﬁdence of its neighbors

δˆci= ˆci−1

|n(i)|X

j∈n(i)

ˆcj(4)

has an interesting correlation to the nodewise calibration quality. In Figure 4 we show the relation

between the relative conﬁdence level of a node and its nodewise calibration error (c.f. Appendix B.4

for GAT results). Especially, We observe that nodes which are less conﬁdent than their neighbors

tend to have worse calibration, and it is in general desirable to have comparable conﬁdence level w.r.t.

the neighbors. For GNN calibration, the relative conﬁdence level

δˆci

can be a useful node feature to

consider.

4.5 Neighborhood similarity

Figure 5: Nodewise calibration error of GCN results depending on the node homophily. Nodes with

strongly agreeing neighbors tend to have signiﬁcantly lower calibration errors.

Furthermore, we ﬁnd that different neighbors tend to introduce distinct inﬂuences. For assortative

graphs which are the focus of this work, we ﬁnd out that calibration of nodes are affected by node

homophily, i.e., whether a node tends to have the same label prediction as its neighbors. For a node

with naagreeing neighbors and nddisagreeing ones, we measure the node homophily as

Node homophily = log na+ 1

nd+ 1 ,(5)

where positive values indicate greater ratio of agree neighbors and vice versa.

Figure 5 summarizes the variation of nodewise calibration error w.r.t. the node homophily for different

graph datasets (c.f. Appendix B.5 for GAT results). We ﬁnd out that nodewise calibration errors tend

to decrease signiﬁcantly for nodes with strongly agreeing neighbors. This suggests that neighborhood

predictive similarity should be considered when doing GNN calibration.

5 Graph attention temperature scaling (GATS)

Based on the ﬁndings in Section 4, we design a new post-hoc calibration method, named Graph

Attention Temperature Scaling (GATS), which is tailored for GNNs.

5.1 Formulation and design of GATS

To obtain a calibration method that is adapted to the graph structure

G= (V,E)

and reﬂects the

observed inﬂuential factors in Section 4, the graph attention temperature scaling approach extends the

temperature scaling [

] method to produce a distinct temperature

for each node

i∈ V

is then

used to scale the uncalibrated nodewise output logits ziand produce calibrated node predictions ˆpi

∀i∈ V,ˆpi= softmax zi

Ti.(6)

Formulation of Ti.

The nodewise temperature

should address the ﬁve factors discussed in

Section 4. We achieve this via the following considerations:

•

We introduce a global bias parameter

to account for the general under-conﬁdent tendency;

•

To tackle the diverse behavior of node predictions, we learn a nodewise temperature contri-

bution τibased on the predicted nodewise logits zi;

•

To incorporate the relative conﬁdence w.r.t. neighbors, we introduce

δˆci

from Eq. 4 as an

additional contribution term scaled by a learnable coefﬁcient ω;

•

To model the inﬂuence of neighborhood similarity, we use an attention mechanism [

] to

aggregate neighboring contributions

τj

with attention coefﬁcients

αi,j

depending on the

output similarities between the neighbors iand j;

•

Distance to training nodes is used to introduce a nodewise scaling factor

γi

to adjust the

node contribution and the aggregation process. It is learnable for training nodes and their

direct neighbors and ﬁxed to 1for the rest:

γi=





γt,if iis a training node

γn,if iis a neighbor of training node

1,otherwise

, γt, γnlearnable parameters.(7)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WhatMakesGraphNeuralNetworksMiscalibrated?HansHao-HsunHsu1YuesongShen1;2ChristianTomani1;2DanielCremers1;21TechnicalUniversityofMunich,Germany2MunichCenterforMachineLearning,Germany{hans.hsu,yuesong.shen,christian.tomani,cremers}@tum.deAbstractGiventheimportanceofgettingcalibratedpredictionsandrel...

展开>> 收起<<

What Makes Graph Neural Networks Miscalibrated Hans Hao-Hsun Hsu1Yuesong Shen12Christian Tomani12Daniel Cremers12 1Technical University of Munich Germany.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

What Makes Graph Neural Networks Miscalibrated Hans Hao-Hsun Hsu1Yuesong Shen12Christian Tomani12Daniel Cremers12 1Technical University of Munich Germany

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: