What Makes Graph Neural Networks Miscalibrated Hans Hao-Hsun Hsu1Yuesong Shen12Christian Tomani12Daniel Cremers12 1Technical University of Munich Germany

2025-04-26 0 0 4.14MB 25 页 10玖币
侵权投诉
What Makes Graph Neural Networks Miscalibrated?
Hans Hao-Hsun Hsu1Yuesong Shen1,2Christian Tomani 1,2Daniel Cremers 1,2
1Technical University of Munich, Germany
2Munich Center for Machine Learning, Germany
{hans.hsu, yuesong.shen, christian.tomani, cremers}@tum.de
Abstract
Given the importance of getting calibrated predictions and reliable uncertainty
estimations, various post-hoc calibration methods have been developed for neural
networks on standard multi-class classification tasks. However, these methods
are not well suited for calibrating graph neural networks (GNNs), which presents
unique challenges such as accounting for the graph structure and the graph-induced
correlations between the nodes. In this work, we conduct a systematic study on
the calibration qualities of GNN node predictions. In particular, we identify five
factors which influence the calibration of GNNs: general under-confident tendency,
diversity of nodewise predictive distributions, distance to training nodes, relative
confidence level, and neighborhood similarity. Furthermore, based on the insights
from this study, we design a novel calibration method named Graph Attention
Temperature Scaling (GATS), which is tailored for calibrating graph neural net-
works. GATS incorporates designs that address all the identified influential factors
and produces nodewise temperature scaling using an attention-based architecture.
GATS is accuracy-preserving, data-efficient, and expressive at the same time. Our
experiments empirically verify the effectiveness of GATS, demonstrating that it can
consistently achieve state-of-the-art calibration results on various graph datasets
for different GNN backbones.2
1 Introduction
Graph-structured data, such as social networks, knowledge graphs and internet of things, have wide-
spread presence and learning on graphs using neural networks has been an active area of research. For
node classification on graphs, a wide range of graph neural network (GNN) models, including GCN
[14], GAT [40] and GraphSAGE [9], have been proposed to achieve high classification accuracy.
This said, high accuracy is not the only desideratum for a classifier. Especially, reliable uncertainty
estimation is crucial for applications like safety critical tasks and active learning. Neural networks
are known to produce poorly calibrated predictions that are either overconfident or under-confident
[
7
,
41
]. To mitigate this issue a variety of post-hoc calibration methods [
7
,
19
,
43
,
38
,
8
] have been
introduced over the last few years for calibrating neural networks on standard multi-class classification
problems. However, calibration of GNNs, in the context of node classification on graphs, is currently
still an underexplored topic. While it is possible to apply existing calibration methods designed for
multi-class classification to GNNs in a nodewise manner, this does not address the specific challenges
of node classification on graphs. Especially, node predictions in a graph are not i.i.d. but correlated,
and we are tackling a structured prediction problem [
25
]. A uniform treatment when calibrating
node predictions would fail to account for the structural information from graphs and the non i.i.d.
behavior of node predictions.
Equal contribution
2Source code available at https://github.com/hans66hsu/GATS
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.06391v1 [cs.LG] 12 Oct 2022
Our contribution.
In this work, we focus on calibrating GNNs for the node classification task
[
14
,
40
]. First, we aim at understanding the specific challenges posed by GNNs by conducting
a systematic study on the calibration qualities of GNN node predictions. Our study reveals five
factors that influence the calibration performance of GNNs: general under-confident tendency,
diversity of nodewise predictive distributions, distance to training nodes, relative confidence level,
and neighborhood similarity. Second, we develop Graph Attention Temperature Scaling (GATS)
approach, which is designed in a way that accounts for the aforementioned influential factors. GATS
generates nodewise temperatures that calibrate GNN predictions based on the graph topology. Third,
we conduct a series of GNN calibration experiments and empirically verify the effectiveness of GATS
in terms of calibration, data-efficiency, and expressivity.
2 Related work
For standard multi-class classification tasks, a variety of post-hoc calibration methods have been
proposed in order to make neural networks uncertainty aware: temperature scaling (TS) [
7
], ensemble
temperature scaling (ETS) [
43
], multi-class isotonic regression (IRM) [
43
], Dirichlet calibration [
19
],
spline calibration [
8
], etc. Additionally, calibration has been formulated for regression tasks [
17
].
More generally, instead of transforming logits after training a classifier, a plethora of methods exists
that modify either the model architecture or the training process itself. This includes methods that are
based on Bayesian paradigm [
12
,
1
,
6
,
22
,
42
], evidential theory [
33
], adversarial calibration [
37
] and
model ensembling [
20
]. One common caveat of these methods is the trade-off between accuracy and
calibration, which oftentimes do not go hand in hand. Post-hoc methods like temperature scaling, on
the other hand, are accuracy preserving. They ensure that the per node logit rankings are unaltered.
Calibration of GNNs is currently a substantially less explored topic. Nodewise post-hoc calibration
on GNNs using methods developed for the multi-class setting has been empirically evaluated by
Teixeira et al.
[36]
. They show that these methods, which perform uniform calibration of nodewise
predictions, are unable to produce calibrated predictions for some harder tasks. Wang et al.
[41]
observe that GNNs tend to be under-confident in contrast to the majority of multi-class classifiers,
which are generally overconfident [
7
]. Based on their findings, Wang et al.
[41]
propose the CaGCN
approach, which attaches a GCN on top of the backbone GNN for calibration. Some approaches
improve the uncertainty estimation of GNNs by adjusting model training. This includes Bayesian
learning approaches [45, 10] and methods based on the evidential theory [46, 35].
3 Problem setup for GNN calibration
We consider the problem of calibrating GNNs for node classification tasks: given a graph
G=
(V,E)
, the training data consist of nodewise input features
{xi}i∈V ∈ X
and ground-truth labels
{yi}i∈L ∈ Y ={1, . . . , K}
for a subset
L⊂V
of nodes, and the goal is to predict the labels
{yi}i∈U ∈ Y
for the rest of the nodes
U=V \ L
. A graph neural network tackles the problem
by producing nodewise probabilistic forecasts
ˆpi
. These forecasts yield the corresponding label
predictions
ˆyi:= argmaxyˆpi(y)
and confidences
ˆci:= maxyˆpi(y)
. The GNN is calibrated when
its probabilistic forecasts are reliable, e.g., for predictions with confidence
0.8
, they should be correct
80% of the time. Formally, a GNN is perfectly calibrated [41] if
c[0,1],P(yi= ˆyi|ˆci=c) = c. (1)
In practice, we quantify the calibration quality with the expected calibration error (ECE) [
27
,
7
]. We
follow the commonly used definition from Guo et al.
[7]
which uses a equal width binning scheme to
estimate calibration error for any node subset
N ⊂ V
: the predictions are regrouped according to
M
equally spaced confidence intervals, i.e.
(B1, . . . , BM)
with
Bm={j∈ N | m1
M<ˆcjm
M}
, and
the expected calibration error of the GNN forecasts is defined as
ECE =
M
X
m=1
|Bm|
|N |
acc(Bm)conf(Bm)
,with (2)
acc(Bm) = 1
|Bm|X
iBm
1(yi= ˆyi)and conf(Bm) = 1
|Bm|X
iBm
ˆci.(3)
2
4 Factors that influence GNN calibration
To design calibration methods adapted to GNNs, we need to figure out the particular factors that
influence the calibration quality of GNN predictions. For this we train a series of GCN [
14
]
and GAT [
40
] models on seven graph datasets: Cora [
32
], Citeseer [
32
], Pubmed [
23
], Amazon
Computers [
34
], Amazon Photo [
34
], Coauthor CS [
34
], and Coauthor Physics [
34
]. We summarize
the dataset statistics in Appendix A.1. Details about model training are provided in Appendix A.2 for
reproducibility. To compare with the standard multi-class classification case, we additionally train
ResNet-20 [11] models on the CIFAR-10 image classification task [16] as a reference.
Our experiments uncover five decisive factors that affect the calibration quality of GNNs. In the
following we discuss them in detail.
4.1 General under-confident tendency
Figure 1: Reliability diagrams of GCN models trained on various graph datasets. We see a general
tendency of under-confident predictions (plots above the diagonal) except the Physics dataset. This is
in contrast to the overconfident behavior of multi-class image classification using CNNs (in gray).
Starting with a global perspective, we notice that GNNs tend to produce under-confident predictions.
In Figure 1 we plot the reliability diagrams [
24
] for results on different graph datasets using GCN.
Similar to Wang et al.
[41]
, we see a general trend of under-confident predictions for GNNs. This is
in contrast to the standard multi-class image classification case which has overconfident behavior.
Also, it is interesting to see that this under-confident trend can be more or less pronounced depending
on the dataset. For Coauthor Physics, the predictions are well calibrated and have no significant bias.
Results using GAT models lead to similar conclusions and are provided in Appendix B.1.
4.2 Diversity of nodewise predictive distributions
Figure 2: Entropy distributions of GCN predictions on graph datasets. Compared to the standard
classification case, GNN predictions tend to be more dispersed, reflecting their disparate behaviors.
3
Contrary to the standard multi-class case, GNN outputs can have varying roles depending on their
positions in the graph, which means that their output distributions could exhibit dissimilar behaviors.
This is empirically evident in Figure 2, where we visualize the entropy distributions of GCN output
predictions v.s. the standard multi-class results (GAT results are available in Appendix B.2). We see
that the entropies of GNN outputs have more spread-out distributions, which indicates that they have
distinct roles and behaviors in graphs.
In terms of GNN calibration, this observation implies that uniform node-agnostic adjustments like
temperature scaling [
7
] might be insufficient for GNNs, whereas nodewise adaptive approaches could
be beneficial.
4.3 Distance to training nodes
Figure 3: Nodewise calibration error of GCN results depending on the minimum distance to training
nodes. We observe that training nodes and their neighbors tend to be better calibrated.
A graph provides additional structural information for its nodes. One insightful feature is the minimum
distance to training nodes. We discover that nodes with shorter distances, especially the training
nodes themselves and their direct neighbors, tend to be better calibrated.
To evaluate the calibration quality nodewise, we propose the nodewise calibration error, which is
based on the binning scheme used to compute the global expected calibration error (ECE) [
27
,
7
]: for
each node, we find its corresponding bin depending on its predicted confidence, and the calibration
error of this bin is assigned to be its nodewise calibration error.
Using this nodewise metric, in Figure 3 we visualize the influence of minimum distance to training
nodes on the nodewise calibration quality (c.f. Appendix B.3 for GAT results). We see that nodes
close to training ones typically have lower nodewise calibration error. This suggests that minimum
distance to training nodes can be useful for GNN calibration.
4.4 Relative confidence level
Figure 4: Nodewise calibration error of GCN results depending on the relative confidence level. We
observe that nodes which are less confident than their neighbors tend to have worse calibration.
Another important structural information is the neighborhood relation. We find out that the relative
confidence level
δˆci
of a node
i
, i.e., the difference between the nodewise confidence
ˆci
and the
average confidence of its neighbors
δˆci= ˆci1
|n(i)|X
jn(i)
ˆcj(4)
has an interesting correlation to the nodewise calibration quality. In Figure 4 we show the relation
between the relative confidence level of a node and its nodewise calibration error (c.f. Appendix B.4
for GAT results). Especially, We observe that nodes which are less confident than their neighbors
tend to have worse calibration, and it is in general desirable to have comparable confidence level w.r.t.
the neighbors. For GNN calibration, the relative confidence level
δˆci
can be a useful node feature to
consider.
4
4.5 Neighborhood similarity
Figure 5: Nodewise calibration error of GCN results depending on the node homophily. Nodes with
strongly agreeing neighbors tend to have significantly lower calibration errors.
Furthermore, we find that different neighbors tend to introduce distinct influences. For assortative
graphs which are the focus of this work, we find out that calibration of nodes are affected by node
homophily, i.e., whether a node tends to have the same label prediction as its neighbors. For a node
with naagreeing neighbors and nddisagreeing ones, we measure the node homophily as
Node homophily = log na+ 1
nd+ 1 ,(5)
where positive values indicate greater ratio of agree neighbors and vice versa.
Figure 5 summarizes the variation of nodewise calibration error w.r.t. the node homophily for different
graph datasets (c.f. Appendix B.5 for GAT results). We find out that nodewise calibration errors tend
to decrease significantly for nodes with strongly agreeing neighbors. This suggests that neighborhood
predictive similarity should be considered when doing GNN calibration.
5 Graph attention temperature scaling (GATS)
Based on the findings in Section 4, we design a new post-hoc calibration method, named Graph
Attention Temperature Scaling (GATS), which is tailored for GNNs.
5.1 Formulation and design of GATS
To obtain a calibration method that is adapted to the graph structure
G= (V,E)
and reflects the
observed influential factors in Section 4, the graph attention temperature scaling approach extends the
temperature scaling [
7
] method to produce a distinct temperature
Ti
for each node
i∈ V
.
Ti
is then
used to scale the uncalibrated nodewise output logits ziand produce calibrated node predictions ˆpi
i∈ V,ˆpi= softmax zi
Ti.(6)
Formulation of Ti.
The nodewise temperature
Ti
should address the five factors discussed in
Section 4. We achieve this via the following considerations:
We introduce a global bias parameter
T0
to account for the general under-confident tendency;
To tackle the diverse behavior of node predictions, we learn a nodewise temperature contri-
bution τibased on the predicted nodewise logits zi;
To incorporate the relative confidence w.r.t. neighbors, we introduce
δˆci
from Eq. 4 as an
additional contribution term scaled by a learnable coefficient ω;
To model the influence of neighborhood similarity, we use an attention mechanism [
39
] to
aggregate neighboring contributions
τj
with attention coefficients
αi,j
depending on the
output similarities between the neighbors iand j;
Distance to training nodes is used to introduce a nodewise scaling factor
γi
to adjust the
node contribution and the aggregation process. It is learnable for training nodes and their
direct neighbors and fixed to 1for the rest:
γi=
γt,if iis a training node
γn,if iis a neighbor of training node
1,otherwise
, γt, γnlearnable parameters.(7)
5
摘要:

WhatMakesGraphNeuralNetworksMiscalibrated?HansHao-HsunHsu1YuesongShen1;2ChristianTomani1;2DanielCremers1;21TechnicalUniversityofMunich,Germany2MunichCenterforMachineLearning,Germany{hans.hsu,yuesong.shen,christian.tomani,cremers}@tum.deAbstractGiventheimportanceofgettingcalibratedpredictionsandrel...

展开>> 收起<<
What Makes Graph Neural Networks Miscalibrated Hans Hao-Hsun Hsu1Yuesong Shen12Christian Tomani12Daniel Cremers12 1Technical University of Munich Germany.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:25 页 大小:4.14MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注