Robust Models are less Over-Confident Julia Grabinski Fraunhofer ITWM Kaiserslautern

2025-04-26 0 0 2.89MB 28 页 10玖币
侵权投诉
Robust Models are less Over-Confident
Julia Grabinski
Fraunhofer ITWM, Kaiserslautern
Visual Computing, University of Siegen
julia.grabinski@itwm.fraunhofer.de
Paul Gavrikov
IMLA, Offenburg University
Janis Keuper
Fraunhofer ITWM, Kaiserslautern
IMLA, Offenburg University
Margret Keuper
University of Siegen
Max Planck Institute for Informatics
Saarland Informatics Campus Saarbrücken
Abstract
Despite the success of convolutional neural networks (CNNs) in many academic
benchmarks for computer vision tasks, their application in the real-world is still
facing fundamental challenges. One of these open problems is the inherent lack of
robustness, unveiled by the striking effectiveness of adversarial attacks. Current
attack methods are able to manipulate the network’s prediction by adding specific
but small amounts of noise to the input. In turn, adversarial training (AT) aims to
achieve robustness against such attacks and ideally a better model generalization
ability by including adversarial samples in the trainingset. However, an in-depth
analysis of the resulting robust models beyond adversarial robustness is still pend-
ing. In this paper, we empirically analyze a variety of adversarially trained models
that achieve high robust accuracies when facing state-of-the-art attacks and we
show that AT has an interesting side-effect: it leads to models that are significantly
less overconfident with their decisions, even on clean data than non-robust models.
Further, our analysis of robust models shows that not only AT but also the model’s
building blocks (like activation functions and pooling) have a strong influence on
the models’ prediction confidences.
Data & Project website: https://github.com/GeJulia/robustness_
confidences_evaluation
1 Introduction
Convolutional Neural Networks (CNNs) have been shown to successfully solve problems across
various tasks and domains. However, distribution shifts in the input data can have a severe impact on
the prediction performance. In real-world applications, these shifts may be caused by a multitude
of reasons including corruption due to weather conditions, camera settings, noise, and maliciously
crafted perturbations to the input data intended to fool the network (adversarial attacks). In recent
years, a vast line of research (e.g. [
25
,
36
,
44
]) has been devoted to solving robustness issues,
highlighting a multitude of causes for the limited generalization ability of networks and potential
solutions to facilitate the training of better models.
A second, yet equally important issue that hampers the deployment of deep learning based models
in practical applications is the lack of calibration concerning prediction confidences. In fact, most
models are overly confident in their predictions, even if they are wrong [
31
,
45
,
57
]. Specifically,
most conventionally trained models are unaware of their own lack of expertise, i.e. they are trained to
make confident predictions in any scenario, even if the test data is sampled from a previously unseen
domain. Adversarial examples seem to leverage this weakness, as they are known to not only fool the
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05938v2 [cs.CV] 6 Dec 2022
network but also to cause very confident wrong predictions [
46
]. In turn, adversarial training (AT)
has shown to improve the prediction accuracy under adversarial attacks [
22
,
25
,
65
,
87
]. However,
only few works so far have been investigating the links between calibration and robustness [45, 60],
leaving a systematic synopsis of adversarial robustness and prediction confidence still pending.
In this work, we provide an extensive empirical analysis of diverse adversarially robust models with
regard to their prediction confidences. Therefore, we evaluate more than 70 adversarially robust
models and their conventionally trained counterparts, which show low robustness when exposed to
adversarial examples. By measuring their output distributions on benign and adversarial examples for
correct and erroneous predictions, we show that adversarially trained models have benefits beyond
adversarial robustness and are less over-confident.
To cope with the lack of calibration in conventionally trained models, Corbière et al.
[13]
propose to
rather use the true class probability than the standard confidence obtained after the Softmax layer,
such as to circumvent the overlapping confidence values for wrong and correct predictions. However,
we observe that exactly these overlaps are an indicator for insufficiently calibrated models and can
be mitigated by the improvement of CNNs building blocks, namely downsampling and activation
functions, that have been proposed in the context of adversarial robustness [17, 28].
Our work analyzes the relationship between robust models and model confidences. Our experiments
for 71 robust and non-robust model pairs on the datasets CIFAR10 [
43
], CIFAR100 and ImageNet
[
19
] confirm that non-robust models are overconfident with their false predictions. This highlights
the challenges for usage in real-world applications. In contrast, we show that robust models are
generally less confident in their predictions, and, especially CNNs which include improved building
blocks (downsampling and activation) turn out to be better calibrated manifesting low confidence in
wrong predictions and high confidence in their correct predictions. Further, we can show that the
prediction confidence of robust models can be used as an indicator for erroneous decisions. However,
we also see that adversarially trained networks (robust models) overfit adversaries similar to the ones
seen during training and show similar performance on unseen attacks as non-robust models. Our
contributions can be summarized as follows:
We provide an extensive analysis of the prediction confidence of 71 adversarially trained
models (
robust models
), and their conventionally trained counterparts (
non-robust models
).
We observe that most non-robust models are exceedingly over-confident while robust models
exhibit less confidence and are better calibrated for slight domain shifts.
We observe that specific layers, that are considered to improve model robustness, also impact
the models’ confidences. In detail, improved downsampling layers and activation functions
can lead to an even better calibration of the learned model.
We investigate the detection of erroneous decisions by using the prediction confidence. We
observe that robust models are able to detect wrong predictions based on their confidences.
However, when faced with unseen adversaries they exhibit a similarly weak performance as
non-robust models.
Our analysis provides a first synopsis of adversarial robustness and model calibration and aims to
foster research that addresses both challenges jointly rather than considering them as two separate
research fields. To further promote this research, we released our modelzoo1.
2 Related Work
In the following, we first briefly review the related work on model calibration which motivates our
empirical analysis. Then, we revise the related work on adversarial attacks and model hardening.
Confidence Calibration.
For many models that perform well with respect to standard benchmarks,
it has been argued that the robust or regular model accuracy may be an insufficient metric [
2
,
13
,
18
,
79
], in particular when real-world applications with potentially open-world scenarios are considered.
In these settings, reliability must be established which can be quantified by the prediction confidence
[
58
]. Ideally, a reliable model would provide high confidence predictions on correct classifications,
and low confidence predictions on false ones [
13
,
57
]. However, most networks are not able to
1https://github.com/GeJulia/robustness_confidences_evaluation
2
instantly provide a sufficient calibration. Hence, confidence calibration is a vivid field of research
and proposed methods are based on additional loss functions [
32
,
35
,
45
,
48
,
52
], on adaptions of the
training input by label smoothing [
54
,
60
,
63
,
75
] or on data augmentation [
20
,
45
,
76
,
88
]. Further,
[
58
] present a benchmark on classification models regarding model accuracy and confidence under
dataset shift. Various evaluation methods have been provided to distinguish between correct and
incorrect predictions [
13
,
56
]. Naeini et al.
[56]
defined the networks expected calibration error
(ECE) for a model fby with 0p≤ ∞
ECEp=E[|ˆzE[1ˆy=y|ˆz]|p]1
p(1)
where the model
f
predicts
ˆy=y
with the confidence
ˆz
. This can be directly related to the
over-confidence o(f)and under-confidence u(f)of a network as follows [81]:
|o(f)P(ˆy6=y)u(f)P(ˆy=y)| ≤ ECEp,(2)
where [55]
o(f) = E[ˆz|ˆy6=y]u(f) = E[1 ˆz|ˆy=y],(3)
i.e. the over-confidence measures the expectation of
ˆz
on wrong predictions, under-confidence
measures the expectation of
1ˆz
on correct predictions and ideally both should be zero. The
ECE provides an upper bound for the difference between the probability of the prediction being
wrong weighted by the networks over-confidence and the probability of the prediction being correctly
weighted by the networks under-confidence and converges to this value for the parameter
p0
(in
eq. 1]). We also recur to this metric as an aggregate measure to evaluate model confidence. Yet, it
should be noted that the ECE metric is based on the assumption that networks make correct as well
as incorrect predictions. A model that always makes incorrect predictions and is less confident in its
few correct decisions than it is in its many erroneous decisions can end up with a comparably low
ECE. Therefore, ECE values for models with an accuracy below 50% are hard to interpret.
Most common CNNs are over-confident [
31
,
45
,
57
]. Moreover, the most dominantly used activation
in modern CNNs [
34
,
39
,
69
,
73
] remains the ReLU function, while is has been pointed out by Hein
et al.
[35]
that ReLUs cause a general increase in the models’ prediction confidences, regardless of
the prediction validity. This is also the case for the vast majority of the adversarially trained models
we consider, except for the model by [17] to which we devote particular attention.
Adversarial Attacks.
Adversarial attacks intentionally add perturbations to the input samples, that
are almost imperceptible to the human eye, yet lead to (high-confidence) false predictions of the
attacked model [
25
,
53
,
74
]. These attacks can be classified into two categories: white-box and
black-box attacks. In black-box attacks, the adversary has no knowledge of the model intrinsics [4],
and can only query its output. These attacks are often developed on surrogate models [
10
,
42
,
78
] to
reduce interaction with the attacked model in order to prevent threat detection. In general, though,
these attacks are less powerful due to their limited access to the target networks. In contrast, in
white-box attacks, the adversary has access to the full model, namely the architecture, weights,
and gradient information [
25
,
44
]. This enables the attacker to perform extremely powerful attacks
customized to the model. One of the earliest approaches, the Fast Gradient Sign Method (FGSM)
by [
25
] uses the sign of the prediction gradient to perturb input samples into the direction of the
gradient, thereby increasing the loss and causing false predictions. This method was further adapted
and improved by Projected Gradient Descent (PGD) [
44
], DeepFool (DF) [
53
], Carlini and Wagner
(CW) [
5
] or Decoupling Direction and Norm (DDN) [
65
]. While FGSM is a single-step attack,
meaning that the perturbation is computed in one single gradient ascent step limited by some
bound,
multi-step attacks such as PGD iteratively search perturbations within the
-bound to change the
models’ prediction. These attacks generally perform better but come at an increased cost of the
attack. AutoAttack [
14
] is an ensemble of different attacks including an adaptive version of PGD,
and has been proposed as a baseline for adversarial robustness. In particular, it is used in robustness
benchmarks such as RobustBench [15].
Adversarial Training and Robustness.
To improve robustness, adversarial training (AT) has
proven to be quite successful on common robustness benchmarks. Some attacks can be simply
defended by using their adversarial examples in the training set [
25
,
65
] through an additional loss
[
22
,
87
]. Furthermore, the addition of more training data, by using external data, or data augmentation
techniques such as the generation of synthetic data, has been shown to be promising for more robust
models [
6
,
26
,
27
,
62
,
68
,
80
]. RobustBench [
15
] provides a leaderboard to study the improvements
made by the aforementioned approaches in a comparable manner in terms of their robust accuracy.
3
Madry et al.
[50]
observed that the performance of adversarial training depends on the models’
capacity. High-capacity models are able to fit the (adversarial) training data better, leading to
increased robust accuracy. Later research investigated the influence on increased model width and
depth [
26
,
85
], and quality of convolution filters [
24
]. Consequently, the best-performing entries
on RobustBench [
15
] are often using Wide-ResNet-70-16’s or even larger architectures. Besides
this trend, concurrent works also started to additionally modify specific building blocks of CNNs
[
17
,
29
]. Grabinski et al. [
28
] showed that weaknesses in simple AT, like FGSM, can be overcome by
improving the network’s downsampling operation.
Adversarial Training and Calibration.
Only a few but notable prior works such as [
45
,
60
] have
investigated adversarial training with respect to model calibration. Without providing a systematic
overview, [
45
] show that AT can help to smoothen the prediction distributions of CNN models.
Qin et al.
[60]
investigate adversarial data points generated using [
5
] with respect to non-robust
models and find that easily attackable data points are badly calibrated while adversarial models
have better calibration properties. In contrast, we analyze the robustness and calibration of pairs of
robust and non-robust versions of the same models rather than investigating individual data points.
[
77
] introduce an adversarial calibration loss to reduce the calibration error. Further, [
72
] propose
confidence calibrated adversarial training to force adversarial samples to show uniform confidence,
while clean samples should be one hot encoded. Complementary to [
15
], we provide an analysis of
the predictive confidences of adversarially trained, robust models and release conventionally trained
counterparts of the models from [
15
] to facilitate future research on the analysis of the impact of
training schemes versus architectural choices. Importantly, our proposed large-scale study allows
a differentiated view on the relationship between adversarial training and model calibration, as
discussed in Section 3. In particular, we find that adversarially trained models are not always better
calibrated than vanilla models especially on clean data, while they are consistently less over-confident.
Adversarial Attack Detection.
A practical defense besides adversarial training, can also be
established by the detection and rejection of malicious input. Most detection methods are based on
input sample statistics [
23
,
30
,
33
,
37
,
47
,
49
], while others attempt to detect adversarial samples via
inference on surrogate models, yet these models themselves might be vulnerable to attacks [
12
,
51
].
While all of these approaches perform additional operations on top of the models’ prediction, we show
that simply taking the models’ prediction confidence can be used as a heuristic to reject erroneous
samples.
3 Analysis
In the following, we first describe our experimental setting in which we then conduct an extensive
analysis on the two CIFAR datasets with respect to robust and non-robust model
2
confidence on
clean and perturbed samples as well as their ECE. Further, we observe by computing the ROC
curves of these models that robust models are best suited to distinguish between correct and incorrect
predictions based on their confidence. In addition we point out that the improvement of pooling
operations or activation functions within the network can enhance the models’ calibration further.
Last, we also investigate ImageNet as a high resolution dataset and observe that the model with the
highest capacity and AT can achieve the best performance results and calibration.
3.1 Experimental Setup
We have collected 71 checkpoints of robust models [
1
,
3
,
7
9
,
11
,
16
,
17
,
21
,
22
,
26
,
27
,
38
,
40
,
41
,
59
,
61
,
62
,
64
,
67
,
68
,
70
,
71
,
80
,
83
,
84
,
86
,
87
,
89
,
90
] listed on the
`
-RobustBench leaderboard
[
15
]. Additionally, we compare each appearing architecture to a second model trained without AT or
any specific robustness regularization, and without any external data (even if the robust counterpart
relied on it). Training details can be found in appendix A.
Then we collect the predictions alongside their respective confidences of robust and non-robust
models on clean validation samples, as well as on samples attacked by a white-box attack (PGD),
and a black-box attack (Squares). PGD (and its adaptive variant APGD [
14
]) is the most widely
used white-box attack and adversarial training schemes explicitly (when using PGD samples for
2
The classification into robust and non-robust models is based on the models’ robustness against adversarial
attacks. We consider a model to be robust when it achieves considerably high accuracy under AutoAttack [
14
].
4
training) or implicitly (when using the faster but strongly related FGSM attack samples for training)
optimize for PGD robustness. In contrast, the Squares attack alters the data at random with an allowed
budget until the label flips. Such samples are rather to be considered out-of-domain samples even for
adversarially trained models and provide a proxy for a model’s generalization ability. Thus, Squares
can be seen as unseen attack for all models while PGD might be not for some adversarially trained,
robust models.
CIFAR10
CIFAR100
Figure 1: Mean model confidences on their correct (x-axis) and incorrect (y-axis) predictions over the
full CIFAR10 dataset (top) and CIFAR100 dataset (bottom), clean (left) and perturbed with the attacks
PGD (middle) and Squares (right). Each point represents a model. Circular points (purple color-map)
represent non-robust models and diamond-shaped points (green color-map) represent robust models.
The color of each point represents the models accuracy, darker signifies higher accuracy (better) on
the given data samples. The star in the bottom right corner indicates the optimal model calibration and
the gray area marks the area were the confidence distribution of the network is worse than random,
i.e. more confident in incorrect predictions than in correct ones.
3.2 CIFAR Models
CIFAR10
[
43
] is a simple ten class dataset consisting of 50,000 training and 10,000 validation images
with a resolution of 32 ×32. Since it is significantly cheaper to train on CIFAR10 in comparison to
e. g. ImageNet, and its low resolution allows to discount additional costs of adversarial training, most
entries on RobustBench [15] focus on CIFAR10.
Clean Samples PGD Samples Squares
Figure 2: Overconfidence (lower is better) bar plots of robust models and their non-robust counter-
parts trained on CIFAR10. Non-robust models are highly overconfident, in contrast, their robust
counterparts are less over-confident.
Figure 1 shows an overview of all robust and non-robust models trained on CIFAR10 in terms
of their accuracy as well as their confidence in their correct and incorrect predictions. Along the
isolines, the ratio between confidence in correct and incorrect predictions is constant. The gray
5
摘要:

RobustModelsarelessOver-CondentJuliaGrabinskiFraunhoferITWM,KaiserslauternVisualComputing,UniversityofSiegenjulia.grabinski@itwm.fraunhofer.dePaulGavrikovIMLA,OffenburgUniversityJanisKeuperFraunhoferITWM,KaiserslauternIMLA,OffenburgUniversityMargretKeuperUniversityofSiegenMaxPlanckInstituteforInfor...

展开>> 收起<<
Robust Models are less Over-Confident Julia Grabinski Fraunhofer ITWM Kaiserslautern.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:2.89MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注