Robust Models are less Over-Conﬁdent Julia Grabinski Fraunhofer ITWM Kaiserslautern

2025-04-26 0 0 2.89MB 28 页 10玖币

侵权投诉

Robust Models are less Over-Conﬁdent

Julia Grabinski

Fraunhofer ITWM, Kaiserslautern

Visual Computing, University of Siegen

julia.grabinski@itwm.fraunhofer.de

Paul Gavrikov

IMLA, Offenburg University

Janis Keuper

Fraunhofer ITWM, Kaiserslautern

IMLA, Offenburg University

Margret Keuper

University of Siegen

Max Planck Institute for Informatics

Saarland Informatics Campus Saarbrücken

Abstract

Despite the success of convolutional neural networks (CNNs) in many academic

benchmarks for computer vision tasks, their application in the real-world is still

facing fundamental challenges. One of these open problems is the inherent lack of

robustness, unveiled by the striking effectiveness of adversarial attacks. Current

attack methods are able to manipulate the network’s prediction by adding speciﬁc

but small amounts of noise to the input. In turn, adversarial training (AT) aims to

achieve robustness against such attacks and ideally a better model generalization

ability by including adversarial samples in the trainingset. However, an in-depth

analysis of the resulting robust models beyond adversarial robustness is still pend-

ing. In this paper, we empirically analyze a variety of adversarially trained models

that achieve high robust accuracies when facing state-of-the-art attacks and we

show that AT has an interesting side-effect: it leads to models that are signiﬁcantly

less overconﬁdent with their decisions, even on clean data than non-robust models.

Further, our analysis of robust models shows that not only AT but also the model’s

building blocks (like activation functions and pooling) have a strong inﬂuence on

the models’ prediction conﬁdences.

Data & Project website: https://github.com/GeJulia/robustness_

confidences_evaluation

1 Introduction

Convolutional Neural Networks (CNNs) have been shown to successfully solve problems across

various tasks and domains. However, distribution shifts in the input data can have a severe impact on

the prediction performance. In real-world applications, these shifts may be caused by a multitude

of reasons including corruption due to weather conditions, camera settings, noise, and maliciously

crafted perturbations to the input data intended to fool the network (adversarial attacks). In recent

years, a vast line of research (e.g. [

]) has been devoted to solving robustness issues,

highlighting a multitude of causes for the limited generalization ability of networks and potential

solutions to facilitate the training of better models.

A second, yet equally important issue that hampers the deployment of deep learning based models

in practical applications is the lack of calibration concerning prediction conﬁdences. In fact, most

models are overly conﬁdent in their predictions, even if they are wrong [

]. Speciﬁcally,

most conventionally trained models are unaware of their own lack of expertise, i.e. they are trained to

make conﬁdent predictions in any scenario, even if the test data is sampled from a previously unseen

domain. Adversarial examples seem to leverage this weakness, as they are known to not only fool the

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05938v2 [cs.CV] 6 Dec 2022

network but also to cause very conﬁdent wrong predictions [

]. In turn, adversarial training (AT)

has shown to improve the prediction accuracy under adversarial attacks [

]. However,

only few works so far have been investigating the links between calibration and robustness [45, 60],

leaving a systematic synopsis of adversarial robustness and prediction conﬁdence still pending.

In this work, we provide an extensive empirical analysis of diverse adversarially robust models with

regard to their prediction conﬁdences. Therefore, we evaluate more than 70 adversarially robust

models and their conventionally trained counterparts, which show low robustness when exposed to

adversarial examples. By measuring their output distributions on benign and adversarial examples for

correct and erroneous predictions, we show that adversarially trained models have beneﬁts beyond

adversarial robustness and are less over-conﬁdent.

To cope with the lack of calibration in conventionally trained models, Corbière et al.

[13]

propose to

rather use the true class probability than the standard conﬁdence obtained after the Softmax layer,

such as to circumvent the overlapping conﬁdence values for wrong and correct predictions. However,

we observe that exactly these overlaps are an indicator for insufﬁciently calibrated models and can

be mitigated by the improvement of CNNs building blocks, namely downsampling and activation

functions, that have been proposed in the context of adversarial robustness [17, 28].

Our work analyzes the relationship between robust models and model conﬁdences. Our experiments

for 71 robust and non-robust model pairs on the datasets CIFAR10 [

], CIFAR100 and ImageNet

[

] conﬁrm that non-robust models are overconﬁdent with their false predictions. This highlights

the challenges for usage in real-world applications. In contrast, we show that robust models are

generally less conﬁdent in their predictions, and, especially CNNs which include improved building

blocks (downsampling and activation) turn out to be better calibrated manifesting low conﬁdence in

wrong predictions and high conﬁdence in their correct predictions. Further, we can show that the

prediction conﬁdence of robust models can be used as an indicator for erroneous decisions. However,

we also see that adversarially trained networks (robust models) overﬁt adversaries similar to the ones

seen during training and show similar performance on unseen attacks as non-robust models. Our

contributions can be summarized as follows:

•

We provide an extensive analysis of the prediction conﬁdence of 71 adversarially trained

models (

robust models

), and their conventionally trained counterparts (

non-robust models

We observe that most non-robust models are exceedingly over-conﬁdent while robust models

exhibit less conﬁdence and are better calibrated for slight domain shifts.

•

We observe that speciﬁc layers, that are considered to improve model robustness, also impact

the models’ conﬁdences. In detail, improved downsampling layers and activation functions

can lead to an even better calibration of the learned model.

•

We investigate the detection of erroneous decisions by using the prediction conﬁdence. We

observe that robust models are able to detect wrong predictions based on their conﬁdences.

However, when faced with unseen adversaries they exhibit a similarly weak performance as

non-robust models.

Our analysis provides a ﬁrst synopsis of adversarial robustness and model calibration and aims to

foster research that addresses both challenges jointly rather than considering them as two separate

research ﬁelds. To further promote this research, we released our modelzoo1.

2 Related Work

In the following, we ﬁrst brieﬂy review the related work on model calibration which motivates our

empirical analysis. Then, we revise the related work on adversarial attacks and model hardening.

Conﬁdence Calibration.

For many models that perform well with respect to standard benchmarks,

it has been argued that the robust or regular model accuracy may be an insufﬁcient metric [

], in particular when real-world applications with potentially open-world scenarios are considered.

In these settings, reliability must be established which can be quantiﬁed by the prediction conﬁdence

[

]. Ideally, a reliable model would provide high conﬁdence predictions on correct classiﬁcations,

and low conﬁdence predictions on false ones [

]. However, most networks are not able to

1https://github.com/GeJulia/robustness_confidences_evaluation

instantly provide a sufﬁcient calibration. Hence, conﬁdence calibration is a vivid ﬁeld of research

and proposed methods are based on additional loss functions [

], on adaptions of the

training input by label smoothing [

] or on data augmentation [

]. Further,

[

] present a benchmark on classiﬁcation models regarding model accuracy and conﬁdence under

dataset shift. Various evaluation methods have been provided to distinguish between correct and

incorrect predictions [

]. Naeini et al.

[56]

deﬁned the networks expected calibration error

(ECE) for a model fby with 0≤p≤ ∞

ECEp=E[|ˆz−E[1ˆy=y|ˆz]|p]1

p(1)

where the model

predicts

ˆy=y

with the conﬁdence

ˆz

. This can be directly related to the

over-conﬁdence o(f)and under-conﬁdence u(f)of a network as follows [81]:

|o(f)P(ˆy6=y)−u(f)P(ˆy=y)| ≤ ECEp,(2)

where [55]

o(f) = E[ˆz|ˆy6=y]u(f) = E[1 −ˆz|ˆy=y],(3)

i.e. the over-conﬁdence measures the expectation of

ˆz

on wrong predictions, under-conﬁdence

measures the expectation of

1−ˆz

on correct predictions and ideally both should be zero. The

ECE provides an upper bound for the difference between the probability of the prediction being

wrong weighted by the networks over-conﬁdence and the probability of the prediction being correctly

weighted by the networks under-conﬁdence and converges to this value for the parameter

p→0

(in

eq. 1]). We also recur to this metric as an aggregate measure to evaluate model conﬁdence. Yet, it

should be noted that the ECE metric is based on the assumption that networks make correct as well

as incorrect predictions. A model that always makes incorrect predictions and is less conﬁdent in its

few correct decisions than it is in its many erroneous decisions can end up with a comparably low

ECE. Therefore, ECE values for models with an accuracy below 50% are hard to interpret.

Most common CNNs are over-conﬁdent [

]. Moreover, the most dominantly used activation

in modern CNNs [

] remains the ReLU function, while is has been pointed out by Hein

et al.

[35]

that ReLUs cause a general increase in the models’ prediction conﬁdences, regardless of

the prediction validity. This is also the case for the vast majority of the adversarially trained models

we consider, except for the model by [17] to which we devote particular attention.

Adversarial Attacks.

Adversarial attacks intentionally add perturbations to the input samples, that

are almost imperceptible to the human eye, yet lead to (high-conﬁdence) false predictions of the

attacked model [

]. These attacks can be classiﬁed into two categories: white-box and

black-box attacks. In black-box attacks, the adversary has no knowledge of the model intrinsics [4],

and can only query its output. These attacks are often developed on surrogate models [

] to

reduce interaction with the attacked model in order to prevent threat detection. In general, though,

these attacks are less powerful due to their limited access to the target networks. In contrast, in

white-box attacks, the adversary has access to the full model, namely the architecture, weights,

and gradient information [

]. This enables the attacker to perform extremely powerful attacks

customized to the model. One of the earliest approaches, the Fast Gradient Sign Method (FGSM)

by [

] uses the sign of the prediction gradient to perturb input samples into the direction of the

gradient, thereby increasing the loss and causing false predictions. This method was further adapted

and improved by Projected Gradient Descent (PGD) [

], DeepFool (DF) [

], Carlini and Wagner

(CW) [

] or Decoupling Direction and Norm (DDN) [

]. While FGSM is a single-step attack,

meaning that the perturbation is computed in one single gradient ascent step limited by some



bound,

multi-step attacks such as PGD iteratively search perturbations within the



-bound to change the

models’ prediction. These attacks generally perform better but come at an increased cost of the

attack. AutoAttack [

] is an ensemble of different attacks including an adaptive version of PGD,

and has been proposed as a baseline for adversarial robustness. In particular, it is used in robustness

benchmarks such as RobustBench [15].

Adversarial Training and Robustness.

To improve robustness, adversarial training (AT) has

proven to be quite successful on common robustness benchmarks. Some attacks can be simply

defended by using their adversarial examples in the training set [

] through an additional loss

[

]. Furthermore, the addition of more training data, by using external data, or data augmentation

techniques such as the generation of synthetic data, has been shown to be promising for more robust

models [

]. RobustBench [

] provides a leaderboard to study the improvements

made by the aforementioned approaches in a comparable manner in terms of their robust accuracy.

Madry et al.

[50]

observed that the performance of adversarial training depends on the models’

capacity. High-capacity models are able to ﬁt the (adversarial) training data better, leading to

increased robust accuracy. Later research investigated the inﬂuence on increased model width and

depth [

], and quality of convolution ﬁlters [

]. Consequently, the best-performing entries

on RobustBench [

] are often using Wide-ResNet-70-16’s or even larger architectures. Besides

this trend, concurrent works also started to additionally modify speciﬁc building blocks of CNNs

[

]. Grabinski et al. [

] showed that weaknesses in simple AT, like FGSM, can be overcome by

improving the network’s downsampling operation.

Adversarial Training and Calibration.

Only a few but notable prior works such as [

] have

investigated adversarial training with respect to model calibration. Without providing a systematic

overview, [

] show that AT can help to smoothen the prediction distributions of CNN models.

Qin et al.

[60]

investigate adversarial data points generated using [

] with respect to non-robust

models and ﬁnd that easily attackable data points are badly calibrated while adversarial models

have better calibration properties. In contrast, we analyze the robustness and calibration of pairs of

robust and non-robust versions of the same models rather than investigating individual data points.

[

] introduce an adversarial calibration loss to reduce the calibration error. Further, [

] propose

conﬁdence calibrated adversarial training to force adversarial samples to show uniform conﬁdence,

while clean samples should be one hot encoded. Complementary to [

], we provide an analysis of

the predictive conﬁdences of adversarially trained, robust models and release conventionally trained

counterparts of the models from [

] to facilitate future research on the analysis of the impact of

training schemes versus architectural choices. Importantly, our proposed large-scale study allows

a differentiated view on the relationship between adversarial training and model calibration, as

discussed in Section 3. In particular, we ﬁnd that adversarially trained models are not always better

calibrated than vanilla models especially on clean data, while they are consistently less over-conﬁdent.

Adversarial Attack Detection.

A practical defense besides adversarial training, can also be

established by the detection and rejection of malicious input. Most detection methods are based on

input sample statistics [

], while others attempt to detect adversarial samples via

inference on surrogate models, yet these models themselves might be vulnerable to attacks [

While all of these approaches perform additional operations on top of the models’ prediction, we show

that simply taking the models’ prediction conﬁdence can be used as a heuristic to reject erroneous

samples.

3 Analysis

In the following, we ﬁrst describe our experimental setting in which we then conduct an extensive

analysis on the two CIFAR datasets with respect to robust and non-robust model

conﬁdence on

clean and perturbed samples as well as their ECE. Further, we observe by computing the ROC

curves of these models that robust models are best suited to distinguish between correct and incorrect

predictions based on their conﬁdence. In addition we point out that the improvement of pooling

operations or activation functions within the network can enhance the models’ calibration further.

Last, we also investigate ImageNet as a high resolution dataset and observe that the model with the

highest capacity and AT can achieve the best performance results and calibration.

3.1 Experimental Setup

We have collected 71 checkpoints of robust models [

–

] listed on the

`∞

-RobustBench leaderboard

[

]. Additionally, we compare each appearing architecture to a second model trained without AT or

any speciﬁc robustness regularization, and without any external data (even if the robust counterpart

relied on it). Training details can be found in appendix A.

Then we collect the predictions alongside their respective conﬁdences of robust and non-robust

models on clean validation samples, as well as on samples attacked by a white-box attack (PGD),

and a black-box attack (Squares). PGD (and its adaptive variant APGD [

]) is the most widely

used white-box attack and adversarial training schemes explicitly (when using PGD samples for

The classiﬁcation into robust and non-robust models is based on the models’ robustness against adversarial

attacks. We consider a model to be robust when it achieves considerably high accuracy under AutoAttack [

training) or implicitly (when using the faster but strongly related FGSM attack samples for training)

optimize for PGD robustness. In contrast, the Squares attack alters the data at random with an allowed

budget until the label ﬂips. Such samples are rather to be considered out-of-domain samples even for

adversarially trained models and provide a proxy for a model’s generalization ability. Thus, Squares

can be seen as unseen attack for all models while PGD might be not for some adversarially trained,

robust models.

CIFAR10

CIFAR100

Figure 1: Mean model conﬁdences on their correct (x-axis) and incorrect (y-axis) predictions over the

full CIFAR10 dataset (top) and CIFAR100 dataset (bottom), clean (left) and perturbed with the attacks

PGD (middle) and Squares (right). Each point represents a model. Circular points (purple color-map)

represent non-robust models and diamond-shaped points (green color-map) represent robust models.

The color of each point represents the models accuracy, darker signiﬁes higher accuracy (better) on

the given data samples. The star in the bottom right corner indicates the optimal model calibration and

the gray area marks the area were the conﬁdence distribution of the network is worse than random,

i.e. more conﬁdent in incorrect predictions than in correct ones.

3.2 CIFAR Models

CIFAR10

[

] is a simple ten class dataset consisting of 50,000 training and 10,000 validation images

with a resolution of 32 ×32. Since it is signiﬁcantly cheaper to train on CIFAR10 in comparison to

e. g. ImageNet, and its low resolution allows to discount additional costs of adversarial training, most

entries on RobustBench [15] focus on CIFAR10.

Clean Samples PGD Samples Squares

Figure 2: Overconﬁdence (lower is better) bar plots of robust models and their non-robust counter-

parts trained on CIFAR10. Non-robust models are highly overconﬁdent, in contrast, their robust

counterparts are less over-conﬁdent.

Figure 1 shows an overview of all robust and non-robust models trained on CIFAR10 in terms

of their accuracy as well as their conﬁdence in their correct and incorrect predictions. Along the

isolines, the ratio between conﬁdence in correct and incorrect predictions is constant. The gray

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustModelsarelessOver-CondentJuliaGrabinskiFraunhoferITWM,KaiserslauternVisualComputing,UniversityofSiegenjulia.grabinski@itwm.fraunhofer.dePaulGavrikovIMLA,OffenburgUniversityJanisKeuperFraunhoferITWM,KaiserslauternIMLA,OffenburgUniversityMargretKeuperUniversityofSiegenMaxPlanckInstituteforInfor...

展开>> 收起<<

Robust Models are less Over-Conﬁdent Julia Grabinski Fraunhofer ITWM Kaiserslautern.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Models are less Over-Conﬁdent Julia Grabinski Fraunhofer ITWM Kaiserslautern

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: