
instantly provide a sufficient calibration. Hence, confidence calibration is a vivid field of research
and proposed methods are based on additional loss functions [
32
,
35
,
45
,
48
,
52
], on adaptions of the
training input by label smoothing [
54
,
60
,
63
,
75
] or on data augmentation [
20
,
45
,
76
,
88
]. Further,
[
58
] present a benchmark on classification models regarding model accuracy and confidence under
dataset shift. Various evaluation methods have been provided to distinguish between correct and
incorrect predictions [
13
,
56
]. Naeini et al.
[56]
defined the networks expected calibration error
(ECE) for a model fby with 0≤p≤ ∞
ECEp=E[|ˆz−E[1ˆy=y|ˆz]|p]1
p(1)
where the model
f
predicts
ˆy=y
with the confidence
ˆz
. This can be directly related to the
over-confidence o(f)and under-confidence u(f)of a network as follows [81]:
|o(f)P(ˆy6=y)−u(f)P(ˆy=y)| ≤ ECEp,(2)
where [55]
o(f) = E[ˆz|ˆy6=y]u(f) = E[1 −ˆz|ˆy=y],(3)
i.e. the over-confidence measures the expectation of
ˆz
on wrong predictions, under-confidence
measures the expectation of
1−ˆz
on correct predictions and ideally both should be zero. The
ECE provides an upper bound for the difference between the probability of the prediction being
wrong weighted by the networks over-confidence and the probability of the prediction being correctly
weighted by the networks under-confidence and converges to this value for the parameter
p→0
(in
eq. 1]). We also recur to this metric as an aggregate measure to evaluate model confidence. Yet, it
should be noted that the ECE metric is based on the assumption that networks make correct as well
as incorrect predictions. A model that always makes incorrect predictions and is less confident in its
few correct decisions than it is in its many erroneous decisions can end up with a comparably low
ECE. Therefore, ECE values for models with an accuracy below 50% are hard to interpret.
Most common CNNs are over-confident [
31
,
45
,
57
]. Moreover, the most dominantly used activation
in modern CNNs [
34
,
39
,
69
,
73
] remains the ReLU function, while is has been pointed out by Hein
et al.
[35]
that ReLUs cause a general increase in the models’ prediction confidences, regardless of
the prediction validity. This is also the case for the vast majority of the adversarially trained models
we consider, except for the model by [17] to which we devote particular attention.
Adversarial Attacks.
Adversarial attacks intentionally add perturbations to the input samples, that
are almost imperceptible to the human eye, yet lead to (high-confidence) false predictions of the
attacked model [
25
,
53
,
74
]. These attacks can be classified into two categories: white-box and
black-box attacks. In black-box attacks, the adversary has no knowledge of the model intrinsics [4],
and can only query its output. These attacks are often developed on surrogate models [
10
,
42
,
78
] to
reduce interaction with the attacked model in order to prevent threat detection. In general, though,
these attacks are less powerful due to their limited access to the target networks. In contrast, in
white-box attacks, the adversary has access to the full model, namely the architecture, weights,
and gradient information [
25
,
44
]. This enables the attacker to perform extremely powerful attacks
customized to the model. One of the earliest approaches, the Fast Gradient Sign Method (FGSM)
by [
25
] uses the sign of the prediction gradient to perturb input samples into the direction of the
gradient, thereby increasing the loss and causing false predictions. This method was further adapted
and improved by Projected Gradient Descent (PGD) [
44
], DeepFool (DF) [
53
], Carlini and Wagner
(CW) [
5
] or Decoupling Direction and Norm (DDN) [
65
]. While FGSM is a single-step attack,
meaning that the perturbation is computed in one single gradient ascent step limited by some
bound,
multi-step attacks such as PGD iteratively search perturbations within the
-bound to change the
models’ prediction. These attacks generally perform better but come at an increased cost of the
attack. AutoAttack [
14
] is an ensemble of different attacks including an adaptive version of PGD,
and has been proposed as a baseline for adversarial robustness. In particular, it is used in robustness
benchmarks such as RobustBench [15].
Adversarial Training and Robustness.
To improve robustness, adversarial training (AT) has
proven to be quite successful on common robustness benchmarks. Some attacks can be simply
defended by using their adversarial examples in the training set [
25
,
65
] through an additional loss
[
22
,
87
]. Furthermore, the addition of more training data, by using external data, or data augmentation
techniques such as the generation of synthetic data, has been shown to be promising for more robust
models [
6
,
26
,
27
,
62
,
68
,
80
]. RobustBench [
15
] provides a leaderboard to study the improvements
made by the aforementioned approaches in a comparable manner in terms of their robust accuracy.
3