
Improving Adversarial Robustness via Joint Classification and Multiple Explicit Detection Classes
ally expensive branch-and-bound search. One can adopt a
composition of certified architectures to enhance the perfor-
mance of the obtained model on both natural and adversarial
accuracy (Müller et al., 2021; Horváth et al., 2022).
Another line of work for enhancing the performance of
certifiably robust neural networks relies on the idea of learn-
ing a detector alongside the classifier to capture adversar-
ial samples. Instead of trying to classify adversarial im-
ages correctly, these works design a detector to determine
whether a given sample is natural/in-distribution or a crafted
attack/out-of-distribution. Chen et al. (2020) train the detec-
tor on both in-distribution and out-of-distribution samples
to learn a detector distinguishing these samples. Hendrycks
and Gimpel (2016) develops a method based on a simple
observation that, for real samples, the output of softmax
layer is closer to
0
or
1
compared to out-of-distribution and
adversarial examples where the softmax output entries are
distributed more uniformly. DeVries and Taylor (2018);
Sheikholeslami et al. (2020); Stutz et al. (2020) learn un-
certainty regions around actual samples where the network
prediction remains the same. Interestingly, this approach
does not require out-of-distribution samples during train-
ing. Other approaches such as deep generative models (Ren
et al., 2019), self-supervised and ensemble methods (Vyas
et al., 2018; Chen et al., 2021b) are also used to learn out-of-
distribution samples. However, typically these methods are
vulnerable to adversarial attacks and can be easily fooled by
carefully designed out-of-distribution images (Fort, 2022)
as discussed in Tramer (2022). A more resilient approach
is to jointly learn the detector and the classifier (Laidlaw
and Feizi, 2019; Sheikholeslami et al., 2021; Chen et al.,
2021a) by adding an auxiliary abstain output class capturing
adversarial samples.
Building on these prior works, this paper develops a frame-
work for detecting adversarial examples using multiple ab-
stain classes. We observe that naïvely adding multiple ab-
stain classes (in the existing framework of Sheikholeslami
et al. (2021)) results in a model degeneracy phenomenon
where all adversarial examples are assigned to a small frac-
tion of abstain classes (while other abstain classes are not
utilized). To resolve this issue, we propose a novel regular-
izer and a training procedure to balance the assignment of
adversarial examples to abstain classes. Our experiments
demonstrate that utilizing multiple abstain classes in con-
junction with the proper regularization enhances the robust
verified accuracy on adversarial examples while maintaining
the standard accuracy of the classifier.
Challenges and Contribution.
We propose a framework
for training and verifying robust neural nets with multiple
detection classes. The resulting optimization problems for
training and verifying such networks is a constrained min-
max optimization problem over a probability simplex that
is more challenging from an optimization perspective than
the problems associated with networks with no or single
detection classes. We devise an efficient algorithm for this
problem. Furthermore, having multiple detectors leads to
the “model degeneracy" phenomenon, where not all detec-
tion classes are utilized. To prevent model degeneracy and
to avoid tuning the number of network detectors
, we intro-
duce a regularization mechanism guaranteeing that all de-
tectors contribute to detecting adversarial examples to the
extent possible. We propose convergent algorithms for the
verification (and training) problems using proximal gradient
descent with Bregman divergence. Compared to networks
with a single detection class, our experiments show that we
enhance the robust verified accuracy by more than
5%
and
2%
on CIFAR-10 and MNIST datasets, respectively, for
various perturbation sizes.
Roadmap.
In section 2 we review interval bound propaga-
tion (IBP) and
β
-crown as two existing efficient methods for
verifying the performance of multi-layer neural networks
against adversarial attacks. We discuss how to train and
verify joint classifier and detector networks (with a single
abstain class) based on these two approaches. Section 3 is
dedicated to the motivation and procedure of joint verifi-
cation and classification of neural networks with
multiple
abstain classes. In particular, we extend IBP and
β
-crown
verification procedures to networks with multiple detection
classes. In section 4, we show how to train neural networks
with multiple detection classes via IBP procedure. How-
ever, we show that the performance of the trained network
cannot be improved by only increasing the number of detec-
tion classes due to “model degeneracy" (a phenomenon that
happens when multiple detectors behave very similarly and
identify the same adversarial examples). To avoid model de-
generacy and to automatically/implicitly tune the number of
detection classes, we introduce a regularization mechanism
such that all detection classes are used in balance.
2 Background
2.1 Verification of feedforward neural networks
Consider an
L
-layer feedforward neural network with
{Wi,bi}
denoting the weight and bias parameters associ-
ated with layer
i
, and let
σi(·)
denote the activation function
applied at layer
i
. Throughout the paper, we assume the
activation function is the same for all hidden layers, i.e.,
σi(·) = σ(·) = ReLU(·),∀i= 1, . . . , L −1
. Thus, our
neural network can be described as
zi=σ(Wizi−1+bi)∀i∈[L−1],zL=WLzL−1+bL,
where
z0=x
is the input to the neural network and
zi
is the
output of layer
i
and
[N]
denotes the set
{1, . . . , N}
. Note
that the activation function is not applied at the last layer.
Further, we use
[z]i
to denote the
i
-th element of the vector
z
. We consider a supervised classification task where
zL
represents the logits. To explicitly show the dependence of
zL
on the input data, we use the notation
zL(x)
to denote
logit values when xis used as the input data point.