We propose ADDMU, an uncertainty estimation
based AED method. The key intuition is based
on the fact that adversarial examples lie off the
manifold of training data and models are typically
uncertain about their predictions of them. Thus,
although the prediction probability is no longer a
good uncertainty measurement when adversarial
examples are far from the model decision bound-
ary, there exist other statistical clues that give out
the ‘uncertainty’ in predictions to identify adver-
sarial data. In this paper, we introduce two of them:
data uncertainty and model uncertainty. Data un-
certainty is defined as the uncertainty of model
predictions over neighbors of the input. Model
uncertainty is defined as the prediction variance
on the original input when applying Monte Carlo
Dropout (MCD) (Gal and Ghahramani,2016) to
the target model during inference time. Previous
work has shown that models trained with dropout
regularization (Srivastava et al.,2014) approximate
the inference in Bayesian neural networks with
MCD, where model uncertainty is easy to obtain
(Gal and Ghahramani,2016;Smith and Gal,2018).
Given the statistics of the two uncertainties, we ap-
ply p-value normalization (Raghuram et al.,2021)
and combine them with Fisher’s method (Fisher,
1992) to produce a stronger test statistic for AED.
To the best of our knowledge, we are the first work
to estimate the uncertainty of Transformer-based
models (Shelmanov et al.,2021) for AED.
The advantages of our proposed AED method
include: 1) it only operates on the output level of
the model; 2) it requires little to no modifications
to adapt to different architectures; 3) it provides
an unified way to combine different types of un-
certainties. Experimental results on four datasets,
four attacks, and two models demonstrate that our
method outperforms existing methods by 3.6 and
6.0 in terms of AUC scores on regular and FB cases,
respectively. We also show that the two uncertainty
statistics can be used to characterize adversarial
data and select useful data for another defense tech-
nique, adversarial data augmentation (ADA).
The code for this paper could be
found at
https://github.com/uclanlp/
AdvExDetection-ADDMU
2 A Diagnostic Study on AED Methods
In this section, we first describe the formulation of
adversarial examples and AED. Then, we show that
current AED methods mainly act well on detecting
adversarial examples near the decision boundary,
but are confused by FB adversarial examples.
2.1 Formulation
Adversarial Examples.
Given an NLP model
f:
X → Y
, a textual input
x∈ X
, a predicted class
from the candidate classes
y∈ Y
, and a set of
boolean indicator functions of constraints,
Ci:X ×
X → {0,1}, i = 1,2,· · · , n
. An (untargeted)
adversarial example x∗∈ X satisfies:
f(x∗)6=f(x),Ci(x, x∗) = 1, i = 1,2,· · · , n.
Constraints are typically grammatical or seman-
tic similarities between original and adversarial
data. For example, Jin et al. (2020) conduct part-of-
speech checks and use Universal Sentence Encoder
(Cer et al.,2018) to ensure semantic similarities
between two sentences.
Adversarial Examples Detection (AED)
The
task of AED is to distinguish adversarial exam-
ples from natural ones, based on certain charac-
teristics of adversarial data. We assume access
to 1) the victim model
f
, trained and tested on
clean datasets
Dtrain
and
Dtest
; 2) an evaluation
set
Deval
; 3) an auxiliary dataset
Daux
contains
only clean data.
Deval
contains equal number of ad-
versarial examples
Deval−adv
and natural examples
Deval−nat
.
Deval−nat
are randomly sampled from
Dtest
.
Deval−adv
is generated by attacking a dis-
joint set of samples from
Deval−nat
on
Dtest
. See
Scenario 1 in Yoo et al. (2022) for details. We use
a subset of
Dtrain
as
Daux
. We adopt an unsuper-
vised setting, i.e., the AED method is not trained
on any dataset that contains adversarial examples.
2.2 Diagnose AED Methods
We define examples near model decision bound-
aries to be those whose output probabilities for
the predicted class and the second largest class are
close. Regular iterative adversarial attacks stop
once the predictions are changed. Therefore, we
suspect that regular attacks are mostly generating
adversarial examples near the boundaries, and ex-
isting AED methods could rely on this property to
detect adversarial examples.
Figure 1verifies this for the state-of-the-art unsu-
pervised AED method (Yoo et al.,2022) in NLP, de-
noted as
RDE
. Similar trends are observed for an-
other baseline. The X-axis shows two attack meth-
ods: TextFooler (Jin et al.,2020) and Pruthi (Pruthi
et al.,2019). The Y-axis represents the probability