low, Shlens, and Szegedy 2015; Madry et al. 2018). Such a
method could degrade classification accuracy on clean data
(Tsipras et al. 2019; Pang et al. 2022). To reduce the degra-
dation in clean classification accuracy, TRADES (Zhang
et al. 2019) is proposed to balance the trade-off between
clean and robust accuracy. Recent works also study the ef-
fects of different hyperparameters (Pang et al. 2021; Huang
et al. 2022a) and data augmentation (Rebuffi et al. 2021; Se-
hwag et al. 2022; Zhao et al. 2020) to reduce robust over-
fitting and avoid the decrease of model’s robust accuracy.
Besides the standard adversarial training, many works also
study the impact of adversarial training on manifolds (Stutz,
Hein, and Schiele 2019; Lin et al. 2020; Zhou, Liang, and
Chen 2020; Patel et al. 2020). Different from these works,
we introduce a novel defense without adversarial training.
Adversarial Purification and Test-time Defense. As an
alternative to adversarial training, adversarial purification
aims to shift adversarial examples back to the represen-
tations of clean examples. Some efforts perform adver-
sarial purification using GAN-based models (Samangouei,
Kabkab, and Chellappa 2018), energy-based models (Grath-
wohl et al. 2020; Yoon, Hwang, and Lee 2021; Hill,
Mitchell, and Zhu 2021), autoencoders (Hwang et al. 2019;
Yin, Zhang, and Zuo 2022; Willetts et al. 2021; Gong et al.
2022; Meng and Chen 2017), augmentations (P´
erez et al.
2021; Shi, Holtz, and Mishne 2021; Mao et al. 2021), etc..
Song et al. (2018) discover that adversarial examples lie in
low-probability regions and they use PixelCNN to restore
the adversarial examples by shifting them back to high-
probability regions. Shi, Holtz, and Mishne (2021) and Mao
et al. (2021) discover that adversarial attacks increase the
loss of self-supervised learning and they define reverse vec-
tors to purify the adversarial examples. Prior efforts (Atha-
lye, Carlini, and Wagner 2018; Croce et al. 2022) have
shown that methods such as Defense-GAN, PixelDefend
(PixelCNN), SOAP (self-supervised), autoencoder-based
purification are vulnerable to the Backward Pass Differen-
tiable Approximation (BPDA) attacks (Athalye, Carlini, and
Wagner 2018). Recently, diffusion-based adversarial purifi-
cation methods have been studied (Nie et al. 2022; Xiao
et al. 2022) and show adversarial robustness against adap-
tive attacks such as BPDA. Lee and Kim (2023), however,
observe that the robustness of diffusion-based purification
drops significantly when evaluated with the surrogate gra-
dient designed for diffusion models. Similar to adversarial
purification, existing test-time defense techniques (Nayak,
Rawal, and Chakraborty 2022; Huang et al. 2022b) are also
vulnerable to adaptive white-box attacks. In this work, we
present a novel defense method combining manifold learn-
ing and variational inference which achieves better perfor-
mance compared with prior works and greater robustness on
adaptive white-box attacks.
Methodology
In this section, we introduce an adversarial purification
method with the manifold hypothesis. We first define suffi-
cient conditions (in theory) to quantify the robustness of pre-
dictions. Then, we use variational inference to approximate
such conditions in implementation and achieve adversarial
robustness without adversarial training.
Let DXY be a set of clean images and their labels where
each image-label pair is defined as (x,y). The manifold hy-
pothesis states that many real-world high-dimensional data
x∈Rnlies on a low-dimensional manifold Mdiffeomor-
phic to Rmwith m≪n. We define an encoder function
f:Rn→Rmand a decoder function f†:Rm→Rnto
form an autoencoder, where fmaps data point x∈Rnto
point f(x)∈Rm. For x∈ M,f†and fare approximate
inverses (see Appendix A for notation details).
Problem Formulation
Let L={1, ..., c}be a discrete label set of cclasses and
h:Rm→ L be a classifier of the latent space. Given an
image-label pair (x,y)∈ DXY , the encoder maps the im-
age xto a lower-dimensional vector z=f(x)∈Rmand
the functions fand hform a classifier of the image space
ypred =h(z)=(h◦f)(x). Generally, the classifier pre-
dicts labels consistent with the ground truth labels such that
ypred =y. However, during adversarial attacks, the adver-
sary can generate a small adversarial perturbation δadv such
that (h◦f)(x)̸= (h◦f)(x+δadv). Thus, our purification
framework aims to find a purified signal ϵpfy ∈Rnsuch
that (h◦f)(x) = (h◦f)(x+δadv +ϵpfy) = y. However,
it is challenging to achieve ϵpfy =−δadv because δadv is
unknown. Thus, we aim to seek an alternative approach to
estimate the purified signal ϵpfy and defend against attacks.
Theoretical Foundation for Adversarial Robustness
The adversarial perturbation is usually ℓp-bounded where
p∈ {0,2,∞}. We define the ℓp-norm of a vector a=
[a1, ..., an]⊺as ∥a∥pand a classifier of the image space as
G:Rn→ L. We follow Bastani et al. (2016); Leino, Wang,
and Fredrikson (2021) to define the local robustness.
Definition 1. (Locally robust image classifier) Given an
image-label pair (x,y)∈ DXY , a classifier Gis (x,y, τ)-
robust with respect to ℓp-norm if for every η∈Rnwith
∥η∥p≤τ,y=G(x) = G(x+η).
Human vision is robust up to a certain perturbation bud-
get. For example, given a clean MNIST image xwith pixel
values in [0, 1], if ∥η∥∞≤85/255, human vision will as-
sign (x+η)and xto the same class (Madry et al. 2018). We
use ρH(x,y)to represent the maximum perturbation budget
for static human vision interpretations given an image-label
pair (x,y). Exact ρH(x,y)is often a large value but diffi-
cult to estimate. We use it to represent the upper bound of
achievable robustness.
Definition 2. (Human-level image classifier) For every
image-label pair (xi,yi)∈ DXY , if a classifier GRis
(xi,yi, τi)-robust and τi≜ρH(xi,yi)where ρH(·,·)rep-
resents the maximum perturbation budget for static hu-
man vision interpretations, we define such a classifier as a
human-level image classifier.
The human-level image classifier GRis an ideal classifier
that is comparable to human vision. To construct such GR,
we need a robust encoder fRof the image space and a robust