Adversarial Purification with the Manifold Hypothesis Zhaoyuan Yang1 Zhiwei Xu2 Jing Zhang2 Richard Hartley2 Peter Tu1 1GE Research2Australian National University

2025-05-06 0 0 4.67MB 14 页 10玖币
侵权投诉
Adversarial Purification with the Manifold Hypothesis
Zhaoyuan Yang1, Zhiwei Xu2, Jing Zhang2, Richard Hartley2, Peter Tu1
1GE Research 2Australian National University
{zhaoyuan.yang,tu}@ge.com, {zhiwei.xu,jing.zhang,richard.hartley}@anu.edu.au
Abstract
In this work, we formulate a novel framework for adversar-
ial robustness using the manifold hypothesis. This framework
provides sufficient conditions for defending against adversar-
ial examples. We develop an adversarial purification method
with this framework. Our method combines manifold learn-
ing with variational inference to provide adversarial robust-
ness without the need for expensive adversarial training. Ex-
perimentally, our approach can provide adversarial robust-
ness even if attackers are aware of the existence of the de-
fense. In addition, our method can also serve as a test-time
defense mechanism for variational autoencoders.
Introduction
State-of-the-art neural network models are known of be-
ing vulnerable to adversarial examples. With small pertur-
bations, adversarial examples can completely change pre-
dictions of neural networks (Szegedy et al. 2014). Defense
methods are then designed to produce robust models to-
wards adversarial attacks. Common defense methods for ad-
versarial attacks include adversarial training (Madry et al.
2018), certified robustness (Wong and Kolter 2018), etc.. Re-
cently, adversarial purification has drawn increasing atten-
tion (Croce et al. 2022), which purifies adversarial examples
during test time and thus requires fewer training resources.
Existing adversarial purification methods achieve supe-
rior performance when attackers are not aware of the exis-
tence of the defense; however, their performance drops sig-
nificantly when attackers create defense-aware or adaptive
attacks (Croce et al. 2022). Besides, most of them are em-
pirical with limited theoretical justifications. Differently, we
adapt ideas from the certified robustness and build an adver-
sarial purification method with a theoretical foundation.
Specifically, our adversarial purification method is based
on the assumption that high-dimensional images lie on low-
dimensional manifolds (the manifold hypothesis). Com-
pared with low-dimensional data, high-dimensional data
are more vulnerable to adversarial examples (Goodfellow,
Shlens, and Szegedy 2015). Thus, we transform the ad-
versarial robustness problem from a high-dimensional im-
age domain to a low-dimensional image manifold domain
Copyright © 2024, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Figure 1: Adversarial purification against adversarial attacks. (a)
Clean, adversarial (adv.), and purified images. (b) Jointly learning
of the variational autoencoder and the classifier to achieve seman-
tic consistency. (c) Applying semantic consistency between predic-
tions and reconstructions to defend against attacks.
and present a novel adversarial purification method for
non-adversarially trained models via manifold learning and
variational inference (see Figure 1 for the pipeline). With
our method, non-adversarially trained models can achieve
performance on par with the performance of adversarially
trained models. Even if attackers are aware of the existence
of the defenses, our approach can still provide adversarial
robustness against attacks.
Our method is significant in introducing the manifold hy-
pothesis to the adversarial defense framework. We improve
a model’s adversarial robustness from a more interpretable
low-dimensional image manifold than the complex high-
dimensional image space. In the meantime, we provide con-
ditions (in theory) to quantify the robustness of the predic-
tions. Also, we present an effective adversarial purification
approach combining manifold learning and variational infer-
ence, which achieves reliable performance on adaptive at-
tacks without adversarial training. We also demonstrate the
feasibility of our method to improve the robustness of adver-
sarially trained models.
Related Work
Adversarial Training. Adversarial training is one of the
most effective adversarial defense methods which incorpo-
rates adversarial examples into the training set (Goodfel-
arXiv:2210.14404v5 [cs.LG] 20 Dec 2023
low, Shlens, and Szegedy 2015; Madry et al. 2018). Such a
method could degrade classification accuracy on clean data
(Tsipras et al. 2019; Pang et al. 2022). To reduce the degra-
dation in clean classification accuracy, TRADES (Zhang
et al. 2019) is proposed to balance the trade-off between
clean and robust accuracy. Recent works also study the ef-
fects of different hyperparameters (Pang et al. 2021; Huang
et al. 2022a) and data augmentation (Rebuffi et al. 2021; Se-
hwag et al. 2022; Zhao et al. 2020) to reduce robust over-
fitting and avoid the decrease of model’s robust accuracy.
Besides the standard adversarial training, many works also
study the impact of adversarial training on manifolds (Stutz,
Hein, and Schiele 2019; Lin et al. 2020; Zhou, Liang, and
Chen 2020; Patel et al. 2020). Different from these works,
we introduce a novel defense without adversarial training.
Adversarial Purification and Test-time Defense. As an
alternative to adversarial training, adversarial purification
aims to shift adversarial examples back to the represen-
tations of clean examples. Some efforts perform adver-
sarial purification using GAN-based models (Samangouei,
Kabkab, and Chellappa 2018), energy-based models (Grath-
wohl et al. 2020; Yoon, Hwang, and Lee 2021; Hill,
Mitchell, and Zhu 2021), autoencoders (Hwang et al. 2019;
Yin, Zhang, and Zuo 2022; Willetts et al. 2021; Gong et al.
2022; Meng and Chen 2017), augmentations (P´
erez et al.
2021; Shi, Holtz, and Mishne 2021; Mao et al. 2021), etc..
Song et al. (2018) discover that adversarial examples lie in
low-probability regions and they use PixelCNN to restore
the adversarial examples by shifting them back to high-
probability regions. Shi, Holtz, and Mishne (2021) and Mao
et al. (2021) discover that adversarial attacks increase the
loss of self-supervised learning and they define reverse vec-
tors to purify the adversarial examples. Prior efforts (Atha-
lye, Carlini, and Wagner 2018; Croce et al. 2022) have
shown that methods such as Defense-GAN, PixelDefend
(PixelCNN), SOAP (self-supervised), autoencoder-based
purification are vulnerable to the Backward Pass Differen-
tiable Approximation (BPDA) attacks (Athalye, Carlini, and
Wagner 2018). Recently, diffusion-based adversarial purifi-
cation methods have been studied (Nie et al. 2022; Xiao
et al. 2022) and show adversarial robustness against adap-
tive attacks such as BPDA. Lee and Kim (2023), however,
observe that the robustness of diffusion-based purification
drops significantly when evaluated with the surrogate gra-
dient designed for diffusion models. Similar to adversarial
purification, existing test-time defense techniques (Nayak,
Rawal, and Chakraborty 2022; Huang et al. 2022b) are also
vulnerable to adaptive white-box attacks. In this work, we
present a novel defense method combining manifold learn-
ing and variational inference which achieves better perfor-
mance compared with prior works and greater robustness on
adaptive white-box attacks.
Methodology
In this section, we introduce an adversarial purification
method with the manifold hypothesis. We first define suffi-
cient conditions (in theory) to quantify the robustness of pre-
dictions. Then, we use variational inference to approximate
such conditions in implementation and achieve adversarial
robustness without adversarial training.
Let DXY be a set of clean images and their labels where
each image-label pair is defined as (x,y). The manifold hy-
pothesis states that many real-world high-dimensional data
xRnlies on a low-dimensional manifold Mdiffeomor-
phic to Rmwith mn. We define an encoder function
f:RnRmand a decoder function f:RmRnto
form an autoencoder, where fmaps data point xRnto
point f(x)Rm. For x∈ M,fand fare approximate
inverses (see Appendix A for notation details).
Problem Formulation
Let L={1, ..., c}be a discrete label set of cclasses and
h:Rm L be a classifier of the latent space. Given an
image-label pair (x,y)∈ DXY , the encoder maps the im-
age xto a lower-dimensional vector z=f(x)Rmand
the functions fand hform a classifier of the image space
ypred =h(z)=(hf)(x). Generally, the classifier pre-
dicts labels consistent with the ground truth labels such that
ypred =y. However, during adversarial attacks, the adver-
sary can generate a small adversarial perturbation δadv such
that (hf)(x)̸= (hf)(x+δadv). Thus, our purification
framework aims to find a purified signal ϵpfy Rnsuch
that (hf)(x) = (hf)(x+δadv +ϵpfy) = y. However,
it is challenging to achieve ϵpfy =δadv because δadv is
unknown. Thus, we aim to seek an alternative approach to
estimate the purified signal ϵpfy and defend against attacks.
Theoretical Foundation for Adversarial Robustness
The adversarial perturbation is usually p-bounded where
p∈ {0,2,∞}. We define the p-norm of a vector a=
[a1, ..., an]as apand a classifier of the image space as
G:Rn→ L. We follow Bastani et al. (2016); Leino, Wang,
and Fredrikson (2021) to define the local robustness.
Definition 1. (Locally robust image classifier) Given an
image-label pair (x,y)∈ DXY , a classifier Gis (x,y, τ)-
robust with respect to p-norm if for every ηRnwith
ηpτ,y=G(x) = G(x+η).
Human vision is robust up to a certain perturbation bud-
get. For example, given a clean MNIST image xwith pixel
values in [0, 1], if η85/255, human vision will as-
sign (x+η)and xto the same class (Madry et al. 2018). We
use ρH(x,y)to represent the maximum perturbation budget
for static human vision interpretations given an image-label
pair (x,y). Exact ρH(x,y)is often a large value but diffi-
cult to estimate. We use it to represent the upper bound of
achievable robustness.
Definition 2. (Human-level image classifier) For every
image-label pair (xi,yi)∈ DXY , if a classifier GRis
(xi,yi, τi)-robust and τiρH(xi,yi)where ρH(·,·)rep-
resents the maximum perturbation budget for static hu-
man vision interpretations, we define such a classifier as a
human-level image classifier.
The human-level image classifier GRis an ideal classifier
that is comparable to human vision. To construct such GR,
we need a robust encoder fRof the image space and a robust
classifier hRon the manifold to form GR=hRfR. How-
ever, it is a challenge to construct a robust encoder fRdue
to the high-dimensional image space. Therefore, we aim to
find an alternative solution to enhance the robustness of hf
against adversarial attacks by enforcing the semantic con-
sistency between the decoder fand the classifier h. Both
functions (fand h) take inputs from a lower dimensional
space (compared with the encoder); thus, they are more reli-
able (Goodfellow, Shlens, and Szegedy 2015).
We define a semantically consistent classifier on the mani-
fold as hS:Rm→ L, which yields a class prediction hS(z)
given a latent vector zRm.
Definition 3. (Semantically consistent classifier on the man-
ifold M) A semantically consistent classifier hSon the man-
ifold Msatisfies the following condition: for all zRm,
hS(z)=(GRf)(z).
A classifier (on the manifold) is a semantically consistent
classifier if its predictions are consistent with the semantic
interpretations of the images reconstructed by the decoder.
While this definition uses the human-level image classifier
GR, we can use the Bayesian method to approximate hS
without using GRin experiment. Below, we provide the suf-
ficient conditions of adversarial robustness for hSfgiven
an input x, where the encoder fis not adversarially robust.
Proposition 1. Let (x,y)be an image-label pair from
DXY and the human-level image classifier GRbe (x,y, τ )-
robust. If the encoder fand the decoder fare approxi-
mately invertible for the given xsuch that the reconstruction
error x(ff)(x)pκτ(sufficient condition), then
there exists a function F:RnRnsuch that (hSfF)
is (x,y,τκ
2)-robust. (See Appendix B)
The function Fis considered to be the purifier for adver-
sarial attacks. We construct such a function based on recon-
struction errors. We assume the sufficient condition holds
(bounded reconstruction errors κfor clean inputs). Lemma 1
states that adversarial attacks on a semantically consistent
classifier lead to reconstruction errors larger than κ(abnor-
mal reconstructions on adversarial examples).
Lemma 1. If an adversarial example xadv =x+δadv with
δadvpτκ
2causes (hSf)(xadv)̸=GR(xadv), then
xadv (ff)(xadv)p>τ+κ
2κ.(See Appendix B)
To defend against the attacks, we need to reduce the re-
construction error. Theorem 1 states that if a purified sample
xpfy =xadv +ϵpfy has a reconstruction error no larger than
κ, the prediction from (hSf)(xpfy)will be the same as the
prediction from GR(x).
Theorem 1. If a purified signal ϵpfy Rnwith ϵpfyp
τκ
2ensures that (xadv +ϵpfy)(ff)(xadv +ϵpfy)p
κ, then (hSf)(xadv +ϵpfy) = GR(x).(See Appendix B)
If ϵpfy =δadv, then xpfy (ff)(xpfy)p=κ. Thus,
feasible regions for ϵpfy are non-empty. Let S:RnRn
be a function that takes an input xand outputs a purified
signal ϵpfy =S(x)by minimizing the reconstruction error,
then F(x)x+S(x)and hSfFis (x,y,τκ
2)-robust.
Remark 1. For every perturbation δRnwith δpν,
if S(x+δ) = δ, then the function S:RnRnis locally
Figure 2: Two-stage pipeline: (a) jointly training for semantic con-
sistency between the decoder and the classifier and (b) iterative up-
dates of ϵpfy to purify xadv in inference.
Lipschitz continuous on Bν{ˆ
xRn| ∥ˆ
xxp< ν}
with a Lipschitz constant of 1.(See Appendix B)
Insights From the Theory. Our framework transforms a
high-dimensional adversarial robustness problem into a low-
dimensional semantic consistency problem. Since we only
provide the sufficient conditions for adversarial robustness,
dissatisfaction with the conditions is not necessary to be ad-
versarially vulnerable. Our conditions indicate that higher
reconstruction quality could lead to stronger robustness.
Meanwhile, our method can certify robustness up to τ
2(re-
construction error κ= 0) when a human-level image classi-
fier can certify robustness up to τ. The insight is that adding
a purified signal ϵpfy on top of an adversarial example xadv
could change the image semantic, see Figure 5(c). In our
formulation, we use τto quantify the local robustness of the
human-level image classifier GRgiven an image-label pair
(x,y). Our objective is not to estimate the τvalue but to
use it to represent the upper bound of the achievable local
robustness. Since GRcan be considered as a human vision
model, the τvalue is often a large number. Our framework
is based on the triangle inequality of the p-metrics; thus, it
can be extended to other distance metrics.
Relaxation. Our framework requires semantic consistency
between the classifier on the manifold and the decoder on
the manifold. Despite that the classifiers and decoders (on
the manifold) have a low input dimension, it is still difficult
to achieve high semantic consistency between them. Mean-
while, the human-level image classifier GRis not available.
Thus, we assume that predictions and reconstructions from
high data density regions of p(z|x)are more likely to be
semantically consistent (Zhou 2022). In the following sec-
tions, we introduce a practical implementation of adversarial
purification based on our theoretical foundation. The imple-
mentation includes two stages: (1) enforce semantic consis-
tency during training and (2) test-time purification of adver-
sarial examples, see Figure 2.
Semantic Consistency with the ELBO
Exact inference of p(z|x)is often intractable, we, therefore,
use variational inference to approximate the posterior p(z|x)
with a different distribution q(z|x). We define two parame-
ters θand ϕwhich parameterize the distributions pθ(x|z)
and qϕ(z|x). When the evidence lower bound (ELBO) is
maximized, qϕ(z|x)is considered to be a reasonable ap-
proximation of p(z|x). To enforce the semantic consistency
摘要:

AdversarialPurificationwiththeManifoldHypothesisZhaoyuanYang1,ZhiweiXu2,JingZhang2,RichardHartley2,PeterTu11GEResearch2AustralianNationalUniversity{zhaoyuan.yang,tu}@ge.com,{zhiwei.xu,jing.zhang,richard.hartley}@anu.edu.auAbstractInthiswork,weformulateanovelframeworkforadversar-ialrobustnessusingthe...

展开>> 收起<<
Adversarial Purification with the Manifold Hypothesis Zhaoyuan Yang1 Zhiwei Xu2 Jing Zhang2 Richard Hartley2 Peter Tu1 1GE Research2Australian National University.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:4.67MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注