Adversarial Purification with the Manifold Hypothesis Zhaoyuan Yang1 Zhiwei Xu2 Jing Zhang2 Richard Hartley2 Peter Tu1 1GE Research2Australian National University

2025-05-06 0 0 4.67MB 14 页 10玖币

侵权投诉

Adversarial Puriﬁcation with the Manifold Hypothesis

Zhaoyuan Yang1, Zhiwei Xu2, Jing Zhang2, Richard Hartley2, Peter Tu1

1GE Research 2Australian National University

{zhaoyuan.yang,tu}@ge.com, {zhiwei.xu,jing.zhang,richard.hartley}@anu.edu.au

Abstract

In this work, we formulate a novel framework for adversar-

ial robustness using the manifold hypothesis. This framework

provides sufﬁcient conditions for defending against adversar-

ial examples. We develop an adversarial puriﬁcation method

with this framework. Our method combines manifold learn-

ing with variational inference to provide adversarial robust-

ness without the need for expensive adversarial training. Ex-

perimentally, our approach can provide adversarial robust-

ness even if attackers are aware of the existence of the de-

fense. In addition, our method can also serve as a test-time

defense mechanism for variational autoencoders.

Introduction

State-of-the-art neural network models are known of be-

ing vulnerable to adversarial examples. With small pertur-

bations, adversarial examples can completely change pre-

dictions of neural networks (Szegedy et al. 2014). Defense

methods are then designed to produce robust models to-

wards adversarial attacks. Common defense methods for ad-

versarial attacks include adversarial training (Madry et al.

2018), certiﬁed robustness (Wong and Kolter 2018), etc.. Re-

cently, adversarial puriﬁcation has drawn increasing atten-

tion (Croce et al. 2022), which puriﬁes adversarial examples

during test time and thus requires fewer training resources.

Existing adversarial puriﬁcation methods achieve supe-

rior performance when attackers are not aware of the exis-

tence of the defense; however, their performance drops sig-

niﬁcantly when attackers create defense-aware or adaptive

attacks (Croce et al. 2022). Besides, most of them are em-

pirical with limited theoretical justiﬁcations. Differently, we

adapt ideas from the certiﬁed robustness and build an adver-

sarial puriﬁcation method with a theoretical foundation.

Speciﬁcally, our adversarial puriﬁcation method is based

on the assumption that high-dimensional images lie on low-

dimensional manifolds (the manifold hypothesis). Com-

pared with low-dimensional data, high-dimensional data

are more vulnerable to adversarial examples (Goodfellow,

Shlens, and Szegedy 2015). Thus, we transform the ad-

versarial robustness problem from a high-dimensional im-

age domain to a low-dimensional image manifold domain

Figure 1: Adversarial puriﬁcation against adversarial attacks. (a)

Clean, adversarial (adv.), and puriﬁed images. (b) Jointly learning

of the variational autoencoder and the classiﬁer to achieve seman-

tic consistency. (c) Applying semantic consistency between predic-

tions and reconstructions to defend against attacks.

and present a novel adversarial puriﬁcation method for

non-adversarially trained models via manifold learning and

variational inference (see Figure 1 for the pipeline). With

our method, non-adversarially trained models can achieve

performance on par with the performance of adversarially

trained models. Even if attackers are aware of the existence

of the defenses, our approach can still provide adversarial

robustness against attacks.

Our method is signiﬁcant in introducing the manifold hy-

pothesis to the adversarial defense framework. We improve

a model’s adversarial robustness from a more interpretable

low-dimensional image manifold than the complex high-

dimensional image space. In the meantime, we provide con-

ditions (in theory) to quantify the robustness of the predic-

tions. Also, we present an effective adversarial puriﬁcation

approach combining manifold learning and variational infer-

ence, which achieves reliable performance on adaptive at-

tacks without adversarial training. We also demonstrate the

feasibility of our method to improve the robustness of adver-

sarially trained models.

Related Work

Adversarial Training. Adversarial training is one of the

most effective adversarial defense methods which incorpo-

rates adversarial examples into the training set (Goodfel-

arXiv:2210.14404v5 [cs.LG] 20 Dec 2023

low, Shlens, and Szegedy 2015; Madry et al. 2018). Such a

method could degrade classiﬁcation accuracy on clean data

(Tsipras et al. 2019; Pang et al. 2022). To reduce the degra-

dation in clean classiﬁcation accuracy, TRADES (Zhang

et al. 2019) is proposed to balance the trade-off between

clean and robust accuracy. Recent works also study the ef-

fects of different hyperparameters (Pang et al. 2021; Huang

et al. 2022a) and data augmentation (Rebufﬁ et al. 2021; Se-

hwag et al. 2022; Zhao et al. 2020) to reduce robust over-

ﬁtting and avoid the decrease of model’s robust accuracy.

Besides the standard adversarial training, many works also

study the impact of adversarial training on manifolds (Stutz,

Hein, and Schiele 2019; Lin et al. 2020; Zhou, Liang, and

Chen 2020; Patel et al. 2020). Different from these works,

we introduce a novel defense without adversarial training.

Adversarial Puriﬁcation and Test-time Defense. As an

alternative to adversarial training, adversarial puriﬁcation

aims to shift adversarial examples back to the represen-

tations of clean examples. Some efforts perform adver-

sarial puriﬁcation using GAN-based models (Samangouei,

Kabkab, and Chellappa 2018), energy-based models (Grath-

wohl et al. 2020; Yoon, Hwang, and Lee 2021; Hill,

Mitchell, and Zhu 2021), autoencoders (Hwang et al. 2019;

Yin, Zhang, and Zuo 2022; Willetts et al. 2021; Gong et al.

2022; Meng and Chen 2017), augmentations (P´

erez et al.

2021; Shi, Holtz, and Mishne 2021; Mao et al. 2021), etc..

Song et al. (2018) discover that adversarial examples lie in

low-probability regions and they use PixelCNN to restore

the adversarial examples by shifting them back to high-

probability regions. Shi, Holtz, and Mishne (2021) and Mao

et al. (2021) discover that adversarial attacks increase the

loss of self-supervised learning and they deﬁne reverse vec-

tors to purify the adversarial examples. Prior efforts (Atha-

lye, Carlini, and Wagner 2018; Croce et al. 2022) have

shown that methods such as Defense-GAN, PixelDefend

(PixelCNN), SOAP (self-supervised), autoencoder-based

puriﬁcation are vulnerable to the Backward Pass Differen-

tiable Approximation (BPDA) attacks (Athalye, Carlini, and

Wagner 2018). Recently, diffusion-based adversarial puriﬁ-

cation methods have been studied (Nie et al. 2022; Xiao

et al. 2022) and show adversarial robustness against adap-

tive attacks such as BPDA. Lee and Kim (2023), however,

observe that the robustness of diffusion-based puriﬁcation

drops signiﬁcantly when evaluated with the surrogate gra-

dient designed for diffusion models. Similar to adversarial

puriﬁcation, existing test-time defense techniques (Nayak,

Rawal, and Chakraborty 2022; Huang et al. 2022b) are also

vulnerable to adaptive white-box attacks. In this work, we

present a novel defense method combining manifold learn-

ing and variational inference which achieves better perfor-

mance compared with prior works and greater robustness on

adaptive white-box attacks.

Methodology

In this section, we introduce an adversarial puriﬁcation

method with the manifold hypothesis. We ﬁrst deﬁne sufﬁ-

cient conditions (in theory) to quantify the robustness of pre-

dictions. Then, we use variational inference to approximate

such conditions in implementation and achieve adversarial

robustness without adversarial training.

Let DXY be a set of clean images and their labels where

each image-label pair is deﬁned as (x,y). The manifold hy-

pothesis states that many real-world high-dimensional data

x∈Rnlies on a low-dimensional manifold Mdiffeomor-

phic to Rmwith m≪n. We deﬁne an encoder function

f:Rn→Rmand a decoder function f†:Rm→Rnto

form an autoencoder, where fmaps data point x∈Rnto

point f(x)∈Rm. For x∈ M,f†and fare approximate

inverses (see Appendix A for notation details).

Problem Formulation

Let L={1, ..., c}be a discrete label set of cclasses and

h:Rm→ L be a classiﬁer of the latent space. Given an

image-label pair (x,y)∈ DXY , the encoder maps the im-

age xto a lower-dimensional vector z=f(x)∈Rmand

the functions fand hform a classiﬁer of the image space

ypred =h(z)=(h◦f)(x). Generally, the classiﬁer pre-

dicts labels consistent with the ground truth labels such that

ypred =y. However, during adversarial attacks, the adver-

sary can generate a small adversarial perturbation δadv such

that (h◦f)(x)̸= (h◦f)(x+δadv). Thus, our puriﬁcation

framework aims to ﬁnd a puriﬁed signal ϵpfy ∈Rnsuch

that (h◦f)(x) = (h◦f)(x+δadv +ϵpfy) = y. However,

it is challenging to achieve ϵpfy =−δadv because δadv is

unknown. Thus, we aim to seek an alternative approach to

estimate the puriﬁed signal ϵpfy and defend against attacks.

Theoretical Foundation for Adversarial Robustness

The adversarial perturbation is usually ℓp-bounded where

p∈ {0,2,∞}. We deﬁne the ℓp-norm of a vector a=

[a1, ..., an]⊺as ∥a∥pand a classiﬁer of the image space as

G:Rn→ L. We follow Bastani et al. (2016); Leino, Wang,

and Fredrikson (2021) to deﬁne the local robustness.

Deﬁnition 1. (Locally robust image classiﬁer) Given an

image-label pair (x,y)∈ DXY , a classiﬁer Gis (x,y, τ)-

robust with respect to ℓp-norm if for every η∈Rnwith

∥η∥p≤τ,y=G(x) = G(x+η).

Human vision is robust up to a certain perturbation bud-

get. For example, given a clean MNIST image xwith pixel

values in [0, 1], if ∥η∥∞≤85/255, human vision will as-

sign (x+η)and xto the same class (Madry et al. 2018). We

use ρH(x,y)to represent the maximum perturbation budget

for static human vision interpretations given an image-label

pair (x,y). Exact ρH(x,y)is often a large value but difﬁ-

cult to estimate. We use it to represent the upper bound of

achievable robustness.

Deﬁnition 2. (Human-level image classiﬁer) For every

image-label pair (xi,yi)∈ DXY , if a classiﬁer GRis

(xi,yi, τi)-robust and τi≜ρH(xi,yi)where ρH(·,·)rep-

resents the maximum perturbation budget for static hu-

man vision interpretations, we deﬁne such a classiﬁer as a

human-level image classiﬁer.

The human-level image classiﬁer GRis an ideal classiﬁer

that is comparable to human vision. To construct such GR,

we need a robust encoder fRof the image space and a robust

classiﬁer hRon the manifold to form GR=hR◦fR. How-

ever, it is a challenge to construct a robust encoder fRdue

to the high-dimensional image space. Therefore, we aim to

ﬁnd an alternative solution to enhance the robustness of h◦f

against adversarial attacks by enforcing the semantic con-

sistency between the decoder f†and the classiﬁer h. Both

functions (f†and h) take inputs from a lower dimensional

space (compared with the encoder); thus, they are more reli-

able (Goodfellow, Shlens, and Szegedy 2015).

We deﬁne a semantically consistent classiﬁer on the mani-

fold as hS:Rm→ L, which yields a class prediction hS(z)

given a latent vector z∈Rm.

Deﬁnition 3. (Semantically consistent classiﬁer on the man-

ifold M) A semantically consistent classiﬁer hSon the man-

ifold Msatisﬁes the following condition: for all z∈Rm,

hS(z)=(GR◦f†)(z).

A classiﬁer (on the manifold) is a semantically consistent

classiﬁer if its predictions are consistent with the semantic

interpretations of the images reconstructed by the decoder.

While this deﬁnition uses the human-level image classiﬁer

GR, we can use the Bayesian method to approximate hS

without using GRin experiment. Below, we provide the suf-

ﬁcient conditions of adversarial robustness for hS◦fgiven

an input x, where the encoder fis not adversarially robust.

Proposition 1. Let (x,y)be an image-label pair from

DXY and the human-level image classiﬁer GRbe (x,y, τ )-

robust. If the encoder fand the decoder f†are approxi-

mately invertible for the given xsuch that the reconstruction

error ∥x−(f†◦f)(x)∥p≜κ≤τ(sufﬁcient condition), then

there exists a function F:Rn→Rnsuch that (hS◦f◦F)

is (x,y,τ−κ

2)-robust. (See Appendix B)

The function Fis considered to be the puriﬁer for adver-

sarial attacks. We construct such a function based on recon-

struction errors. We assume the sufﬁcient condition holds

(bounded reconstruction errors κfor clean inputs). Lemma 1

states that adversarial attacks on a semantically consistent

classiﬁer lead to reconstruction errors larger than κ(abnor-

mal reconstructions on adversarial examples).

Lemma 1. If an adversarial example xadv =x+δadv with

∥δadv∥p≤τ−κ

2causes (hS◦f)(xadv)̸=GR(xadv), then

∥xadv −(f†◦f)(xadv)∥p>τ+κ

2≥κ.(See Appendix B)

To defend against the attacks, we need to reduce the re-

construction error. Theorem 1 states that if a puriﬁed sample

xpfy =xadv +ϵpfy has a reconstruction error no larger than

κ, the prediction from (hS◦f)(xpfy)will be the same as the

prediction from GR(x).

Theorem 1. If a puriﬁed signal ϵpfy ∈Rnwith ∥ϵpfy∥p≤

τ−κ

2ensures that ∥(xadv +ϵpfy)−(f†◦f)(xadv +ϵpfy)∥p≤

κ, then (hS◦f)(xadv +ϵpfy) = GR(x).(See Appendix B)

If ϵpfy =−δadv, then ∥xpfy −(f†◦f)(xpfy)∥p=κ. Thus,

feasible regions for ϵpfy are non-empty. Let S:Rn→Rn

be a function that takes an input xand outputs a puriﬁed

signal ϵpfy =S(x)by minimizing the reconstruction error,

then F(x)≜x+S(x)and hS◦f◦Fis (x,y,τ−κ

2)-robust.

Remark 1. For every perturbation δ∈Rnwith ∥δ∥p≤ν,

if S(x+δ) = −δ, then the function S:Rn→Rnis locally

Figure 2: Two-stage pipeline: (a) jointly training for semantic con-

sistency between the decoder and the classiﬁer and (b) iterative up-

dates of ϵpfy to purify xadv in inference.

Lipschitz continuous on Bν≜{ˆ

x∈Rn| ∥ˆ

x−x∥p< ν}

with a Lipschitz constant of 1.(See Appendix B)

Insights From the Theory. Our framework transforms a

high-dimensional adversarial robustness problem into a low-

dimensional semantic consistency problem. Since we only

provide the sufﬁcient conditions for adversarial robustness,

dissatisfaction with the conditions is not necessary to be ad-

versarially vulnerable. Our conditions indicate that higher

reconstruction quality could lead to stronger robustness.

Meanwhile, our method can certify robustness up to τ

2(re-

construction error κ= 0) when a human-level image classi-

ﬁer can certify robustness up to τ. The insight is that adding

a puriﬁed signal ϵpfy on top of an adversarial example xadv

could change the image semantic, see Figure 5(c). In our

formulation, we use τto quantify the local robustness of the

human-level image classiﬁer GRgiven an image-label pair

(x,y). Our objective is not to estimate the τvalue but to

use it to represent the upper bound of the achievable local

robustness. Since GRcan be considered as a human vision

model, the τvalue is often a large number. Our framework

is based on the triangle inequality of the ℓp-metrics; thus, it

can be extended to other distance metrics.

Relaxation. Our framework requires semantic consistency

between the classiﬁer on the manifold and the decoder on

the manifold. Despite that the classiﬁers and decoders (on

the manifold) have a low input dimension, it is still difﬁcult

to achieve high semantic consistency between them. Mean-

while, the human-level image classiﬁer GRis not available.

Thus, we assume that predictions and reconstructions from

high data density regions of p(z|x)are more likely to be

semantically consistent (Zhou 2022). In the following sec-

tions, we introduce a practical implementation of adversarial

puriﬁcation based on our theoretical foundation. The imple-

mentation includes two stages: (1) enforce semantic consis-

tency during training and (2) test-time puriﬁcation of adver-

sarial examples, see Figure 2.

Semantic Consistency with the ELBO

Exact inference of p(z|x)is often intractable, we, therefore,

use variational inference to approximate the posterior p(z|x)

with a different distribution q(z|x). We deﬁne two parame-

ters θand ϕwhich parameterize the distributions pθ(x|z)

and qϕ(z|x). When the evidence lower bound (ELBO) is

maximized, qϕ(z|x)is considered to be a reasonable ap-

proximation of p(z|x). To enforce the semantic consistency

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdversarialPurificationwiththeManifoldHypothesisZhaoyuanYang1,ZhiweiXu2,JingZhang2,RichardHartley2,PeterTu11GEResearch2AustralianNationalUniversity{zhaoyuan.yang,tu}@ge.com,{zhiwei.xu,jing.zhang,richard.hartley}@anu.edu.auAbstractInthiswork,weformulateanovelframeworkforadversar-ialrobustnessusingthe...

展开>> 收起<<

Adversarial Purification with the Manifold Hypothesis Zhaoyuan Yang1 Zhiwei Xu2 Jing Zhang2 Richard Hartley2 Peter Tu1 1GE Research2Australian National University.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adversarial Purification with the Manifold Hypothesis Zhaoyuan Yang1 Zhiwei Xu2 Jing Zhang2 Richard Hartley2 Peter Tu1 1GE Research2Australian National University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: