Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation Zeyu Qin1 Yanbo Fan2 Yi Liu1 Li Shen3 Yong Zhang2 Jue Wang2 Baoyuan Wu1y

2025-04-27 1 0 1.89MB 26 页 10玖币

侵权投诉

Boosting the Transferability of Adversarial Attacks

with Reverse Adversarial Perturbation

Zeyu Qin1*, Yanbo Fan2*, Yi Liu1, Li Shen3, Yong Zhang2, Jue Wang2, Baoyuan Wu1†

1School of Data Science, Shenzhen Research Institute of Big Data,

The Chinese University of Hong Kong, Shenzhen

2Tencent AI Lab

3JD Explore Academy

{zeyu6181136, fanyanbo0124, yiliuhk2000}@gmail.com

{mathshenli, zhangyong201303, arphid}@gmail.com

wubaoyuan@cuhk.edu.cn

Abstract

Deep neural networks (DNNs) have been shown to be vulnerable to adversarial

examples, which can produce erroneous predictions by injecting imperceptible per-

turbations. In this work, we study the transferability of adversarial examples, which

is signiﬁcant due to its threat to real-world applications where model architecture or

parameters are usually unknown. Many existing works reveal that the adversarial

examples are likely to overﬁt the surrogate model that they are generated from,

limiting its transfer attack performance against different target models. To mitigate

the overﬁtting of the surrogate model, we propose a novel attack method, dubbed

reverse adversarial perturbation (RAP). Speciﬁcally, instead of minimizing the

loss of a single adversarial point, we advocate seeking adversarial example located

at a region with uniﬁed low loss value, by injecting the worst-case perturbation (i.e.,

the reverse adversarial perturbation) for each step of the optimization procedure.

The adversarial attack with RAP is formulated as a min-max bi-level optimization

problem. By integrating RAP into the iterative process for attacks, our method can

ﬁnd more stable adversarial examples which are less sensitive to the changes of de-

cision boundary, mitigating the overﬁtting of the surrogate model. Comprehensive

experimental comparisons demonstrate that RAP can signiﬁcantly boost adversarial

transferability. Furthermore, RAP can be naturally combined with many existing

black-box attack techniques, to further boost the transferability. When attacking

a real-world image recognition system, i.e., Google Cloud Vision API, we obtain

22% performance improvement of targeted attacks over the compared method. Our

codes are available at: https://github.com/SCLBD/Transfer_attack_RAP.

1 Introduction

Deep neural networks (DNNs) have been successfully applied in many safety-critical tasks, such

as autonomous driving, face recognition and veriﬁcation, etc. However, it has been shown that

DNN models are vulnerable to adversarial examples [

], which are

indistinguishable from natural examples but make a model produce erroneous predictions. For

real-world applications, the DNN models are often hidden from users. Therefore, the attackers need

to generate the adversarial examples under black-box setting where they do not know any information

of the target model [

]. For black-box setting, the adversarial transferability matters since

*Equal contribution. †Corresponding author.

‡This work is done when Zeyu Qin is a research intern at Tencent AI Lab.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05968v1 [cs.CV] 12 Oct 2022

it can allow the attackers to attack target models by using adversarial examples generated on the

surrogate models. Therefore, learning how to generate adversarial examples with high transferability

has gained more attentions in the literature [5,10,14,23,26,38,48].

Under white-box setting where the complete information of the attacked model (e.g., architecture and

parameters) is available, the gradient-based attacks such as PGD [

] have demonstrated good attack

performance. However, they often exhibit the poor transferiability [

], i.e., the adversarial example

xadv

generated from the surrogate model

performs poorly against different target models

The previous works attribute that to the overﬁtting of adversarial examples to the surrogate models

[

]. Figure 1 (b) gives an illustration. The PGD attack aims to ﬁnd an adversarial point

xpgd

with minimal attack loss, while doesn’t consider the attack loss of the neighborhood regions round

xpgd

. Due to the highly non-convex of deep models, when

xpgd

locates at a sharply local minimum,

a slight change on model parameters of

could cause a large increase of the attack loss, making

xpgd fail to attack the perturbed model.

Many techniques have been proposed to mitigate the overﬁtting and improve the transferability,

including input transformation [

], gradient calibration [

], feature-level attacks [

], and

generative models [

], etc. However, there still exists a large gap of attack performance between the

transfer setting and the ideal white-box setting, especially for targeted attack, requiring more efforts

for boosting the transferability.

In this work, we propose a novel attack method called reverse adversarial perturbation (RAP) to

alleviate the overﬁtting of the surrogate model and boost the transferability of adversarial examples.

We encourage that

xadv

is not only of low attack loss but also locates at a local ﬂat region, i.e., the

points within the local neighborhood region around

xadv

should also be of low loss values. Figure 1

(b) illustrates the difference between the sharp local minimum and ﬂat local minimum. When the

model parameter of

has some slight changes, the variation of the attack loss w.r.t. the ﬂat local

minimum is less than that of the sharp one. Therefore, the ﬂat local minimum is less sensitive to the

changes of decision boundary. To achieve this goal, we formulate a min-max bi-level optimization

problem. The inner maximization aims to ﬁnd the worst-case perturbation (i.e., that with the largest

attack loss, and this is why we call it reverse adversarial perturbation) within the local region around

the current adversarial example, which can be solved by the projected gradient ascent algorithm.

Then, the outer minimization will update the adversarial example to ﬁnd a new point added with the

provided reverse perturbation that leads to lower attack loss. Figure 1 (a) provides an illustration of

the optimization process. For

-th iteration and

, RAP ﬁrst ﬁnds the point

xt+nrap

with max

attack loss within the neighborhood region of

. Then it updates

with the gradient calculated

by minimizing the attack loss w.r.t.

xt+nrap

. Compared to directly adopting the gradient at

RAP could help escape from the sharp local minimum and pursue a relatively ﬂat local minimum.

Besides, we design a late-start variant of RAP (RAP-LS) to further boost the attack effectiveness

and efﬁciency, which doesn’t insert the reverse perturbation into the optimization procedure in

the early stage. Moreover, from the technical perspective, since the proposed RAP method only

introduces one specially designed perturbation onto adversarial attacks, one notable advantage of

RAP is that it can be naturally combined with many existing black-box attack techniques to further

boost the transferability. For example, when combined with different input transformations (e.g., the

random resizing and padding in Diverse Input [

]), our RAP method consistently outperforms the

counterparts by a clear margin.

Our main contributions are three-fold:

Based on a novel perspective, the ﬂatness of loss landscape

for adversarial examples, we propose a novel adversarial attack method RAP that encourages both

the adversarial example and its neighborhood region be of low loss value;

we present a vigorous

experimental study and show that RAP can signiﬁcantly boost the adversarial transferability on

both untargeted and targeted attacks for various networks also containing defense models;

demonstrate that RAP can be easily combined with existing transfer attack techniques and outperforms

the state-of-the-art performance by a large margin.

2 Related Work

The black-box attacks can be categorized into two categories: 1) query-based attacks that conduct the

attack based on the feedback of iterative queries to target models, and 2) transfer attacks that use the

adversarial examples generated on some surrogate models to attack the target models. In this work,

we focus on the transfer attacks. For surrogate models, existing attack algorithms such as FGSM [

]

and I-FGSM [

] could achieve good attack performance. However, they often overﬁt the surrogate

models and thus exhibit poor transferability. Recently, many works have been proposed to generate

more transferable adversarial examples [

], which we

brieﬂy summarize as below.

Input transformation:

Data augmentation, which has been shown to be effective in improving

model generalization, has also been studied to boost the adversarial transferability, such as randomly

resizing and padding [

], randomly scaling [

], and adversarial mixup [

]. In addition, the work

of Dong et al.

[6]

uses a set of translated images to compute gradient and get the better performance

against defense models. Expectation of Transformation (EOT) method [

] synthesizes adversarial

examples over a chosen distribution of transformations to enhance its adversarial transferability.

Gradient modiﬁcation:

Instead of the I-FGSM, the work of Dong et al.

[5]

integrates momentum

into the updating strategy. And Lin et al.

[24]

uses the Nesterov accelerated gradient to boost the

transferability. The work of Wang and He

[40]

aims to ﬁnd a more stable gradient direction by

tuning the variance of each gradient step. There are also some model-speciﬁc designs to boost the

adversarial transferability. For example, Wu et al.

[43]

found that the gradient of skip connections

is more crucial to generate more transferable attacks. The work of Guo et al.

[14]

proposed LinBP

to utilize more gradient of skip connections during the back-propagation. However, these methods

tend to be speciﬁc to a particular model architecture, such as skip connection, and it is nontrivial

to extend the ﬁndings to other architectures or modules.

Intermediate feature attack:

Meanwhile,

Huang et al.

[17]

, Inkawhich et al.

[19

20]

proposed to exploit feature space constraints to generate

more transferable attacks. Yet they need to identity the best performing intermediate layers or train

one-vs-all binary classiﬁes for all attacked classes. Recently, Zhao et al.

[49]

ﬁnd iterative attacks

with much more iterations and logit loss can achieve relatively high targeted transferability and

exceed the feature-based attacks.

Generative models:

In addition, there have been some methods

utilizing the generative models to generate the adversarial perturbations [

]. For example,

the work of Naseer et al.

[30]

proposed to train a generative model to match the distributions of

source and target class, so as to increase the targeted transferability. However, the learning of the

perturbation generator is nontrivial, especially on large-scale datasets.

In summary, the current performance of transfer attacks is still unsatisfactory, especially for targeted

attacks. In this work, we study adversarial transferability from the prespective of the ﬂatness of

adversarial examples. We ﬁnd that adversarial examples located at ﬂat local minimum will be more

transferable than those at sharp local minimum and propose an novel algorithm to ﬁnd adversarial

example that locates at ﬂat local minimum.

3 Methodology

3.1 Preliminaries of Transfer Adversarial Attack

Given an benign sample

(x, y)∈(X,Y)

, the procedure of transfer adversarial attack is ﬁrstly

constructing the adversarial example

xadv

within the neighborhood region

B(x) = {x0:kx0−

xkp≤}

by attacking the white-box surrogate model

Ms(x;θ) : X → Y

, then transferring

xadv

to directly attack the black-box target model

Mt(x;φ) : X → Y

. The attack goal is to mislead the

target model, i.e.,

Mt(xadv;φ)6=y

(untargeted attack), or

Mt(xadv;φ) = yt

(targeted attack) with

yt∈ Y

indicting the target label. Taking the target attack as example, the general formulation of

many existing transfer attack methods can be written as follows:

min

xadv ∈B(x)L(Ms(G(xadv); θ), yt).(1)

The loss function

is often set as the cross entropy (CE) loss [

] or the logit loss [

], which will

be speciﬁed in later experiments. Besides, the formulation of untargeted attack can be easily obtained

by replacing the loss function Land ytby −L and y, respectively.

Since

is white-box, if

G(·)

is set as the identity function, then any off-the-shelf white-box

adversarial attack method can be adopted to solve Problem (1), such as I-FSGM [

], MI-FGSM

[

], etc. Meanwhile, existing works have designed different

G(·)

functions and developed the

corresponding optimization algorithms, to boost the adversarial transferability between surrogate and

target models. For example,

G(·)

is speciﬁed as random resizing and padding (DI) [

], translation

transformation (TI) [6], scale transformation (SI) [24], and adversarial mixup (Admix) [41].

(a) (b)

Figure 1: These two plots are schematic diagrams in 1D space. The x-axis means the value of input

. The y-axis means the value of attack loss function

. (a) Illustration of our attack method and the

original PGD attack. (b) Illustration of attack loss landscape of

and

MS0

denotes a slight

change on the model parameters of

. The blue and yellow dots correspond to attacks located at

different local minima on

, respectively. The gray and red points are their counterparts on

MS0

3.2 Reverse Adversarial Perturbation

As discussed above, although having good performance for the white-box setting, the adversarial

examples generated from

exist poor adversarial transferability on

, especially for targeted

attacks. The previous works attribute this issue to the overﬁtting of adversarial attack to

[

]. As shown in Figure 1 (b), when

xpgd

locates at a sharp local minimum, it is not

stable and is sensitive to changes of

. When having some changes on model parameters,

xpgd

could results in a high attack loss against MS0and lead to a failure attack.

To mitigating the overﬁtting to

, we advocate to ﬁnd

xadv

located at ﬂat local region. That means

we encourage that not only

xadv

itself has low loss value, but also the points in the vicinity of

xadv

have similarly low loss values.

To this end, we propose to minimize the maximal loss value within a local neighborhood region

around the adversarial example

xadv

. The maximal loss is implemented by perturbing

xadv

maximize the attack loss, named Reverse Adversarial Perturbation (RAP). By inserting the RAP into

the formulation (1), we aim to solve the following problem,

min

xadv ∈B(x)L(Ms(G(xadv +nrap); θ), yt),(2)

where

nrap = arg max

knrap k∞≤n

L(Ms(xadv +nrap;θ), yt),(3)

with

nrap

indicating the RAP, and

n

deﬁning its search region. The above formulations

(2)

and

(3)

correspond to the targeted attack, and the corresponding untargeted formulations can be easily

obtained by replacing the loss function Land ytby −L and y, respectively.

It is a min-max bi-level optimization problem [

], and can be solved by iteratively optimizing the

inner maximization and the outer minimization problem. Speciﬁcally, in each iteration, given

xadv

the inner maximization w.r.t. nrap is solved by the projected gradient ascent algorithm:

nrap ←nrap +αn·sign(∇nrap L(Ms(xadv +nrap;θ), yt)).(4)

The above update is conducted by

steps, and

αn=n

. Then, given

nrap

, the outer minimization

w.r.t.

xadv

can be solved by any off-the-shelf algorithm that is developed for solving

(1)

. For example,

it can be undated by one step projected gradient descent, as follows:

xadv ←ClipB(x)xadv −α·sign(∇xadv L(Ms(G(xadv +nrap); θ), yt)),(5)

with

ClipB(x)(a)

clipping

into the neighborhood region

B(x)

. The overall optimization procedure

is summarized in Algorithm 1. Moreover, since the optimization w.r.t.

xadv

can be implemented by

any off-the-shelf algorithm for solving Problem

(1)

, one notable advantage of the proposed RAP is

that it can be naturally combined with any one of them, such as the input transformation methods

[6,24,41,48].

Figure 2: Targeted attack success rate (

) on Dense-121 and VGG-16. We take the Res-50 as the

surrogate model and take MI and Admix as baseline methods.

Algorithm 1 Reverse Adversarial Perturbation (RAP) Algorithm

Input:

Surrogate model

, benign data

(x, y)

, target label

, loss function

, transforma-

tion

, the global iteration number

, the late-start iteration number

KLS

of RAP, as well as

hyper-parameters in optimization (speciﬁed in later experiments)

Output:

the adversarial example

xadv

1: Initialize xadv ←x

2: for k= 1, . . . , K do

3: if k≥KLS then

4: Initialize nrap ←0

5: for t= 1, . . . , T do

6: Update nrap using (4)

7: Update xadv using (5)

A Late-Start (LS) Variant of RAP.

In our preliminary experiments, we ﬁnd that RAP requires

more iterations to converge and the performance is slightly lower during the initial iterations, com-

pared to its baseline attack methods. As shown in Figure 2, we combine MI [

] and Admix [

] with

RAP, and adopt ResNet-50 as the surrogate model. We take the evaluation on 1000 images from

ImageNet (see Sec.4.1). It is observed that the method with RAP (see the orange curves) quickly

surpasses its baseline method (see the blue curves) and ﬁnally achieves much higher success rate with

more iterations, which verify the effect of RAP on enhancing the adversarial transferability. However,

it is also observed that the performance of RAP is slightly lower than its baseline method in the early

stage. The possible reason is that the early-stage attack is of very weak attack performance to the

surrogate model. In this case, it may be waste to pursue better transferable attacks by solving the

min-max problem. A better strategy may be only solving the minimization problem

(1)

in the early

stage to quickly achieve the region of relatively high adversarial attack performance, then starting

RAP to further enhance the attack performance and transferability simultaneously. This strategy

is denoted as RAP with late-start (RAP-LS), whose effect is preliminarily supported by the results

shown in Figure 2 (see the green curve) and will be evaluated extensively in later experiments.

3.3 A Closer Look at RAP

To verify whether RAP can help us ﬁnd a

xadv

located at the local ﬂat region or not, we use ResNet-

50 as surrogate model and conduct the untargeted attacks. We visualize the loss landscape around

xadv

by plotting the loss variations when we move

xadv

along a random direction with

different magnitudes

. The details of the calculation are provided in Appendix. Figure 3 plots the

visualizations. We take I-FGSM [

] (denoted as I), MI [

], DI [

], and MI-TI-DI (MTDI) as

baselines attacks and combined them with RAP. We can see that comparing to the baselines, RAP

could help ﬁnd xadv located at the ﬂat region.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BoostingtheTransferabilityofAdversarialAttackswithReverseAdversarialPerturbationZeyuQin1*,YanboFan2*,YiLiu1,LiShen3,YongZhang2,JueWang2,BaoyuanWu1y1SchoolofDataScience,ShenzhenResearchInstituteofBigData,TheChineseUniversityofHongKong,Shenzhen2TencentAILab3JDExploreAcademy{zeyu6181136,fanyanbo0124,yi...

展开>> 收起<<

Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation Zeyu Qin1 Yanbo Fan2 Yi Liu1 Li Shen3 Yong Zhang2 Jue Wang2 Baoyuan Wu1y.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation Zeyu Qin1 Yanbo Fan2 Yi Liu1 Li Shen3 Yong Zhang2 Jue Wang2 Baoyuan Wu1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: