Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation Zeyu Qin1 Yanbo Fan2 Yi Liu1 Li Shen3 Yong Zhang2 Jue Wang2 Baoyuan Wu1y

2025-04-27 0 0 1.89MB 26 页 10玖币
侵权投诉
Boosting the Transferability of Adversarial Attacks
with Reverse Adversarial Perturbation
Zeyu Qin1*, Yanbo Fan2*, Yi Liu1, Li Shen3, Yong Zhang2, Jue Wang2, Baoyuan Wu1
1School of Data Science, Shenzhen Research Institute of Big Data,
The Chinese University of Hong Kong, Shenzhen
2Tencent AI Lab
3JD Explore Academy
{zeyu6181136, fanyanbo0124, yiliuhk2000}@gmail.com
{mathshenli, zhangyong201303, arphid}@gmail.com
wubaoyuan@cuhk.edu.cn
Abstract
Deep neural networks (DNNs) have been shown to be vulnerable to adversarial
examples, which can produce erroneous predictions by injecting imperceptible per-
turbations. In this work, we study the transferability of adversarial examples, which
is significant due to its threat to real-world applications where model architecture or
parameters are usually unknown. Many existing works reveal that the adversarial
examples are likely to overfit the surrogate model that they are generated from,
limiting its transfer attack performance against different target models. To mitigate
the overfitting of the surrogate model, we propose a novel attack method, dubbed
reverse adversarial perturbation (RAP). Specifically, instead of minimizing the
loss of a single adversarial point, we advocate seeking adversarial example located
at a region with unified low loss value, by injecting the worst-case perturbation (i.e.,
the reverse adversarial perturbation) for each step of the optimization procedure.
The adversarial attack with RAP is formulated as a min-max bi-level optimization
problem. By integrating RAP into the iterative process for attacks, our method can
find more stable adversarial examples which are less sensitive to the changes of de-
cision boundary, mitigating the overfitting of the surrogate model. Comprehensive
experimental comparisons demonstrate that RAP can significantly boost adversarial
transferability. Furthermore, RAP can be naturally combined with many existing
black-box attack techniques, to further boost the transferability. When attacking
a real-world image recognition system, i.e., Google Cloud Vision API, we obtain
22% performance improvement of targeted attacks over the compared method. Our
codes are available at: https://github.com/SCLBD/Transfer_attack_RAP.
1 Introduction
Deep neural networks (DNNs) have been successfully applied in many safety-critical tasks, such
as autonomous driving, face recognition and verification, etc. However, it has been shown that
DNN models are vulnerable to adversarial examples [
9
,
12
,
27
,
32
,
35
,
44
,
45
,
50
], which are
indistinguishable from natural examples but make a model produce erroneous predictions. For
real-world applications, the DNN models are often hidden from users. Therefore, the attackers need
to generate the adversarial examples under black-box setting where they do not know any information
of the target model [
2
,
3
,
18
,
32
]. For black-box setting, the adversarial transferability matters since
*Equal contribution. Corresponding author.
This work is done when Zeyu Qin is a research intern at Tencent AI Lab.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05968v1 [cs.CV] 12 Oct 2022
it can allow the attackers to attack target models by using adversarial examples generated on the
surrogate models. Therefore, learning how to generate adversarial examples with high transferability
has gained more attentions in the literature [5,10,14,23,26,38,48].
Under white-box setting where the complete information of the attacked model (e.g., architecture and
parameters) is available, the gradient-based attacks such as PGD [
27
] have demonstrated good attack
performance. However, they often exhibit the poor transferiability [
5
,
48
], i.e., the adversarial example
xadv
generated from the surrogate model
MS
performs poorly against different target models
MT
.
The previous works attribute that to the overfitting of adversarial examples to the surrogate models
[
5
,
24
,
48
]. Figure 1 (b) gives an illustration. The PGD attack aims to find an adversarial point
xpgd
with minimal attack loss, while doesn’t consider the attack loss of the neighborhood regions round
xpgd
. Due to the highly non-convex of deep models, when
xpgd
locates at a sharply local minimum,
a slight change on model parameters of
MS
could cause a large increase of the attack loss, making
xpgd fail to attack the perturbed model.
Many techniques have been proposed to mitigate the overfitting and improve the transferability,
including input transformation [
6
,
48
], gradient calibration [
14
], feature-level attacks [
17
], and
generative models [
30
], etc. However, there still exists a large gap of attack performance between the
transfer setting and the ideal white-box setting, especially for targeted attack, requiring more efforts
for boosting the transferability.
In this work, we propose a novel attack method called reverse adversarial perturbation (RAP) to
alleviate the overfitting of the surrogate model and boost the transferability of adversarial examples.
We encourage that
xadv
is not only of low attack loss but also locates at a local flat region, i.e., the
points within the local neighborhood region around
xadv
should also be of low loss values. Figure 1
(b) illustrates the difference between the sharp local minimum and flat local minimum. When the
model parameter of
MS
has some slight changes, the variation of the attack loss w.r.t. the flat local
minimum is less than that of the sharp one. Therefore, the flat local minimum is less sensitive to the
changes of decision boundary. To achieve this goal, we formulate a min-max bi-level optimization
problem. The inner maximization aims to find the worst-case perturbation (i.e., that with the largest
attack loss, and this is why we call it reverse adversarial perturbation) within the local region around
the current adversarial example, which can be solved by the projected gradient ascent algorithm.
Then, the outer minimization will update the adversarial example to find a new point added with the
provided reverse perturbation that leads to lower attack loss. Figure 1 (a) provides an illustration of
the optimization process. For
t
-th iteration and
xt
, RAP first finds the point
xt+nrap
with max
attack loss within the neighborhood region of
xt
. Then it updates
xt
with the gradient calculated
by minimizing the attack loss w.r.t.
xt+nrap
. Compared to directly adopting the gradient at
xt
,
RAP could help escape from the sharp local minimum and pursue a relatively flat local minimum.
Besides, we design a late-start variant of RAP (RAP-LS) to further boost the attack effectiveness
and efficiency, which doesn’t insert the reverse perturbation into the optimization procedure in
the early stage. Moreover, from the technical perspective, since the proposed RAP method only
introduces one specially designed perturbation onto adversarial attacks, one notable advantage of
RAP is that it can be naturally combined with many existing black-box attack techniques to further
boost the transferability. For example, when combined with different input transformations (e.g., the
random resizing and padding in Diverse Input [
48
]), our RAP method consistently outperforms the
counterparts by a clear margin.
Our main contributions are three-fold:
1)
Based on a novel perspective, the flatness of loss landscape
for adversarial examples, we propose a novel adversarial attack method RAP that encourages both
the adversarial example and its neighborhood region be of low loss value;
2)
we present a vigorous
experimental study and show that RAP can significantly boost the adversarial transferability on
both untargeted and targeted attacks for various networks also containing defense models;
3)
we
demonstrate that RAP can be easily combined with existing transfer attack techniques and outperforms
the state-of-the-art performance by a large margin.
2 Related Work
The black-box attacks can be categorized into two categories: 1) query-based attacks that conduct the
attack based on the feedback of iterative queries to target models, and 2) transfer attacks that use the
adversarial examples generated on some surrogate models to attack the target models. In this work,
2
we focus on the transfer attacks. For surrogate models, existing attack algorithms such as FGSM [
12
]
and I-FGSM [
21
] could achieve good attack performance. However, they often overfit the surrogate
models and thus exhibit poor transferability. Recently, many works have been proposed to generate
more transferable adversarial examples [
5
,
6
,
13
,
14
,
17
,
19
,
20
,
24
,
39
,
40
,
41
,
43
,
48
], which we
briefly summarize as below.
Input transformation:
Data augmentation, which has been shown to be effective in improving
model generalization, has also been studied to boost the adversarial transferability, such as randomly
resizing and padding [
48
], randomly scaling [
24
], and adversarial mixup [
41
]. In addition, the work
of Dong et al.
[6]
uses a set of translated images to compute gradient and get the better performance
against defense models. Expectation of Transformation (EOT) method [
1
] synthesizes adversarial
examples over a chosen distribution of transformations to enhance its adversarial transferability.
Gradient modification:
Instead of the I-FGSM, the work of Dong et al.
[5]
integrates momentum
into the updating strategy. And Lin et al.
[24]
uses the Nesterov accelerated gradient to boost the
transferability. The work of Wang and He
[40]
aims to find a more stable gradient direction by
tuning the variance of each gradient step. There are also some model-specific designs to boost the
adversarial transferability. For example, Wu et al.
[43]
found that the gradient of skip connections
is more crucial to generate more transferable attacks. The work of Guo et al.
[14]
proposed LinBP
to utilize more gradient of skip connections during the back-propagation. However, these methods
tend to be specific to a particular model architecture, such as skip connection, and it is nontrivial
to extend the findings to other architectures or modules.
Intermediate feature attack:
Meanwhile,
Huang et al.
[17]
, Inkawhich et al.
[19
,
20]
proposed to exploit feature space constraints to generate
more transferable attacks. Yet they need to identity the best performing intermediate layers or train
one-vs-all binary classifies for all attacked classes. Recently, Zhao et al.
[49]
find iterative attacks
with much more iterations and logit loss can achieve relatively high targeted transferability and
exceed the feature-based attacks.
Generative models:
In addition, there have been some methods
utilizing the generative models to generate the adversarial perturbations [
28
,
30
,
31
]. For example,
the work of Naseer et al.
[30]
proposed to train a generative model to match the distributions of
source and target class, so as to increase the targeted transferability. However, the learning of the
perturbation generator is nontrivial, especially on large-scale datasets.
In summary, the current performance of transfer attacks is still unsatisfactory, especially for targeted
attacks. In this work, we study adversarial transferability from the prespective of the flatness of
adversarial examples. We find that adversarial examples located at flat local minimum will be more
transferable than those at sharp local minimum and propose an novel algorithm to find adversarial
example that locates at flat local minimum.
3 Methodology
3.1 Preliminaries of Transfer Adversarial Attack
Given an benign sample
(x, y)(X,Y)
, the procedure of transfer adversarial attack is firstly
constructing the adversarial example
xadv
within the neighborhood region
B(x) = {x0:kx0
xkp}
by attacking the white-box surrogate model
Ms(x;θ) : X → Y
, then transferring
xadv
to directly attack the black-box target model
Mt(x;φ) : X → Y
. The attack goal is to mislead the
target model, i.e.,
Mt(xadv;φ)6=y
(untargeted attack), or
Mt(xadv;φ) = yt
(targeted attack) with
yt∈ Y
indicting the target label. Taking the target attack as example, the general formulation of
many existing transfer attack methods can be written as follows:
min
xadv ∈B(x)L(Ms(G(xadv); θ), yt).(1)
The loss function
L
is often set as the cross entropy (CE) loss [
48
] or the logit loss [
49
], which will
be specified in later experiments. Besides, the formulation of untargeted attack can be easily obtained
by replacing the loss function Land ytby −L and y, respectively.
Since
Ms
is white-box, if
G(·)
is set as the identity function, then any off-the-shelf white-box
adversarial attack method can be adopted to solve Problem (1), such as I-FSGM [
21
], MI-FGSM
[
5
], etc. Meanwhile, existing works have designed different
G(·)
functions and developed the
corresponding optimization algorithms, to boost the adversarial transferability between surrogate and
target models. For example,
G(·)
is specified as random resizing and padding (DI) [
48
], translation
transformation (TI) [6], scale transformation (SI) [24], and adversarial mixup (Admix) [41].
3
(a) (b)
Figure 1: These two plots are schematic diagrams in 1D space. The x-axis means the value of input
x
. The y-axis means the value of attack loss function
L
. (a) Illustration of our attack method and the
original PGD attack. (b) Illustration of attack loss landscape of
MS
and
MS0
.
MS0
denotes a slight
change on the model parameters of
MS
. The blue and yellow dots correspond to attacks located at
different local minima on
MS
, respectively. The gray and red points are their counterparts on
MS0
.
3.2 Reverse Adversarial Perturbation
As discussed above, although having good performance for the white-box setting, the adversarial
examples generated from
MS
exist poor adversarial transferability on
MT
, especially for targeted
attacks. The previous works attribute this issue to the overfitting of adversarial attack to
MS
[
4
,
5
,
6
,
38
,
46
]. As shown in Figure 1 (b), when
xpgd
locates at a sharp local minimum, it is not
stable and is sensitive to changes of
MS
. When having some changes on model parameters,
xpgd
could results in a high attack loss against MS0and lead to a failure attack.
To mitigating the overfitting to
MS
, we advocate to find
xadv
located at flat local region. That means
we encourage that not only
xadv
itself has low loss value, but also the points in the vicinity of
xadv
have similarly low loss values.
To this end, we propose to minimize the maximal loss value within a local neighborhood region
around the adversarial example
xadv
. The maximal loss is implemented by perturbing
xadv
to
maximize the attack loss, named Reverse Adversarial Perturbation (RAP). By inserting the RAP into
the formulation (1), we aim to solve the following problem,
min
xadv ∈B(x)L(Ms(G(xadv +nrap); θ), yt),(2)
where
nrap = arg max
knrap kn
L(Ms(xadv +nrap;θ), yt),(3)
with
nrap
indicating the RAP, and
n
defining its search region. The above formulations
(2)
and
(3)
correspond to the targeted attack, and the corresponding untargeted formulations can be easily
obtained by replacing the loss function Land ytby −L and y, respectively.
It is a min-max bi-level optimization problem [
25
], and can be solved by iteratively optimizing the
inner maximization and the outer minimization problem. Specifically, in each iteration, given
xadv
,
the inner maximization w.r.t. nrap is solved by the projected gradient ascent algorithm:
nrap nrap +αn·sign(nrap L(Ms(xadv +nrap;θ), yt)).(4)
The above update is conducted by
T
steps, and
αn=n
T
. Then, given
nrap
, the outer minimization
w.r.t.
xadv
can be solved by any off-the-shelf algorithm that is developed for solving
(1)
. For example,
it can be undated by one step projected gradient descent, as follows:
xadv ClipB(x)xadv α·sign(xadv L(Ms(G(xadv +nrap); θ), yt)),(5)
with
ClipB(x)(a)
clipping
a
into the neighborhood region
B(x)
. The overall optimization procedure
is summarized in Algorithm 1. Moreover, since the optimization w.r.t.
xadv
can be implemented by
any off-the-shelf algorithm for solving Problem
(1)
, one notable advantage of the proposed RAP is
that it can be naturally combined with any one of them, such as the input transformation methods
[6,24,41,48].
4
Figure 2: Targeted attack success rate (
%
) on Dense-121 and VGG-16. We take the Res-50 as the
surrogate model and take MI and Admix as baseline methods.
Algorithm 1 Reverse Adversarial Perturbation (RAP) Algorithm
Input:
Surrogate model
Ms
, benign data
(x, y)
, target label
yt
, loss function
L
, transforma-
tion
G
, the global iteration number
K
, the late-start iteration number
KLS
of RAP, as well as
hyper-parameters in optimization (specified in later experiments)
Output:
the adversarial example
xadv
1: Initialize xadv x
2: for k= 1, . . . , K do
3: if kKLS then
4: Initialize nrap 0
5: for t= 1, . . . , T do
6: Update nrap using (4)
7: Update xadv using (5)
A Late-Start (LS) Variant of RAP.
In our preliminary experiments, we find that RAP requires
more iterations to converge and the performance is slightly lower during the initial iterations, com-
pared to its baseline attack methods. As shown in Figure 2, we combine MI [
5
] and Admix [
41
] with
RAP, and adopt ResNet-50 as the surrogate model. We take the evaluation on 1000 images from
ImageNet (see Sec.4.1). It is observed that the method with RAP (see the orange curves) quickly
surpasses its baseline method (see the blue curves) and finally achieves much higher success rate with
more iterations, which verify the effect of RAP on enhancing the adversarial transferability. However,
it is also observed that the performance of RAP is slightly lower than its baseline method in the early
stage. The possible reason is that the early-stage attack is of very weak attack performance to the
surrogate model. In this case, it may be waste to pursue better transferable attacks by solving the
min-max problem. A better strategy may be only solving the minimization problem
(1)
in the early
stage to quickly achieve the region of relatively high adversarial attack performance, then starting
RAP to further enhance the attack performance and transferability simultaneously. This strategy
is denoted as RAP with late-start (RAP-LS), whose effect is preliminarily supported by the results
shown in Figure 2 (see the green curve) and will be evaluated extensively in later experiments.
3.3 A Closer Look at RAP
To verify whether RAP can help us find a
xadv
located at the local flat region or not, we use ResNet-
50 as surrogate model and conduct the untargeted attacks. We visualize the loss landscape around
xadv
on
MS
by plotting the loss variations when we move
xadv
along a random direction with
different magnitudes
a
. The details of the calculation are provided in Appendix. Figure 3 plots the
visualizations. We take I-FGSM [
21
] (denoted as I), MI [
5
], DI [
48
], and MI-TI-DI (MTDI) as
baselines attacks and combined them with RAP. We can see that comparing to the baselines, RAP
could help find xadv located at the flat region.
5
摘要:

BoostingtheTransferabilityofAdversarialAttackswithReverseAdversarialPerturbationZeyuQin1*,YanboFan2*,YiLiu1,LiShen3,YongZhang2,JueWang2,BaoyuanWu1y1SchoolofDataScience,ShenzhenResearchInstituteofBigData,TheChineseUniversityofHongKong,Shenzhen2TencentAILab3JDExploreAcademy{zeyu6181136,fanyanbo0124,yi...

收起<<
Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation Zeyu Qin1 Yanbo Fan2 Yi Liu1 Li Shen3 Yong Zhang2 Jue Wang2 Baoyuan Wu1y.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:1.89MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注