it can allow the attackers to attack target models by using adversarial examples generated on the
surrogate models. Therefore, learning how to generate adversarial examples with high transferability
has gained more attentions in the literature [5,10,14,23,26,38,48].
Under white-box setting where the complete information of the attacked model (e.g., architecture and
parameters) is available, the gradient-based attacks such as PGD [
27
] have demonstrated good attack
performance. However, they often exhibit the poor transferiability [
5
,
48
], i.e., the adversarial example
xadv
generated from the surrogate model
MS
performs poorly against different target models
MT
.
The previous works attribute that to the overfitting of adversarial examples to the surrogate models
[
5
,
24
,
48
]. Figure 1 (b) gives an illustration. The PGD attack aims to find an adversarial point
xpgd
with minimal attack loss, while doesn’t consider the attack loss of the neighborhood regions round
xpgd
. Due to the highly non-convex of deep models, when
xpgd
locates at a sharply local minimum,
a slight change on model parameters of
MS
could cause a large increase of the attack loss, making
xpgd fail to attack the perturbed model.
Many techniques have been proposed to mitigate the overfitting and improve the transferability,
including input transformation [
6
,
48
], gradient calibration [
14
], feature-level attacks [
17
], and
generative models [
30
], etc. However, there still exists a large gap of attack performance between the
transfer setting and the ideal white-box setting, especially for targeted attack, requiring more efforts
for boosting the transferability.
In this work, we propose a novel attack method called reverse adversarial perturbation (RAP) to
alleviate the overfitting of the surrogate model and boost the transferability of adversarial examples.
We encourage that
xadv
is not only of low attack loss but also locates at a local flat region, i.e., the
points within the local neighborhood region around
xadv
should also be of low loss values. Figure 1
(b) illustrates the difference between the sharp local minimum and flat local minimum. When the
model parameter of
MS
has some slight changes, the variation of the attack loss w.r.t. the flat local
minimum is less than that of the sharp one. Therefore, the flat local minimum is less sensitive to the
changes of decision boundary. To achieve this goal, we formulate a min-max bi-level optimization
problem. The inner maximization aims to find the worst-case perturbation (i.e., that with the largest
attack loss, and this is why we call it reverse adversarial perturbation) within the local region around
the current adversarial example, which can be solved by the projected gradient ascent algorithm.
Then, the outer minimization will update the adversarial example to find a new point added with the
provided reverse perturbation that leads to lower attack loss. Figure 1 (a) provides an illustration of
the optimization process. For
t
-th iteration and
xt
, RAP first finds the point
xt+nrap
with max
attack loss within the neighborhood region of
xt
. Then it updates
xt
with the gradient calculated
by minimizing the attack loss w.r.t.
xt+nrap
. Compared to directly adopting the gradient at
xt
,
RAP could help escape from the sharp local minimum and pursue a relatively flat local minimum.
Besides, we design a late-start variant of RAP (RAP-LS) to further boost the attack effectiveness
and efficiency, which doesn’t insert the reverse perturbation into the optimization procedure in
the early stage. Moreover, from the technical perspective, since the proposed RAP method only
introduces one specially designed perturbation onto adversarial attacks, one notable advantage of
RAP is that it can be naturally combined with many existing black-box attack techniques to further
boost the transferability. For example, when combined with different input transformations (e.g., the
random resizing and padding in Diverse Input [
48
]), our RAP method consistently outperforms the
counterparts by a clear margin.
Our main contributions are three-fold:
1)
Based on a novel perspective, the flatness of loss landscape
for adversarial examples, we propose a novel adversarial attack method RAP that encourages both
the adversarial example and its neighborhood region be of low loss value;
2)
we present a vigorous
experimental study and show that RAP can significantly boost the adversarial transferability on
both untargeted and targeted attacks for various networks also containing defense models;
3)
we
demonstrate that RAP can be easily combined with existing transfer attack techniques and outperforms
the state-of-the-art performance by a large margin.
2 Related Work
The black-box attacks can be categorized into two categories: 1) query-based attacks that conduct the
attack based on the feedback of iterative queries to target models, and 2) transfer attacks that use the
adversarial examples generated on some surrogate models to attack the target models. In this work,
2