Towards Out-of-Distribution Adversarial Robustness Adam Ibrahim Mila

2025-05-06 0 0 2.73MB 33 页 10玖币
侵权投诉
Towards Out-of-Distribution Adversarial Robustness
Adam Ibrahim
Mila
Université de Montréal
first.last@mila.quebec
Charles Guille-Escuret
Mila
Université de Montréal
Ioannis Mitliagkas
Mila
Université de Montréal
Irina Rish
Mila
Université de Montréal
David Krueger
University of Cambridge
Pouya Bashivan
Mila
McGill University
Abstract
Adversarial robustness continues to be a major challenge for deep learning. A core
issue is that robustness to one type of attack often fails to transfer to other attacks.
While prior work establishes a theoretical trade-off in robustness against different
Lp
norms, we show that there is potential for improvement against many commonly
used attacks by adopting a domain generalisation approach. Concretely, we treat
each type of attack as a domain, and apply the Risk Extrapolation method (REx),
which promotes similar levels of robustness against all training attacks. Compared
to existing methods, we obtain similar or superior worst-case adversarial robustness
on attacks seen during training. Moreover, we achieve superior performance on
families or tunings of attacks only encountered at test time. On ensembles of
attacks, our approach improves the accuracy from 3.4% with the best existing
baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.
1 Introduction
Vulnerability to adversarial perturbations (Biggio et al.,2013;Szegedy et al.,2014;Goodfellow et al.,
2015) is a major concern for real-world applications of machine learning such as healthcare (Qayyum
et al.,2020) and autonomous driving (Deng et al.,2020). For example, Eykholt et al. (2018) show how
seemingly minor physical modifications to road signs may lead autonomous cars into misinterpreting
stop signs, while Li et al. (2020) achieve high success rates with over-the-air adversarial attacks on
speaker systems.
Much work has been done on defending against adversarial attacks (Goodfellow et al.,2015;Papernot
et al.,2016). However, new attacks commonly overcome existing defenses (Athalye et al.,2018).
A defense that has so far passed the test of time against individual attacks is adversarial training.
Goodfellow et al. (2015) originally proposed training on examples perturbed with the Fast Gradient
Sign Method (FGSM), which performs a step of sign gradient ascent on a sample
x
to increase
the chances of the model misclassifying it. Madry et al. (2018) further improved robustness by
training on Projected Gradient Descent (PGD) adversaries, which perform multiple updates of
(projected) gradient ascent to try to generate a maximally confusing perturbation within some
Lp
ball
of predetermined radius ϵcentred at the chosen data sample.
Unfortunately, adversarial training can fail to provide high robustness against several attacks, or
tunings of attacks, only encountered at test time. For instance, simply changing the norm constraining
the search for adversarial examples with PGD has been shown theoretically and empirically (Khoury
& Hadfield-Menell,2018;Tramèr & Boneh,2019;Maini et al.,2020) to induce significant trade-offs
in performance against PGD of different norms. This issue highlights the importance of having a
Preprint. Under review.
arXiv:2210.03150v4 [cs.LG] 26 Jun 2023
well-defined notion of “robustness”: while using the accuracy against individual attacks has often
been used as a proxy for robustness, a better notion of robustness, as argued by Athalye et al. (2018),
is to consider the accuracy against an ensemble of attacks within a threat model (i.e. a predefined set
of allowed attacks). Indeed, in the example of autonomous driving, an attacker will not be constrained
to a single attack on stop signs, and is free to attempt several attacks to find one that succeeds.
In order to be robust against multiple attacks, we draw inspiration from domain generalisation. In
domain generalisation, we seek to achieve consistent performance even in case of unknown distribu-
tional shifts in the inputs at test time. We interpret different attacks as distinct distributional shifts
in the data, and propose to leverage existing techniques from the out-of-distribution generalisation
literature.
We choose variance REx (Krueger et al.,2021), which consists in using as a loss penalty the variance
on the different training domains of the empirical risk minimisation loss. We choose this method as it
is conceptually simple, its iterations are no more costly than existing multi-perturbation baselines’, it
does not constrain the architecture, and it can be used on models pretrained with existing defenses.
We consider robustness against an adversary having access to both the model and multiple attacks.
However, there are multiple potential challenges: first, Gulrajani & Lopez-Paz (2020) show that
domain generalisation methods, such as REx, often fail to improve over empirical risk minimisation
(ERM) in many settings. Thus, it is possible that REx would fail to improve Tramèr & Boneh (2019)’s
defense, which uses ERM. Second, domain generalisation methods are usually designed for stationary
settings, whereas in adversarial machine learning, the distribution of adversarial perturbations is
non-stationary during training as the attacks adapt to the changes in the model parameters. Finally,
the state-of-the-art multi-perturbation defense proposed by Maini et al. (2020), which we intend to
improve with REx, does not explicitly train on multiple domains, which REx originally requires.
Therefore, we are interested in the two following research questions:
1. Can REx improve robustness against multiple attacks seen during training?
2. Can REx improve robustness against unseen attacks, that is, attacks only seen at test time?
Our results show that the answer to both questions is yes on the ensembles of attacks used in this
work. We show that REx consistently yields benefits across variations in: datasets, architectures,
multi-perturbation defenses, hyperparameter tuning, attacks seen during training, and attack types or
tunings only encountered at test time.
2 Related Work
2.1 Adversarial attacks and defenses
Since the discovery of adversarial examples against neural networks (Szegedy et al.,2014), numerous
approaches for finding adversarial perturbations (i.e. adversarial attacks) have been proposed (Good-
fellow et al.,2015;Madry et al.,2018;Moosavi-Dezfooli et al.,2016;Carlini & Wagner,2017;Croce
& Hein,2020), with the common goal of finding perturbation vectors with constrained magnitude
that, when added to the network’s input, lead to (often highly confident) misclassification.
One of the earliest attacks, the Fast Gradient Sign Method (FGSM) (Goodfellow et al.,2015),
computes a perturbation on an input
x0
by performing a step of sign gradient ascent in the direction
that increases the loss
L
the most, given the model’s current parameters
θ
. This yields an adversarial
example ˜xthat may be misclassified:
˜x=x0+αsgn(xL(θ, x0, y)).(1)
This was later enhanced into the Projected Gradient Descent (PGD) attack (Kurakin et al.,2017;
Madry et al.,2018) by iterating multiple times this operation and adding projections to constrain it to
some neighbourhood of x0, usually a ball of radius ϵcentered at x0, noted Bϵ(x0):
xt+1 = ΠBϵ(x0)xt+αsgn(xL(θ, xt, y)).(2)
With the advent of diverse algorithms to defend classifiers against such attacks, approaches for
discovering adversarial examples have become increasingly more complex over the years. Notably, it
was found that a great number of adversarial defenses rely on gradient obfuscation (Athalye et al.,
2018), which consists in learning how to mask or distort the classifier’s gradients to prevent attacks
2
iterating over gradients from making progress. However, it was later discovered that such approaches
can be broken by other attacks (Athalye et al.,2018;Croce & Hein,2020), some of which bypass
these defenses by not relying on gradients (Brendel et al.,2019;Andriushchenko et al.,2020).
A defense that was shown to be robust to such countermeasures is Adversarial Training (Madry
et al.,2018), which consists in training on adversarial examples. Adversarial training corresponds
to solving a minimax optimisation problem where the inner loop executes an adversarial attack
algorithm, usually PGD, to find pertubations to the inputs that maximise the classification loss, while
the outer loop tunes the network parameters to minimise the loss on the adversarial examples. Despite
the method’s simplicity, robust classifiers trained with adversarial training achieve state-of-the-art
levels of robustness against various newer attacks (Athalye et al.,2018;Croce & Hein,2020). For
this reason, adversarial training has become one of the most common defenses.
Figure 1: Validation accuracy of a model adversar-
ially trained on PGD
L2
-perturbed CIFAR10 with
a ResNet18, evaluated on PGD
L2
and Carlini &
Wagner (CW)
L2
attacks. Curves are smoothed
with exponential moving averaging (weight 0.7).
However, Khoury & Hadfield-Menell (2018)
and Tramèr & Boneh (2019) show how train-
ing on PGD with a search region constrained by
a
p
-norm may not yield robustness against PGD
attacks using other
p
-norms. One reason is that
different radii are typically chosen for different
norms, leading to the search spaces of PGD with
respect to different norms to potentially have
some mutually exclusive regions. Another rea-
son is that different attacks, such as PGD and the
Carlini and Wagner (Carlini & Wagner,2017)
attacks, optimise different losses (note that this
is also true for PGD of different norms). As an
example, Fig. 1illustrates how, when training
adversarially a model on
L2
-norm PGD, the ac-
curacy against one attack may improve while it
may decrease against another attack, even if the
attacks use the same p-norm.
Highlighting the need for methods specific to
defending against multiple of perturbations, Tramèr & Boneh (2019) select a set of 3 attacks
A=
{P, P2, P1}
, where
Pp
is PGD with a search region constrained by the
Lp
norm. They attempt two
strategies: the average (Avg) strategy consists in training over all attacks in Afor each input (x, y)
in the dataset, and the max strategy, which trains on the attack with the highest loss for each sample:
LAvg(θ, A) = E1
|A| X
A∈A
(θ, A(x), y)(3)
Lmax(θ, A) = Emax
A∈A (θ, A(x), y)(4)
Maini et al. (2020) propose a modification to the max method: instead of 3 different PGD adversaries
that each iterate over a budget of iterations as in eq. 2, they design an attack consisting in choosing
the worst perturbation among
L
,
L2
and
L1
PGD every iteration through the chosen number of
iterations. This attack, Multi-Steepest Descent (MSD), differs from the max approach of Tramèr &
Boneh (2019) where each attack is individually iterated through the budget of iterations first, and
the one leading to the worst loss is chosen at the end. Note that this implies that technically, unlike
(Tramèr & Boneh,2019)’s Avg approach, MSD
1
only consists in training on a single attack. Maini
et al. (2020) show that, in their experimental setup, MSD yields superior performance to both the
Avg and Max approaches.
Nevertheless, there is still a very large gap between the performance of such approaches against data
perturbed by ensembles of attacks, and the accuracy on the unperturbed data. In order to help address
this large gap, we will be exploiting a connection between our goal and domain generalisation.
1
In the rest of the paper, we will use MSD to refer to both the MSD attack, and training on MSD as a defense.
3
2.2 Robustness as a domain generalisation problem
Domain generalisation – Out-of-Distribution generalisation (OoD) is an approach to dealing with
(typically non-adversarial) distributional shifts. In the domain generalisation setting, the training
data is assumed to come from several different domains, each with a different data distribution. The
goal is to use the variability across training (or seen) domains to learn a model that can generalise to
unseen domains while performing well on the seen domains. In other words, the goal is for the model
to have consistent performance by learning to be invariant under distributional shifts. Typically, we
also assume access to domain labels, i.e. we know which domain each data point belongs to. Many
methods for domain generalisation have been proposed – see (Wang et al.,2021) for a survey.
Our work views adversarial robustness as a domain generalisation problem, where the domains stem
from different adversarial attacks. Because different attacks use different methods of searching for
adversarial examples, and sometimes different search spaces, they may produce different distributions
of adversarial examples
2
. One might draw an analogy to Hendrycks & Dietterich (2019)’s work on
natural pertubations, where both the type and the strength of the perturbations play a similar role
as varying the attacks or their tuning, respectively. There are several reasons why the domains we
consider may be distributionally shifted with one another (although the distributions may have some
overlap). To non-exhaustively name a few, first, we already evoked how different
p
-norms affect the
distributions of adversarial examples yielded by PGD (Khoury & Hadfield-Menell,2018;Tramèr &
Boneh,2019). Second, different attacks may optimise different losses – for example when comparing
P2
and
L2
CW – which may yield different solutions. Third, the same attack tuned differently (e.g.
different
ϵ
or iteration budget) may yield different distributions of adversarial examples since they do
not have the same support. Therefore, robustness to attacks unseen during training means robustness
against the corresponding distributional shifts at test time. It is natural to frame adversarial robustness
as a domain generalisation problem, as we seek a model that is robust to any method to generate
adversarially distributional shifts within a threat model, including novel attacks.
In spite of this intuition, it is not obvious that such methods would work in the case of adversarial
machine learning. First, recent work demonstrates that domain generalisation methods often fail
to improve upon the standard empirical risk minimisation (ERM), i.e. minimising loss on the
combined training domains without making use of domain labels (Gulrajani & Lopez-Paz,2020). On
the other hand, success may depend on choosing a method appropriate for the type of shifts at play.
Second, a key difference with most work in domain generalisation, is that when adversarially training,
the training distribution shifts every epoch, as the attacks are computed from the continuously-
updated values of the weights. In contrast, in domain generalisation, the training domains are usually
fixed. Non-stationarity is known to cause generalisation failure in many areas of machine learning,
notably reinforcement learning (Igl et al.,2020), thereby potentially affecting the success of domain
generalisation methods in adversarial machine learning. Third, MSD does not generate multiple
domains, which domain generalisation approaches would typically require.
We note that interestingly, the Avg approach of Tramèr & Boneh (2019) can be interpreted as doing
domain generalisation with ERM over the 3 PGD adversaries as training domains. Similarly, the
max approach consists in applying the Robust Optimisation approach on the same set of domains.
Furthermore, Song et al. (2018) and Bashivan et al. (2021) propose to treat the clean and PGD-
perturbed data as training and testing domains from which some samples are accessible during
training, and adopt domain adaptation approaches. Therefore, it is difficult to predict in advance how
much a domain generalisation approach can successfully improve adversarial defenses.
In this work, we apply the method of variance-based risk extrapolation (REx) (Krueger et al.,
2021), which simply adds as a loss penalty the variance of the ERM loss across different domains.
This encourages worst-case robustness over more extreme versions of the shifts (here, shifts are
between different attacks) observed between the training domains. This can be motivated in the
setting of adversarial robustness by the observation that adversaries might shift their distribution
of attacks to better exploit vulnerabilities in a model. In that light, REx is particularly appropriate
given our objective of mitigating trade-offs in performance between different attacks to achieve
a more consistent degree of robustness. We note that our implementation of REx has the same
computational complexity per epoch as the MSD, Avg and max approaches, requiring the computation
of 3 adversarial perturbations per sample.
2
Another way to think about this, is that if different attacks or tunings yielded identical distributions, then
standard results from statistical learning theory would imply similar performance on the various attacks.
4
3 Methodology
Threat model – In this work, we consider white-box attacks, which are typically the strongest type
of attacks as they assume the attacker has access to the model and its parameters. Additionally, the
attacks considered in the evaluations are gradient-based, with the exception of AutoAttack, which is
composite and includes gradient-free perturbations (Croce & Hein,2020). Because we assume that
the attacker has access to all of these attacks, we emphasise that, as argued by Athalye et al. (2018),
the robustness against the ensemble of the different attacks is a better metric for how the defenses
perform than the accuracy on each individual attack. Thus, using
01
as the 0-1 loss, we evaluate the
performance on an ensemble of domains Das:
R= 1 Emax
D∈D 01(θ, D(x), y)(5)
REx – We propose to regularise the average loss over a set of training domains
D
by the variance of
the losses on the different domains:
LREx(θ, D) = LAvg(θ, D) + βVar
D∈D
E(θ, D(x), y)(6)
where
is the cross-entropy loss. We start penalising by the variance over the training domains once
the baseline’s accuracies on the seen domains stabilise or peak.
Datasets and architectures – We consider two datasets: MNIST (LeCun et al.,1998) and CI-
FAR10 (Krizhevsky et al.,2009). It is still an open problem to obtain high robustness against multiple
attacks on MNIST (Tramèr & Boneh,2019;Maini et al.,2020), even at standard tunings of some
commonly used attacks. On MNIST, we use a 3-layer perceptron of size [512, 512, 10]. On CIFAR10,
we use the ResNet18 architecture (He et al.,2016). We choose two significantly different architectures
to illustrate that our approach may work agnostically to the choice of architecture. We always use
batch sizes of 128 when training.
Optimiser – We use Stochastic Gradient Descent (SGD) with momentum
0.9
. In subsections 4.2 and
B.2 we do not perform hyperparameter optimisation, to isolate the effect of REx from interactions
with hyperparameter tuning, which would differ for each defense. We use a fixed learning rate of
0.01
and no weight decay. We fix the coefficient
β
in the REx loss. In subsection 4.4, we optimise
hyperparameters. Based on (Rice et al.,2020) and (Pang et al.,2020), we use in all cases a weight
decay of
5·104
and a piecewise learning rate decay. For every defense, we search for an optimal
epoch to decay the learning rate, with a particular attention to MSD and MSD+REx due to observing
a high sensitivity to the choice of learning rate decay milestone. Note that in the case of REx defenses,
we always use checkpoints of corresponding baselines before the learning rate is decayed, as we
observed this to lead to better performance.
Domains – We consider several domains: unperturbed data,
L1
,
L2
and
L
PGD (denoted
P1, P2, P
),
L2
Carlini & Wagner (CW
2
) (Carlini & Wagner,2017),
L
DeepFool (DF
) (Moosavi-
Dezfooli et al.,2016) and AutoAttack (AA) (Croce & Hein,2020). We use the Advertorch implemen-
tation of these attacks (Ding et al.,2019). For
L
PGD, CW and DF, we use two sets of tunings, see
appendix Afor details. The attacks with a
superscript indicate a harder tuning of these attacks that
no model was trained on. Those tunings are intentionally chosen to make the attacks stronger. The set
of domains unseen by all models is defined as
{P
, DF
, CW
2,AutoAttack}
, with additionally
AutoAttack
2
in subsection 4.4. The set of domains unseen by a specific model is the set of all
domains except those seen by the model during training, and therefore varies between baselines. We
perform 10 attack restarts per sample to reduce randomness in the test set evaluations.
Defenses – Aside from the adversarial training baselines on PGD of
L1, L2
and
L
norms, we
define 3 sets of seen domains:
D={, P, DF, CW2}
,
DPGDs ={, P1, P2, P}
and
DMSD =
{MSD}
where
represents the unperturbed data. We train two Avg baselines: one on
D
and one on
DPGDs
. We train the MSD baseline on
DMSD
. We use REx on the Avg baselines on the corresponding
set of seen domains. However, when REx is used on the model trained with the MSD baseline, we
revert to using the set of seen domains
DPGDs
. While the MSD baseline does not exactly train over
P1, P2
and
P
but rather a composition of these three attacks, we use these attacks when applying
REx to the MSD baseline as MSD would only generate one domain, which would not allow us to
compute a variance over domains. Note that we chose different sets of seen domains, and different
baselines (Avg and MSD), in order to show that REx yields benefits on several multi-perturbation
baselines, or within a same baseline with different choices of seen domains. We use cross-entropy
for all defenses.
5
摘要:

TowardsOut-of-DistributionAdversarialRobustnessAdamIbrahimMilaUniversitédeMontréalfirst.last@mila.quebecCharlesGuille-EscuretMilaUniversitédeMontréalIoannisMitliagkasMilaUniversitédeMontréalIrinaRishMilaUniversitédeMontréalDavidKruegerUniversityofCambridgePouyaBashivanMilaMcGillUniversityAbstractAdv...

展开>> 收起<<
Towards Out-of-Distribution Adversarial Robustness Adam Ibrahim Mila.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:2.73MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注