Towards Out-of-Distribution Adversarial Robustness Adam Ibrahim Mila

2025-05-06 1 0 2.73MB 33 页 10玖币

侵权投诉

Towards Out-of-Distribution Adversarial Robustness

Adam Ibrahim

Mila

Université de Montréal

first.last@mila.quebec

Charles Guille-Escuret

Mila

Université de Montréal

Ioannis Mitliagkas

Mila

Université de Montréal

Irina Rish

Mila

Université de Montréal

David Krueger

University of Cambridge

Pouya Bashivan

Mila

McGill University

Abstract

Adversarial robustness continues to be a major challenge for deep learning. A core

issue is that robustness to one type of attack often fails to transfer to other attacks.

While prior work establishes a theoretical trade-off in robustness against different

norms, we show that there is potential for improvement against many commonly

used attacks by adopting a domain generalisation approach. Concretely, we treat

each type of attack as a domain, and apply the Risk Extrapolation method (REx),

which promotes similar levels of robustness against all training attacks. Compared

to existing methods, we obtain similar or superior worst-case adversarial robustness

on attacks seen during training. Moreover, we achieve superior performance on

families or tunings of attacks only encountered at test time. On ensembles of

attacks, our approach improves the accuracy from 3.4% with the best existing

baseline to 25.9% on MNIST, and from 16.9% to 23.5% on CIFAR10.

1 Introduction

Vulnerability to adversarial perturbations (Biggio et al.,2013;Szegedy et al.,2014;Goodfellow et al.,

2015) is a major concern for real-world applications of machine learning such as healthcare (Qayyum

et al.,2020) and autonomous driving (Deng et al.,2020). For example, Eykholt et al. (2018) show how

seemingly minor physical modiﬁcations to road signs may lead autonomous cars into misinterpreting

stop signs, while Li et al. (2020) achieve high success rates with over-the-air adversarial attacks on

speaker systems.

Much work has been done on defending against adversarial attacks (Goodfellow et al.,2015;Papernot

et al.,2016). However, new attacks commonly overcome existing defenses (Athalye et al.,2018).

A defense that has so far passed the test of time against individual attacks is adversarial training.

Goodfellow et al. (2015) originally proposed training on examples perturbed with the Fast Gradient

Sign Method (FGSM), which performs a step of sign gradient ascent on a sample

to increase

the chances of the model misclassifying it. Madry et al. (2018) further improved robustness by

training on Projected Gradient Descent (PGD) adversaries, which perform multiple updates of

(projected) gradient ascent to try to generate a maximally confusing perturbation within some

ball

of predetermined radius ϵcentred at the chosen data sample.

Unfortunately, adversarial training can fail to provide high robustness against several attacks, or

tunings of attacks, only encountered at test time. For instance, simply changing the norm constraining

the search for adversarial examples with PGD has been shown theoretically and empirically (Khoury

& Hadﬁeld-Menell,2018;Tramèr & Boneh,2019;Maini et al.,2020) to induce signiﬁcant trade-offs

in performance against PGD of different norms. This issue highlights the importance of having a

Preprint. Under review.

arXiv:2210.03150v4 [cs.LG] 26 Jun 2023

well-deﬁned notion of “robustness”: while using the accuracy against individual attacks has often

been used as a proxy for robustness, a better notion of robustness, as argued by Athalye et al. (2018),

is to consider the accuracy against an ensemble of attacks within a threat model (i.e. a predeﬁned set

of allowed attacks). Indeed, in the example of autonomous driving, an attacker will not be constrained

to a single attack on stop signs, and is free to attempt several attacks to ﬁnd one that succeeds.

In order to be robust against multiple attacks, we draw inspiration from domain generalisation. In

domain generalisation, we seek to achieve consistent performance even in case of unknown distribu-

tional shifts in the inputs at test time. We interpret different attacks as distinct distributional shifts

in the data, and propose to leverage existing techniques from the out-of-distribution generalisation

literature.

We choose variance REx (Krueger et al.,2021), which consists in using as a loss penalty the variance

on the different training domains of the empirical risk minimisation loss. We choose this method as it

is conceptually simple, its iterations are no more costly than existing multi-perturbation baselines’, it

does not constrain the architecture, and it can be used on models pretrained with existing defenses.

We consider robustness against an adversary having access to both the model and multiple attacks.

However, there are multiple potential challenges: ﬁrst, Gulrajani & Lopez-Paz (2020) show that

domain generalisation methods, such as REx, often fail to improve over empirical risk minimisation

(ERM) in many settings. Thus, it is possible that REx would fail to improve Tramèr & Boneh (2019)’s

defense, which uses ERM. Second, domain generalisation methods are usually designed for stationary

settings, whereas in adversarial machine learning, the distribution of adversarial perturbations is

non-stationary during training as the attacks adapt to the changes in the model parameters. Finally,

the state-of-the-art multi-perturbation defense proposed by Maini et al. (2020), which we intend to

improve with REx, does not explicitly train on multiple domains, which REx originally requires.

Therefore, we are interested in the two following research questions:

1. Can REx improve robustness against multiple attacks seen during training?

2. Can REx improve robustness against unseen attacks, that is, attacks only seen at test time?

Our results show that the answer to both questions is yes on the ensembles of attacks used in this

work. We show that REx consistently yields beneﬁts across variations in: datasets, architectures,

multi-perturbation defenses, hyperparameter tuning, attacks seen during training, and attack types or

tunings only encountered at test time.

2 Related Work

2.1 Adversarial attacks and defenses

Since the discovery of adversarial examples against neural networks (Szegedy et al.,2014), numerous

approaches for ﬁnding adversarial perturbations (i.e. adversarial attacks) have been proposed (Good-

fellow et al.,2015;Madry et al.,2018;Moosavi-Dezfooli et al.,2016;Carlini & Wagner,2017;Croce

& Hein,2020), with the common goal of ﬁnding perturbation vectors with constrained magnitude

that, when added to the network’s input, lead to (often highly conﬁdent) misclassiﬁcation.

One of the earliest attacks, the Fast Gradient Sign Method (FGSM) (Goodfellow et al.,2015),

computes a perturbation on an input

by performing a step of sign gradient ascent in the direction

that increases the loss

the most, given the model’s current parameters

. This yields an adversarial

example ˜xthat may be misclassiﬁed:

˜x=x0+αsgn(∇xL(θ, x0, y)).(1)

This was later enhanced into the Projected Gradient Descent (PGD) attack (Kurakin et al.,2017;

Madry et al.,2018) by iterating multiple times this operation and adding projections to constrain it to

some neighbourhood of x0, usually a ball of radius ϵcentered at x0, noted Bϵ(x0):

xt+1 = ΠBϵ(x0)xt+αsgn(∇xL(θ, xt, y)).(2)

With the advent of diverse algorithms to defend classiﬁers against such attacks, approaches for

discovering adversarial examples have become increasingly more complex over the years. Notably, it

was found that a great number of adversarial defenses rely on gradient obfuscation (Athalye et al.,

2018), which consists in learning how to mask or distort the classiﬁer’s gradients to prevent attacks

iterating over gradients from making progress. However, it was later discovered that such approaches

can be broken by other attacks (Athalye et al.,2018;Croce & Hein,2020), some of which bypass

these defenses by not relying on gradients (Brendel et al.,2019;Andriushchenko et al.,2020).

A defense that was shown to be robust to such countermeasures is Adversarial Training (Madry

et al.,2018), which consists in training on adversarial examples. Adversarial training corresponds

to solving a minimax optimisation problem where the inner loop executes an adversarial attack

algorithm, usually PGD, to ﬁnd pertubations to the inputs that maximise the classiﬁcation loss, while

the outer loop tunes the network parameters to minimise the loss on the adversarial examples. Despite

the method’s simplicity, robust classiﬁers trained with adversarial training achieve state-of-the-art

levels of robustness against various newer attacks (Athalye et al.,2018;Croce & Hein,2020). For

this reason, adversarial training has become one of the most common defenses.

Figure 1: Validation accuracy of a model adversar-

ially trained on PGD

-perturbed CIFAR10 with

a ResNet18, evaluated on PGD

and Carlini &

Wagner (CW)

attacks. Curves are smoothed

with exponential moving averaging (weight 0.7).

However, Khoury & Hadﬁeld-Menell (2018)

and Tramèr & Boneh (2019) show how train-

ing on PGD with a search region constrained by

-norm may not yield robustness against PGD

attacks using other

-norms. One reason is that

different radii are typically chosen for different

norms, leading to the search spaces of PGD with

respect to different norms to potentially have

some mutually exclusive regions. Another rea-

son is that different attacks, such as PGD and the

Carlini and Wagner (Carlini & Wagner,2017)

attacks, optimise different losses (note that this

is also true for PGD of different norms). As an

example, Fig. 1illustrates how, when training

adversarially a model on

-norm PGD, the ac-

curacy against one attack may improve while it

may decrease against another attack, even if the

attacks use the same p-norm.

Highlighting the need for methods speciﬁc to

defending against multiple of perturbations, Tramèr & Boneh (2019) select a set of 3 attacks

{P∞, P2, P1}

, where

is PGD with a search region constrained by the

norm. They attempt two

strategies: the average (Avg) strategy consists in training over all attacks in Afor each input (x, y)

in the dataset, and the max strategy, which trains on the attack with the highest loss for each sample:

LAvg(θ, A) = E1

|A| X

A∈A

ℓ(θ, A(x), y)(3)

Lmax(θ, A) = Emax

A∈A ℓ(θ, A(x), y)(4)

Maini et al. (2020) propose a modiﬁcation to the max method: instead of 3 different PGD adversaries

that each iterate over a budget of iterations as in eq. 2, they design an attack consisting in choosing

the worst perturbation among

L∞

and

PGD every iteration through the chosen number of

iterations. This attack, Multi-Steepest Descent (MSD), differs from the max approach of Tramèr &

Boneh (2019) where each attack is individually iterated through the budget of iterations ﬁrst, and

the one leading to the worst loss is chosen at the end. Note that this implies that technically, unlike

(Tramèr & Boneh,2019)’s Avg approach, MSD

only consists in training on a single attack. Maini

et al. (2020) show that, in their experimental setup, MSD yields superior performance to both the

Avg and Max approaches.

Nevertheless, there is still a very large gap between the performance of such approaches against data

perturbed by ensembles of attacks, and the accuracy on the unperturbed data. In order to help address

this large gap, we will be exploiting a connection between our goal and domain generalisation.

In the rest of the paper, we will use MSD to refer to both the MSD attack, and training on MSD as a defense.

2.2 Robustness as a domain generalisation problem

Domain generalisation – Out-of-Distribution generalisation (OoD) is an approach to dealing with

(typically non-adversarial) distributional shifts. In the domain generalisation setting, the training

data is assumed to come from several different domains, each with a different data distribution. The

goal is to use the variability across training (or seen) domains to learn a model that can generalise to

unseen domains while performing well on the seen domains. In other words, the goal is for the model

to have consistent performance by learning to be invariant under distributional shifts. Typically, we

also assume access to domain labels, i.e. we know which domain each data point belongs to. Many

methods for domain generalisation have been proposed – see (Wang et al.,2021) for a survey.

Our work views adversarial robustness as a domain generalisation problem, where the domains stem

from different adversarial attacks. Because different attacks use different methods of searching for

adversarial examples, and sometimes different search spaces, they may produce different distributions

of adversarial examples

. One might draw an analogy to Hendrycks & Dietterich (2019)’s work on

natural pertubations, where both the type and the strength of the perturbations play a similar role

as varying the attacks or their tuning, respectively. There are several reasons why the domains we

consider may be distributionally shifted with one another (although the distributions may have some

overlap). To non-exhaustively name a few, ﬁrst, we already evoked how different

-norms affect the

distributions of adversarial examples yielded by PGD (Khoury & Hadﬁeld-Menell,2018;Tramèr &

Boneh,2019). Second, different attacks may optimise different losses – for example when comparing

and

CW – which may yield different solutions. Third, the same attack tuned differently (e.g.

different

or iteration budget) may yield different distributions of adversarial examples since they do

not have the same support. Therefore, robustness to attacks unseen during training means robustness

against the corresponding distributional shifts at test time. It is natural to frame adversarial robustness

as a domain generalisation problem, as we seek a model that is robust to any method to generate

adversarially distributional shifts within a threat model, including novel attacks.

In spite of this intuition, it is not obvious that such methods would work in the case of adversarial

machine learning. First, recent work demonstrates that domain generalisation methods often fail

to improve upon the standard empirical risk minimisation (ERM), i.e. minimising loss on the

combined training domains without making use of domain labels (Gulrajani & Lopez-Paz,2020). On

the other hand, success may depend on choosing a method appropriate for the type of shifts at play.

Second, a key difference with most work in domain generalisation, is that when adversarially training,

the training distribution shifts every epoch, as the attacks are computed from the continuously-

updated values of the weights. In contrast, in domain generalisation, the training domains are usually

ﬁxed. Non-stationarity is known to cause generalisation failure in many areas of machine learning,

notably reinforcement learning (Igl et al.,2020), thereby potentially affecting the success of domain

generalisation methods in adversarial machine learning. Third, MSD does not generate multiple

domains, which domain generalisation approaches would typically require.

We note that interestingly, the Avg approach of Tramèr & Boneh (2019) can be interpreted as doing

domain generalisation with ERM over the 3 PGD adversaries as training domains. Similarly, the

max approach consists in applying the Robust Optimisation approach on the same set of domains.

Furthermore, Song et al. (2018) and Bashivan et al. (2021) propose to treat the clean and PGD-

perturbed data as training and testing domains from which some samples are accessible during

training, and adopt domain adaptation approaches. Therefore, it is difﬁcult to predict in advance how

much a domain generalisation approach can successfully improve adversarial defenses.

In this work, we apply the method of variance-based risk extrapolation (REx) (Krueger et al.,

2021), which simply adds as a loss penalty the variance of the ERM loss across different domains.

This encourages worst-case robustness over more extreme versions of the shifts (here, shifts are

between different attacks) observed between the training domains. This can be motivated in the

setting of adversarial robustness by the observation that adversaries might shift their distribution

of attacks to better exploit vulnerabilities in a model. In that light, REx is particularly appropriate

given our objective of mitigating trade-offs in performance between different attacks to achieve

a more consistent degree of robustness. We note that our implementation of REx has the same

computational complexity per epoch as the MSD, Avg and max approaches, requiring the computation

of 3 adversarial perturbations per sample.

Another way to think about this, is that if different attacks or tunings yielded identical distributions, then

standard results from statistical learning theory would imply similar performance on the various attacks.

3 Methodology

Threat model – In this work, we consider white-box attacks, which are typically the strongest type

of attacks as they assume the attacker has access to the model and its parameters. Additionally, the

attacks considered in the evaluations are gradient-based, with the exception of AutoAttack, which is

composite and includes gradient-free perturbations (Croce & Hein,2020). Because we assume that

the attacker has access to all of these attacks, we emphasise that, as argued by Athalye et al. (2018),

the robustness against the ensemble of the different attacks is a better metric for how the defenses

perform than the accuracy on each individual attack. Thus, using

ℓ01

as the 0-1 loss, we evaluate the

performance on an ensemble of domains Das:

R= 1 −Emax

D∈D ℓ01(θ, D(x), y)(5)

REx – We propose to regularise the average loss over a set of training domains

by the variance of

the losses on the different domains:

LREx(θ, D) = LAvg(θ, D) + βVar

D∈D

Eℓ(θ, D(x), y)(6)

where

ℓ

is the cross-entropy loss. We start penalising by the variance over the training domains once

the baseline’s accuracies on the seen domains stabilise or peak.

Datasets and architectures – We consider two datasets: MNIST (LeCun et al.,1998) and CI-

FAR10 (Krizhevsky et al.,2009). It is still an open problem to obtain high robustness against multiple

attacks on MNIST (Tramèr & Boneh,2019;Maini et al.,2020), even at standard tunings of some

commonly used attacks. On MNIST, we use a 3-layer perceptron of size [512, 512, 10]. On CIFAR10,

we use the ResNet18 architecture (He et al.,2016). We choose two signiﬁcantly different architectures

to illustrate that our approach may work agnostically to the choice of architecture. We always use

batch sizes of 128 when training.

Optimiser – We use Stochastic Gradient Descent (SGD) with momentum

0.9

. In subsections 4.2 and

B.2 we do not perform hyperparameter optimisation, to isolate the effect of REx from interactions

with hyperparameter tuning, which would differ for each defense. We use a ﬁxed learning rate of

0.01

and no weight decay. We ﬁx the coefﬁcient

in the REx loss. In subsection 4.4, we optimise

hyperparameters. Based on (Rice et al.,2020) and (Pang et al.,2020), we use in all cases a weight

decay of

5·10−4

and a piecewise learning rate decay. For every defense, we search for an optimal

epoch to decay the learning rate, with a particular attention to MSD and MSD+REx due to observing

a high sensitivity to the choice of learning rate decay milestone. Note that in the case of REx defenses,

we always use checkpoints of corresponding baselines before the learning rate is decayed, as we

observed this to lead to better performance.

Domains – We consider several domains: unperturbed data,

and

L∞

PGD (denoted

P1, P2, P∞

Carlini & Wagner (CW

) (Carlini & Wagner,2017),

L∞

DeepFool (DF

∞

) (Moosavi-

Dezfooli et al.,2016) and AutoAttack (AA) (Croce & Hein,2020). We use the Advertorch implemen-

tation of these attacks (Ding et al.,2019). For

L∞

PGD, CW and DF, we use two sets of tunings, see

appendix Afor details. The attacks with a

•

superscript indicate a harder tuning of these attacks that

no model was trained on. Those tunings are intentionally chosen to make the attacks stronger. The set

of domains unseen by all models is deﬁned as

{P•

∞, DF •

∞, CW •

2,AutoAttack∞}

, with additionally

AutoAttack

in subsection 4.4. The set of domains unseen by a speciﬁc model is the set of all

domains except those seen by the model during training, and therefore varies between baselines. We

perform 10 attack restarts per sample to reduce randomness in the test set evaluations.

Defenses – Aside from the adversarial training baselines on PGD of

L1, L2

and

L∞

norms, we

deﬁne 3 sets of seen domains:

D={∅, P∞, DF∞, CW2}

DPGDs ={∅, P1, P2, P∞}

and

DMSD =

{MSD}

where

∅

represents the unperturbed data. We train two Avg baselines: one on

and one on

DPGDs

. We train the MSD baseline on

DMSD

. We use REx on the Avg baselines on the corresponding

set of seen domains. However, when REx is used on the model trained with the MSD baseline, we

revert to using the set of seen domains

DPGDs

. While the MSD baseline does not exactly train over

P1, P2

and

P∞

but rather a composition of these three attacks, we use these attacks when applying

REx to the MSD baseline as MSD would only generate one domain, which would not allow us to

compute a variance over domains. Note that we chose different sets of seen domains, and different

baselines (Avg and MSD), in order to show that REx yields beneﬁts on several multi-perturbation

baselines, or within a same baseline with different choices of seen domains. We use cross-entropy

for all defenses.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsOut-of-DistributionAdversarialRobustnessAdamIbrahimMilaUniversitédeMontréalfirst.last@mila.quebecCharlesGuille-EscuretMilaUniversitédeMontréalIoannisMitliagkasMilaUniversitédeMontréalIrinaRishMilaUniversitédeMontréalDavidKruegerUniversityofCambridgePouyaBashivanMilaMcGillUniversityAbstractAdv...

展开>> 收起<<

Towards Out-of-Distribution Adversarial Robustness Adam Ibrahim Mila.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards Out-of-Distribution Adversarial Robustness Adam Ibrahim Mila

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: