Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang

2025-05-02 0 0 497.28KB 30 页 10玖币

侵权投诉

Nash Equilibria and Pitfalls of Adversarial Training

in Adversarial Robustness Games

Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang

Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University University of Waterloo

Abstract

Adversarial training is a standard technique for

training adversarially robust models. In this pa-

per, we study adversarial training as an alternat-

ing best-response strategy in a 2-player zero-sum

game. We prove that even in a simple scenario of

a linear classiﬁer and a statistical model that ab-

stracts robust vs. non-robust features, the alter-

nating best response strategy of such game may

not converge. On the other hand, a unique pure

Nash equilibrium of the game exists and is prov-

ably robust. We support our theoretical results

with experiments, showing the non-convergence

of adversarial training and the robustness of Nash

equilibrium.

1 INTRODUCTION

Deep neural networks have been widely applied to various

tasks (LeCun et al., 2015; Goodfellow et al., 2016). How-

ever, these models are vulnerable to human-imperceptible

perturbations, which may lead to signiﬁcant performance

drop (Goodfellow et al., 2015; Szegedy et al., 2014). A

large body of works has improved the robustness of neural

networks against such perturbations. For example, Adver-

sarial Training (Madry et al., 2018) (AT) is a notable tech-

nique that trains a robust model by two alternative steps:

1) ﬁnding adversarial examples of training data against the

current model; 2) updating the model to correctly classify

the adversarial examples and returning to step 1). This pro-

cedure has a strong connection with an alternating best-

response strategy in a 2-player zero-sum game. In par-

ticular, we consider a game between an adversary (row

player) and a defender (column player). At each time t, a

row player outputs a perturbation function that maps each

data point to a perturbation, and a column player selects a

Proceedings of the 26th International Conference on Artiﬁcial

Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.

model. Given a loss function, the utility of the row player

is the expected loss of the model on the perturbed data and

the utility of the column player is the negative expected

loss. Therefore, steps 1) and 2) correspond to both players’

actions that maximize their utility against the latest action

of the opponents.

In this work, we show that for the adversarial robust-

ness game, even in a simple setting, the alternating best-

response strategy may not converge. We consider a general

symmetric independent distribution beyond the symmetric

Gaussian distribution which was typically assumed in the

prior works (Tsipras et al., 2018; Ilyas et al., 2019). We

call this game the Symmetric Linear Adversarial Robust-

ness (SLAR) game. The challenge is that SLAR is not

a convex-concave game, and so those known results on

convex-concave zero-sum games do not apply in our set-

ting. One of our key contributions is to analyze the dynam-

ics of adversarial training in the SLAR game which sheds

light on the behavior of adversarial training in general. On

the other hand, we prove the existence of a pure Nash equi-

librium and show that any Nash equilibrium provably leads

to a robust classiﬁer, i.e., a classiﬁer that puts zero weight

on the non-robust features. The Nash equilibrium is unique

where any two Nash equilibria select the same classiﬁer.

Our ﬁnding motivates us to train a model that achieves a

Nash equilibrium.

For linear models, there is a closed-form solution of ad-

versarial examples for each data point (Bose et al., 2020;

Tsipras et al., 2018). Different from the alternating best-

response strategy, we also study the procedure of substi-

tuting the closed-form adversarial examples into the inner

maximization problem and reducing the problem to a stan-

dard minimization objective. We refer to this procedure as

Optimal Adversarial Training (OAT). (Tsipras et al., 2018)

has shown that OAT leads to a robust classiﬁer under sym-

metric Gaussian distributions. We extend their results by

showing that the same conclusion also holds for the SLAR

game. We support our theoretical results with experiments,

demonstrating that standard adversarial training does not

converge while a Nash equilibrium is robust.

arXiv:2210.12606v3 [cs.LG] 28 Feb 2023

Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games

2 RELATED WORK

2.1 Adversarial Robustness

Variants of adversarial training methods have been pro-

posed to improve adversarial robustness of neural networks

(Zhang et al., 2019; Shafahi et al., 2019; Rice et al., 2020;

Wong et al., 2019; Xie et al., 2019; Qin et al., 2019).

Recent works utilize extra unlabeled data (Carmon et al.,

2019; Zhai et al., 2019; Deng et al., 2021; Rebufﬁ et al.,

2021) or synthetic data from a generative model (Gowal

et al., 2021; Sehwag et al., 2021) to improve the robust

accuracy. Another line of works consider ensembling

techniques (Tram`

er et al., 2018; Sen et al., 2019; Pang

et al., 2019; Zhang et al., 2022). A line of theoretical works

analyzed adversarial robustness by linear models, from the

trade-off between robustness and accuracy (Tsipras et al.,

2018; Javanmard et al., 2020; Raghunathan et al., 2020), to

the generalization property (Schmidt et al., 2018). Recent

works further analyze a more complex class of models

such as 2-layer neural networks (Allen-Zhu and Li, 2022;

Bubeck et al., 2021; Bartlett et al., 2021; Bubeck and

Sellke, 2021).

Speciﬁcally, prior works considered adversarial robustness

as a 2-player zero-sum game (Pal and Vidal, 2020; Meu-

nier et al., 2021; Bose et al., 2020; Bul`

o et al., 2016; Per-

domo and Singer, 2019; Pinot et al., 2020). For instance,

Pal and Vidal (2020) proved that randomized smoothing

(Cohen et al., 2019) and FGSM attack (Goodfellow et al.,

2015) form Nash equilibria. Bose et al. (2020) introduced

a framework to ﬁnd adversarial examples that transfer to

an unseen model in the same hypothesis class. Pinot et al.

(2020) shows the non-existence of a Nash equilibrium in

the adversarial robustness game when the classiﬁer and the

Adversary are both deterministic. However, our settings

are different from prior papers. The key assumption in

previous work (Pinot et al., 2020) is that an adversary is

regularized and would not attack if the adversarial exam-

ple does not change the model prediction. We consider the

case where the adversary attacks even though the adversar-

ial example does not change the model prediction (as in the

standard adversarial training).

While most works focused on the existence of Nash equi-

librium and proposed algorithms that converge to the equi-

librium, to the best of our knowledge, no prior works

showed that a Nash equilibrium is robust.

2.2 Dynamics in Games

The dynamics of a 2-player zero-sum game has been well-

studied, especially when each player takes an alternat-

ing best-response strategy in the ﬁnite action space. A

classical question is whether players’ actions will con-

verge to an equilibrium as the two players alternatively

play a game (Nash Jr, 1950; Nash, 1951). It is known

that the alternating best-response strategy converges to a

Nash equilibrium for many types of games, such as poten-

tial games (Monderer and Shapley, 1996b), weakly acyclic

games (Fabrikant et al., 2010), aggregative games (Dindoˇ

and Mezzetti, 2006), super modular games (Milgrom and

Roberts, 1990), and random games (Heinrich et al., 2021;

Amiet et al., 2021). However, this general phenomenon

may not apply to adversarial robustness games since this

natural learning algorithm may not converge even in sim-

ple games (Balcan Maria-Florina, 2012), as these results

rely on speciﬁc properties of those games. In addition,

there are also works on different strategies such as ﬁcti-

tious play (Brown, 1951; Robinson, 1951; Monderer and

Shapley, 1996a; Benaım and Hirsch, 1999) and its exten-

sion to an inﬁnite action space (Oechssler and Riedel, 2001;

Perkins and Leslie, 2014) or continuous time space (Hop-

kins, 1999; Hofbauer and Sorin, 2006). Furthermore, there

is a connection between a 2-player zero-sum game with on-

line learning where it is possible to show that an average

payoff of a player with a sub-linear regret algorithm (such

as follow the regularized leader or follow the perturbed

leader) converges to a Nash equilibrium (Cesa-Bianchi and

Lugosi, 2006; Syrgkanis et al., 2015; Suggala and Netra-

palli, 2020). However, Mertikopoulos et al. (2018) studied

the dynamics of such no-regret algorithms and showed that

when both players play the follow-the-regularized-leader

algorithms, the actions of each player do not converge to a

Nash equilibrium with a cycling behavior in the game.

3 SETUP

We consider a binary classiﬁcation problem where we want

to learn a linear function f:Rd→R, such that our pre-

diction is given by sign(f(x)). Let Dbe the underlying

distribution of (x, y)and let x= [x1, . . . , xd]. We assume

that the distribution of each feature xihas a symmetrical

mean and is independent of the others given the label.

Assumption 1. (Symmetrical mean) The mean of each fea-

ture xiis symmetrical over class y=−1,1. That is

E[xi|y] = yµi,

where µiis a constant.

Assumption 2. (Independent features given the label)

Each feature xiis independent of each other given the la-

bel.

We study this problem on a soft-SVM objective

min

wL(w),(1)

Maria-Florina Balcan, Rattana Pukdee, Pradeep Ravikumar, Hongyang Zhang

where

L(w) = E(x,y)∼D[max(0,1−yw>x)] + λ

2||w||2

and λis the regularization parameter. Assume that at

the test time, we have an adversarial perturbation function

δ:D → B(ε),B(ε) = {a:||a||∞≤ε}, that adds a per-

turbation δ(x, y)to a feature of each point (x, y). Our goal

is to learn a function fthat makes correct predictions on

the perturbed data points. We denote εas the perturbation

budget.

For a given perturbation budget ε, we divide all features

xi’s into robust features and non-robust features.

Deﬁnition 1 (Non-robust feature).A feature xiis non-

robust when the perturbation budget is larger than or equal

to the mean of that feature

|µi| ≤ ε.

Otherwise, xiis a robust feature.

We discuss non-robust features in more details in Appendix

3.1 Symmetric Linear Adversarial Robustness Game

We formulate the problem of learning a robust function f

as a 2-player zero-sum game between an adversary (row

player) and a defender (column player). The game is

played repeatedly where at each time t, the row player out-

puts a perturbation function δ(t):D → B(ε)that maps

each data point in Dto a perturbation while the column

player outputs a linear function f(t)= (w(t))>x. The util-

ity of the row player is given by

Urow(δ(t), w(t)) := E(x,y)∼D [l(δ(t), w(t), x, y)]+λ

2||w(t)||2

where

l(δ, w, x, y) = max(0,1−yw>(x+δ(x, y))).

The goal of the row player is to ﬁnd a perturbation func-

tion that maximizes the expected loss of the perturbed data

given a model from the column player. The utility of the

column player is the negative expected loss:

Ucol(δ(t), w(t)) = −Urow(δ(t), w(t)),

where the column player wants to output a model that min-

imizes the expected loss given the perturbed data.

3.2 Adversarial Training as An Alternating

Best-Response Strategy

Recall that in AT (Madry et al., 2018), we ﬁrst ﬁnd the

“adversarial examples”, i.e., the perturbed data points that

maximize the loss. We then optimize our model according

to the given adversarial examples. In the game-theoretic

setting, AT is an alternating best-response strategy:

1. The row player submits a perturbation function that

maximizes the utility from the last iteration:

δ(t)= argmax

δ:D→B(ε)

Urow(δ, w(t−1)).

2. The column player chooses a model that maximizes

the utility given the perturbation δ(t):

w(t)= argmax

w∈Rd

Ucol(δ(t), w).

In practice, we achieve an approximation of the w(t)via

stochastic gradient descent and an approximation of each

instance δ(t)(x, y)by projected gradient descent (Madry

et al., 2018).

4 NON-CONVERGENCE OF

ADVERSARIAL TRAINING

In this section, we start with the dynamics of AT on a SLAR

game. We then provide an example of a class of data dis-

tributions on which AT does not converge. A key property

of such distributions is that it has a large fraction of non-

robust features.

It is known that we have a closed form solution for the

worst-case adversarial perturbation w.r.t. a linear model

(Bose et al., 2020; Tsipras et al., 2018).

Lemma 1. For a ﬁxed w, for any (x, y)∼D, the pertur-

bation δ(x, y) = −yε sign(w)maximizes the inner opti-

mization objective

max

δ∈B(ε)max(0,1−yw>(x+δ)),

where

sign(x) = 





1,if x > 0;

0,if x= 0;

−1,if x < 0.

When xis a vector, sign(x)is applied to each dimension.

We denote this as the worst-case perturbation.

We note that the worst-case perturbation does not depend

on the feature x, which means any point in the same class

has the same perturbation. Intuitively, the worst-case per-

turbation shifts the distribution of each class toward the de-

cision boundary. Since there is no other incentive for the

adversary to choose another perturbation, we assume that

the AT always picks the worse-case perturbation δ(x, y) =

−yε sign(w). This implies that at time t, the row player

submits the perturbation function δ(t)such that

δ(t)(x, y) = −yε sign(w(t−1)).

However, later in this work, we do not restrict our action

space to only the worst-case perturbations when analyzing

Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games

Figure 1: The space of two non-robust features, where the red and blue circles are for class y=−1,1, respectively. Dashed

lines represent a decision boundary of each model and background colors represent their prediction. The ﬁgure on the left

is the original distribution. The adversary can always shift the mean of non-robust features across the decision boundary.

A model trained with adversarial examples has to ﬂip the decision boundary at every iteration. For example, the adversary

shifts the red and blue circle across the original decision boundary leading to a decision boundary ﬂip for a model trained

on the perturbed data (second ﬁgure from the left).

a Nash equilibrium. Now, we can derive the dynamics of

AT. We prove that for non-robust features xi’s, if a column

player puts a positive (negative) weight of the model on xi

at time t, then the model at time t+1 will put a non-positive

(non-negative) weight on xi.

Theorem 1 (Dynamics of AT).Consider applying AT

to learn a linear model f(x) = w>x. Let w(t)=

[w(t)

1, w(t)

2, . . . , w(t)

d]be the parameter of the linear func-

tion at time t. For a non-robust feature xi,

1. If w(t)

i>0, we have w(t+1)

i≤0;

2. If w(t)

i<0, we have w(t+1)

i≥0,

for all time t > 0.

Proof. The key intuition is that mean of non-robust fea-

tures is smaller than the perturbation budget, and the adver-

sary can always shift the mean of these features across the

decision boundary. Therefore, if we want to train a model

to ﬁt the adversarial examples, we have to ﬂip the decision

boundary at every iteration (Figure 1). Formally, consider

a non-robust feature xiwith w(t)

i>0, the perturbation at

time t+ 1 of feature xiis given by

δ(t+1)

i(x, y) = −yε sign(w(t)

i) = −yε.

The mean of the feature xiof the adversarial examples at

time t+ 1 of class y= 1 is given by

µ(t+1)

i=E[xi+δ(t+1)

i(x, y)|y= 1] = µi−ε≤ |µi|−ε < 0.

The ﬁnal inequality holds because xiis a non-robust fea-

ture. We note that for a linear classiﬁer under SVM-

objective, when the mean µ(t+1)

i<0we must have

w(t+1)

i≤0(see Lemma 3).

Theorem 1 implies that the difference between the consec-

utive model weight is at least the magnitude of weight on

non-robust features.

Corollary 1. Consider applying AT to learn a linear model

f(x) = w>x. Let w(t)= [w(t)

1, w(t)

2, . . . , w(t)

d]be the pa-

rameters of the linear function at time t. We have

||w(t+1) −w(t)||2

2≥X

|µi|<ε

(w(t)

i)2.

If a large fraction of the model weight is on the non-robust

features at each time t, then the model will not converge.

We provide an example of data distributions where, when

we train a model with AT, our model will always rely on the

non-robust features at time t. We consider the following

distribution:

Deﬁnition 2 (Data distribution with a large fraction of

non-robust features).Let the data distribution be as fol-

lows

1. y∼unif{−1,+1},

2. x1=+y, w.p. p;

−y, w.p. 1−p,

3. xj|yis a distribution with mean yµjand variance σ2

where ε>µj>0, for j= 2,3, . . . , d + 1.

Given a label y, the ﬁrst feature x1takes 2 possible values,

ywith probability pand −ywith probability 1−p. We note

that x1is the true label with probability pand is robust to

adversarial perturbations. On the other hand, for j≥2,

feature xjis more ﬂexible where it can follow any distribu-

tion with a condition that the feature must be weakly corre-

lated with the true label in expectation. Each feature might

be less informative compared to x1but combining many of

them can lead to a highly accurate model. We note that xj

Maria-Florina Balcan, Rattana Pukdee, Pradeep Ravikumar, Hongyang Zhang

is non-robust since its mean is smaller than the perturbation

budget.

The data distribution is a speciﬁcation of our setup in Sec-

tion 3 where we have a signiﬁcantly large number of non-

robust features compared to the robust features. This dis-

tribution is inspired by the one studied in (Tsipras et al.,

2018), where they showed that standard training on this

type of distribution (when features j= 2, . . . , d + 1 are

Gaussian distributions) leads to a model that relies on non-

robust features. We generalize their result to a scenario

when xjcan follow any distribution (see Appendix C). We

note that the lack of assumption on each distribution is a

key technical challenge for our analysis.

Theorem 2 (Adversarial training uses non-robust feature

(simpliﬁed version)).Let the data distribution follows the

distribution as in Deﬁnition 2. Consider applying AT to

learn a linear model f(x) = w>x. Assume that ε > 2µj

for j= 2, . . . , d + 1. Let w(t)= [w(t)

1, w(t)

2, . . . , w(t)

d+1]be

the parameter of the linear function at time t. If

p < 1− 1

2(σmax

||µ0||2

+λ

2||µ0||2

) + 1

2r2

λσmax!,(2)

where

σi≤σmax, µ0= [0, µ2, . . . , µd+1],

then

d+1

j=2

(w(t)

j)2≥||w(t)||2

2(1 −ε)2

(1 −ε)2+Pd+1

j=2 (µj+ε)2.

(For simplicity, we also assume that σmax ≥1. For a

tighter bound, see Appendix D).

Since, features j= 2, . . . , d + 1 are not robust, Theorem 2

implies that a model at time twill put at least

(1 −ε)2

(1 −ε)2+Pd+1

j=2 (µj+ε)2

fraction of weight on non-robust features. We note that

the term ||µ0||2at the denominator of condition (2) grows

as O(d). Therefore, if the number of non-robust features

dand the regularization parameter λis large enough then

the condition (2) holds. We discuss the full version of this

theorem in Appendix D. Next, we prove that the magni-

tude ||w(t)||2is bounded below by a constant (see Lemma

6) which implies that ||w(t)||2does not converge to zero.

Therefore, we can conclude that a model trained with AT

puts a non-trivial amount of weights on non-robust features

at each iteration. This implies that AT does not converge.

Theorem 3 (AT does not converge (simpliﬁed)).Let the

data follow the distribution as in Deﬁnition 2. Assume

that the variance σ2

jis bounded above and ε > 2µjfor

j= 2, . . . , d + 1. Consider applying AT to learn a linear

model f(x) = w>xon the SVM objective Let w(t)be the

parameter of the linear function at time t. If the number of

non-robust feature dand the regularization parameter λis

large enough then w(t)does not converge as t→ ∞.

5 PROPERTIES OF THE NASH

EQUILIBRIUM

Recall the deﬁnition of a Nash equilibrium:

Deﬁnition 3 (Nash Equilibrium).For a SLAR game, a pair

of actions (δ∗, w∗)is called a pure strategy Nash equilib-

rium if the following hold

sup

Urow(δ, w∗)≤Urow(δ∗, w∗)≤inf

wUrow(δ∗, w).

From this deﬁnition, we note that for any ﬁxed w∗, the in-

equality

sup

Urow(δ, w∗)≤Urow(δ∗, w∗)

holds if and only if δ∗is optimal. We know that the worst-

case perturbation δ∗(x, y) = −yε sign(w∗)satisﬁes this

condition. However, the optimal perturbations might not be

unique because of the max(0,·)operator in the hinge loss.

For instance, for a point that is far away from the decision

boundary such that the worst-case perturbation leads to a

zero loss:

1−yw>(x+δ∗(x, y)) ≤0,

any perturbation δ(x, y)will lead to a zero loss:

1−yw>(x+δ(x, y)) ≤0.

Therefore, any perturbation δleads to the same utility (Fig-

ure 2). On the other hand, if the worst-case perturbation

leads to a positive loss, we can show that the optimal per-

turbation must be the worst-case perturbation.

Figure 2: For points that are far enough from the decision

boundary, any perturbation is optimal.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NashEquilibriaandPitfallsofAdversarialTraininginAdversarialRobustnessGamesMaria-FlorinaBalcanRattanaPukdeePradeepRavikumarHongyangZhangCarnegieMellonUniversityCarnegieMellonUniversityCarnegieMellonUniversityUniversityofWaterlooAbstractAdversarialtrainingisastandardtechniquefortrainingadversariallyro...

展开>> 收起<<

Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: