Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang

2025-05-02 0 0 497.28KB 30 页 10玖币
侵权投诉
Nash Equilibria and Pitfalls of Adversarial Training
in Adversarial Robustness Games
Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang
Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University University of Waterloo
Abstract
Adversarial training is a standard technique for
training adversarially robust models. In this pa-
per, we study adversarial training as an alternat-
ing best-response strategy in a 2-player zero-sum
game. We prove that even in a simple scenario of
a linear classifier and a statistical model that ab-
stracts robust vs. non-robust features, the alter-
nating best response strategy of such game may
not converge. On the other hand, a unique pure
Nash equilibrium of the game exists and is prov-
ably robust. We support our theoretical results
with experiments, showing the non-convergence
of adversarial training and the robustness of Nash
equilibrium.
1 INTRODUCTION
Deep neural networks have been widely applied to various
tasks (LeCun et al., 2015; Goodfellow et al., 2016). How-
ever, these models are vulnerable to human-imperceptible
perturbations, which may lead to significant performance
drop (Goodfellow et al., 2015; Szegedy et al., 2014). A
large body of works has improved the robustness of neural
networks against such perturbations. For example, Adver-
sarial Training (Madry et al., 2018) (AT) is a notable tech-
nique that trains a robust model by two alternative steps:
1) finding adversarial examples of training data against the
current model; 2) updating the model to correctly classify
the adversarial examples and returning to step 1). This pro-
cedure has a strong connection with an alternating best-
response strategy in a 2-player zero-sum game. In par-
ticular, we consider a game between an adversary (row
player) and a defender (column player). At each time t, a
row player outputs a perturbation function that maps each
data point to a perturbation, and a column player selects a
Proceedings of the 26th International Conference on Artificial
Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.
PMLR: Volume 206. Copyright 2023 by the author(s).
model. Given a loss function, the utility of the row player
is the expected loss of the model on the perturbed data and
the utility of the column player is the negative expected
loss. Therefore, steps 1) and 2) correspond to both players’
actions that maximize their utility against the latest action
of the opponents.
In this work, we show that for the adversarial robust-
ness game, even in a simple setting, the alternating best-
response strategy may not converge. We consider a general
symmetric independent distribution beyond the symmetric
Gaussian distribution which was typically assumed in the
prior works (Tsipras et al., 2018; Ilyas et al., 2019). We
call this game the Symmetric Linear Adversarial Robust-
ness (SLAR) game. The challenge is that SLAR is not
a convex-concave game, and so those known results on
convex-concave zero-sum games do not apply in our set-
ting. One of our key contributions is to analyze the dynam-
ics of adversarial training in the SLAR game which sheds
light on the behavior of adversarial training in general. On
the other hand, we prove the existence of a pure Nash equi-
librium and show that any Nash equilibrium provably leads
to a robust classifier, i.e., a classifier that puts zero weight
on the non-robust features. The Nash equilibrium is unique
where any two Nash equilibria select the same classifier.
Our finding motivates us to train a model that achieves a
Nash equilibrium.
For linear models, there is a closed-form solution of ad-
versarial examples for each data point (Bose et al., 2020;
Tsipras et al., 2018). Different from the alternating best-
response strategy, we also study the procedure of substi-
tuting the closed-form adversarial examples into the inner
maximization problem and reducing the problem to a stan-
dard minimization objective. We refer to this procedure as
Optimal Adversarial Training (OAT). (Tsipras et al., 2018)
has shown that OAT leads to a robust classifier under sym-
metric Gaussian distributions. We extend their results by
showing that the same conclusion also holds for the SLAR
game. We support our theoretical results with experiments,
demonstrating that standard adversarial training does not
converge while a Nash equilibrium is robust.
arXiv:2210.12606v3 [cs.LG] 28 Feb 2023
Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games
2 RELATED WORK
2.1 Adversarial Robustness
Variants of adversarial training methods have been pro-
posed to improve adversarial robustness of neural networks
(Zhang et al., 2019; Shafahi et al., 2019; Rice et al., 2020;
Wong et al., 2019; Xie et al., 2019; Qin et al., 2019).
Recent works utilize extra unlabeled data (Carmon et al.,
2019; Zhai et al., 2019; Deng et al., 2021; Rebuffi et al.,
2021) or synthetic data from a generative model (Gowal
et al., 2021; Sehwag et al., 2021) to improve the robust
accuracy. Another line of works consider ensembling
techniques (Tram`
er et al., 2018; Sen et al., 2019; Pang
et al., 2019; Zhang et al., 2022). A line of theoretical works
analyzed adversarial robustness by linear models, from the
trade-off between robustness and accuracy (Tsipras et al.,
2018; Javanmard et al., 2020; Raghunathan et al., 2020), to
the generalization property (Schmidt et al., 2018). Recent
works further analyze a more complex class of models
such as 2-layer neural networks (Allen-Zhu and Li, 2022;
Bubeck et al., 2021; Bartlett et al., 2021; Bubeck and
Sellke, 2021).
Specifically, prior works considered adversarial robustness
as a 2-player zero-sum game (Pal and Vidal, 2020; Meu-
nier et al., 2021; Bose et al., 2020; Bul`
o et al., 2016; Per-
domo and Singer, 2019; Pinot et al., 2020). For instance,
Pal and Vidal (2020) proved that randomized smoothing
(Cohen et al., 2019) and FGSM attack (Goodfellow et al.,
2015) form Nash equilibria. Bose et al. (2020) introduced
a framework to find adversarial examples that transfer to
an unseen model in the same hypothesis class. Pinot et al.
(2020) shows the non-existence of a Nash equilibrium in
the adversarial robustness game when the classifier and the
Adversary are both deterministic. However, our settings
are different from prior papers. The key assumption in
previous work (Pinot et al., 2020) is that an adversary is
regularized and would not attack if the adversarial exam-
ple does not change the model prediction. We consider the
case where the adversary attacks even though the adversar-
ial example does not change the model prediction (as in the
standard adversarial training).
While most works focused on the existence of Nash equi-
librium and proposed algorithms that converge to the equi-
librium, to the best of our knowledge, no prior works
showed that a Nash equilibrium is robust.
2.2 Dynamics in Games
The dynamics of a 2-player zero-sum game has been well-
studied, especially when each player takes an alternat-
ing best-response strategy in the finite action space. A
classical question is whether players’ actions will con-
verge to an equilibrium as the two players alternatively
play a game (Nash Jr, 1950; Nash, 1951). It is known
that the alternating best-response strategy converges to a
Nash equilibrium for many types of games, such as poten-
tial games (Monderer and Shapley, 1996b), weakly acyclic
games (Fabrikant et al., 2010), aggregative games (Dindoˇ
s
and Mezzetti, 2006), super modular games (Milgrom and
Roberts, 1990), and random games (Heinrich et al., 2021;
Amiet et al., 2021). However, this general phenomenon
may not apply to adversarial robustness games since this
natural learning algorithm may not converge even in sim-
ple games (Balcan Maria-Florina, 2012), as these results
rely on specific properties of those games. In addition,
there are also works on different strategies such as ficti-
tious play (Brown, 1951; Robinson, 1951; Monderer and
Shapley, 1996a; Benaım and Hirsch, 1999) and its exten-
sion to an infinite action space (Oechssler and Riedel, 2001;
Perkins and Leslie, 2014) or continuous time space (Hop-
kins, 1999; Hofbauer and Sorin, 2006). Furthermore, there
is a connection between a 2-player zero-sum game with on-
line learning where it is possible to show that an average
payoff of a player with a sub-linear regret algorithm (such
as follow the regularized leader or follow the perturbed
leader) converges to a Nash equilibrium (Cesa-Bianchi and
Lugosi, 2006; Syrgkanis et al., 2015; Suggala and Netra-
palli, 2020). However, Mertikopoulos et al. (2018) studied
the dynamics of such no-regret algorithms and showed that
when both players play the follow-the-regularized-leader
algorithms, the actions of each player do not converge to a
Nash equilibrium with a cycling behavior in the game.
3 SETUP
We consider a binary classification problem where we want
to learn a linear function f:RdR, such that our pre-
diction is given by sign(f(x)). Let Dbe the underlying
distribution of (x, y)and let x= [x1, . . . , xd]. We assume
that the distribution of each feature xihas a symmetrical
mean and is independent of the others given the label.
Assumption 1. (Symmetrical mean) The mean of each fea-
ture xiis symmetrical over class y=1,1. That is
E[xi|y] = yµi,
where µiis a constant.
Assumption 2. (Independent features given the label)
Each feature xiis independent of each other given the la-
bel.
We study this problem on a soft-SVM objective
min
wL(w),(1)
Maria-Florina Balcan, Rattana Pukdee, Pradeep Ravikumar, Hongyang Zhang
where
L(w) = E(x,y)∼D[max(0,1yw>x)] + λ
2||w||2
2,
and λis the regularization parameter. Assume that at
the test time, we have an adversarial perturbation function
δ:D → B(ε),B(ε) = {a:||a||ε}, that adds a per-
turbation δ(x, y)to a feature of each point (x, y). Our goal
is to learn a function fthat makes correct predictions on
the perturbed data points. We denote εas the perturbation
budget.
For a given perturbation budget ε, we divide all features
xis into robust features and non-robust features.
Definition 1 (Non-robust feature).A feature xiis non-
robust when the perturbation budget is larger than or equal
to the mean of that feature
|µi| ≤ ε.
Otherwise, xiis a robust feature.
We discuss non-robust features in more details in Appendix
B.
3.1 Symmetric Linear Adversarial Robustness Game
We formulate the problem of learning a robust function f
as a 2-player zero-sum game between an adversary (row
player) and a defender (column player). The game is
played repeatedly where at each time t, the row player out-
puts a perturbation function δ(t):D → B(ε)that maps
each data point in Dto a perturbation while the column
player outputs a linear function f(t)= (w(t))>x. The util-
ity of the row player is given by
Urow(δ(t), w(t)) := E(x,y)∼D [l(δ(t), w(t), x, y)]+λ
2||w(t)||2
2,
where
l(δ, w, x, y) = max(0,1yw>(x+δ(x, y))).
The goal of the row player is to find a perturbation func-
tion that maximizes the expected loss of the perturbed data
given a model from the column player. The utility of the
column player is the negative expected loss:
Ucol(δ(t), w(t)) = Urow(δ(t), w(t)),
where the column player wants to output a model that min-
imizes the expected loss given the perturbed data.
3.2 Adversarial Training as An Alternating
Best-Response Strategy
Recall that in AT (Madry et al., 2018), we first find the
“adversarial examples”, i.e., the perturbed data points that
maximize the loss. We then optimize our model according
to the given adversarial examples. In the game-theoretic
setting, AT is an alternating best-response strategy:
1. The row player submits a perturbation function that
maximizes the utility from the last iteration:
δ(t)= argmax
δ:D→B(ε)
Urow(δ, w(t1)).
2. The column player chooses a model that maximizes
the utility given the perturbation δ(t):
w(t)= argmax
wRd
Ucol(δ(t), w).
In practice, we achieve an approximation of the w(t)via
stochastic gradient descent and an approximation of each
instance δ(t)(x, y)by projected gradient descent (Madry
et al., 2018).
4 NON-CONVERGENCE OF
ADVERSARIAL TRAINING
In this section, we start with the dynamics of AT on a SLAR
game. We then provide an example of a class of data dis-
tributions on which AT does not converge. A key property
of such distributions is that it has a large fraction of non-
robust features.
It is known that we have a closed form solution for the
worst-case adversarial perturbation w.r.t. a linear model
(Bose et al., 2020; Tsipras et al., 2018).
Lemma 1. For a fixed w, for any (x, y)D, the pertur-
bation δ(x, y) = yε sign(w)maximizes the inner opti-
mization objective
max
δ∈B(ε)max(0,1yw>(x+δ)),
where
sign(x) =
1,if x > 0;
0,if x= 0;
1,if x < 0.
When xis a vector, sign(x)is applied to each dimension.
We denote this as the worst-case perturbation.
We note that the worst-case perturbation does not depend
on the feature x, which means any point in the same class
has the same perturbation. Intuitively, the worst-case per-
turbation shifts the distribution of each class toward the de-
cision boundary. Since there is no other incentive for the
adversary to choose another perturbation, we assume that
the AT always picks the worse-case perturbation δ(x, y) =
yε sign(w). This implies that at time t, the row player
submits the perturbation function δ(t)such that
δ(t)(x, y) = yε sign(w(t1)).
However, later in this work, we do not restrict our action
space to only the worst-case perturbations when analyzing
Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games
Figure 1: The space of two non-robust features, where the red and blue circles are for class y=1,1, respectively. Dashed
lines represent a decision boundary of each model and background colors represent their prediction. The figure on the left
is the original distribution. The adversary can always shift the mean of non-robust features across the decision boundary.
A model trained with adversarial examples has to flip the decision boundary at every iteration. For example, the adversary
shifts the red and blue circle across the original decision boundary leading to a decision boundary flip for a model trained
on the perturbed data (second figure from the left).
a Nash equilibrium. Now, we can derive the dynamics of
AT. We prove that for non-robust features xis, if a column
player puts a positive (negative) weight of the model on xi
at time t, then the model at time t+1 will put a non-positive
(non-negative) weight on xi.
Theorem 1 (Dynamics of AT).Consider applying AT
to learn a linear model f(x) = w>x. Let w(t)=
[w(t)
1, w(t)
2, . . . , w(t)
d]be the parameter of the linear func-
tion at time t. For a non-robust feature xi,
1. If w(t)
i>0, we have w(t+1)
i0;
2. If w(t)
i<0, we have w(t+1)
i0,
for all time t > 0.
Proof. The key intuition is that mean of non-robust fea-
tures is smaller than the perturbation budget, and the adver-
sary can always shift the mean of these features across the
decision boundary. Therefore, if we want to train a model
to fit the adversarial examples, we have to flip the decision
boundary at every iteration (Figure 1). Formally, consider
a non-robust feature xiwith w(t)
i>0, the perturbation at
time t+ 1 of feature xiis given by
δ(t+1)
i(x, y) = yε sign(w(t)
i) = yε.
The mean of the feature xiof the adversarial examples at
time t+ 1 of class y= 1 is given by
µ(t+1)
i=E[xi+δ(t+1)
i(x, y)|y= 1] = µiε≤ |µi|−ε < 0.
The final inequality holds because xiis a non-robust fea-
ture. We note that for a linear classifier under SVM-
objective, when the mean µ(t+1)
i<0we must have
w(t+1)
i0(see Lemma 3).
Theorem 1 implies that the difference between the consec-
utive model weight is at least the magnitude of weight on
non-robust features.
Corollary 1. Consider applying AT to learn a linear model
f(x) = w>x. Let w(t)= [w(t)
1, w(t)
2, . . . , w(t)
d]be the pa-
rameters of the linear function at time t. We have
||w(t+1) w(t)||2
2X
|µi|
(w(t)
i)2.
If a large fraction of the model weight is on the non-robust
features at each time t, then the model will not converge.
We provide an example of data distributions where, when
we train a model with AT, our model will always rely on the
non-robust features at time t. We consider the following
distribution:
Definition 2 (Data distribution with a large fraction of
non-robust features).Let the data distribution be as fol-
lows
1. yunif{−1,+1},
2. x1=+y, w.p. p;
y, w.p. 1p,
3. xj|yis a distribution with mean jand variance σ2
j,
where ε>µj>0, for j= 2,3, . . . , d + 1.
Given a label y, the first feature x1takes 2 possible values,
ywith probability pand ywith probability 1p. We note
that x1is the true label with probability pand is robust to
adversarial perturbations. On the other hand, for j2,
feature xjis more flexible where it can follow any distribu-
tion with a condition that the feature must be weakly corre-
lated with the true label in expectation. Each feature might
be less informative compared to x1but combining many of
them can lead to a highly accurate model. We note that xj
Maria-Florina Balcan, Rattana Pukdee, Pradeep Ravikumar, Hongyang Zhang
is non-robust since its mean is smaller than the perturbation
budget.
The data distribution is a specification of our setup in Sec-
tion 3 where we have a significantly large number of non-
robust features compared to the robust features. This dis-
tribution is inspired by the one studied in (Tsipras et al.,
2018), where they showed that standard training on this
type of distribution (when features j= 2, . . . , d + 1 are
Gaussian distributions) leads to a model that relies on non-
robust features. We generalize their result to a scenario
when xjcan follow any distribution (see Appendix C). We
note that the lack of assumption on each distribution is a
key technical challenge for our analysis.
Theorem 2 (Adversarial training uses non-robust feature
(simplified version)).Let the data distribution follows the
distribution as in Definition 2. Consider applying AT to
learn a linear model f(x) = w>x. Assume that ε > 2µj
for j= 2, . . . , d + 1. Let w(t)= [w(t)
1, w(t)
2, . . . , w(t)
d+1]be
the parameter of the linear function at time t. If
p < 1 1
2(σmax
||µ0||2
+λ
2||µ0||2
2
) + 1
2r2
λσmax!,(2)
where
σiσmax, µ0= [0, µ2, . . . , µd+1],
then
d+1
X
j=2
(w(t)
j)2||w(t)||2
2(1 ε)2
(1 ε)2+Pd+1
j=2 (µj+ε)2.
(For simplicity, we also assume that σmax 1. For a
tighter bound, see Appendix D).
Since, features j= 2, . . . , d + 1 are not robust, Theorem 2
implies that a model at time twill put at least
(1 ε)2
(1 ε)2+Pd+1
j=2 (µj+ε)2
fraction of weight on non-robust features. We note that
the term ||µ0||2at the denominator of condition (2) grows
as O(d). Therefore, if the number of non-robust features
dand the regularization parameter λis large enough then
the condition (2) holds. We discuss the full version of this
theorem in Appendix D. Next, we prove that the magni-
tude ||w(t)||2is bounded below by a constant (see Lemma
6) which implies that ||w(t)||2does not converge to zero.
Therefore, we can conclude that a model trained with AT
puts a non-trivial amount of weights on non-robust features
at each iteration. This implies that AT does not converge.
Theorem 3 (AT does not converge (simplified)).Let the
data follow the distribution as in Definition 2. Assume
that the variance σ2
jis bounded above and ε > 2µjfor
j= 2, . . . , d + 1. Consider applying AT to learn a linear
model f(x) = w>xon the SVM objective Let w(t)be the
parameter of the linear function at time t. If the number of
non-robust feature dand the regularization parameter λis
large enough then w(t)does not converge as t→ ∞.
5 PROPERTIES OF THE NASH
EQUILIBRIUM
Recall the definition of a Nash equilibrium:
Definition 3 (Nash Equilibrium).For a SLAR game, a pair
of actions (δ, w)is called a pure strategy Nash equilib-
rium if the following hold
sup
δ
Urow(δ, w)Urow(δ, w)inf
wUrow(δ, w).
From this definition, we note that for any fixed w, the in-
equality
sup
δ
Urow(δ, w)Urow(δ, w)
holds if and only if δis optimal. We know that the worst-
case perturbation δ(x, y) = yε sign(w)satisfies this
condition. However, the optimal perturbations might not be
unique because of the max(0,·)operator in the hinge loss.
For instance, for a point that is far away from the decision
boundary such that the worst-case perturbation leads to a
zero loss:
1yw>(x+δ(x, y)) 0,
any perturbation δ(x, y)will lead to a zero loss:
1yw>(x+δ(x, y)) 0.
Therefore, any perturbation δleads to the same utility (Fig-
ure 2). On the other hand, if the worst-case perturbation
leads to a positive loss, we can show that the optimal per-
turbation must be the worst-case perturbation.
Figure 2: For points that are far enough from the decision
boundary, any perturbation is optimal.
摘要:

NashEquilibriaandPitfallsofAdversarialTraininginAdversarialRobustnessGamesMaria-FlorinaBalcanRattanaPukdeePradeepRavikumarHongyangZhangCarnegieMellonUniversityCarnegieMellonUniversityCarnegieMellonUniversityUniversityofWaterlooAbstractAdversarialtrainingisastandardtechniquefortrainingadversariallyro...

展开>> 收起<<
Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games Maria-Florina Balcan Rattana Pukdee Pradeep Ravikumar Hongyang Zhang.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:497.28KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注