Efficient Adversarial Training without Attacking Worst-Case-Aware Robust Reinforcement Learning Yongyuan LiangyYanchao SunzRuijie ZhengzFurong Huangz

2025-04-26 0 0 1.57MB 28 页 10玖币
侵权投诉
Efficient Adversarial Training without Attacking:
Worst-Case-Aware Robust Reinforcement Learning
Yongyuan LiangYanchao SunRuijie ZhengFurong Huang
Shanghai AI Lab, University of Maryland, College Park
cheryllLiang@outlook.com {ycs,rzheng12,furongh}@umd.edu
Abstract
Recent studies reveal that a well-trained deep reinforcement learning (RL) policy
can be particularly vulnerable to adversarial perturbations on input observations.
Therefore, it is crucial to train RL agents that are robust against any attacks
with a bounded budget. Existing robust training methods in deep RL either treat
correlated steps separately, ignoring the robustness of long-term rewards, or train
the agents and RL-based attacker together, doubling the computational burden and
sample complexity of the training process. In this work, we propose a strong and
efficient robust training framework for RL, named Worst-case-aware Robust RL
(WocaR-RL), that directly estimates and optimizes the worst-case reward of a policy
under bounded
`p
attacks without requiring extra samples for learning an attacker.
Experiments on multiple environments show that WocaR-RL achieves state-of-
the-art performance under various strong attacks, and obtains significantly higher
training efficiency than prior state-of-the-art robust training methods. The code of
this work is available at https://github.com/umd-huang-lab/WocaR-RL.
1 Introduction
Deep reinforcement learning (DRL) has achieved impressive results by using deep neural networks
(DNN) to learn complex policies in large-scale tasks. However, well-trained DNNs may drastically
fail under adversarial perturbations of the input [
1
,
6
]. Therefore, before deploying DRL policies
to real-life applications, it is crucial to improve the robustness of deep policies against adversarial
attacks, especially worst-case attacks that maximally depraves the performance of trained agents [
42
].
Figure 1:
Policies have
different vulnerabilities.
A line of regularization-based robust methods [
54
,
33
,
40
] focuses on im-
proving the robustness of the DNN itself and regularizes the policy network
to output similar actions under bounded state perturbations. However, dif-
ferent from supervised learning problems, the vulnerability of a deep policy
comes not only from the DNN approximator, but also from the dynamics
of the RL environment [
52
]. These regularization-based methods neglect
the intrinsic vulnerability of policies under the environment dynamics, and
thus may still fail under strong attacks [
42
]. For example, in the go-home
task shown in Figure 1, both the green policy and the red policy arrive
home without rock collision, when there is no attack. However, although
regularization-based methods may ensure a minor action change under a
state perturbation, the red policy may still be susceptible to a low reward
under attacks, as a very small divergence can lead it to the bomb. On the contrary, the green policy is
more robust to adversarial attacks since it stays away from the bomb. Therefore, besides promoting
the robustness of DNN approximators (such as the policy network), it is also important to learn a
policy with stronger intrinsic robustness.
Equal contribution.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05927v1 [cs.LG] 12 Oct 2022
There is another line of work considering the long-term robustness of a deep policy under strong
adversarial attacks. In particular, it is theoretically proved [
54
,
42
] that the strongest (worst-case)
attacker against a policy can be learned as an RL problem, and training the agent under such a
learned attacker can result in a robust policy. Zhang et al. [
52
] propose the Alternating Training with
Learned Adversaries (ATLA) framework, which alternately trains an RL agent and an RL attacker.
Sun et al. [
42
] further propose PA-ATLA, which alternately trains an agent and the proposed more
efficient PA-AD RL attacker, obtaining state-of-the-art robustness in many MuJoCo environments.
However, training an RL attacker requires extra samples from the environment, and the attacker’s
RL problem may even be more difficult and sample expensive to solve than the agent’s original RL
problem [
52
,
42
], especially in large-scale environments such as Atari games with pixel observations.
Therefore, although ATLA and PA-ATLA are able to achieve high long-term reward under attacks,
they double the computational burden and sample complexity to train the robust agent.
The above analysis of existing literature suggests two main challenges in improving the adversarial
robustness of DRL agents: (1) correctly characterizing the long-term reward vulnerability of an
RL policy, and (2) efficiently training a robust agent without requiring much more effort than
vanilla training. To tackle these challenges, in this paper, we propose a generic and efficient robust
training framework named Worst-case-aware Robust RL (WocaR-RL) that estimates and improves the
long-term robustness of an RL agent.
WocaR-RL has 3 key mechanisms.
First
, WocaR-RL introduces a novel worst-attack Bellman
operator which uses existing off-policy samples to estimate the lower bound of the policy value under
the worst-case attack. Compared to prior works [
52
,
42
] which attempt to learn the worst-case attack
by RL methods, WocaR-RL does not require any extra interaction with the environment.
Second
,
using the estimated worst-case policy value, WocaR-RL optimizes the policy to select actions that
not only achieve high natural future reward, but also achieve high worst-case reward when there are
adversarial attacks. Therefore, WocaR-RL learns a policy with less intrinsic vulnerability.
Third
,
WocaR-RL regularizes the policy network with a carefully designed state importance weight. As
a result, the DNN approximator tolerates state perturbations, especially for more important states
where decisions are crucial for future reward. The above 3 mechanisms can also be interpreted from
a geometric perspective of adversarial policy learning, as detailed in Appendix B.
Our
contributions
can be summarized as below.
(1)
We provide an approach to estimate the worst-
case value of any policy under any bounded
`p
adversarial attacks. This helps evaluate the robustness
of a policy without learning an attacker which requires extra samples and exploration.
(2)
We
propose a novel and principled robust training framework for RL, named Worst-case-aware Robust
RL (WocaR-RL), which characterizes and improves the worst-case robustness of an agent. WocaR-
RL can be used to robustify existing DRL algorithms (e.g. PPO [
39
], DQN [
32
]).
(3)
We show
by experiments that WocaR-RL achieve
improved robustness
against various adversarial attacks
as well as
higher efficiency
, compared with state-of-the-art (SOTA) robust RL methods in many
MuJoCo and Atari games. For example, compared to the SOTA algorithm PA-ATLA-PPO [
42
] in the
Walker environment, we obtain 20% more worst-case reward (under the strongest attack algorithm),
with only about 50% training samples and 50% running time. Moreover, WocaR-RL learns
more
interpretable “robust behaviors” than PA-ATLA-PPO in Walker as shown in Figure 2.
Previous robust agent (PA-ATLA-PPO): jumping with one leg
Our robust agent: lowering down its body
Figure 2:
The robust Walker agents trained with
(top)
the state-of-the-art method PA-ATLA-PPO [
42
] and
(bottom)
our WocaR-RL. Although PA-ATLA-PPO agent also achieves high reward under attacks, it learns to
jump with one leg, which is counter-intuitive and may indicate some level of overfitting to a specific attacker. In
contrast, our WocaR-RL agent learns to lower down its body, which is more intuitive and interpretable. The full
agent trajectories in Walker and other environments are provided in supplementary materials as GIF figures.
2
2 Related Work
Defending against Adversarial Perturbations on State Observations. (1)
Regularization-based
methods [
54
,
40
,
33
] enforce the policy to have similar outputs under similar inputs, which achieves
certifiable performance for DQN in some Atari games. But in continuous control tasks, these methods
may not reliably improve the worst-case performance. A recent work by Korkmaz [
21
] points out
that these adversarially trained models may still be sensible to new perturbations.
(2)
Attack-driven
methods train DRL agents with adversarial examples. Some early works [
22
,
4
,
29
,
34
] apply weak or
strong gradient-based attacks on state observations to train RL agents against adversarial perturbations.
Zhang et al. [
52
] propose Alternating Training with Learned Adversaries (ATLA), which alternately
trains an RL agent and an RL adversary and significantly improves the policy robustness in continuous
control games. Sun et al. [
42
] further extend this framework to PA-ATLA with their proposed more
advanced RL attacker PA-AD. Although ATLA and PA-ATLA achieve strong empirical robustness,
they require training an extra RL adversary that can be computationally and sample expensive.
(3)
There is another line of work studying certifiable robustness of RL policies. Several works [
27
,
33
,
9
]
computed lower bounds of the action value network
Qπ
to certify robustness of action selection at
every step. However, these bounds do not consider the distribution shifts caused by attacks, so some
actions that appear safe for now can lead to extremely vulnerable future states and low long-term
reward under future attacks. Moreover, these methods cannot apply to continuous action spaces.
Kumar et al. and Wu et al.[
23
,
49
] both extend randomized smoothing [
7
] to derive robustness
certificates for trained policies. But these works mostly focus on theoretical analysis, and effective
robust training approaches rather than robust training.
Adversarial Defenses against Other Adversarial Attacks.
Besides observation perturbations,
attacks can happen in many other scenarios. For example, the agent’s executed actions can be
perturbed [
50
,
44
,
45
,
24
]. Moreover, in a multi-agent game, an agent’s behavior can create adversarial
perturbations to a victim agent [
13
]. Pinto et al. [
35
] model the competition between the agent and
the attacker as a zero-sum two-player game, and train the agent under a learned attacker to tolerate
both environment shifts and adversarial disturbances. We point out that although we mainly consider
state adversaries, our WocaR-RL can be extended to action attacks as formulated in Appendix C.5.
Note that we focus on robustness against test-time attacks, different from poisoning attacks which
alter the RL training process [3,20,41,56,36].
Safe RL and Risk-sensitive RL.
There are several lines of work that study RL under safety/risk
constraints [
18
,
11
,
10
,
2
,
46
] or under intrinsic uncertainty of environment dynamics [
26
,
30
].
However, these works do not deal with adversarial attacks, which can be adaptive to the learned
policy. More comparison between these methods and our proposed method is discussed in Section 4.
3 Preliminaries and Background
Reinforcement Learning (RL).
An RL environment is modeled by a Markov Decision Process
(MDP), denoted by a tuple
M=hS,A, P, R, γi
, where
S
is a state space,
A
is an action space,
P:
S×A → ∆(S)
is a stochastic dynamics model
2
,
R:S×A → R
is a reward function and
γ[0,1)
is
a discount factor. An agent takes actions based on a policy
π:S ∆(A)
. For any policy, its natural
performance can be measured by the value function
Vπ(s) := EP,π [P
t=0 γtR(st, at)|s0=s]
,
and the action value function
Qπ(s, a) := EP,π [P
t=0 γtR(st, at)|s0=s, a0=a]
. We call
Vπ
the natural value and
Qπ
the natural action value in contrast to the values under attacks, as will be
introduced in Section 4.
Deep Reinforcement Learning (DRL).
In large-scale problems, a policy can be parameterized by
a neural network. For example, value-based RL methods (e.g. DQN [
32
]) usually fit a Q network and
take the greedy policy
π(s) = argmaxaQ(s, a)
. In actor-critic methods (e.g. PPO [
39
]), the learner
directly learns a policy network and a critic network. In practice, an agent usually follows a stochastic
policy during training that enables exploration, and executes a trained policy deterministically in
test-time, e.g. the greedy policy learned with DQN. Throughout this paper, we use
πθ
to denote the
training-time stochastic policy parameterized by
θ
, while
π
denotes the trained deterministic policy
that maps a state to an action.
Test-time Adversarial Attacks.
After training, the agent is deployed into the environment and
executes a pre-trained fixed policy
π
. An attacker/adversary, during the deployment of the agent, may
2∆(X)denotes the space of probability distributions over X.
3
perturb the state observation of the agent/victim at every time step with a certain attack budget
. Note
that the attacker only perturbs the inputs to the policy, and the underlying state in the environment
does not change. This is a realistic setting because real-world observations can come from noisy
sensors or be manipulated by malicious attacks. For example, an auto-driving car receives sensory
observations; an attacker may add imperceptible noise to the camera, or perturb the GPS signal,
although the underlying environment (the road) remains unchanged. In this paper, we consider the
`p
thread model which is widely used in adversarial learning literature: at step
t
, the attacker alters the
observation
st
into
˜st∈ B(st)
, where
B(st)
is a
`p
norm ball centered at
st
with radius
. The above
setting (`pconstrained observation attack) is the same with many prior works [19,34,54,52,42].
4 Worst-case-aware Robust RL
In this section, we present Worst-case-aware Robust RL (WocaR-RL), a generic framework that can
be fused with any DRL approach to improve the adversarial robustness of an agent. We will introduce
the three key mechanisms in WocaR-RL: worst-attack value estimation, worst-case-aware policy
optimization, and value-enhanced state regularization, respectively. Then, we will illustrate how to
incorporate these mechanisms into existing DRL algorithms to improve their robustness.
Mechanism 1: Worst-attack Value Estimation
Traditional RL aims to learn a policy with the maximal value
Vπ
. However, in a real-world problem
where observations can be noisy or even adversarially perturbed, it is not enough to only consider
the natural value
Vπ
and
Qπ
. As motivated in Figure 1, two policies with similar natural rewards
can get totally different rewards under attacks. To comprehensively evaluate how good a policy is
in an adversarial scenario and to improve its robustness, we should be aware of the lowest possible
long-term reward of the policy when its observation is adversarially perturbed with a certain attack
budget at every step (with an `pattack model introduced in Section 3).
The worst-case value of a policy is, by definition, the cumulative reward obtained under the optimal
attacker. As justified by prior works [
54
,
42
], for any given victim policy
π
and attack budget
 > 0
,
there exists an optimal attacker, and finding the optimal attacker is equivalent to learning the optimal
policy in another MDP. We denote the optimal (deterministic) attacker’s policy as
h
. However,
learning such an optimal attacker by RL algorithms requires extra interaction samples from the
environment, due to the unknown dynamics. Moreover, learning the attacker by RL can be hard and
expensive, especially when the state observation space is high-dimensional.
Instead of explicitly learning the optimal attacker with a large amount of samples, we propose to
directly estimate the worst-case cumulative reward of the policy by characterizing the vulnerability
of the given policy. We first define the worst-attack action value of policy
π
as
Qπ(s, a) :=
EP[P
t=0 γtR(st, π(h(st))) |s0=s, a0=a].
The worst-attack value
Vπ
can be defined using
h
in the same way, as shown in Definition A.1 in Appendix A. Then, we introduce a novel operator
Tπ, namely the worst-attack Bellman operator, defined as below.
Definition 4.1
(Worst-attack Bellman Operator)
.
For MDP
M
, given a fixed policy
π
and attack
radius , define the worst-attack Bellman operator Tπas
(TπQ) (s, a) := Es0P(s,a)[R(s, a) + γmin
a0∈Aadv (s0)Q(s0, a0)],(1)
where s∈ S,Aadv(s, π)is defined as
Aadv(s, π) := {a∈ A :˜s∈ B(s)s.t. π(˜s) = a}.(2)
Here
Aadv(s0, π)
denotes the set of actions an adversary can mislead the victim
π
into selecting by
perturbing the state
s0
into a neighboring state
˜s∈ B(s0)
. This hypothetical perturbation to the future
state
s0
is the key for characterizing the worst-case long-term reward under attack. The following
theorem associates the worst-attack Bellman operator and the worst-attack action value.
Theorem 4.2
(Worst-attack Bellman Operator and Worst-attack Action Value)
.
For any given policy
π
,
Tπ
is a contraction whose fixed point is
Qπ
, the worst-attack action value of
π
under any
`p
observation attacks with radius .
Theorem 4.2 proved in Appendix Asuggests that the lowest possible cumulative reward of a pol-
icy under bounded observation attacks can be computed by worst-attack Bellman operator. The
corresponding worst-attack value Vπcan be obtained by Vπ(s) = mina∈Aadv (s,π)Qπ(s, a).
4
How to Compute Aadv.
To obtain
Aadv(s, π)
, we need to identify the actions that can be the
outputs of the policy
π
when the input state
s
is perturbed within
B(s)
. This can be solved by
commonly-used convex relaxation of neural networks [
15
,
55
,
48
,
53
,
14
], where layer-wise lower
and upper bounds of the neural network are derived. That is, we calculate
π
and
π
such that
π(s)π(ˆs)π(s),ˆs∈ B(s)
. With such a relaxation, we can obtain a superset of
Aadv
, namely
ˆ
Aadv
. Then, the fixed point of Equation
(1)
with
Aadv
being replaced by
ˆ
Aadv
becomes a lower
bound of the worst-attack action value. For a continuous action space,
ˆ
Aadv(s, π)
contains actions
bounded by
π(s)
and
π(s)
. For a discrete action space, we can first compute the maximal and minimal
probabilities of taking each action, and derive the set of actions that are likely to be selected. The
computation of
ˆ
Aadv
is not expensive, as there are many efficient convex relaxation methods [
31
,
53
]
which compute
π
and
π
with only constant-factor more computations than directly computing
π(s)
.
Experiment in Section 5verifies the efficiency of our approach, where we use a well-developed
toolbox
auto_LiRPA
[
51
] to calculate the convex relaxation. More implementation details and
explanations are provided in Appendix C.1.
Estimating Worst-attack Value.
Note that the worst-attack Bellman operator
Tπ
is similar to
the optimal Bellman operator
T
, although it uses
mina∈Aadv
instead of
maxa∈A
. Therefore, once
we identify
Aadv
as introduced above, it is straightforward to compute the worst-attack action
value using Bellman backups. To model the worst-attack action value, we train a network named
worst-attack critic, denoted by
Qπ
φ
, where
φ
is the parameterization. Concretely, for any mini-batch
{st, at, rt, st+1}N
t=1,Qπ
φis optimized by minimizing the following estimation loss:
Lest(Qπ
φ):= 1
N
N
X
t=1
(ytQπ
φ(st, at))2,where yt=rt+γmin
ˆa∈Aadv (st+1)Qπ
φ(st+1,ˆa).(3)
For a discrete action space,
Aadv
is a discrete set and solving
yt
is straightforward. For a continuous
action space, we use gradient descent to approximately find the minimizer
ˆa
. Since
Aadv
is in general
small, this minimization is usually easy to solve. In MuJoCo, we find that 50-step gradient descent
already converges to a good solution with little computational cost, as detailed in Appendix D.3.3.
Differences with Worst-case Value Estimation in Related Work.
Our proposed worst-attack
Bellman operator is different from the worst-case Bellman operator in the literature of risk-sensitive
RL [
18
,
11
,
43
,
10
,
2
,
46
], whose goal is to avoid unsafe trajectories under the intrinsic uncertainties
of the MDP. These inherent uncertainties of the environment are independent of the learned policy.
In contrast, our focus is to defend against adversarial perturbations created by malicious attackers
that can be adaptive to the policy. The GWC reward proposed by [
33
] also estimates the worst-case
reward of a policy under state perturbations. But their evaluation is based on a greedy strategy and
requires interactions with the environment, which is different from our estimation.
Mechanism 2: Worst-case-aware Policy Optimization
So far we have introduced how to evaluate the worst-attack value of a policy by learning a worst-attack
critic. Inspired by the actor-critic framework, where the actor policy network
πθ
is optimized towards
a direction that the critic value increases the most, we can regard worst-attack critic as a special critic
that directs the actor to increase the worst-attack value. That is, we encourage the agent to select an
action with a higher worst-attack action value, by minimizing the worst-attack policy loss below:
Lwst(πθ;Qπ
φ) := 1
N
N
X
t=1 X
a∈A
πθ(a|st)Qπ
φ(st, a),(4)
where
Qπ
φ
is the worst-attack critic learned via
Lest
introduced in Equation
(3)
. Note that
Lwst
is a
general form, while the detailed implementation of the worst-attack policy optimization can vary
depending on the architecture of
πθ
in the base RL algorithm (e.g. PPO has a policy network, while
DQN acts using the greedy policy induced by a Q network). In Appendix C.2 and Appendix C.3, we
illustrate how to implement Lwst for PPO and DQN as two examples.
The proposed worst-case-aware policy optimization has several
merits
compared to prior ATLA [
52
]
and PA-ATLA [
42
] methods which alternately train the agent and an RL attacker.
(1)
Learning the
optimal attacker
h
requires collecting extra samples using the current policy (on-policy estimation).
In contrast,
Qπ
φ
can be learned using off-policy samples, e.g., historical samples in the replay buffer,
and thus is more suitable for training where the policy changes over time. (
Qπ
φ
depends on the current
5
摘要:

EfcientAdversarialTrainingwithoutAttacking:Worst-Case-AwareRobustReinforcementLearningYongyuanLiangyYanchaoSunzRuijieZhengzFurongHuangzyShanghaiAILab,zUniversityofMaryland,CollegeParkycheryllLiang@outlook.comz{ycs,rzheng12,furongh}@umd.eduAbstractRecentstudiesrevealthatawell-traineddeepreinforcem...

展开>> 收起<<
Efficient Adversarial Training without Attacking Worst-Case-Aware Robust Reinforcement Learning Yongyuan LiangyYanchao SunzRuijie ZhengzFurong Huangz.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.57MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注