
2 Related Work
Defending against Adversarial Perturbations on State Observations. (1)
Regularization-based
methods [
54
,
40
,
33
] enforce the policy to have similar outputs under similar inputs, which achieves
certifiable performance for DQN in some Atari games. But in continuous control tasks, these methods
may not reliably improve the worst-case performance. A recent work by Korkmaz [
21
] points out
that these adversarially trained models may still be sensible to new perturbations.
(2)
Attack-driven
methods train DRL agents with adversarial examples. Some early works [
22
,
4
,
29
,
34
] apply weak or
strong gradient-based attacks on state observations to train RL agents against adversarial perturbations.
Zhang et al. [
52
] propose Alternating Training with Learned Adversaries (ATLA), which alternately
trains an RL agent and an RL adversary and significantly improves the policy robustness in continuous
control games. Sun et al. [
42
] further extend this framework to PA-ATLA with their proposed more
advanced RL attacker PA-AD. Although ATLA and PA-ATLA achieve strong empirical robustness,
they require training an extra RL adversary that can be computationally and sample expensive.
(3)
There is another line of work studying certifiable robustness of RL policies. Several works [
27
,
33
,
9
]
computed lower bounds of the action value network
Qπ
to certify robustness of action selection at
every step. However, these bounds do not consider the distribution shifts caused by attacks, so some
actions that appear safe for now can lead to extremely vulnerable future states and low long-term
reward under future attacks. Moreover, these methods cannot apply to continuous action spaces.
Kumar et al. and Wu et al.[
23
,
49
] both extend randomized smoothing [
7
] to derive robustness
certificates for trained policies. But these works mostly focus on theoretical analysis, and effective
robust training approaches rather than robust training.
Adversarial Defenses against Other Adversarial Attacks.
Besides observation perturbations,
attacks can happen in many other scenarios. For example, the agent’s executed actions can be
perturbed [
50
,
44
,
45
,
24
]. Moreover, in a multi-agent game, an agent’s behavior can create adversarial
perturbations to a victim agent [
13
]. Pinto et al. [
35
] model the competition between the agent and
the attacker as a zero-sum two-player game, and train the agent under a learned attacker to tolerate
both environment shifts and adversarial disturbances. We point out that although we mainly consider
state adversaries, our WocaR-RL can be extended to action attacks as formulated in Appendix C.5.
Note that we focus on robustness against test-time attacks, different from poisoning attacks which
alter the RL training process [3,20,41,56,36].
Safe RL and Risk-sensitive RL.
There are several lines of work that study RL under safety/risk
constraints [
18
,
11
,
10
,
2
,
46
] or under intrinsic uncertainty of environment dynamics [
26
,
30
].
However, these works do not deal with adversarial attacks, which can be adaptive to the learned
policy. More comparison between these methods and our proposed method is discussed in Section 4.
3 Preliminaries and Background
Reinforcement Learning (RL).
An RL environment is modeled by a Markov Decision Process
(MDP), denoted by a tuple
M=hS,A, P, R, γi
, where
S
is a state space,
A
is an action space,
P:
S×A → ∆(S)
is a stochastic dynamics model
2
,
R:S×A → R
is a reward function and
γ∈[0,1)
is
a discount factor. An agent takes actions based on a policy
π:S → ∆(A)
. For any policy, its natural
performance can be measured by the value function
Vπ(s) := EP,π [P∞
t=0 γtR(st, at)|s0=s]
,
and the action value function
Qπ(s, a) := EP,π [P∞
t=0 γtR(st, at)|s0=s, a0=a]
. We call
Vπ
the natural value and
Qπ
the natural action value in contrast to the values under attacks, as will be
introduced in Section 4.
Deep Reinforcement Learning (DRL).
In large-scale problems, a policy can be parameterized by
a neural network. For example, value-based RL methods (e.g. DQN [
32
]) usually fit a Q network and
take the greedy policy
π(s) = argmaxaQ(s, a)
. In actor-critic methods (e.g. PPO [
39
]), the learner
directly learns a policy network and a critic network. In practice, an agent usually follows a stochastic
policy during training that enables exploration, and executes a trained policy deterministically in
test-time, e.g. the greedy policy learned with DQN. Throughout this paper, we use
πθ
to denote the
training-time stochastic policy parameterized by
θ
, while
π
denotes the trained deterministic policy
that maps a state to an action.
Test-time Adversarial Attacks.
After training, the agent is deployed into the environment and
executes a pre-trained fixed policy
π
. An attacker/adversary, during the deployment of the agent, may
2∆(X)denotes the space of probability distributions over X.
3