Efﬁcient Adversarial Training without Attacking Worst-Case-Aware Robust Reinforcement Learning Yongyuan LiangyYanchao SunzRuijie ZhengzFurong Huangz

2025-04-26 0 0 1.57MB 28 页 10玖币

侵权投诉

Efﬁcient Adversarial Training without Attacking:

Worst-Case-Aware Robust Reinforcement Learning

Yongyuan Liang†∗ Yanchao Sun‡∗ Ruijie Zheng‡Furong Huang‡

†Shanghai AI Lab, ‡University of Maryland, College Park

†cheryllLiang@outlook.com ‡{ycs,rzheng12,furongh}@umd.edu

Abstract

Recent studies reveal that a well-trained deep reinforcement learning (RL) policy

can be particularly vulnerable to adversarial perturbations on input observations.

Therefore, it is crucial to train RL agents that are robust against any attacks

with a bounded budget. Existing robust training methods in deep RL either treat

correlated steps separately, ignoring the robustness of long-term rewards, or train

the agents and RL-based attacker together, doubling the computational burden and

sample complexity of the training process. In this work, we propose a strong and

efﬁcient robust training framework for RL, named Worst-case-aware Robust RL

(WocaR-RL), that directly estimates and optimizes the worst-case reward of a policy

under bounded

attacks without requiring extra samples for learning an attacker.

Experiments on multiple environments show that WocaR-RL achieves state-of-

the-art performance under various strong attacks, and obtains signiﬁcantly higher

training efﬁciency than prior state-of-the-art robust training methods. The code of

this work is available at https://github.com/umd-huang-lab/WocaR-RL.

1 Introduction

Deep reinforcement learning (DRL) has achieved impressive results by using deep neural networks

(DNN) to learn complex policies in large-scale tasks. However, well-trained DNNs may drastically

fail under adversarial perturbations of the input [

]. Therefore, before deploying DRL policies

to real-life applications, it is crucial to improve the robustness of deep policies against adversarial

attacks, especially worst-case attacks that maximally depraves the performance of trained agents [

Figure 1:

Policies have

different vulnerabilities.

A line of regularization-based robust methods [

] focuses on im-

proving the robustness of the DNN itself and regularizes the policy network

to output similar actions under bounded state perturbations. However, dif-

ferent from supervised learning problems, the vulnerability of a deep policy

comes not only from the DNN approximator, but also from the dynamics

of the RL environment [

]. These regularization-based methods neglect

the intrinsic vulnerability of policies under the environment dynamics, and

thus may still fail under strong attacks [

]. For example, in the go-home

task shown in Figure 1, both the green policy and the red policy arrive

home without rock collision, when there is no attack. However, although

regularization-based methods may ensure a minor action change under a

state perturbation, the red policy may still be susceptible to a low reward

under attacks, as a very small divergence can lead it to the bomb. On the contrary, the green policy is

more robust to adversarial attacks since it stays away from the bomb. Therefore, besides promoting

the robustness of DNN approximators (such as the policy network), it is also important to learn a

policy with stronger intrinsic robustness.

∗Equal contribution.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05927v1 [cs.LG] 12 Oct 2022

There is another line of work considering the long-term robustness of a deep policy under strong

adversarial attacks. In particular, it is theoretically proved [

] that the strongest (worst-case)

attacker against a policy can be learned as an RL problem, and training the agent under such a

learned attacker can result in a robust policy. Zhang et al. [

] propose the Alternating Training with

Learned Adversaries (ATLA) framework, which alternately trains an RL agent and an RL attacker.

Sun et al. [

] further propose PA-ATLA, which alternately trains an agent and the proposed more

efﬁcient PA-AD RL attacker, obtaining state-of-the-art robustness in many MuJoCo environments.

However, training an RL attacker requires extra samples from the environment, and the attacker’s

RL problem may even be more difﬁcult and sample expensive to solve than the agent’s original RL

problem [

], especially in large-scale environments such as Atari games with pixel observations.

Therefore, although ATLA and PA-ATLA are able to achieve high long-term reward under attacks,

they double the computational burden and sample complexity to train the robust agent.

The above analysis of existing literature suggests two main challenges in improving the adversarial

robustness of DRL agents: (1) correctly characterizing the long-term reward vulnerability of an

RL policy, and (2) efﬁciently training a robust agent without requiring much more effort than

vanilla training. To tackle these challenges, in this paper, we propose a generic and efﬁcient robust

training framework named Worst-case-aware Robust RL (WocaR-RL) that estimates and improves the

long-term robustness of an RL agent.

WocaR-RL has 3 key mechanisms.

First

, WocaR-RL introduces a novel worst-attack Bellman

operator which uses existing off-policy samples to estimate the lower bound of the policy value under

the worst-case attack. Compared to prior works [

] which attempt to learn the worst-case attack

by RL methods, WocaR-RL does not require any extra interaction with the environment.

Second

using the estimated worst-case policy value, WocaR-RL optimizes the policy to select actions that

not only achieve high natural future reward, but also achieve high worst-case reward when there are

adversarial attacks. Therefore, WocaR-RL learns a policy with less intrinsic vulnerability.

Third

WocaR-RL regularizes the policy network with a carefully designed state importance weight. As

a result, the DNN approximator tolerates state perturbations, especially for more important states

where decisions are crucial for future reward. The above 3 mechanisms can also be interpreted from

a geometric perspective of adversarial policy learning, as detailed in Appendix B.

Our

contributions

can be summarized as below.

(1)

We provide an approach to estimate the worst-

case value of any policy under any bounded

adversarial attacks. This helps evaluate the robustness

of a policy without learning an attacker which requires extra samples and exploration.

(2)

propose a novel and principled robust training framework for RL, named Worst-case-aware Robust

RL (WocaR-RL), which characterizes and improves the worst-case robustness of an agent. WocaR-

RL can be used to robustify existing DRL algorithms (e.g. PPO [

], DQN [

]).

(3)

We show

by experiments that WocaR-RL achieve

improved robustness

against various adversarial attacks

as well as

higher efﬁciency

, compared with state-of-the-art (SOTA) robust RL methods in many

MuJoCo and Atari games. For example, compared to the SOTA algorithm PA-ATLA-PPO [

] in the

Walker environment, we obtain 20% more worst-case reward (under the strongest attack algorithm),

with only about 50% training samples and 50% running time. Moreover, WocaR-RL learns

interpretable “robust behaviors” than PA-ATLA-PPO in Walker as shown in Figure 2.

Previous robust agent (PA-ATLA-PPO): jumping with one leg

Our robust agent: lowering down its body

Figure 2:

The robust Walker agents trained with

(top)

the state-of-the-art method PA-ATLA-PPO [

] and

(bottom)

our WocaR-RL. Although PA-ATLA-PPO agent also achieves high reward under attacks, it learns to

jump with one leg, which is counter-intuitive and may indicate some level of overﬁtting to a speciﬁc attacker. In

contrast, our WocaR-RL agent learns to lower down its body, which is more intuitive and interpretable. The full

agent trajectories in Walker and other environments are provided in supplementary materials as GIF ﬁgures.

2 Related Work

Defending against Adversarial Perturbations on State Observations. (1)

Regularization-based

methods [

] enforce the policy to have similar outputs under similar inputs, which achieves

certiﬁable performance for DQN in some Atari games. But in continuous control tasks, these methods

may not reliably improve the worst-case performance. A recent work by Korkmaz [

] points out

that these adversarially trained models may still be sensible to new perturbations.

(2)

Attack-driven

methods train DRL agents with adversarial examples. Some early works [

] apply weak or

strong gradient-based attacks on state observations to train RL agents against adversarial perturbations.

Zhang et al. [

] propose Alternating Training with Learned Adversaries (ATLA), which alternately

trains an RL agent and an RL adversary and signiﬁcantly improves the policy robustness in continuous

control games. Sun et al. [

] further extend this framework to PA-ATLA with their proposed more

advanced RL attacker PA-AD. Although ATLA and PA-ATLA achieve strong empirical robustness,

they require training an extra RL adversary that can be computationally and sample expensive.

(3)

There is another line of work studying certiﬁable robustness of RL policies. Several works [

]

computed lower bounds of the action value network

Qπ

to certify robustness of action selection at

every step. However, these bounds do not consider the distribution shifts caused by attacks, so some

actions that appear safe for now can lead to extremely vulnerable future states and low long-term

reward under future attacks. Moreover, these methods cannot apply to continuous action spaces.

Kumar et al. and Wu et al.[

] both extend randomized smoothing [

] to derive robustness

certiﬁcates for trained policies. But these works mostly focus on theoretical analysis, and effective

robust training approaches rather than robust training.

Adversarial Defenses against Other Adversarial Attacks.

Besides observation perturbations,

attacks can happen in many other scenarios. For example, the agent’s executed actions can be

perturbed [

]. Moreover, in a multi-agent game, an agent’s behavior can create adversarial

perturbations to a victim agent [

]. Pinto et al. [

] model the competition between the agent and

the attacker as a zero-sum two-player game, and train the agent under a learned attacker to tolerate

both environment shifts and adversarial disturbances. We point out that although we mainly consider

state adversaries, our WocaR-RL can be extended to action attacks as formulated in Appendix C.5.

Note that we focus on robustness against test-time attacks, different from poisoning attacks which

alter the RL training process [3,20,41,56,36].

Safe RL and Risk-sensitive RL.

There are several lines of work that study RL under safety/risk

constraints [

] or under intrinsic uncertainty of environment dynamics [

However, these works do not deal with adversarial attacks, which can be adaptive to the learned

policy. More comparison between these methods and our proposed method is discussed in Section 4.

3 Preliminaries and Background

Reinforcement Learning (RL).

An RL environment is modeled by a Markov Decision Process

(MDP), denoted by a tuple

M=hS,A, P, R, γi

, where

is a state space,

is an action space,

S×A → ∆(S)

is a stochastic dynamics model

R:S×A → R

is a reward function and

γ∈[0,1)

a discount factor. An agent takes actions based on a policy

π:S → ∆(A)

. For any policy, its natural

performance can be measured by the value function

Vπ(s) := EP,π [P∞

t=0 γtR(st, at)|s0=s]

and the action value function

Qπ(s, a) := EP,π [P∞

t=0 γtR(st, at)|s0=s, a0=a]

. We call

Vπ

the natural value and

Qπ

the natural action value in contrast to the values under attacks, as will be

introduced in Section 4.

Deep Reinforcement Learning (DRL).

In large-scale problems, a policy can be parameterized by

a neural network. For example, value-based RL methods (e.g. DQN [

]) usually ﬁt a Q network and

take the greedy policy

π(s) = argmaxaQ(s, a)

. In actor-critic methods (e.g. PPO [

]), the learner

directly learns a policy network and a critic network. In practice, an agent usually follows a stochastic

policy during training that enables exploration, and executes a trained policy deterministically in

test-time, e.g. the greedy policy learned with DQN. Throughout this paper, we use

πθ

to denote the

training-time stochastic policy parameterized by

, while

denotes the trained deterministic policy

that maps a state to an action.

Test-time Adversarial Attacks.

After training, the agent is deployed into the environment and

executes a pre-trained ﬁxed policy

. An attacker/adversary, during the deployment of the agent, may

2∆(X)denotes the space of probability distributions over X.

perturb the state observation of the agent/victim at every time step with a certain attack budget



. Note

that the attacker only perturbs the inputs to the policy, and the underlying state in the environment

does not change. This is a realistic setting because real-world observations can come from noisy

sensors or be manipulated by malicious attacks. For example, an auto-driving car receives sensory

observations; an attacker may add imperceptible noise to the camera, or perturb the GPS signal,

although the underlying environment (the road) remains unchanged. In this paper, we consider the

thread model which is widely used in adversarial learning literature: at step

, the attacker alters the

observation

into

˜st∈ B(st)

, where

B(st)

is a

norm ball centered at

with radius



. The above

setting (`pconstrained observation attack) is the same with many prior works [19,34,54,52,42].

4 Worst-case-aware Robust RL

In this section, we present Worst-case-aware Robust RL (WocaR-RL), a generic framework that can

be fused with any DRL approach to improve the adversarial robustness of an agent. We will introduce

the three key mechanisms in WocaR-RL: worst-attack value estimation, worst-case-aware policy

optimization, and value-enhanced state regularization, respectively. Then, we will illustrate how to

incorporate these mechanisms into existing DRL algorithms to improve their robustness.

Mechanism 1: Worst-attack Value Estimation

Traditional RL aims to learn a policy with the maximal value

Vπ

. However, in a real-world problem

where observations can be noisy or even adversarially perturbed, it is not enough to only consider

the natural value

Vπ

and

Qπ

. As motivated in Figure 1, two policies with similar natural rewards

can get totally different rewards under attacks. To comprehensively evaluate how good a policy is

in an adversarial scenario and to improve its robustness, we should be aware of the lowest possible

long-term reward of the policy when its observation is adversarially perturbed with a certain attack

budget at every step (with an `pattack model introduced in Section 3).

The worst-case value of a policy is, by deﬁnition, the cumulative reward obtained under the optimal

attacker. As justiﬁed by prior works [

], for any given victim policy

and attack budget

 > 0

there exists an optimal attacker, and ﬁnding the optimal attacker is equivalent to learning the optimal

policy in another MDP. We denote the optimal (deterministic) attacker’s policy as

h∗

. However,

learning such an optimal attacker by RL algorithms requires extra interaction samples from the

environment, due to the unknown dynamics. Moreover, learning the attacker by RL can be hard and

expensive, especially when the state observation space is high-dimensional.

Instead of explicitly learning the optimal attacker with a large amount of samples, we propose to

directly estimate the worst-case cumulative reward of the policy by characterizing the vulnerability

of the given policy. We ﬁrst deﬁne the worst-attack action value of policy

Qπ(s, a) :=

EP[P∞

t=0 γtR(st, π(h∗(st))) |s0=s, a0=a].

The worst-attack value

Vπ

can be deﬁned using

h∗

in the same way, as shown in Deﬁnition A.1 in Appendix A. Then, we introduce a novel operator

Tπ, namely the worst-attack Bellman operator, deﬁned as below.

Deﬁnition 4.1

(Worst-attack Bellman Operator)

For MDP

, given a ﬁxed policy

and attack

radius , deﬁne the worst-attack Bellman operator Tπas

(TπQ) (s, a) := Es0∼P(s,a)[R(s, a) + γmin

a0∈Aadv (s0,π)Q(s0, a0)],(1)

where ∀s∈ S,Aadv(s, π)is deﬁned as

Aadv(s, π) := {a∈ A :∃˜s∈ B(s)s.t. π(˜s) = a}.(2)

Here

Aadv(s0, π)

denotes the set of actions an adversary can mislead the victim

into selecting by

perturbing the state

into a neighboring state

˜s∈ B(s0)

. This hypothetical perturbation to the future

state

is the key for characterizing the worst-case long-term reward under attack. The following

theorem associates the worst-attack Bellman operator and the worst-attack action value.

Theorem 4.2

(Worst-attack Bellman Operator and Worst-attack Action Value)

For any given policy

Tπ

is a contraction whose ﬁxed point is

Qπ

, the worst-attack action value of

under any

observation attacks with radius .

Theorem 4.2 proved in Appendix Asuggests that the lowest possible cumulative reward of a pol-

icy under bounded observation attacks can be computed by worst-attack Bellman operator. The

corresponding worst-attack value Vπcan be obtained by Vπ(s) = mina∈Aadv (s,π)Qπ(s, a).

How to Compute Aadv.

To obtain

Aadv(s, π)

, we need to identify the actions that can be the

outputs of the policy

when the input state

is perturbed within

B(s)

. This can be solved by

commonly-used convex relaxation of neural networks [

], where layer-wise lower

and upper bounds of the neural network are derived. That is, we calculate

and

such that

π(s)≥π(ˆs)≥π(s),∀ˆs∈ B(s)

. With such a relaxation, we can obtain a superset of

Aadv

, namely

Aadv

. Then, the ﬁxed point of Equation

(1)

with

Aadv

being replaced by

Aadv

becomes a lower

bound of the worst-attack action value. For a continuous action space,

Aadv(s, π)

contains actions

bounded by

π(s)

and

π(s)

. For a discrete action space, we can ﬁrst compute the maximal and minimal

probabilities of taking each action, and derive the set of actions that are likely to be selected. The

computation of

Aadv

is not expensive, as there are many efﬁcient convex relaxation methods [

]

which compute

and

with only constant-factor more computations than directly computing

π(s)

Experiment in Section 5veriﬁes the efﬁciency of our approach, where we use a well-developed

toolbox

auto_LiRPA

[

] to calculate the convex relaxation. More implementation details and

explanations are provided in Appendix C.1.

Estimating Worst-attack Value.

Note that the worst-attack Bellman operator

Tπ

is similar to

the optimal Bellman operator

T∗

, although it uses

mina∈Aadv

instead of

maxa∈A

. Therefore, once

we identify

Aadv

as introduced above, it is straightforward to compute the worst-attack action

value using Bellman backups. To model the worst-attack action value, we train a network named

worst-attack critic, denoted by

Qπ

, where

is the parameterization. Concretely, for any mini-batch

{st, at, rt, st+1}N

t=1,Qπ

φis optimized by minimizing the following estimation loss:

Lest(Qπ

φ):= 1

t=1

(yt−Qπ

φ(st, at))2,where yt=rt+γmin

ˆa∈Aadv (st+1,π)Qπ

φ(st+1,ˆa).(3)

For a discrete action space,

Aadv

is a discrete set and solving

is straightforward. For a continuous

action space, we use gradient descent to approximately ﬁnd the minimizer

ˆa

. Since

Aadv

is in general

small, this minimization is usually easy to solve. In MuJoCo, we ﬁnd that 50-step gradient descent

already converges to a good solution with little computational cost, as detailed in Appendix D.3.3.

Differences with Worst-case Value Estimation in Related Work.

Our proposed worst-attack

Bellman operator is different from the worst-case Bellman operator in the literature of risk-sensitive

RL [

], whose goal is to avoid unsafe trajectories under the intrinsic uncertainties

of the MDP. These inherent uncertainties of the environment are independent of the learned policy.

In contrast, our focus is to defend against adversarial perturbations created by malicious attackers

that can be adaptive to the policy. The GWC reward proposed by [

] also estimates the worst-case

reward of a policy under state perturbations. But their evaluation is based on a greedy strategy and

requires interactions with the environment, which is different from our estimation.

Mechanism 2: Worst-case-aware Policy Optimization

So far we have introduced how to evaluate the worst-attack value of a policy by learning a worst-attack

critic. Inspired by the actor-critic framework, where the actor policy network

πθ

is optimized towards

a direction that the critic value increases the most, we can regard worst-attack critic as a special critic

that directs the actor to increase the worst-attack value. That is, we encourage the agent to select an

action with a higher worst-attack action value, by minimizing the worst-attack policy loss below:

Lwst(πθ;Qπ

φ) := −1

t=1 X

a∈A

πθ(a|st)Qπ

φ(st, a),(4)

where

Qπ

is the worst-attack critic learned via

Lest

introduced in Equation

(3)

. Note that

Lwst

is a

general form, while the detailed implementation of the worst-attack policy optimization can vary

depending on the architecture of

πθ

in the base RL algorithm (e.g. PPO has a policy network, while

DQN acts using the greedy policy induced by a Q network). In Appendix C.2 and Appendix C.3, we

illustrate how to implement Lwst for PPO and DQN as two examples.

The proposed worst-case-aware policy optimization has several

merits

compared to prior ATLA [

]

and PA-ATLA [

] methods which alternately train the agent and an RL attacker.

(1)

Learning the

optimal attacker

h∗

requires collecting extra samples using the current policy (on-policy estimation).

In contrast,

Qπ

can be learned using off-policy samples, e.g., historical samples in the replay buffer,

and thus is more suitable for training where the policy changes over time. (

Qπ

depends on the current

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientAdversarialTrainingwithoutAttacking:Worst-Case-AwareRobustReinforcementLearningYongyuanLiangyYanchaoSunzRuijieZhengzFurongHuangzyShanghaiAILab,zUniversityofMaryland,CollegeParkycheryllLiang@outlook.comz{ycs,rzheng12,furongh}@umd.eduAbstractRecentstudiesrevealthatawell-traineddeepreinforcem...

展开>> 收起<<

Efﬁcient Adversarial Training without Attacking Worst-Case-Aware Robust Reinforcement Learning Yongyuan LiangyYanchao SunzRuijie ZhengzFurong Huangz.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Adversarial Training without Attacking Worst-Case-Aware Robust Reinforcement Learning Yongyuan LiangyYanchao SunzRuijie ZhengzFurong Huangz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: