Policy Gradients for Probabilistic Constrained Reinforcement Learning Weiqin Chen

2025-05-02 0 0 394.57KB 6 页 10玖币

侵权投诉

Policy Gradients for Probabilistic Constrained

Reinforcement Learning

Weiqin Chen

Department of Electrical,

Computer, and Systems Engineering

Rensselaer Polytechnic Institute

Troy, NY, USA

chenw18@rpi.edu

Dharmashankar Subramanian

IBM T. J. Watson Research Center

Yorktown Heights, NY, USA

dharmash@us.ibm.com

Santiago Paternain

Department of Electrical,

Computer, and Systems Engineering

Rensselaer Polytechnic Institute

Troy, NY, USA

paters@rpi.edu

Abstract—This paper considers the problem of learning safe

policies in the context of reinforcement learning (RL). In particu-

lar, we consider the notion of probabilistic safety. This is, we aim

to design policies that maintain the state of the system in a safe

set with high probability. This notion differs from cumulative

constraints often considered in the literature. The challenge of

working with probabilistic safety is the lack of expressions for

their gradients. Indeed, policy optimization algorithms rely on

gradients of the objective function and the constraints. To the

best of our knowledge, this work is the ﬁrst one providing such

explicit gradient expressions for probabilistic constraints. It is

worth noting that the gradient of this family of constraints can

be applied to various policy-based algorithms. We demonstrate

empirically that it is possible to handle probabilistic constraints

in a continuous navigation problem.

Index Terms—reinforcement learning, probabilistic constraint,

safe policy, policy gradient

I. INTRODUCTION

Reinforcement learning (RL) has gained traction as a

solution to the problem of computing policies to perform

challenging and high-dimensional tasks, e.g., playing video

games [1], mastering Go [2], robotic manipulation [3] and

locomotion [4], etc. However, in general, RL algorithms are

only concerned with maximizing a cumulative reward [5], [6],

which may lead to risky behaviors [7] in realistic domains

such as robot navigation [8].

Taking into account the safety requirements motivates the

development of policy optimization under safety guaran-

tees [9]–[11]. Some approaches consider risk-aware objectives

or regularized solutions where the reward is modiﬁed to

take into account the safety requirements [12]–[14]. Other

formulations include Constrained Markov Decision Processes

(CMDPs) [15] where additional cumulative (or average) re-

wards need to be kept above a desired threshold. This approach

has been commonly used to induce safe behaviors [16]–

[24]. To solve these constrained problems, primal-dual algo-

rithms [16]–[20] combined with classical and state-of-the-art

policy-based algorithms, e.g., REINFORCE [25], DDPG [26],

TRPO [27], PPO [28] are generally used.

In this cumulative constraint setting, safety violations are

acceptable as long as the amount of violations does not exceed

This work was supported by the IBM Rensselaer Research Collaboration.

the desired thresholds. This makes them often not suitable

for safety-critical applications. For instance, in the context of

autonomous driving, even one collision is unacceptable.

A more suitable notion of safety in this context is to guaran-

tee that the whole trajectory of the system remains within a set

that is deemed to be safe (see e.g., [29]). Ideally, one would

like to achieve this goal for every possible trajectory. This

being an ambitious goal, in this work we settle for solutions

that guarantee every time safety with high probability. We

describe this setting in detail in Section II.

The main challenge in solving problems under probabilistic

safety constraints is that explicit policy gradient-like expres-

sions for such constraints are not readily available. Indeed in

[20], [30] replace this constraint with a suitable lower bound.

In [31] an approximation of the gradient is also provided.

These limitations, prevent us from running the aforementioned

policy-based algorithms. In Section III, we present the main

contribution of this work: an expression for the gradient

that enables stochastic approximations. Other than conclud-

ing remarks, the paper ﬁnishes (Section IV) with numerical

examples that demonstrate the use of the probabilistic safe

gradient to train safe policies in a navigation task.

II. PROBLEM FORMULATION

In this work, we consider the problem of ﬁnding optimal

policies for Markov Decision Processes (MDPs) under prob-

abilistic safety guarantees. In particular, we are interested in

situations where the state transition distributions are unknown

and thus the policies need to be computed from data. An

MDP [32] is deﬁned by a tuple (S,A, r, P, µ, T ), where S

is the state space, Ais the action space, r:S × A → R

is the reward function describing the quality of the decision,

Pat

st→st+1 (ˆ

S) := P(st+1 ∈ˆ

S | st, at)where ˆ

S ⊂ S, st∈

S, at∈ A, t ∈Nis the transition probability describing the

dynamics of the system, µ(ˆ

S) := P(s∈ˆ

S)is the initial state

distribution, and Tis the time horizon. The state and action at

time tare random variables denoted respectively by Stand At.

Apolicy is a conditional distribution πθ(a|s)parameterized

by θ∈Rd(for instance the weights and biases of neural

networks), from which the agent draws action a∈ A when

in the corresponding state s∈ S. In the context of MDPs the

arXiv:2210.00596v2 [cs.LG] 18 Apr 2023

objective is to ﬁnd a policy that maximizes the value function.

The latter is deﬁned as

V(θ) = Ea∼πθ(a|s),S0∼µ"T

t=0

r(St, At)#,(1)

where aand sdenote the sequences of actions and states for

the whole episode, this is, from time t= 0 to t=T. The

subscripts of this expectation are omitted in the remaining of

the paper for simplicity.

In this paper, we are concerned with imposing safety

requirements. In particular, we focus on the notion of proba-

bilistic safety which we formally deﬁne next.

Deﬁnition 1. A policy πθis (1 −δ)-safe for the set Ssafe ⊂ S

if and only if PT

t=0

{St∈ Ssafe}|πθ≥1−δ.

With this deﬁnition, we formulate the following probabilis-

tic safe RL problem as a constrained optimization problem

P?= max

θ∈RdV(θ)

s.t. P T

t=0

{St∈ Ssafe}|πθ!≥1−δ. (2)

Note that the previous problem differs from most of the

safe RL literature that work with the cumulative constraint

setting (see e.g., [19]–[21], [24]). To solve problem (2), it

is conceivable to employ gradient-based methods e.g., regu-

larization [33] or primal-dual [34] to achieve local optimal

solutions. For instance, consider the regularization method

with a ﬁxed penalty. This is, for λ > 0we formulate the

following unconstrained problem as an approximation to the

constrained problem (2)

E"T

t=0

r(St, At)#+λ P T

t=0

{St∈ Ssafe}|πθ!−(1 −δ)

(3)

Note that λtrades-off safety and task performance. Indeed, for

large values of λsolutions to (3) will prioritize safe behaviors,

whereas for small values of λthe solutions will focus on

maximizing the expected cumulative rewards.

Then, to solve problem (3) locally, gradient ascent [35]

or its stochastic versions are generally employed. Note that

the gradient of the ﬁrst term in (3) can be computed using

the Policy Gradient Theorem [6]. Nevertheless, the lack of

an expression for the gradient of the probabilistic safety

constraint, i.e., ∇θPT

t=0

{St∈ Ssafe}|πθprevents us from

applying the gradient ascent family of methods to solve (3).

In the next section, we provide such an expression for the

gradient that allows us to overcome this limitation.

III. THE GRADIENT OF PROBABILISTIC CONSTRAINTS

We start this section by deﬁning an important quantity in

what follows next. For any tsuch that 0≤t≤Tdeﬁne

Gt=

u=t

(Su∈ Ssafe).(4)

Having deﬁned this quantity we are now in conditions of

providing an expression for the gradient of the probabilistic

constraint. This is the subject of the following Theorem.

Theorem 1. Let S0∈ Ssafe, the gradient of the probability of

being safe for a given policy πθyields

∇θP T

t=0

{St∈ Ssafe}|πθ, S0!

=E"T−1

t=0

G1∇θlog πθ(At|St)|πθ, S0#.(5)

Proof. We proceed by presenting and proving the following

two technical lemmas (Lemma 1 and Lemma 2).

Lemma 1. Given St−1∈ Ssafe and Gt, t = 1,2,· · · , T −1

deﬁned in (4), it holds that

∇θE[Gt|St−1] = E[∇θE[Gt+1 |St] (St∈ Ssafe)|St−1]

+E[Gt∇θlog πθ(At−1|St−1)|St−1].

(6)

Proof. We start the proof by rewriting the expectation of G1

using the towering property of the expectation

E[G1|S0] = E[E[G1|S1]|S0]

=E[E[G2(S1∈ Ssafe)|S1]|S0],(7)

where the second equality follows from (4). Since S1is

measurable with respect to the σ-algebra F1it follows that

E[G1|S0] = E[E[G2|S1] (S1∈ Ssafe)|S0].(8)

Rewriting the outer expectation in terms of the probability

distribution of S1, the previous expression reduces to

E[G1|S0] = ZS

E[G2|s1] (s1∈ Ssafe)p(s1|S0)ds1

=ZS×A

E[G2|s1] (s1∈ Ssafe)

p(s1|S0, a0)πθ(a0|S0)ds1da0,(9)

where the last equality follows from p(s1|S0) = RAp(s1|

S0, a0)πθ(a0|S0)da0.Taking the gradient of the previous

expression with respect to the policy parameters θresults in

∇θE[G1|S0] = ZS×A

∇θ(E[G2|s1]) (s1∈ Ssafe)

p(s1|S0, a0)πθ(a0|S0)ds1da0

+ZS×A

E[G2|s1] (s1∈ Ssafe)(10)

p(s1|S0, a0)∇θπθ(a0|S0)ds1da0.

Notice that the ﬁrst in the right-hand side of the previous

expression can be presented by

ZS×A

∇θ(E[G2|s1]) (s1∈ Ssafe)

p(s1|S0, a0)πθ(a0|S0)ds1da0

=E[∇θE[G2|S1] (S1∈ Ssafe)|S0].(11)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PolicyGradientsforProbabilisticConstrainedReinforcementLearningWeiqinChenDepartmentofElectrical,Computer,andSystemsEngineeringRensselaerPolytechnicInstituteTroy,NY,USAchenw18@rpi.eduDharmashankarSubramanianIBMT.J.WatsonResearchCenterYorktownHeights,NY,USAdharmash@us.ibm.comSantiagoPaternainDepartmen...

展开>> 收起<<

Policy Gradients for Probabilistic Constrained Reinforcement Learning Weiqin Chen.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Policy Gradients for Probabilistic Constrained Reinforcement Learning Weiqin Chen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: