Policy Gradients for Probabilistic Constrained Reinforcement Learning Weiqin Chen

2025-05-02 0 0 394.57KB 6 页 10玖币
侵权投诉
Policy Gradients for Probabilistic Constrained
Reinforcement Learning
Weiqin Chen
Department of Electrical,
Computer, and Systems Engineering
Rensselaer Polytechnic Institute
Troy, NY, USA
chenw18@rpi.edu
Dharmashankar Subramanian
IBM T. J. Watson Research Center
Yorktown Heights, NY, USA
dharmash@us.ibm.com
Santiago Paternain
Department of Electrical,
Computer, and Systems Engineering
Rensselaer Polytechnic Institute
Troy, NY, USA
paters@rpi.edu
Abstract—This paper considers the problem of learning safe
policies in the context of reinforcement learning (RL). In particu-
lar, we consider the notion of probabilistic safety. This is, we aim
to design policies that maintain the state of the system in a safe
set with high probability. This notion differs from cumulative
constraints often considered in the literature. The challenge of
working with probabilistic safety is the lack of expressions for
their gradients. Indeed, policy optimization algorithms rely on
gradients of the objective function and the constraints. To the
best of our knowledge, this work is the first one providing such
explicit gradient expressions for probabilistic constraints. It is
worth noting that the gradient of this family of constraints can
be applied to various policy-based algorithms. We demonstrate
empirically that it is possible to handle probabilistic constraints
in a continuous navigation problem.
Index Terms—reinforcement learning, probabilistic constraint,
safe policy, policy gradient
I. INTRODUCTION
Reinforcement learning (RL) has gained traction as a
solution to the problem of computing policies to perform
challenging and high-dimensional tasks, e.g., playing video
games [1], mastering Go [2], robotic manipulation [3] and
locomotion [4], etc. However, in general, RL algorithms are
only concerned with maximizing a cumulative reward [5], [6],
which may lead to risky behaviors [7] in realistic domains
such as robot navigation [8].
Taking into account the safety requirements motivates the
development of policy optimization under safety guaran-
tees [9]–[11]. Some approaches consider risk-aware objectives
or regularized solutions where the reward is modified to
take into account the safety requirements [12]–[14]. Other
formulations include Constrained Markov Decision Processes
(CMDPs) [15] where additional cumulative (or average) re-
wards need to be kept above a desired threshold. This approach
has been commonly used to induce safe behaviors [16]–
[24]. To solve these constrained problems, primal-dual algo-
rithms [16]–[20] combined with classical and state-of-the-art
policy-based algorithms, e.g., REINFORCE [25], DDPG [26],
TRPO [27], PPO [28] are generally used.
In this cumulative constraint setting, safety violations are
acceptable as long as the amount of violations does not exceed
This work was supported by the IBM Rensselaer Research Collaboration.
the desired thresholds. This makes them often not suitable
for safety-critical applications. For instance, in the context of
autonomous driving, even one collision is unacceptable.
A more suitable notion of safety in this context is to guaran-
tee that the whole trajectory of the system remains within a set
that is deemed to be safe (see e.g., [29]). Ideally, one would
like to achieve this goal for every possible trajectory. This
being an ambitious goal, in this work we settle for solutions
that guarantee every time safety with high probability. We
describe this setting in detail in Section II.
The main challenge in solving problems under probabilistic
safety constraints is that explicit policy gradient-like expres-
sions for such constraints are not readily available. Indeed in
[20], [30] replace this constraint with a suitable lower bound.
In [31] an approximation of the gradient is also provided.
These limitations, prevent us from running the aforementioned
policy-based algorithms. In Section III, we present the main
contribution of this work: an expression for the gradient
that enables stochastic approximations. Other than conclud-
ing remarks, the paper finishes (Section IV) with numerical
examples that demonstrate the use of the probabilistic safe
gradient to train safe policies in a navigation task.
II. PROBLEM FORMULATION
In this work, we consider the problem of finding optimal
policies for Markov Decision Processes (MDPs) under prob-
abilistic safety guarantees. In particular, we are interested in
situations where the state transition distributions are unknown
and thus the policies need to be computed from data. An
MDP [32] is defined by a tuple (S,A, r, P, µ, T ), where S
is the state space, Ais the action space, r:S × A R
is the reward function describing the quality of the decision,
Pat
stst+1 (ˆ
S) := P(st+1 ˆ
S | st, at)where ˆ
S ⊂ S, st
S, at∈ A, t Nis the transition probability describing the
dynamics of the system, µ(ˆ
S) := P(sˆ
S)is the initial state
distribution, and Tis the time horizon. The state and action at
time tare random variables denoted respectively by Stand At.
Apolicy is a conditional distribution πθ(a|s)parameterized
by θRd(for instance the weights and biases of neural
networks), from which the agent draws action a∈ A when
in the corresponding state s∈ S. In the context of MDPs the
arXiv:2210.00596v2 [cs.LG] 18 Apr 2023
objective is to find a policy that maximizes the value function.
The latter is defined as
V(θ) = Eaπθ(a|s),S0µ"T
X
t=0
r(St, At)#,(1)
where aand sdenote the sequences of actions and states for
the whole episode, this is, from time t= 0 to t=T. The
subscripts of this expectation are omitted in the remaining of
the paper for simplicity.
In this paper, we are concerned with imposing safety
requirements. In particular, we focus on the notion of proba-
bilistic safety which we formally define next.
Definition 1. A policy πθis (1 δ)-safe for the set Ssafe ⊂ S
if and only if PT
T
t=0
{St∈ Ssafe}|πθ1δ.
With this definition, we formulate the following probabilis-
tic safe RL problem as a constrained optimization problem
P?= max
θRdV(θ)
s.t. P T
\
t=0
{St∈ Ssafe}|πθ!1δ. (2)
Note that the previous problem differs from most of the
safe RL literature that work with the cumulative constraint
setting (see e.g., [19]–[21], [24]). To solve problem (2), it
is conceivable to employ gradient-based methods e.g., regu-
larization [33] or primal-dual [34] to achieve local optimal
solutions. For instance, consider the regularization method
with a fixed penalty. This is, for λ > 0we formulate the
following unconstrained problem as an approximation to the
constrained problem (2)
E"T
X
t=0
r(St, At)#+λ P T
\
t=0
{St∈ Ssafe}|πθ!(1 δ)
!.
(3)
Note that λtrades-off safety and task performance. Indeed, for
large values of λsolutions to (3) will prioritize safe behaviors,
whereas for small values of λthe solutions will focus on
maximizing the expected cumulative rewards.
Then, to solve problem (3) locally, gradient ascent [35]
or its stochastic versions are generally employed. Note that
the gradient of the first term in (3) can be computed using
the Policy Gradient Theorem [6]. Nevertheless, the lack of
an expression for the gradient of the probabilistic safety
constraint, i.e., θPT
T
t=0
{St∈ Ssafe}|πθprevents us from
applying the gradient ascent family of methods to solve (3).
In the next section, we provide such an expression for the
gradient that allows us to overcome this limitation.
III. THE GRADIENT OF PROBABILISTIC CONSTRAINTS
We start this section by defining an important quantity in
what follows next. For any tsuch that 0tTdefine
Gt=
T
Y
u=t
(Su∈ Ssafe).(4)
Having defined this quantity we are now in conditions of
providing an expression for the gradient of the probabilistic
constraint. This is the subject of the following Theorem.
Theorem 1. Let S0∈ Ssafe, the gradient of the probability of
being safe for a given policy πθyields
θP T
\
t=0
{St∈ Ssafe}|πθ, S0!
=E"T1
X
t=0
G1θlog πθ(At|St)|πθ, S0#.(5)
Proof. We proceed by presenting and proving the following
two technical lemmas (Lemma 1 and Lemma 2).
Lemma 1. Given St1∈ Ssafe and Gt, t = 1,2,· · · , T 1
defined in (4), it holds that
θE[Gt|St1] = E[θE[Gt+1 |St] (St∈ Ssafe)|St1]
+E[Gtθlog πθ(At1|St1)|St1].
(6)
Proof. We start the proof by rewriting the expectation of G1
using the towering property of the expectation
E[G1|S0] = E[E[G1|S1]|S0]
=E[E[G2(S1∈ Ssafe)|S1]|S0],(7)
where the second equality follows from (4). Since S1is
measurable with respect to the σ-algebra F1it follows that
E[G1|S0] = E[E[G2|S1] (S1∈ Ssafe)|S0].(8)
Rewriting the outer expectation in terms of the probability
distribution of S1, the previous expression reduces to
E[G1|S0] = ZS
E[G2|s1] (s1∈ Ssafe)p(s1|S0)ds1
=ZS×A
E[G2|s1] (s1∈ Ssafe)
p(s1|S0, a0)πθ(a0|S0)ds1da0,(9)
where the last equality follows from p(s1|S0) = RAp(s1|
S0, a0)πθ(a0|S0)da0.Taking the gradient of the previous
expression with respect to the policy parameters θresults in
θE[G1|S0] = ZS×A
θ(E[G2|s1]) (s1∈ Ssafe)
p(s1|S0, a0)πθ(a0|S0)ds1da0
+ZS×A
E[G2|s1] (s1∈ Ssafe)(10)
p(s1|S0, a0)θπθ(a0|S0)ds1da0.
Notice that the first in the right-hand side of the previous
expression can be presented by
ZS×A
θ(E[G2|s1]) (s1∈ Ssafe)
p(s1|S0, a0)πθ(a0|S0)ds1da0
=E[θE[G2|S1] (S1∈ Ssafe)|S0].(11)
摘要:

PolicyGradientsforProbabilisticConstrainedReinforcementLearningWeiqinChenDepartmentofElectrical,Computer,andSystemsEngineeringRensselaerPolytechnicInstituteTroy,NY,USAchenw18@rpi.eduDharmashankarSubramanianIBMT.J.WatsonResearchCenterYorktownHeights,NY,USAdharmash@us.ibm.comSantiagoPaternainDepartmen...

展开>> 收起<<
Policy Gradients for Probabilistic Constrained Reinforcement Learning Weiqin Chen.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:394.57KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注