A Policy Gradient Approach for Finite Horizon Constrained Markov Decision Processes Soumyajit Guin1and Shalabh Bhatnagar1

2025-04-27 0 0 404.79KB 12 页 10玖币
侵权投诉
A Policy Gradient Approach for Finite Horizon
Constrained Markov Decision Processes
Soumyajit Guin1and Shalabh Bhatnagar1
Abstract
The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably
result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest
and for such problems, the optimal policies are time-varying in general. Another setting that has become popular
in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also
aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of
infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the
Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation
in our algorithm which is essential when the state and action spaces are large or continuous and use the policy
gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is
non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for
the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal
policy. We also compare and analyze the performance of our algorithm through experiments and show that our
algorithm performs better than some other well known algorithms.
I. INTRODUCTION
The Constrained Markov Decision Process (C-MDP) setting has recently received significant attention
in the reinforcement learning (RL) literature due to its natural application in safe RL problems [1], [2].
A textbook treatment of C-MDP can be found in [3]. In the C-MDP framework, in addition to the
long-term objective specified via single-stage rewards (that are associated with state transitions), there
are also long-term constraint functions specified via additional single-stage rewards or costs. The goal
then is to find an optimal policy that maximizes a long-term reward objective while satisfying prescribed
constraints. RL algorithms for infinite horizon discounted C-MDP have been studied in [4]. For the long-
run average cost C-MDP, [5] has developed the first actor-critic RL algorithm in the full state setting.
RL Algorithms with function approximation have also been developed for infinite horizon discounted
cost C-MDP [6] as well as the long-run average cost C-MDP [7].
In this paper, we present an RL algorithm for C-MDP in the finite horizon setting. Finite Horizon
problems [8], [9] deal with situations where the agent needs to choose a finite number of actions
depending on the states of the environment in order to maximize the expected sum of single-stage
and the terminal reward. An optimal policy in this setting would in general be non-stationary as the
choice of an action at an instant would depend not just on the state at that instant but also on the
number of actions remaining to be chosen from then on so as to maximize a long-term objective.
RL techniques for Finite Horizon (regular) MDP in the full state case have been discussed in [10].
They give two algorithms for the tabular case: QH-Learning and RH-Learning, and do a learning-rate
analysis between them. Actor-critic type algorithms for Finite-Horizon (regular) MDPs are discussed
in [11]. They give four algorithms for Finite Horizon, among which three are for tabular setting, and
one algorithm uses function approximation. Their algorithms for tabular setting are not scalable to large
state spaces. The algorithm which uses function approximation uses zeroth order gradient instead of
first order gradient, like us and also doesn’t consider inequality constraints, which we do. Our RL
*This work was supported by a J. C. Bose Fellowship, Project No. DFTM/02/3125/M/04/AIR-04 from DRDO under DIA-RCOE, a
project from DST-ICPS, and the RBCCPS, IISc.
1The authors are with the Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India
gsoumyajit@iisc.ac.in; shalabh@iisc.ac.in.
arXiv:2210.04527v5 [cs.LG] 20 Mar 2025
algorithm is devised for finite-horizon C-MDP, uses function approximation, and involves actor-critic
type updates. Temporal difference learning algorithms for a finite horizon setting have also recently
been studied in [12]. They show convergence for Q-Learning with Linear function-approximation and
some general function approximations for Finite Horizon. With their motivation to make infinite horizon
Q-Learning stable, they do not consider time varying transition probability and reward functions, which
we do, which is a more complicated setting.
We prove the convergence of our algorithm under standard assumptions. Our convergence guarantees
gives one the power to mimic our algorithm and make algorithms using neural networks for solving
complex tasks. One such task could be a portfolio management system [13], where a person wants to
invest in the stock market for a finite amount of time. The person here needs to decide the ratio of the
money they will invest in different stocks. The stock values naturally change with time, and the decisions
made are time critical in nature. We also show in this paper, empirical results on a two-dimensional grid
world problem using our algorithms where we observe that our algorithms clearly meet the constraint
cost performance while giving a good reward performance. The other algorithms in the literature do
not meet the constraint objective.
Our contributions:
1) We present and analyze both theoretically and experimentally the first policy gradient reinforce-
ment learning algorithm with function approximation for Finite Horizon Constrained Markov
Decision Processes.
2) This setting differs significantly from infinite horizon problems since in the latter stationary policies
are optimal unlike the finite horizon setting where invariably non-stationary policies are optimal.
This is because knowledge of the time remaining for termination of the horizon often has a
profound bearing on action selection in the finite horizon case.
3) We prove that our proposed algorithm converges almost surely to a constrained optimum over a
tuple of parameters, one corresponding to each instant in the horizon.
4) We show a comparison of the empirical performance of our algorithm in relation to well known
algorithms in the literature originally designed for infinite horizon problems.
5) Our key observation here is that our algorithm gives a good reward performance while strictly
meeting the constraint criterion at every time instant unlike the other algorithms that do meet the
constraint criterion and are therefore unsuitable for Constrained Finite Horizon problems.
II. THE FRAMEWORK AND PROBLEM FORMULATION
By a Markov Decision Process (MDP), we mean a stochastic process {Xh}taking values in a set
(called the state space) S, and governed by a control sequence {Zh}taking values in a set A(called the
action space), that satisfies the following controlled Markov property: P(Xh+1 =s|Xm, Zm, m h) =
ph(Xh, Zh, s)a.s. Also, associated with any state transition is a single-stage reward. We assume here
that Sand Aare finite sets. Actions are chosen by the decision maker at instants h= 0,1, . . . , H 1,
and the process terminates at instant H, where His a given positive integer. For simplicity, we consider
all actions in Ato be feasible in every state in S. A non-stationary randomized policy (NRP) is a
set of Hdistributions defined by π:= {µ0, µ1, ..., µH1}, where µh(s, ·)is a distribution over A, s
S, h = 0,1, . . . , H 1. Here, the action Zhµh(Xh,·),h. It is easy to see under a NRP, {Xh}is a
non-homogeneous Markov chain.
Let rh(s, a, s)(resp. g(1)
h(s, a, s), . . . , g(M)
h(s, a, s)) be the single-stage reward (resp. the set of single-
stage costs) at instant h= 0,1, . . . , H 1when the state is sS, the action chosen is aAand the
next state is sS. Let rH(s)(resp. g(1)
H(s), . . . , g(M)
H(s)) likewise denote the terminal reward (resp. set
of terminal costs) at instant Hwhen the terminal state is sS. Our aim is to find a NRP πthat
maximizes the following over all NRP π:
J(π) = E"r(k)
H(XH) +
H1
X
h=0
r(k)
h(Xh, Zh, Xh+1)#,(1)
subject to the constraints
S(k)(π) = Eg(k)
H(XH) +
H1
X
h=0
g(k)
h(Xh, Zh, Xh+1)α(k),(2)
k= 1, . . . , M. Here α(1), . . . , α(M)are certain prescribed threshold values. We assume there exists at
least one NRP πfor which all the inequality constraints in (2) are satisfied. Let λ= (λ(1), . . . , λ(M))T
denote the vector of Lagrange multipliers λ(1), . . . , λ(M)R+∪ {0}and let L(π, λ)denote the
Lagrangian:
L(π, λ) = J(π) +
M
X
k=1
λ(k)(S(k)(π)α(k)) = E"cλ
H(XH) +
H1
X
h=0
cλ
h(Xh, Zh, Xh+1)#,(3)
where cλ
H(s) := rH(s) + PM
k=1 λ(k)(g(k)
H(s)α(k))and cλ
h(s, a, s) := rh(s, a, s) + PM
k=1 λ(k)g(k)
h(s, a, s)
respectively. Let Vπ,λ
h(s), s Sbe the value function at instant h= 0,1, . . . , H 1for the relaxed
MDP problem.
Vπ
h(s) = Ecλ
H(XH) +
H1
X
t=h
cλ
t(Xt, Zt, Xt+1)|Xh=s(4)
and similarly,
Vπ
H(s) = cλ
H(s).(5)
Note that the foregoing can be written in terms of the Q-value function for sS, a Aas
Vπ
h(s) = X
aA
µh(s, a)Qπ
h(s, a)(6)
where,
Qπ
h(s, a) = X
sS
ph(s, a, s)hcλ
h(s, a, s) + Vπ
h+1(s)i.
Similarly we can define value function for the constraints. The value function for sSof the kth
constraint, k= 1, . . . , M, is defined as:
Wπ,(k)
h(s) = X
aA
µh(s, a)X
sS
ph(s, a, s)gh(s, a, s) + Wπ,(k)
h+1 (s)(7)
and,
Wπ,(k)
H(s) = g(k)
H(s)α(k).(8)
Let β(s0), s0Sbe the initial state distribution. For an NRP π, let P r(s0s, h, π)be the probability
of transitioning from initial state s0to state sin hsteps under policy π. Also, under π, let dπ
h(s)be the
probability of reaching state sin hsteps, h= 0,1, . . . , H.
dπ
h(s) := X
s0S
β(s0)P r(s0s, h, π) = X
s0S
β(s0)X
a0A
µ0(s0, a0)X
s1S
p0(s0, a0, s1)
X
a1A
µ1(s1, a1)X
s2S
p1(s1, a1, s2)· · · X
ah1A
µh1(sh1, ah1)ph1(sh1, ah1, s).(9)
摘要:

APolicyGradientApproachforFiniteHorizonConstrainedMarkovDecisionProcessesSoumyajitGuin1andShalabhBhatnagar1AbstractTheinfinitehorizonsettingiswidelyadoptedforproblemsofreinforcementlearning(RL).Theseinvariablyresultinstationarypoliciesthatareoptimal.Inmanysituations,finitehorizoncontrolproblemsareof...

展开>> 收起<<
A Policy Gradient Approach for Finite Horizon Constrained Markov Decision Processes Soumyajit Guin1and Shalabh Bhatnagar1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:404.79KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注