A Policy Gradient Approach for Finite Horizon Constrained Markov Decision Processes Soumyajit Guin1and Shalabh Bhatnagar1

2025-04-27 0 0 404.79KB 12 页 10玖币

侵权投诉

A Policy Gradient Approach for Finite Horizon

Constrained Markov Decision Processes

Soumyajit Guin1and Shalabh Bhatnagar1

Abstract

The inﬁnite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably

result in stationary policies that are optimal. In many situations, ﬁnite horizon control problems are of interest

and for such problems, the optimal policies are time-varying in general. Another setting that has become popular

in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also

aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of

inﬁnite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the

Finite Horizon Setting where the horizon terminates after a ﬁxed (ﬁnite) time. We use function approximation

in our algorithm which is essential when the state and action spaces are large or continuous and use the policy

gradient method to ﬁnd the optimal policy. The optimal policy that we obtain depends on the stage and so is

non-stationary in general. To the best of our knowledge, our paper presents the ﬁrst policy gradient algorithm for

the ﬁnite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal

policy. We also compare and analyze the performance of our algorithm through experiments and show that our

algorithm performs better than some other well known algorithms.

I. INTRODUCTION

The Constrained Markov Decision Process (C-MDP) setting has recently received signiﬁcant attention

in the reinforcement learning (RL) literature due to its natural application in safe RL problems [1], [2].

A textbook treatment of C-MDP can be found in [3]. In the C-MDP framework, in addition to the

long-term objective speciﬁed via single-stage rewards (that are associated with state transitions), there

are also long-term constraint functions speciﬁed via additional single-stage rewards or costs. The goal

then is to ﬁnd an optimal policy that maximizes a long-term reward objective while satisfying prescribed

constraints. RL algorithms for inﬁnite horizon discounted C-MDP have been studied in [4]. For the long-

run average cost C-MDP, [5] has developed the ﬁrst actor-critic RL algorithm in the full state setting.

RL Algorithms with function approximation have also been developed for inﬁnite horizon discounted

cost C-MDP [6] as well as the long-run average cost C-MDP [7].

In this paper, we present an RL algorithm for C-MDP in the ﬁnite horizon setting. Finite Horizon

problems [8], [9] deal with situations where the agent needs to choose a ﬁnite number of actions

depending on the states of the environment in order to maximize the expected sum of single-stage

and the terminal reward. An optimal policy in this setting would in general be non-stationary as the

choice of an action at an instant would depend not just on the state at that instant but also on the

number of actions remaining to be chosen from then on so as to maximize a long-term objective.

RL techniques for Finite Horizon (regular) MDP in the full state case have been discussed in [10].

They give two algorithms for the tabular case: QH-Learning and RH-Learning, and do a learning-rate

analysis between them. Actor-critic type algorithms for Finite-Horizon (regular) MDPs are discussed

in [11]. They give four algorithms for Finite Horizon, among which three are for tabular setting, and

one algorithm uses function approximation. Their algorithms for tabular setting are not scalable to large

state spaces. The algorithm which uses function approximation uses zeroth order gradient instead of

ﬁrst order gradient, like us and also doesn’t consider inequality constraints, which we do. Our RL

*This work was supported by a J. C. Bose Fellowship, Project No. DFTM/02/3125/M/04/AIR-04 from DRDO under DIA-RCOE, a

project from DST-ICPS, and the RBCCPS, IISc.

1The authors are with the Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560012, India

gsoumyajit@iisc.ac.in; shalabh@iisc.ac.in.

arXiv:2210.04527v5 [cs.LG] 20 Mar 2025

algorithm is devised for ﬁnite-horizon C-MDP, uses function approximation, and involves actor-critic

type updates. Temporal difference learning algorithms for a ﬁnite horizon setting have also recently

been studied in [12]. They show convergence for Q-Learning with Linear function-approximation and

some general function approximations for Finite Horizon. With their motivation to make inﬁnite horizon

Q-Learning stable, they do not consider time varying transition probability and reward functions, which

we do, which is a more complicated setting.

We prove the convergence of our algorithm under standard assumptions. Our convergence guarantees

gives one the power to mimic our algorithm and make algorithms using neural networks for solving

complex tasks. One such task could be a portfolio management system [13], where a person wants to

invest in the stock market for a ﬁnite amount of time. The person here needs to decide the ratio of the

money they will invest in different stocks. The stock values naturally change with time, and the decisions

made are time critical in nature. We also show in this paper, empirical results on a two-dimensional grid

world problem using our algorithms where we observe that our algorithms clearly meet the constraint

cost performance while giving a good reward performance. The other algorithms in the literature do

not meet the constraint objective.

Our contributions:

1) We present and analyze both theoretically and experimentally the ﬁrst policy gradient reinforce-

ment learning algorithm with function approximation for Finite Horizon Constrained Markov

Decision Processes.

2) This setting differs signiﬁcantly from inﬁnite horizon problems since in the latter stationary policies

are optimal unlike the ﬁnite horizon setting where invariably non-stationary policies are optimal.

This is because knowledge of the time remaining for termination of the horizon often has a

profound bearing on action selection in the ﬁnite horizon case.

3) We prove that our proposed algorithm converges almost surely to a constrained optimum over a

tuple of parameters, one corresponding to each instant in the horizon.

4) We show a comparison of the empirical performance of our algorithm in relation to well known

algorithms in the literature originally designed for inﬁnite horizon problems.

5) Our key observation here is that our algorithm gives a good reward performance while strictly

meeting the constraint criterion at every time instant unlike the other algorithms that do meet the

constraint criterion and are therefore unsuitable for Constrained Finite Horizon problems.

II. THE FRAMEWORK AND PROBLEM FORMULATION

By a Markov Decision Process (MDP), we mean a stochastic process {Xh}taking values in a set

(called the state space) S, and governed by a control sequence {Zh}taking values in a set A(called the

action space), that satisﬁes the following controlled Markov property: P(Xh+1 =s′|Xm, Zm, m ≤h) =

ph(Xh, Zh, s′)a.s. Also, associated with any state transition is a single-stage reward. We assume here

that Sand Aare ﬁnite sets. Actions are chosen by the decision maker at instants h= 0,1, . . . , H −1,

and the process terminates at instant H, where His a given positive integer. For simplicity, we consider

all actions in Ato be feasible in every state in S. A non-stationary randomized policy (NRP) is a

set of Hdistributions deﬁned by π:= {µ0, µ1, ..., µH−1}, where µh(s, ·)is a distribution over A, s ∈

S, h = 0,1, . . . , H −1. Here, the action Zh∼µh(Xh,·),∀h. It is easy to see under a NRP, {Xh}is a

non-homogeneous Markov chain.

Let rh(s, a, s′)(resp. g(1)

h(s, a, s′), . . . , g(M)

h(s, a, s′)) be the single-stage reward (resp. the set of single-

stage costs) at instant h= 0,1, . . . , H −1when the state is s∈S, the action chosen is a∈Aand the

next state is s′∈S. Let rH(s)(resp. g(1)

H(s), . . . , g(M)

H(s)) likewise denote the terminal reward (resp. set

of terminal costs) at instant Hwhen the terminal state is s∈S. Our aim is to ﬁnd a NRP π∗that

maximizes the following over all NRP π:

J(π) = E"r(k)

H(XH) +

H−1

h=0

r(k)

h(Xh, Zh, Xh+1)#,(1)

subject to the constraints

S(k)(π) = Eg(k)

H(XH) +

H−1

h=0

g(k)

h(Xh, Zh, Xh+1)≤α(k),(2)

k= 1, . . . , M. Here α(1), . . . , α(M)are certain prescribed threshold values. We assume there exists at

least one NRP πfor which all the inequality constraints in (2) are satisﬁed. Let λ= (λ(1), . . . , λ(M))T

denote the vector of Lagrange multipliers λ(1), . . . , λ(M)∈R+∪ {0}and let L(π, λ)denote the

Lagrangian:

L(π, λ) = J(π) +

k=1

λ(k)(S(k)(π)−α(k)) = E"cλ

H(XH) +

H−1

h=0

cλ

h(Xh, Zh, Xh+1)#,(3)

where cλ

H(s) := rH(s) + PM

k=1 λ(k)(g(k)

H(s)−α(k))and cλ

h(s, a, s′) := rh(s, a, s′) + PM

k=1 λ(k)g(k)

h(s, a, s′)

respectively. Let Vπ,λ

h(s), s ∈Sbe the value function at instant h= 0,1, . . . , H −1for the relaxed

MDP problem.

Vπ,λ

h(s) = Ecλ

H(XH) +

H−1

t=h

cλ

t(Xt, Zt, Xt+1)|Xh=s(4)

and similarly,

Vπ,λ

H(s) = cλ

H(s).(5)

Note that the foregoing can be written in terms of the Q-value function for s∈S, a ∈Aas

Vπ,λ

h(s) = X

a∈A

µh(s, a)Qπ,λ

h(s, a)(6)

where,

Qπ,λ

h(s, a) = X

s′∈S

ph(s, a, s′)hcλ

h(s, a, s′) + Vπ,λ

h+1(s′)i.

Similarly we can deﬁne value function for the constraints. The value function for s∈Sof the kth

constraint, k= 1, . . . , M, is deﬁned as:

Wπ,(k)

h(s) = X

a∈A

µh(s, a)X

s′∈S

ph(s, a, s′)gh(s, a, s′) + Wπ,(k)

h+1 (s′)(7)

and,

Wπ,(k)

H(s) = g(k)

H(s)−α(k).(8)

Let β(s0), s0∈Sbe the initial state distribution. For an NRP π, let P r(s0→s, h, π)be the probability

of transitioning from initial state s0to state sin hsteps under policy π. Also, under π, let dπ

h(s)be the

probability of reaching state sin hsteps, h= 0,1, . . . , H.

dπ

h(s) := X

s0∈S

β(s0)P r(s0→s, h, π) = X

s0∈S

β(s0)X

a0∈A

µ0(s0, a0)X

s1∈S

p0(s0, a0, s1)

a1∈A

µ1(s1, a1)X

s2∈S

p1(s1, a1, s2)· · · X

ah−1∈A

µh−1(sh−1, ah−1)ph−1(sh−1, ah−1, s).(9)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APolicyGradientApproachforFiniteHorizonConstrainedMarkovDecisionProcessesSoumyajitGuin1andShalabhBhatnagar1AbstractTheinfinitehorizonsettingiswidelyadoptedforproblemsofreinforcementlearning(RL).Theseinvariablyresultinstationarypoliciesthatareoptimal.Inmanysituations,finitehorizoncontrolproblemsareof...

展开>> 收起<<

A Policy Gradient Approach for Finite Horizon Constrained Markov Decision Processes Soumyajit Guin1and Shalabh Bhatnagar1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Policy Gradient Approach for Finite Horizon Constrained Markov Decision Processes Soumyajit Guin1and Shalabh Bhatnagar1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: