algorithm is devised for finite-horizon C-MDP, uses function approximation, and involves actor-critic
type updates. Temporal difference learning algorithms for a finite horizon setting have also recently
been studied in [12]. They show convergence for Q-Learning with Linear function-approximation and
some general function approximations for Finite Horizon. With their motivation to make infinite horizon
Q-Learning stable, they do not consider time varying transition probability and reward functions, which
we do, which is a more complicated setting.
We prove the convergence of our algorithm under standard assumptions. Our convergence guarantees
gives one the power to mimic our algorithm and make algorithms using neural networks for solving
complex tasks. One such task could be a portfolio management system [13], where a person wants to
invest in the stock market for a finite amount of time. The person here needs to decide the ratio of the
money they will invest in different stocks. The stock values naturally change with time, and the decisions
made are time critical in nature. We also show in this paper, empirical results on a two-dimensional grid
world problem using our algorithms where we observe that our algorithms clearly meet the constraint
cost performance while giving a good reward performance. The other algorithms in the literature do
not meet the constraint objective.
Our contributions:
1) We present and analyze both theoretically and experimentally the first policy gradient reinforce-
ment learning algorithm with function approximation for Finite Horizon Constrained Markov
Decision Processes.
2) This setting differs significantly from infinite horizon problems since in the latter stationary policies
are optimal unlike the finite horizon setting where invariably non-stationary policies are optimal.
This is because knowledge of the time remaining for termination of the horizon often has a
profound bearing on action selection in the finite horizon case.
3) We prove that our proposed algorithm converges almost surely to a constrained optimum over a
tuple of parameters, one corresponding to each instant in the horizon.
4) We show a comparison of the empirical performance of our algorithm in relation to well known
algorithms in the literature originally designed for infinite horizon problems.
5) Our key observation here is that our algorithm gives a good reward performance while strictly
meeting the constraint criterion at every time instant unlike the other algorithms that do meet the
constraint criterion and are therefore unsuitable for Constrained Finite Horizon problems.
II. THE FRAMEWORK AND PROBLEM FORMULATION
By a Markov Decision Process (MDP), we mean a stochastic process {Xh}taking values in a set
(called the state space) S, and governed by a control sequence {Zh}taking values in a set A(called the
action space), that satisfies the following controlled Markov property: P(Xh+1 =s′|Xm, Zm, m ≤h) =
ph(Xh, Zh, s′)a.s. Also, associated with any state transition is a single-stage reward. We assume here
that Sand Aare finite sets. Actions are chosen by the decision maker at instants h= 0,1, . . . , H −1,
and the process terminates at instant H, where His a given positive integer. For simplicity, we consider
all actions in Ato be feasible in every state in S. A non-stationary randomized policy (NRP) is a
set of Hdistributions defined by π:= {µ0, µ1, ..., µH−1}, where µh(s, ·)is a distribution over A, s ∈
S, h = 0,1, . . . , H −1. Here, the action Zh∼µh(Xh,·),∀h. It is easy to see under a NRP, {Xh}is a
non-homogeneous Markov chain.
Let rh(s, a, s′)(resp. g(1)
h(s, a, s′), . . . , g(M)
h(s, a, s′)) be the single-stage reward (resp. the set of single-
stage costs) at instant h= 0,1, . . . , H −1when the state is s∈S, the action chosen is a∈Aand the
next state is s′∈S. Let rH(s)(resp. g(1)
H(s), . . . , g(M)
H(s)) likewise denote the terminal reward (resp. set
of terminal costs) at instant Hwhen the terminal state is s∈S. Our aim is to find a NRP π∗that
maximizes the following over all NRP π: