
II. RISK SENSITIVE MDPS
Risk-sensitive Markov Decision Process (Risk-MDP) is a
sequential decision-making problem that aims to optimize the
exponential function of a combination of sequential rewards.
In contrast to classical MDPs, it also considers higher
moments of the combined cost. As in classical MDP, the
risk-sensitive framework consists of time horizon 1, . . . , T,
a finite state space Stat time t, a finite action space At,x
at time tand for state x∈ St, a transition function that
determines the probability pt(x0|x, a)of reaching state x0∈
St+1 based on current state x∈ Stand action a∈ At,x. In
this paper, we consider a finite-horizon problem, i.e., T < ∞.
A reward ˜rt(x, a, x0)is achieved at time twhen action a
chosen in state xresults in (next) state x0. The aim is to find
a policy that optimizes a given objective function constructed
from the combined cost. Let (A)be the set of probability
distributions on set A. Any policy consists of decision rules,
that prescribe the actions to be taken for a given state and
time epoch. It is represented as Π=(d1, . . . , dT−1), where
dt:St→(At,·)(to be more precise, x7→ (At,x))
prescribes the action for t-th time-slot; in other words, in this
work we restrict our attention to the Markovian policies.
The risk-sensitive objective function under policy Πand
initial distribution αrfor the Risk-MDP problem is defined
as,
Jr(Π, αr) = EΠ
αrhePT−1
t=1 rt(Xt,At,Xt+1)+rT(XT)i,with,
rt(x, a, x0) := γβt˜rt(x, a, x0)and rT(x) = γβT˜rT(x).(1)
Here, γis the risk-factor, β∈(0,1) is the discount factor,
(Xt, At)t≤T−1is the stochastic state-action trajectory that
evolves under policy Π, and EΠ
αrrepresents the expectation
under policy Πand initial distribution αr. A higher |γ|
indicates more importance to the higher moments of the
combined cost, while with γ→0, one can approach the
classical MDP problem. The aim in Risk-MDP problems is
to find an optimal policy, i.e., a policy Π∗that satisfies,
Π∗∈arg sup
Π∈Γ
Jr(Π, αr),(2)
where Γis the set of Markovian randomized policies.
The value function ut(x)at time tand in state x∈ Stis
defined to be (see [8], [9]),
ut(x) = sup
Π
Jr(t, Π, x),where (3)
Jr(t, Π, x) := EΠ
αrhePT−1
τ=trτ(Xτ,Aτ,Xτ+1)+rT(XT)|Xt=xi.
By strong Markov property applicable under Markovian
policies, the above quantity depends only on sub-policy (of
Π) from tonwards. Observe here that, the optimal policy
Π∗is the one that achieves the value function u1(x)for all
x∈ S1. The well-known DP equations to solve the Risk-
MDPs are as follows (for x∈ St),
uT(x) = erT(x)and for t≤T−1,(4)
ut(x) = max
a(X
x0
ert(x,a,x0)pt(x0|x, a)ut+1(x0)).
The fixed point dynamic programming equations (4) facil-
itate the derivation of optimal policy for unconstrained prob-
lems. However, such equations are not known for constrained
problems, which we introduce in the immediate following.
One of the aims of this paper is to derive an appropriate
fixed-point equation that solves the Risk-CMDP.
A. Risk-sensitive constrained MDP (Risk-CMDP)
We now consider a constraint in the Risk-MDP problem
defined in (2). Here, at time t, an immediate constraint-cost
˜ct(x, a, x0)is incurred along with the reward ˜rt(x, a, x0),
when action achosen in state xresults in next state x0. The
aim is to keep the expected exponential of the combined
constraint-cost below a certain bound. To keep it general,
let βcand γcbe the discount and risk-factors corresponding
to the constraint; the factors γc,βccan be different from
respective factors γ, β corresponding to ˜rt. Thus a policy Π
is feasible if it satisfies the following constraint,
Jc(Π, αc) =EΠ
αchePT−1
t=1 ct(Xt,At,Xt+1)+cT(XT)i≤B, with,
ct(x, a, x0) := γcβt
c˜ct(x, a, x0),and cT(x) = γcβT
c˜cT(x),(5)
where αcis the initial distribution and can be different from
αr, the initial distribution of state corresponding to ˜rt. Thus,
the overall problem is,
sup
Π∈Γ
Jr(Π, αr),(6)
subject to Jc(Π, αc)≤B.
Let Γc:= {Π∈Γ : Jc(Π, αc)≤B}represent the
corresponding feasible region. Throughout we assume the
existence of a solution for (6),and the aim is to design an
algorithm that obtains the same. Towards this, as a first step
we derive a fixed-point equation, in the next section whose
solution optimizes (6).
III. FIXED-POINT EQUATION
We begin this section with a few definitions. For ease
of notation, we let mtrepresent the immediate reward
rtor constraint-cost ctfunction at time tdepending
upon the choice m∈ {r, c}. Thus mt(x, a, x0)is the
reward/constraint-cost function at time t, when action a
chosen in state xresults in x0, the new state.
A. Forward factors
For any policy Π, time t, reward/constraint-cost mand
state xt, define the forward factors θΠ
m,t(xt)as below:
θΠ
m,t(xt) := EΠ
αmhePt−1
k=1 mk(Xk,Ak,Xk+1){Xt=xt}i,
=X
xt−1
1,at−1
1
αm(x1)Y
k≤t−1
dk(xk, ak)pk(xk+1|xk, ak)emk(xk,ak,xk+1),
where xt−1
1:= (x1, . . . , xt−1)is a vector with each
xk∈ Skand at−1
1:= (a1, . . . , at−1)is a vector with
each ak∈ Ak,xk. These factors represent the expected
reward/constraint-cost accumulated under policy Πtill time
t−1, and the probability that Xt=xt. For any t,m, it is