Fixed-point equations solving Risk-sensitive MDP with constraint Vartika Singh and Veeraruna Kavitha IEOR IIT Bombay India

2025-04-27 0 0 729.89KB 8 页 10玖币
侵权投诉
Fixed-point equations solving Risk-sensitive MDP with constraint
Vartika Singh and Veeraruna Kavitha
IEOR, IIT Bombay, India
Abstract There are no computationally feasible algorithms
that provide solutions to the finite horizon Risk-sensitive Con-
strained Markov Decision Process (Risk-CMDP) problem, even
for problems with moderate horizon. With an aim to design the
same, we derive a fixed-point equation such that the optimal
policy of Risk-CMDP is also a solution. We further provide
two optimization problems equivalent to the Risk-CMDP. These
formulations are instrumental in designing a global algorithm
that converges to the optimal policy. The proposed algorithm
is based on random restarts and a local improvement step,
where the local improvement step utilizes the solution of the
derived fixed-point equation; random restarts ensure global
optimization. We also provide numerical examples to illustrate
the feasibility of our algorithm for inventory control problem
with risk-sensitive cost and constraint. The complexity of the
algorithm grows only linearly with the time-horizon.
I. INTRODUCTION
Classical Markov Decision Process (MDP) problems aim
to derive an optimal policy that optimizes the expected
combined cost accumulated over time (e.g., [1]). Both finite
and infinite horizon problems are well studied. These prob-
lems can be solved with the help of well-known Dynamic
Programming (DP) equations. One can also solve MDP prob-
lems using Linear Programming (LP) based approach (e.g.,
[1]). Further, the LP formulation facilitates the inclusion of
constraints, which is not the case with DP equations.
In many scenarios, it becomes important to consider the
variations in accumulated cost for different sample paths and
not only the expected value (e.g., [3], [10]). In this case, the
risk-sensitive MDPs (Risk-MDPs) are useful, where higher
moments of combined cost are also considered (see [2], [4]).
The Risk-MDPs have a sensitivity parameter γ, called a
risk-factor which determines the importance of the higher
moments and the variance. It is well known that, as the risk-
factor tends to zero, the value of the Risk-MDP approaches
that of the classical MDP (e.g., [4], [7] for discounted cost
and [5] for average cost problems); the same is also true for
the optimal policies. However the policies for higher γcan
be drastically different.
The Risk-MDP problems can be solved using the cor-
responding dynamic programming approach (see [8]). Re-
cently, authors in [9] proposed an LP based formulation to
solve finite horizon Risk-MDP problems. However, the inclu-
sion of constraints in these problems is not straightforward.
The DP equations are again not satisfied; the LP approach
in [9] can handle Risk-sensitive constrained-MDPs (briefly
referred to as Risk-CMDP). However, in [9], the state space
*The work of first author is partially supported by Prime minister research
fellowship, India
of the Risk-MDP is augmented to include the constraints,
which causes the state space to grow exponentially over time.
Such formulation is impossible to implement for problems
with a longer time horizon. To the best of our knowledge,
there are no other computationally feasible algorithms in the
literature to solve Risk-CMDP problem. This calls for an im-
plementable algorithm that can solve Risk-CMDP problems,
and that does not suffer from the curse of dimensionality.
For infinite horizon Risk-MDP, in contrast to classical
MDPs, stationary policies are not optimal in general (e.g.,
[4], [6], [7]). This makes the derivation of optimal policies
even more difficult. In literature, such problems are solved
by approximating infinite horizon problems with a finite
horizon Risk-MDP; for example, [6] considers a tail cut-off
policy, while [7] considers a tail replacement policy. These
approximations are in line with ultimately stationary policies
discussed in [4].
Including constraints in infinite horizon Risk-MDPs is
even more challenging. A recent work [10] considers infinite
horizon Risk-CMDP and provides approximate solutions via
the solutions of an appropriate finite horizon Risk-CMDP.
This approach provides -optimal policies which improve
as the terminal time of finite horizon Risk-CMDP increases
to infinity. Such approximations require solution of finite-
horizon problems with sufficiently large terminal times. This
again calls for an implementable algorithm, that does not
suffer from the curse of dimensionality as in [9].
One of the main goals of this paper is an algorithm that
solves finite-horizon Risk-CMDP problems, whose complex-
ity grows (only) linearly with terminal time. The proposed
algorithm is based on the derivation of a fixed-point equa-
tion which must be satisfied by any optimal policy of the
constrained problem. Our contributions are threefold: i) we
derive a fixed-point equation such that the optimal policy of
Risk-CMDP is also a solution of the fixed-point equation;
ii) we derive two equivalent optimization problems that
facilitate the derivation of the optimal policy; and iii) we
provide a global iterative algorithm that converges to the
optimal policy under certain conditions.
To illustrate the feasibility of our approach, we present
numerical examples that solve some inventory control prob-
lems with constraint. We could easily solve problems with
time horizons as long as 1000 decision epochs.
Our approach can easily be extended to the case with
multiple constraints including risk-neutral constraints. We
believe we can also extend to a case with a combination
of risk-sensitive and risk-neutral objective functions and or
constraint functions.
arXiv:2210.02686v2 [math.OC] 24 Mar 2023
II. RISK SENSITIVE MDPS
Risk-sensitive Markov Decision Process (Risk-MDP) is a
sequential decision-making problem that aims to optimize the
exponential function of a combination of sequential rewards.
In contrast to classical MDPs, it also considers higher
moments of the combined cost. As in classical MDP, the
risk-sensitive framework consists of time horizon 1, . . . , T,
a finite state space Stat time t, a finite action space At,x
at time tand for state x∈ St, a transition function that
determines the probability pt(x0|x, a)of reaching state x0
St+1 based on current state x∈ Stand action a∈ At,x. In
this paper, we consider a finite-horizon problem, i.e., T < .
A reward ˜rt(x, a, x0)is achieved at time twhen action a
chosen in state xresults in (next) state x0. The aim is to find
a policy that optimizes a given objective function constructed
from the combined cost. Let (A)be the set of probability
distributions on set A. Any policy consists of decision rules,
that prescribe the actions to be taken for a given state and
time epoch. It is represented as Π=(d1, . . . , dT1), where
dt:St(At,·)(to be more precise, x7→ (At,x))
prescribes the action for t-th time-slot; in other words, in this
work we restrict our attention to the Markovian policies.
The risk-sensitive objective function under policy Πand
initial distribution αrfor the Risk-MDP problem is defined
as,
Jr, αr) = EΠ
αrhePT1
t=1 rt(Xt,At,Xt+1)+rT(XT)i,with,
rt(x, a, x0) := γβt˜rt(x, a, x0)and rT(x) = γβT˜rT(x).(1)
Here, γis the risk-factor, β(0,1) is the discount factor,
(Xt, At)tT1is the stochastic state-action trajectory that
evolves under policy Π, and EΠ
αrrepresents the expectation
under policy Πand initial distribution αr. A higher |γ|
indicates more importance to the higher moments of the
combined cost, while with γ0, one can approach the
classical MDP problem. The aim in Risk-MDP problems is
to find an optimal policy, i.e., a policy Πthat satisfies,
Πarg sup
ΠΓ
Jr, αr),(2)
where Γis the set of Markovian randomized policies.
The value function ut(x)at time tand in state x∈ Stis
defined to be (see [8], [9]),
ut(x) = sup
Π
Jr(t, Π, x),where (3)
Jr(t, Π, x) := EΠ
αrhePT1
τ=trτ(Xτ,Aτ,Xτ+1)+rT(XT)|Xt=xi.
By strong Markov property applicable under Markovian
policies, the above quantity depends only on sub-policy (of
Π) from tonwards. Observe here that, the optimal policy
Πis the one that achieves the value function u1(x)for all
x∈ S1. The well-known DP equations to solve the Risk-
MDPs are as follows (for x∈ St),
uT(x) = erT(x)and for tT1,(4)
ut(x) = max
a(X
x0
ert(x,a,x0)pt(x0|x, a)ut+1(x0)).
The fixed point dynamic programming equations (4) facil-
itate the derivation of optimal policy for unconstrained prob-
lems. However, such equations are not known for constrained
problems, which we introduce in the immediate following.
One of the aims of this paper is to derive an appropriate
fixed-point equation that solves the Risk-CMDP.
A. Risk-sensitive constrained MDP (Risk-CMDP)
We now consider a constraint in the Risk-MDP problem
defined in (2). Here, at time t, an immediate constraint-cost
˜ct(x, a, x0)is incurred along with the reward ˜rt(x, a, x0),
when action achosen in state xresults in next state x0. The
aim is to keep the expected exponential of the combined
constraint-cost below a certain bound. To keep it general,
let βcand γcbe the discount and risk-factors corresponding
to the constraint; the factors γc,βccan be different from
respective factors γ, β corresponding to ˜rt. Thus a policy Π
is feasible if it satisfies the following constraint,
Jc, αc) =EΠ
αchePT1
t=1 ct(Xt,At,Xt+1)+cT(XT)iB, with,
ct(x, a, x0) := γcβt
c˜ct(x, a, x0),and cT(x) = γcβT
c˜cT(x),(5)
where αcis the initial distribution and can be different from
αr, the initial distribution of state corresponding to ˜rt. Thus,
the overall problem is,
sup
ΠΓ
Jr, αr),(6)
subject to Jc, αc)B.
Let Γc:= {ΠΓ : Jc, αc)B}represent the
corresponding feasible region. Throughout we assume the
existence of a solution for (6),and the aim is to design an
algorithm that obtains the same. Towards this, as a first step
we derive a fixed-point equation, in the next section whose
solution optimizes (6).
III. FIXED-POINT EQUATION
We begin this section with a few definitions. For ease
of notation, we let mtrepresent the immediate reward
rtor constraint-cost ctfunction at time tdepending
upon the choice m∈ {r, c}. Thus mt(x, a, x0)is the
reward/constraint-cost function at time t, when action a
chosen in state xresults in x0, the new state.
A. Forward factors
For any policy Π, time t, reward/constraint-cost mand
state xt, define the forward factors θΠ
m,t(xt)as below:
θΠ
m,t(xt) := EΠ
αmhePt1
k=1 mk(Xk,Ak,Xk+1){Xt=xt}i,
=X
xt1
1,at1
1
αm(x1)Y
kt1
dk(xk, ak)pk(xk+1|xk, ak)emk(xk,ak,xk+1),
where xt1
1:= (x1, . . . , xt1)is a vector with each
xk∈ Skand at1
1:= (a1, . . . , at1)is a vector with
each ak∈ Ak,xk. These factors represent the expected
reward/constraint-cost accumulated under policy Πtill time
t1, and the probability that Xt=xt. For any t,m, it is
摘要:

Fixed-pointequationssolvingRisk-sensitiveMDPwithconstraintVartikaSinghandVeerarunaKavithaIEOR,IITBombay,IndiaAbstract—TherearenocomputationallyfeasiblealgorithmsthatprovidesolutionstothenitehorizonRisk-sensitiveCon-strainedMarkovDecisionProcess(Risk-CMDP)problem,evenforproblemswithmoderatehorizon.W...

展开>> 收起<<
Fixed-point equations solving Risk-sensitive MDP with constraint Vartika Singh and Veeraruna Kavitha IEOR IIT Bombay India.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:729.89KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注