Fixed-point equations solving Risk-sensitive MDP with constraint Vartika Singh and Veeraruna Kavitha IEOR IIT Bombay India

2025-04-27 0 0 729.89KB 8 页 10玖币

侵权投诉

Fixed-point equations solving Risk-sensitive MDP with constraint

Vartika Singh and Veeraruna Kavitha

IEOR, IIT Bombay, India

Abstract— There are no computationally feasible algorithms

that provide solutions to the ﬁnite horizon Risk-sensitive Con-

strained Markov Decision Process (Risk-CMDP) problem, even

for problems with moderate horizon. With an aim to design the

same, we derive a ﬁxed-point equation such that the optimal

policy of Risk-CMDP is also a solution. We further provide

two optimization problems equivalent to the Risk-CMDP. These

formulations are instrumental in designing a global algorithm

that converges to the optimal policy. The proposed algorithm

is based on random restarts and a local improvement step,

where the local improvement step utilizes the solution of the

derived ﬁxed-point equation; random restarts ensure global

optimization. We also provide numerical examples to illustrate

the feasibility of our algorithm for inventory control problem

with risk-sensitive cost and constraint. The complexity of the

algorithm grows only linearly with the time-horizon.

I. INTRODUCTION

Classical Markov Decision Process (MDP) problems aim

to derive an optimal policy that optimizes the expected

combined cost accumulated over time (e.g., [1]). Both ﬁnite

and inﬁnite horizon problems are well studied. These prob-

lems can be solved with the help of well-known Dynamic

Programming (DP) equations. One can also solve MDP prob-

lems using Linear Programming (LP) based approach (e.g.,

[1]). Further, the LP formulation facilitates the inclusion of

constraints, which is not the case with DP equations.

In many scenarios, it becomes important to consider the

variations in accumulated cost for different sample paths and

not only the expected value (e.g., [3], [10]). In this case, the

risk-sensitive MDPs (Risk-MDPs) are useful, where higher

moments of combined cost are also considered (see [2], [4]).

The Risk-MDPs have a sensitivity parameter γ, called a

risk-factor which determines the importance of the higher

moments and the variance. It is well known that, as the risk-

factor tends to zero, the value of the Risk-MDP approaches

that of the classical MDP (e.g., [4], [7] for discounted cost

and [5] for average cost problems); the same is also true for

the optimal policies. However the policies for higher γcan

be drastically different.

The Risk-MDP problems can be solved using the cor-

responding dynamic programming approach (see [8]). Re-

cently, authors in [9] proposed an LP based formulation to

solve ﬁnite horizon Risk-MDP problems. However, the inclu-

sion of constraints in these problems is not straightforward.

The DP equations are again not satisﬁed; the LP approach

in [9] can handle Risk-sensitive constrained-MDPs (brieﬂy

referred to as Risk-CMDP). However, in [9], the state space

*The work of ﬁrst author is partially supported by Prime minister research

fellowship, India

of the Risk-MDP is augmented to include the constraints,

which causes the state space to grow exponentially over time.

Such formulation is impossible to implement for problems

with a longer time horizon. To the best of our knowledge,

there are no other computationally feasible algorithms in the

literature to solve Risk-CMDP problem. This calls for an im-

plementable algorithm that can solve Risk-CMDP problems,

and that does not suffer from the curse of dimensionality.

For inﬁnite horizon Risk-MDP, in contrast to classical

MDPs, stationary policies are not optimal in general (e.g.,

[4], [6], [7]). This makes the derivation of optimal policies

even more difﬁcult. In literature, such problems are solved

by approximating inﬁnite horizon problems with a ﬁnite

horizon Risk-MDP; for example, [6] considers a tail cut-off

policy, while [7] considers a tail replacement policy. These

approximations are in line with ultimately stationary policies

discussed in [4].

Including constraints in inﬁnite horizon Risk-MDPs is

even more challenging. A recent work [10] considers inﬁnite

horizon Risk-CMDP and provides approximate solutions via

the solutions of an appropriate ﬁnite horizon Risk-CMDP.

This approach provides -optimal policies which improve

as the terminal time of ﬁnite horizon Risk-CMDP increases

to inﬁnity. Such approximations require solution of ﬁnite-

horizon problems with sufﬁciently large terminal times. This

again calls for an implementable algorithm, that does not

suffer from the curse of dimensionality as in [9].

One of the main goals of this paper is an algorithm that

solves ﬁnite-horizon Risk-CMDP problems, whose complex-

ity grows (only) linearly with terminal time. The proposed

algorithm is based on the derivation of a ﬁxed-point equa-

tion which must be satisﬁed by any optimal policy of the

constrained problem. Our contributions are threefold: i) we

derive a ﬁxed-point equation such that the optimal policy of

Risk-CMDP is also a solution of the ﬁxed-point equation;

ii) we derive two equivalent optimization problems that

facilitate the derivation of the optimal policy; and iii) we

provide a global iterative algorithm that converges to the

optimal policy under certain conditions.

To illustrate the feasibility of our approach, we present

numerical examples that solve some inventory control prob-

lems with constraint. We could easily solve problems with

time horizons as long as 1000 decision epochs.

Our approach can easily be extended to the case with

multiple constraints including risk-neutral constraints. We

believe we can also extend to a case with a combination

of risk-sensitive and risk-neutral objective functions and or

constraint functions.

arXiv:2210.02686v2 [math.OC] 24 Mar 2023

II. RISK SENSITIVE MDPS

Risk-sensitive Markov Decision Process (Risk-MDP) is a

sequential decision-making problem that aims to optimize the

exponential function of a combination of sequential rewards.

In contrast to classical MDPs, it also considers higher

moments of the combined cost. As in classical MDP, the

risk-sensitive framework consists of time horizon 1, . . . , T,

a ﬁnite state space Stat time t, a ﬁnite action space At,x

at time tand for state x∈ St, a transition function that

determines the probability pt(x0|x, a)of reaching state x0∈

St+1 based on current state x∈ Stand action a∈ At,x. In

this paper, we consider a ﬁnite-horizon problem, i.e., T < ∞.

A reward ˜rt(x, a, x0)is achieved at time twhen action a

chosen in state xresults in (next) state x0. The aim is to ﬁnd

a policy that optimizes a given objective function constructed

from the combined cost. Let (A)be the set of probability

distributions on set A. Any policy consists of decision rules,

that prescribe the actions to be taken for a given state and

time epoch. It is represented as Π=(d1, . . . , dT−1), where

dt:St→(At,·)(to be more precise, x7→ (At,x))

prescribes the action for t-th time-slot; in other words, in this

work we restrict our attention to the Markovian policies.

The risk-sensitive objective function under policy Πand

initial distribution αrfor the Risk-MDP problem is deﬁned

as,

Jr(Π, αr) = EΠ

αrhePT−1

t=1 rt(Xt,At,Xt+1)+rT(XT)i,with,

rt(x, a, x0) := γβt˜rt(x, a, x0)and rT(x) = γβT˜rT(x).(1)

Here, γis the risk-factor, β∈(0,1) is the discount factor,

(Xt, At)t≤T−1is the stochastic state-action trajectory that

evolves under policy Π, and EΠ

αrrepresents the expectation

under policy Πand initial distribution αr. A higher |γ|

indicates more importance to the higher moments of the

combined cost, while with γ→0, one can approach the

classical MDP problem. The aim in Risk-MDP problems is

to ﬁnd an optimal policy, i.e., a policy Π∗that satisﬁes,

Π∗∈arg sup

Π∈Γ

Jr(Π, αr),(2)

where Γis the set of Markovian randomized policies.

The value function ut(x)at time tand in state x∈ Stis

deﬁned to be (see [8], [9]),

ut(x) = sup

Jr(t, Π, x),where (3)

Jr(t, Π, x) := EΠ

αrhePT−1

τ=trτ(Xτ,Aτ,Xτ+1)+rT(XT)|Xt=xi.

By strong Markov property applicable under Markovian

policies, the above quantity depends only on sub-policy (of

Π) from tonwards. Observe here that, the optimal policy

Π∗is the one that achieves the value function u1(x)for all

x∈ S1. The well-known DP equations to solve the Risk-

MDPs are as follows (for x∈ St),

uT(x) = erT(x)and for t≤T−1,(4)

ut(x) = max

a(X

ert(x,a,x0)pt(x0|x, a)ut+1(x0)).

The ﬁxed point dynamic programming equations (4) facil-

itate the derivation of optimal policy for unconstrained prob-

lems. However, such equations are not known for constrained

problems, which we introduce in the immediate following.

One of the aims of this paper is to derive an appropriate

ﬁxed-point equation that solves the Risk-CMDP.

A. Risk-sensitive constrained MDP (Risk-CMDP)

We now consider a constraint in the Risk-MDP problem

deﬁned in (2). Here, at time t, an immediate constraint-cost

˜ct(x, a, x0)is incurred along with the reward ˜rt(x, a, x0),

when action achosen in state xresults in next state x0. The

aim is to keep the expected exponential of the combined

constraint-cost below a certain bound. To keep it general,

let βcand γcbe the discount and risk-factors corresponding

to the constraint; the factors γc,βccan be different from

respective factors γ, β corresponding to ˜rt. Thus a policy Π

is feasible if it satisﬁes the following constraint,

Jc(Π, αc) =EΠ

αchePT−1

t=1 ct(Xt,At,Xt+1)+cT(XT)i≤B, with,

ct(x, a, x0) := γcβt

c˜ct(x, a, x0),and cT(x) = γcβT

c˜cT(x),(5)

where αcis the initial distribution and can be different from

αr, the initial distribution of state corresponding to ˜rt. Thus,

the overall problem is,

sup

Π∈Γ

Jr(Π, αr),(6)

subject to Jc(Π, αc)≤B.

Let Γc:= {Π∈Γ : Jc(Π, αc)≤B}represent the

corresponding feasible region. Throughout we assume the

existence of a solution for (6),and the aim is to design an

algorithm that obtains the same. Towards this, as a ﬁrst step

we derive a ﬁxed-point equation, in the next section whose

solution optimizes (6).

III. FIXED-POINT EQUATION

We begin this section with a few deﬁnitions. For ease

of notation, we let mtrepresent the immediate reward

rtor constraint-cost ctfunction at time tdepending

upon the choice m∈ {r, c}. Thus mt(x, a, x0)is the

reward/constraint-cost function at time t, when action a

chosen in state xresults in x0, the new state.

A. Forward factors

For any policy Π, time t, reward/constraint-cost mand

state xt, deﬁne the forward factors θΠ

m,t(xt)as below:

θΠ

m,t(xt) := EΠ

αmhePt−1

k=1 mk(Xk,Ak,Xk+1){Xt=xt}i,

xt−1

1,at−1

αm(x1)Y

k≤t−1

dk(xk, ak)pk(xk+1|xk, ak)emk(xk,ak,xk+1),

where xt−1

1:= (x1, . . . , xt−1)is a vector with each

xk∈ Skand at−1

1:= (a1, . . . , at−1)is a vector with

each ak∈ Ak,xk. These factors represent the expected

reward/constraint-cost accumulated under policy Πtill time

t−1, and the probability that Xt=xt. For any t,m, it is

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Fixed-pointequationssolvingRisk-sensitiveMDPwithconstraintVartikaSinghandVeerarunaKavithaIEOR,IITBombay,IndiaAbstractTherearenocomputationallyfeasiblealgorithmsthatprovidesolutionstothenitehorizonRisk-sensitiveCon-strainedMarkovDecisionProcess(Risk-CMDP)problem,evenforproblemswithmoderatehorizon.W...

展开>> 收起<<

Fixed-point equations solving Risk-sensitive MDP with constraint Vartika Singh and Veeraruna Kavitha IEOR IIT Bombay India.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fixed-point equations solving Risk-sensitive MDP with constraint Vartika Singh and Veeraruna Kavitha IEOR IIT Bombay India

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: