1 Equivalence of Optimality Criteria for Markov Decision Process and Model Predictive Control

2025-04-28 0 0 1.12MB 8 页 10玖币

侵权投诉

Equivalence of Optimality Criteria for Markov Decision Process and

Model Predictive Control

Arash Bahari Kordabad, Mario Zanon, Sebastien Gros

Abstract—This paper shows that the optimal policy and value

functions of a Markov Decision Process (MDP), either discounted

or not, can be captured by a ﬁnite-horizon undiscounted Optimal

Control Problem (OCP), even if based on an inexact model. This

can be achieved by selecting a proper stage cost and terminal

cost for the OCP. A very useful particular case of OCP is a Model

Predictive Control (MPC) scheme where a deterministic (possibly

nonlinear) model is used to reduce the computational complexity.

This observation leads us to parameterize an MPC scheme fully,

including the cost function. In practice, Reinforcement Learning al-

gorithms can then be used to tune the parameterized MPC scheme.

We verify the developed theorems analytically in an LQR case and

we investigate some other nonlinear examples in simulations.

Index Terms—Markov Decision Process, Model Predic-

tive Control, Reinforcement Learning, Optimality

I. INTRODUCTION

Markov Decision Processes (MDPs) provide a standard framework

for the optimal control of discrete-time stochastic processes, where

the stage cost and transition probability depend only on the current

state and the current input of the system [1]. A control system,

described by an MDP, receives an input at each time instance and

proceeds to a new state with a given probability density, and in the

meantime, it gets a stage cost at each transition. For an MDP, a

policy is a mapping from the state space into the input space and

determines how to select the input based on the observation of the

current state. This policy can either be a deterministic mapping from

the state space [2] or a conditional probability of the current state,

describing the stochastic policy [3]. This paper focuses on deterministic

policies. Solving an MDP refers to ﬁnding an optimal policy that

minimizes the expected value of a total cumulative cost as a function

of the current state. The cumulative cost can be either discounted

or undiscounted with respect to the time instant. Therefore, different

deﬁnitions for the cumulative cost yields different optimality criteria

for the MDPs. Dynamic Programming (DP) techniques can be used

to solve MDPs based on the Bellman equations. However, solving the

Bellman equations is typically intractable unless the problem is of very

low dimension [4]. This issue is known as “curse of dimensionality” in

the literature [5]. Besides, DP requires the exact transition probability

of MDPs, while in most engineering applications, we do not have

access to the exact probability transition of the real system.

Reinforcement Learning (RL) [6] and approximate DP [7] are two

common techniques that tackle these difﬁculties. RL offers powerful

tools for tackling MDP without having an accurate knowledge of

the probability distribution underlying the state transition. In most

cases, RL requires a function approximator to capture the optimal

policy or the optimal value functions underlying the MDP. A common

choice of function approximator in the RL community is to use a

Deep Neural Network (DNN) [8]. DNNs can be used to capture either

the optimal policy underlying the MDP directly or the action-value

Arash Bahari Kordabad and Sebastien Gros are

with Department of Engineering Cybernetics, Norwegian

University of Science and Technology (NTNU), Trondheim,

Norway. Mario Zanon is with the IMT School for Advanced

Studies Lucca, Italy. E-mail:

arash.b.kordabad@ntnu.no,

mario.zanon@imtlucca.it and sebastien.gros@ntnu.no

function from which the optimal policy can be indirectly extracted.

However, the formal analysis of closed-loop stability and safety of

the policies provided by approximators such as DNNs is challenging.

Moreover, DNNs usually need a large number of tunable parameters

and a pre-training is often required so that the initial values of the

parameters are reasonable.

Model Predictive Control (MPC) is a well-known control strategy

that employs a (possibly inaccurate) model of the real system dynamics

to produce an input-state sequence over a given ﬁnite-horizon such

that the resulting predicted state trajectory minimizes a given cost

function while explicitly enforcing the input-state constraints imposed

on the system trajectories [9]. For computational reasons, simple

models are usually preferred in the MPC scheme. Hence, the MPC

model often does not have the structure required to correctly capture

the real system dynamics and stochasticity. The idea of using MPC as

a function approximator for RL techniques was justiﬁed ﬁrst in [10],

where it was shown that the optimal policy of a discounted MDP

can be captured by a discounted MPC scheme even if the model is

inexact. Recently, MPC has been used in different systems to deliver

a structured function approximator for MDPs (see e.g., [10]–[12])

and partially observable MDPs [13]. Stability for discounted MPC

schemes is challenging, and for a ﬁnite-horizon problem, it is shown

in [14] that even if the provided stage cost, terminal cost and terminal

set satisfy the stability requirements, the closed-loop might be unstable

for some discount factors. Indeed, the discount factor has a critical

role in the stability of the closed-loop system under the optimal policy

of the discounted cost. The conditions for the asymptotic stability for

discounted optimal control problems have been recently developed in

[15] for deterministic systems with the exact model. Therefore, an

undiscounted MPC scheme is more desirable, where the closed-loop

stability analysis is straightforward and well-developed [9].

The equivalence of MDPs criteria (discounted and undiscounted)

has been recently discussed in [16] in the case an exact model of MDP

is available. However, in practice, the exact probability transition of

the MDP might not be available and we usually have a (possibly

inaccurate) model of the real system. This work extends the results

of [16] in the sense of the model mismatch and while extends also

the results of [10] to the case of using undiscounted MPC scheme

to capture a (possibly discounted) MDP. More speciﬁcally, we show

that, under some conditions, an undiscounted ﬁnite-horizon Optimal

Control Problem (OCP) can capture the optimal policy and the optimal

value functions of a given MDP, either discounted or undiscounted,

even if an inexact model is used in the undiscounted OCP. We then

propose to use a deterministic (possibly nonlinear) MPC scheme as a

particular case of the theorem to formulate the undiscounted OCP as

a common MPC scheme. By parameterizing the MPC scheme, and

tuning the parameters via RL algorithms one can achieve the best

approximation of the optimal policy and the optimal value functions

of the original MDP within the adopted MPC structure.

The paper is structured as follows. Section II provides the

formulation of MDPs under discounted and undiscounted optimality

criteria. Section III provides formal statements showing that using

cost modiﬁcation in a ﬁnite-horizon undiscounted OCP one is able to

capture the optimal value function and optimal policy function of the

real system with discounted and undiscounted cost even with a wrong

arXiv:2210.04302v2 [eess.SY] 7 Feb 2023

model. Section IV presents a parameterized MPC scheme as a special

case of the undiscounted OCP, where the model is deterministic (i.e.

the probability transition is a Dirac measure). Then the parameters

can be tuned using RL techniques. Section V provides an analytical

LQR example. Section VI illustrates different numerical simulation.

Finally, section VII delivers the conclusions.

II. REAL SYSTEM

In this section, we formulate the real system as Markov Decision

Processes (MDPs). We consider an MDP on a continuous state and

input spaces over

and

, respectively, with stochastic states

sk∈ X ⊆ Rn

in the Lebesgue-measurable set

and inputs

ak∈

U ∈ Rm

. The triple

(Ω,F, ρ)

deﬁnes the probability space associated

with a Markov chain, where

Ω=Π∞

k=0X

, with associated

-ﬁeld

and

is the probability measure. We then consider stochastic

dynamics deﬁned by the following conditional probability measure:

ρ[sk+1|sk,ak],(1)

deﬁning the conditional probability of observing a transition from

a given state-action pair

to a subsequent state

sk+1

. The

input

applied to the system for a given state

is selected by a

deterministic policy

π:X → U

. We denote

sπ

0,1,...

the (possibly

stochastic) trajectories of the system

(1)

under policy

, i.e.,

sπ

k+1 ∼

ρ[·|sπ

k,π(sπ

k)]

, starting from

sπ

0=s

∀π

. We further denote the

measure associated with such trajectories as

τπ

in the same space

. More speciﬁcally,

τπ

0(·) = ρ0(·),∀π

, where

ρ0(·)

is the initial

state distribution and τπ

k+1(·) := RXρ[·|s,π(s)] τπ

k(ds), k > 0.

A. Discounted MDPs

In the discounted setting, we aim to ﬁnd the optimal policy

π?

solution of the following discounted inﬁnite-horizon OCP:

V?(s) := min

πVπ(s) := Eτπ"∞

k=0

γk`(sπ

k,πsπ

k)#,(2)

for all initial states

sπ

0=s

, where

V?:X → R

is the optimal value

function,

Vπ

is the value function of the Markov Chain in closed-loop

with policy

`:X × U → R

is the stage cost function of the real

system and

γ∈(0,1]

is the discount factor. The expectation

Eτπ

is taken over the distribution underlying the Markov Chain

(1)

closed-loop with policy

, i.e.,

sk∼τπ

k(·)

for

k > 0

. The action-

value function

Q?(s,a)

and advantage function

A?(s,a)

associated

to (2) are deﬁned as follows:

Q?(s,a) := `(s,a) + γEρhV?(s+)|s,ai,(3a)

A?(s,a) := Q?(s,a)−V?(s).(3b)

Then from the Bellman equation, we have the following identities:

V?(s) = Q?(s,π?(s)) = min

aQ?(s,a),∀s∈ X ,(4a)

0 = min

aA?(s,a),π?(s)∈arg min

aA?(s,a),∀s∈ X .(4b)

B. Undiscounted MDPs

Undiscounted MDPs refer to MDPs when

γ= 1

. In this case

in general unbounded and the MDP is ill-posed. In order to tackle this

issue, alternative optimality criteria are needed. Gain optimality is one

of the common criteria in the undiscounted setting. Gain optimality

is deﬁned based on the following average-cost problem:

V?(s) := min

πlim

N→∞

NEτπ"N−1

k=0

`(sπ

k,πsπ

k)#,(5)

for all initial states

sπ

0=s

∀π

, where

is the optimal average

cost. We denote the optimal policy solution of

(5)

π?

. This optimal

policy is called gain optimal. The gain optimal policy

π?

may not be

unique. Moreover, the optimal average cost

is commonly assumed

to be independent of the initial state

[17]. This assumption e.g.

holds for unichain MDPs, in which under any policy any state can be

reached in ﬁnite time from any other state. Unfortunately, the gain

optimality criterion only considers the optimal steady-state distribution

and it overlooks transients. As an alternative, bias optimality considers

the optimality of the transients. Precisely, bias optimality can be

formulated through the following OCP:

V?(s) = min

Eτπ"∞

k=0

(`(sπ

k,πsπ

k)−¯

V?)#,(6)

where

is the optimal value function associated to bias optimality.

Note that

(6)

can be seen as a special case of the discounted setting

(2)

when

γ= 1

and the optimal average cost

is subtracted

from the stage cost in

(2)

. Therefore, for the rest of the paper we

will consider the discounted setting

(2)

. Without loss of generality

we assume that

V?= 0

in the case

γ= 1

. This choice yields a

well-posed optimal value function in the undiscounted setting. Clearly,

if this does not hold, one can shift the stage cost to achieve

V?= 0

III. MODEL OF THE SYSTEM

In general, we may not have full knowledge of the probability

transition of the real MDP

(1)

. One then typically considers an

imperfect model of the real MDP (1), having the state transition:

ˆρ[sk+1|sk,ak].(7)

in the same space as

. In order to distinguish it from the real

system trajectory, let us denote

sπ

0,1,...

the (possibly stochastic)

trajectories of the state transition model

(7)

under policy

, i.e.,

sπ

k+1 ∼ˆρ[·|ˆ

sπ

k,π(ˆ

sπ

k)]

, starting from

sπ

0=s

∀π

. We further

denote the measure associated with such trajectories as

ˆτπ

. In general,

refers to the notations related to the imperfect model of the system

in this paper. It has been shown in [18] that proving closed-loop

stability of the Markov Chains with the optimal policy resulting

from an undiscounted OCP is more straightforward than a discounted

setting [16]. This observation is well-known in MPC of deterministic

systems [19]. Therefore, in this paper, we are interested in using an

undiscounted OCP for the model

(7)

in order to extract the optimal

policy and optimal value functions of the real system

(1)

, as this

allows us to enforce stability guarantees.

A. Finite-horizon OCP

While MPC allows one to introduce stability and safety guarantees,

it also requires a model of the real system which is bound to be

imperfect, and it optimizes the cost over a ﬁnite horizon with unitary

discount factor. In other words, MPC is an MDP based on the imperfect

system model

(7)

which we will formulate in

(8)

. In this section we

will prove that these differences between the MPC formulation and

the original MDP formulation do not hinder the ability to obtain the

optimal policy and the optimal value functions of the real system

through MPC. Consider the following undiscounted ﬁnite-horizon

OCP associated to model (7):

N(s) = min

Vπ

N(s) :=Eˆτπˆ

T(ˆ

sπ

N) +

N−1

k=0

L(ˆ

sπ

k,πˆ

sπ

k),

(8)

with initial state

sπ

0=s

, where

N∈N

is the horizon length,

and

Vπ

are the terminal cost, the stage cost, the optimal

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1EquivalenceofOptimalityCriteriaforMarkovDecisionProcessandModelPredictiveControlArashBahariKordabad,MarioZanon,SebastienGrosAbstractThispapershowsthattheoptimalpolicyandvaluefunctionsofaMarkovDecisionProcess(MDP),eitherdiscountedornot,canbecapturedbyanite-horizonundiscountedOptimalControlProblem(...

展开>> 收起<<

1 Equivalence of Optimality Criteria for Markov Decision Process and Model Predictive Control.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Equivalence of Optimality Criteria for Markov Decision Process and Model Predictive Control

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: