1 Equivalence of Optimality Criteria for Markov Decision Process and Model Predictive Control

2025-04-28 0 0 1.12MB 8 页 10玖币
侵权投诉
1
Equivalence of Optimality Criteria for Markov Decision Process and
Model Predictive Control
Arash Bahari Kordabad, Mario Zanon, Sebastien Gros
AbstractThis paper shows that the optimal policy and value
functions of a Markov Decision Process (MDP), either discounted
or not, can be captured by a finite-horizon undiscounted Optimal
Control Problem (OCP), even if based on an inexact model. This
can be achieved by selecting a proper stage cost and terminal
cost for the OCP. A very useful particular case of OCP is a Model
Predictive Control (MPC) scheme where a deterministic (possibly
nonlinear) model is used to reduce the computational complexity.
This observation leads us to parameterize an MPC scheme fully,
including the cost function. In practice, Reinforcement Learning al-
gorithms can then be used to tune the parameterized MPC scheme.
We verify the developed theorems analytically in an LQR case and
we investigate some other nonlinear examples in simulations.
Index TermsMarkov Decision Process, Model Predic-
tive Control, Reinforcement Learning, Optimality
I. INTRODUCTION
Markov Decision Processes (MDPs) provide a standard framework
for the optimal control of discrete-time stochastic processes, where
the stage cost and transition probability depend only on the current
state and the current input of the system [1]. A control system,
described by an MDP, receives an input at each time instance and
proceeds to a new state with a given probability density, and in the
meantime, it gets a stage cost at each transition. For an MDP, a
policy is a mapping from the state space into the input space and
determines how to select the input based on the observation of the
current state. This policy can either be a deterministic mapping from
the state space [2] or a conditional probability of the current state,
describing the stochastic policy [3]. This paper focuses on deterministic
policies. Solving an MDP refers to finding an optimal policy that
minimizes the expected value of a total cumulative cost as a function
of the current state. The cumulative cost can be either discounted
or undiscounted with respect to the time instant. Therefore, different
definitions for the cumulative cost yields different optimality criteria
for the MDPs. Dynamic Programming (DP) techniques can be used
to solve MDPs based on the Bellman equations. However, solving the
Bellman equations is typically intractable unless the problem is of very
low dimension [4]. This issue is known as “curse of dimensionality” in
the literature [5]. Besides, DP requires the exact transition probability
of MDPs, while in most engineering applications, we do not have
access to the exact probability transition of the real system.
Reinforcement Learning (RL) [6] and approximate DP [7] are two
common techniques that tackle these difficulties. RL offers powerful
tools for tackling MDP without having an accurate knowledge of
the probability distribution underlying the state transition. In most
cases, RL requires a function approximator to capture the optimal
policy or the optimal value functions underlying the MDP. A common
choice of function approximator in the RL community is to use a
Deep Neural Network (DNN) [8]. DNNs can be used to capture either
the optimal policy underlying the MDP directly or the action-value
Arash Bahari Kordabad and Sebastien Gros are
with Department of Engineering Cybernetics, Norwegian
University of Science and Technology (NTNU), Trondheim,
Norway. Mario Zanon is with the IMT School for Advanced
Studies Lucca, Italy. E-mail:
arash.b.kordabad@ntnu.no,
mario.zanon@imtlucca.it and sebastien.gros@ntnu.no
function from which the optimal policy can be indirectly extracted.
However, the formal analysis of closed-loop stability and safety of
the policies provided by approximators such as DNNs is challenging.
Moreover, DNNs usually need a large number of tunable parameters
and a pre-training is often required so that the initial values of the
parameters are reasonable.
Model Predictive Control (MPC) is a well-known control strategy
that employs a (possibly inaccurate) model of the real system dynamics
to produce an input-state sequence over a given finite-horizon such
that the resulting predicted state trajectory minimizes a given cost
function while explicitly enforcing the input-state constraints imposed
on the system trajectories [9]. For computational reasons, simple
models are usually preferred in the MPC scheme. Hence, the MPC
model often does not have the structure required to correctly capture
the real system dynamics and stochasticity. The idea of using MPC as
a function approximator for RL techniques was justified first in [10],
where it was shown that the optimal policy of a discounted MDP
can be captured by a discounted MPC scheme even if the model is
inexact. Recently, MPC has been used in different systems to deliver
a structured function approximator for MDPs (see e.g., [10]–[12])
and partially observable MDPs [13]. Stability for discounted MPC
schemes is challenging, and for a finite-horizon problem, it is shown
in [14] that even if the provided stage cost, terminal cost and terminal
set satisfy the stability requirements, the closed-loop might be unstable
for some discount factors. Indeed, the discount factor has a critical
role in the stability of the closed-loop system under the optimal policy
of the discounted cost. The conditions for the asymptotic stability for
discounted optimal control problems have been recently developed in
[15] for deterministic systems with the exact model. Therefore, an
undiscounted MPC scheme is more desirable, where the closed-loop
stability analysis is straightforward and well-developed [9].
The equivalence of MDPs criteria (discounted and undiscounted)
has been recently discussed in [16] in the case an exact model of MDP
is available. However, in practice, the exact probability transition of
the MDP might not be available and we usually have a (possibly
inaccurate) model of the real system. This work extends the results
of [16] in the sense of the model mismatch and while extends also
the results of [10] to the case of using undiscounted MPC scheme
to capture a (possibly discounted) MDP. More specifically, we show
that, under some conditions, an undiscounted finite-horizon Optimal
Control Problem (OCP) can capture the optimal policy and the optimal
value functions of a given MDP, either discounted or undiscounted,
even if an inexact model is used in the undiscounted OCP. We then
propose to use a deterministic (possibly nonlinear) MPC scheme as a
particular case of the theorem to formulate the undiscounted OCP as
a common MPC scheme. By parameterizing the MPC scheme, and
tuning the parameters via RL algorithms one can achieve the best
approximation of the optimal policy and the optimal value functions
of the original MDP within the adopted MPC structure.
The paper is structured as follows. Section II provides the
formulation of MDPs under discounted and undiscounted optimality
criteria. Section III provides formal statements showing that using
cost modification in a finite-horizon undiscounted OCP one is able to
capture the optimal value function and optimal policy function of the
real system with discounted and undiscounted cost even with a wrong
arXiv:2210.04302v2 [eess.SY] 7 Feb 2023
2
model. Section IV presents a parameterized MPC scheme as a special
case of the undiscounted OCP, where the model is deterministic (i.e.
the probability transition is a Dirac measure). Then the parameters
can be tuned using RL techniques. Section V provides an analytical
LQR example. Section VI illustrates different numerical simulation.
Finally, section VII delivers the conclusions.
II. REAL SYSTEM
In this section, we formulate the real system as Markov Decision
Processes (MDPs). We consider an MDP on a continuous state and
input spaces over
Rn
and
Rm
, respectively, with stochastic states
sk∈ X Rn
in the Lebesgue-measurable set
X
and inputs
ak
U Rm
. The triple
(Ω,F, ρ)
defines the probability space associated
with a Markov chain, where
Ω=Π
k=0X
, with associated
σ
-field
F
and
ρ
is the probability measure. We then consider stochastic
dynamics defined by the following conditional probability measure:
ρ[sk+1|sk,ak],(1)
defining the conditional probability of observing a transition from
a given state-action pair
sk
,
ak
to a subsequent state
sk+1
. The
input
a
applied to the system for a given state
s
is selected by a
deterministic policy
π:X → U
. We denote
sπ
0,1,...
the (possibly
stochastic) trajectories of the system
(1)
under policy
π
, i.e.,
sπ
k+1
ρ[·|sπ
k,π(sπ
k)]
, starting from
sπ
0=s
,
π
. We further denote the
measure associated with such trajectories as
τπ
k
in the same space
as
ρ
. More specifically,
τπ
0(·) = ρ0(·),π
, where
ρ0(·)
is the initial
state distribution and τπ
k+1(·) := RXρ[·|s,π(s)] τπ
k(ds), k > 0.
A. Discounted MDPs
In the discounted setting, we aim to find the optimal policy
π?
,
solution of the following discounted infinite-horizon OCP:
V?(s) := min
πVπ(s) := Eτπ"
X
k=0
γk`(sπ
k,πsπ
k)#,(2)
for all initial states
sπ
0=s
, where
V?:X R
is the optimal value
function,
Vπ
is the value function of the Markov Chain in closed-loop
with policy
π
,
`:X × U R
is the stage cost function of the real
system and
γ(0,1]
is the discount factor. The expectation
Eτπ
is taken over the distribution underlying the Markov Chain
(1)
in
closed-loop with policy
π
, i.e.,
skτπ
k(·)
for
k > 0
. The action-
value function
Q?(s,a)
and advantage function
A?(s,a)
associated
to (2) are defined as follows:
Q?(s,a) := `(s,a) + γEρhV?(s+)|s,ai,(3a)
A?(s,a) := Q?(s,a)V?(s).(3b)
Then from the Bellman equation, we have the following identities:
V?(s) = Q?(s,π?(s)) = min
aQ?(s,a),s X ,(4a)
0 = min
aA?(s,a),π?(s)arg min
aA?(s,a),s X .(4b)
B. Undiscounted MDPs
Undiscounted MDPs refer to MDPs when
γ= 1
. In this case
V?
is
in general unbounded and the MDP is ill-posed. In order to tackle this
issue, alternative optimality criteria are needed. Gain optimality is one
of the common criteria in the undiscounted setting. Gain optimality
is defined based on the following average-cost problem:
¯
V?(s) := min
πlim
N→∞
1
NEτπ"N1
X
k=0
`(sπ
k,πsπ
k)#,(5)
for all initial states
sπ
0=s
,
π
, where
¯
V?
is the optimal average
cost. We denote the optimal policy solution of
(5)
as
¯
π?
. This optimal
policy is called gain optimal. The gain optimal policy
¯
π?
may not be
unique. Moreover, the optimal average cost
¯
V?
is commonly assumed
to be independent of the initial state
s
[17]. This assumption e.g.
holds for unichain MDPs, in which under any policy any state can be
reached in finite time from any other state. Unfortunately, the gain
optimality criterion only considers the optimal steady-state distribution
and it overlooks transients. As an alternative, bias optimality considers
the optimality of the transients. Precisely, bias optimality can be
formulated through the following OCP:
˜
V?(s) = min
π
Eτπ"
X
k=0
(`(sπ
k,πsπ
k)¯
V?)#,(6)
where
˜
V?
is the optimal value function associated to bias optimality.
Note that
(6)
can be seen as a special case of the discounted setting
in
(2)
when
γ= 1
and the optimal average cost
¯
V?
is subtracted
from the stage cost in
(2)
. Therefore, for the rest of the paper we
will consider the discounted setting
(2)
. Without loss of generality
we assume that
¯
V?= 0
in the case
γ= 1
. This choice yields a
well-posed optimal value function in the undiscounted setting. Clearly,
if this does not hold, one can shift the stage cost to achieve
¯
V?= 0
.
III. MODEL OF THE SYSTEM
In general, we may not have full knowledge of the probability
transition of the real MDP
(1)
. One then typically considers an
imperfect model of the real MDP (1), having the state transition:
ˆρ[sk+1|sk,ak].(7)
in the same space as
ρ
. In order to distinguish it from the real
system trajectory, let us denote
ˆ
sπ
0,1,...
the (possibly stochastic)
trajectories of the state transition model
(7)
under policy
π
, i.e.,
ˆ
sπ
k+1 ˆρ[·|ˆ
sπ
k,π(ˆ
sπ
k)]
, starting from
ˆ
sπ
0=s
,
π
. We further
denote the measure associated with such trajectories as
ˆτπ
. In general,
˜
·
refers to the notations related to the imperfect model of the system
in this paper. It has been shown in [18] that proving closed-loop
stability of the Markov Chains with the optimal policy resulting
from an undiscounted OCP is more straightforward than a discounted
setting [16]. This observation is well-known in MPC of deterministic
systems [19]. Therefore, in this paper, we are interested in using an
undiscounted OCP for the model
(7)
in order to extract the optimal
policy and optimal value functions of the real system
(1)
, as this
allows us to enforce stability guarantees.
A. Finite-horizon OCP
While MPC allows one to introduce stability and safety guarantees,
it also requires a model of the real system which is bound to be
imperfect, and it optimizes the cost over a finite horizon with unitary
discount factor. In other words, MPC is an MDP based on the imperfect
system model
(7)
which we will formulate in
(8)
. In this section we
will prove that these differences between the MPC formulation and
the original MDP formulation do not hinder the ability to obtain the
optimal policy and the optimal value functions of the real system
through MPC. Consider the following undiscounted finite-horizon
OCP associated to model (7):
ˆ
V?
N(s) = min
π
ˆ
Vπ
N(s) :=Eˆτπˆ
T(ˆ
sπ
N) +
N1
X
k=0
ˆ
L(ˆ
sπ
k,πˆ
sπ
k),
(8)
with initial state
ˆ
sπ
0=s
, where
NN
is the horizon length,
ˆ
T
,
ˆ
L
,
ˆ
V?
N
and
ˆ
Vπ
N
are the terminal cost, the stage cost, the optimal
摘要:

1EquivalenceofOptimalityCriteriaforMarkovDecisionProcessandModelPredictiveControlArashBahariKordabad,MarioZanon,SebastienGrosAbstract—ThispapershowsthattheoptimalpolicyandvaluefunctionsofaMarkovDecisionProcess(MDP),eitherdiscountedornot,canbecapturedbyanite-horizonundiscountedOptimalControlProblem(...

展开>> 收起<<
1 Equivalence of Optimality Criteria for Markov Decision Process and Model Predictive Control.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.12MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注