
1
Equivalence of Optimality Criteria for Markov Decision Process and
Model Predictive Control
Arash Bahari Kordabad, Mario Zanon, Sebastien Gros
Abstract—This paper shows that the optimal policy and value
functions of a Markov Decision Process (MDP), either discounted
or not, can be captured by a finite-horizon undiscounted Optimal
Control Problem (OCP), even if based on an inexact model. This
can be achieved by selecting a proper stage cost and terminal
cost for the OCP. A very useful particular case of OCP is a Model
Predictive Control (MPC) scheme where a deterministic (possibly
nonlinear) model is used to reduce the computational complexity.
This observation leads us to parameterize an MPC scheme fully,
including the cost function. In practice, Reinforcement Learning al-
gorithms can then be used to tune the parameterized MPC scheme.
We verify the developed theorems analytically in an LQR case and
we investigate some other nonlinear examples in simulations.
Index Terms—Markov Decision Process, Model Predic-
tive Control, Reinforcement Learning, Optimality
I. INTRODUCTION
Markov Decision Processes (MDPs) provide a standard framework
for the optimal control of discrete-time stochastic processes, where
the stage cost and transition probability depend only on the current
state and the current input of the system [1]. A control system,
described by an MDP, receives an input at each time instance and
proceeds to a new state with a given probability density, and in the
meantime, it gets a stage cost at each transition. For an MDP, a
policy is a mapping from the state space into the input space and
determines how to select the input based on the observation of the
current state. This policy can either be a deterministic mapping from
the state space [2] or a conditional probability of the current state,
describing the stochastic policy [3]. This paper focuses on deterministic
policies. Solving an MDP refers to finding an optimal policy that
minimizes the expected value of a total cumulative cost as a function
of the current state. The cumulative cost can be either discounted
or undiscounted with respect to the time instant. Therefore, different
definitions for the cumulative cost yields different optimality criteria
for the MDPs. Dynamic Programming (DP) techniques can be used
to solve MDPs based on the Bellman equations. However, solving the
Bellman equations is typically intractable unless the problem is of very
low dimension [4]. This issue is known as “curse of dimensionality” in
the literature [5]. Besides, DP requires the exact transition probability
of MDPs, while in most engineering applications, we do not have
access to the exact probability transition of the real system.
Reinforcement Learning (RL) [6] and approximate DP [7] are two
common techniques that tackle these difficulties. RL offers powerful
tools for tackling MDP without having an accurate knowledge of
the probability distribution underlying the state transition. In most
cases, RL requires a function approximator to capture the optimal
policy or the optimal value functions underlying the MDP. A common
choice of function approximator in the RL community is to use a
Deep Neural Network (DNN) [8]. DNNs can be used to capture either
the optimal policy underlying the MDP directly or the action-value
Arash Bahari Kordabad and Sebastien Gros are
with Department of Engineering Cybernetics, Norwegian
University of Science and Technology (NTNU), Trondheim,
Norway. Mario Zanon is with the IMT School for Advanced
Studies Lucca, Italy. E-mail:
arash.b.kordabad@ntnu.no,
mario.zanon@imtlucca.it and sebastien.gros@ntnu.no
function from which the optimal policy can be indirectly extracted.
However, the formal analysis of closed-loop stability and safety of
the policies provided by approximators such as DNNs is challenging.
Moreover, DNNs usually need a large number of tunable parameters
and a pre-training is often required so that the initial values of the
parameters are reasonable.
Model Predictive Control (MPC) is a well-known control strategy
that employs a (possibly inaccurate) model of the real system dynamics
to produce an input-state sequence over a given finite-horizon such
that the resulting predicted state trajectory minimizes a given cost
function while explicitly enforcing the input-state constraints imposed
on the system trajectories [9]. For computational reasons, simple
models are usually preferred in the MPC scheme. Hence, the MPC
model often does not have the structure required to correctly capture
the real system dynamics and stochasticity. The idea of using MPC as
a function approximator for RL techniques was justified first in [10],
where it was shown that the optimal policy of a discounted MDP
can be captured by a discounted MPC scheme even if the model is
inexact. Recently, MPC has been used in different systems to deliver
a structured function approximator for MDPs (see e.g., [10]–[12])
and partially observable MDPs [13]. Stability for discounted MPC
schemes is challenging, and for a finite-horizon problem, it is shown
in [14] that even if the provided stage cost, terminal cost and terminal
set satisfy the stability requirements, the closed-loop might be unstable
for some discount factors. Indeed, the discount factor has a critical
role in the stability of the closed-loop system under the optimal policy
of the discounted cost. The conditions for the asymptotic stability for
discounted optimal control problems have been recently developed in
[15] for deterministic systems with the exact model. Therefore, an
undiscounted MPC scheme is more desirable, where the closed-loop
stability analysis is straightforward and well-developed [9].
The equivalence of MDPs criteria (discounted and undiscounted)
has been recently discussed in [16] in the case an exact model of MDP
is available. However, in practice, the exact probability transition of
the MDP might not be available and we usually have a (possibly
inaccurate) model of the real system. This work extends the results
of [16] in the sense of the model mismatch and while extends also
the results of [10] to the case of using undiscounted MPC scheme
to capture a (possibly discounted) MDP. More specifically, we show
that, under some conditions, an undiscounted finite-horizon Optimal
Control Problem (OCP) can capture the optimal policy and the optimal
value functions of a given MDP, either discounted or undiscounted,
even if an inexact model is used in the undiscounted OCP. We then
propose to use a deterministic (possibly nonlinear) MPC scheme as a
particular case of the theorem to formulate the undiscounted OCP as
a common MPC scheme. By parameterizing the MPC scheme, and
tuning the parameters via RL algorithms one can achieve the best
approximation of the optimal policy and the optimal value functions
of the original MDP within the adopted MPC structure.
The paper is structured as follows. Section II provides the
formulation of MDPs under discounted and undiscounted optimality
criteria. Section III provides formal statements showing that using
cost modification in a finite-horizon undiscounted OCP one is able to
capture the optimal value function and optimal policy function of the
real system with discounted and undiscounted cost even with a wrong
arXiv:2210.04302v2 [eess.SY] 7 Feb 2023