
1
Policy iteration: for want of recursive feasibility, all is not lost
Mathieu Granzotto, Olivier Lindamulage De Silva, Romain Postoyan, Dragan Neši´
c, and Zhong-Ping Jiang
Abstract—This paper investigates recursive feasibility,
recursive robust stability and near-optimality properties
of policy iteration (PI). For this purpose, we consider de-
terministic nonlinear discrete-time systems whose inputs
are generated by PI for undiscounted cost functions. We
first assume that PI is recursively feasible, in the sense
that the optimization problems solved at each iteration
admit a solution. In this case, we provide novel condi-
tions to establish recursive robust stability properties for
a general attractor, meaning that the policies generated at
each iteration ensure a robust KL-stability property with
respect to a general state measure. We then derive novel
explicit bounds on the mismatch between the (suboptimal)
value function returned by PI at each iteration and the
optimal one. Afterwards, motivated by a counter-example
that shows that PI may fail to be recursively feasible, we
modify PI so that recursive feasibility is guaranteed a priori
under mild conditions. This modified algorithm, called PI+,
is shown to preserve the recursive robust stability when
the attractor is compact. Additionally, PI+enjoys the same
near-optimality properties as its PI counterpart under the
same assumptions. Therefore, PI+is an attractive tool for
generating near-optimal stabilizing control of deterministic
discrete-time nonlinear systems.
I. INTRODUCTION
Policy iteration (PI) is an optimization algorithm that forms
one of the pillars of dynamic programming [2]. PI iteratively
generates control laws, also called policies, that converge to
an optimal control law for general dynamical systems and cost
functions under mild conditions, see, e.g., [2,5,15,17]. Also,
PI may exhibit the attractive feature of converging faster to
the optimal value function than its counterpart value iteration
(VI) [15] at the price of more computations. For these reasons,
PI attracts a lot of attention both in terms of theoretical
investigations see, e.g., [3,5,7,15,23,24], and practical appli-
cations e.g., [14,22,29,30]. Nevertheless, several fundamental
questions remain largely open regarding the properties of PI
in a control context: (i) its recursive feasibility; (ii) general
conditions for recursive robust stability when the attractor is
Mathieu Granzotto and Dragan Neši´
c are with the Department of
Electrical and Electronic Engineering, University of Melbourne, Parkville,
VIC 3010, Australia (e-mail: {mgranzotto, dnesic}@unimelb.edu.au).
Their work was supported by the Australian Research Council under
the Discovery Project DP210102600.
Olivier Lindamulage De Silva and Romain Postoyan are with the
Université de Lorraine, CNRS, CRAN, F-54000 Nancy, France (e-mails:
{name.surname}@univ-lorraine.fr).
Zhong-Ping Jiang is with the Department of Electrical and Computer
Engineering, New York University, 370 Jay Street, Brooklyn, NY 11201,
USA. email: zjiang@nyu.edu. His work was supported partly by the
National Science Foundation under Grant EPCN-1903781.
This work was supported by the France-Australia collaboration project
IRP-ARS CNRS.
not necessarily a single point but a more general set; (iii) near-
optimality guarantees, in particular when the cost function is
not discounted. We explain each of these challenges next.
It is essential that PI is recursively feasible in the sense that
the optimization problem admits a solution at each iteration.
Surprisingly, we have not been able to find general conditions
for the recursive feasibility of PI in the literature when
dealing with deterministic nonlinear discrete-time systems
with general cost functions, whose state and inputs evolves
on a Euclidean space. The only results we came across
concentrate on special cases like when the input set is finite
[3] or the system is linear and the cost is quadratic [2]. The
dominant approach in the literature for nonlinear discrete-
time systems on Euclidean spaces is thus to assume that the
algorithm is recursively feasible, see, e.g., [15,23], or to rely
on conditions that are hard to verify a priori in general as they
employ feasibility tests at each iteration [3]. Model predictive
control literature recognised a long time ago the importance
of recursive feasibility. Hence we believe that the recursive
feasibility of PI is a property of major importance in view
of the burgeoning literature on dynamic programming and
reinforcement learning where PI plays a major role [4,6].
A second challenge for PI is related to its application in a
control context. In many applications, the closed-loop system
must exhibit stability guarantees as: (i) it provides analytical
guarantees on the behavior of the controlled system solutions
as time evolves; (ii) it endows the system with robustness prop-
erties and is thus associated to safety considerations, see, e.g.,
[1]. Available results on the stability of systems controlled by
PI concentrate on the case where the attractor is a single point,
as in, e.g., [5,7,23,24]. They exclude set stability, which is
inevitable for instance in presence of clock or toggle variables
[9, Examples 3.1-3.2], and more generally when the desired
operating behaviour of the closed-loop system is given by a set
and not a point. Moreover, the commonly used assumptions
imposed on the plant model and the stage cost are also subject
to some conservatism, like requiring the stage cost to satisfy
positive definiteness properties. In addition, it is essential to
ensure that these stability properties are robust, which is not
automatically guaranteed, as pointed out in [12,21], and this
matter is often eluded in the literature at the exception of
the recent work in [25] in the linear quadratic case. There is
therefore a need for general conditions allowing to conclude
robust set stability properties for systems controlled by PI. We
further would like these stability properties to be preserved
at each iteration; that is, we want to ensure recursive robust
stability.
Finally, it is important to understand when and how the
sequence of value functions generated by PI at each iteration
converges to the optimal value function; we talk of near-
arXiv:2210.14459v1 [math.OC] 26 Oct 2022