1 Policy iteration for want of recursive feasibility all is not lost Mathieu Granzotto Olivier Lindamulage De Silva Romain Postoyan Dragan Neši c and Zhong-Ping Jiang

2025-04-30 0 0 2.13MB 16 页 10玖币
侵权投诉
1
Policy iteration: for want of recursive feasibility, all is not lost
Mathieu Granzotto, Olivier Lindamulage De Silva, Romain Postoyan, Dragan Neši´
c, and Zhong-Ping Jiang
AbstractThis paper investigates recursive feasibility,
recursive robust stability and near-optimality properties
of policy iteration (PI). For this purpose, we consider de-
terministic nonlinear discrete-time systems whose inputs
are generated by PI for undiscounted cost functions. We
first assume that PI is recursively feasible, in the sense
that the optimization problems solved at each iteration
admit a solution. In this case, we provide novel condi-
tions to establish recursive robust stability properties for
a general attractor, meaning that the policies generated at
each iteration ensure a robust KL-stability property with
respect to a general state measure. We then derive novel
explicit bounds on the mismatch between the (suboptimal)
value function returned by PI at each iteration and the
optimal one. Afterwards, motivated by a counter-example
that shows that PI may fail to be recursively feasible, we
modify PI so that recursive feasibility is guaranteed a priori
under mild conditions. This modified algorithm, called PI+,
is shown to preserve the recursive robust stability when
the attractor is compact. Additionally, PI+enjoys the same
near-optimality properties as its PI counterpart under the
same assumptions. Therefore, PI+is an attractive tool for
generating near-optimal stabilizing control of deterministic
discrete-time nonlinear systems.
I. INTRODUCTION
Policy iteration (PI) is an optimization algorithm that forms
one of the pillars of dynamic programming [2]. PI iteratively
generates control laws, also called policies, that converge to
an optimal control law for general dynamical systems and cost
functions under mild conditions, see, e.g., [2,5,15,17]. Also,
PI may exhibit the attractive feature of converging faster to
the optimal value function than its counterpart value iteration
(VI) [15] at the price of more computations. For these reasons,
PI attracts a lot of attention both in terms of theoretical
investigations see, e.g., [3,5,7,15,23,24], and practical appli-
cations e.g., [14,22,29,30]. Nevertheless, several fundamental
questions remain largely open regarding the properties of PI
in a control context: (i) its recursive feasibility; (ii) general
conditions for recursive robust stability when the attractor is
Mathieu Granzotto and Dragan Neši´
c are with the Department of
Electrical and Electronic Engineering, University of Melbourne, Parkville,
VIC 3010, Australia (e-mail: {mgranzotto, dnesic}@unimelb.edu.au).
Their work was supported by the Australian Research Council under
the Discovery Project DP210102600.
Olivier Lindamulage De Silva and Romain Postoyan are with the
Université de Lorraine, CNRS, CRAN, F-54000 Nancy, France (e-mails:
{name.surname}@univ-lorraine.fr).
Zhong-Ping Jiang is with the Department of Electrical and Computer
Engineering, New York University, 370 Jay Street, Brooklyn, NY 11201,
USA. email: zjiang@nyu.edu. His work was supported partly by the
National Science Foundation under Grant EPCN-1903781.
This work was supported by the France-Australia collaboration project
IRP-ARS CNRS.
not necessarily a single point but a more general set; (iii) near-
optimality guarantees, in particular when the cost function is
not discounted. We explain each of these challenges next.
It is essential that PI is recursively feasible in the sense that
the optimization problem admits a solution at each iteration.
Surprisingly, we have not been able to find general conditions
for the recursive feasibility of PI in the literature when
dealing with deterministic nonlinear discrete-time systems
with general cost functions, whose state and inputs evolves
on a Euclidean space. The only results we came across
concentrate on special cases like when the input set is finite
[3] or the system is linear and the cost is quadratic [2]. The
dominant approach in the literature for nonlinear discrete-
time systems on Euclidean spaces is thus to assume that the
algorithm is recursively feasible, see, e.g., [15,23], or to rely
on conditions that are hard to verify a priori in general as they
employ feasibility tests at each iteration [3]. Model predictive
control literature recognised a long time ago the importance
of recursive feasibility. Hence we believe that the recursive
feasibility of PI is a property of major importance in view
of the burgeoning literature on dynamic programming and
reinforcement learning where PI plays a major role [4,6].
A second challenge for PI is related to its application in a
control context. In many applications, the closed-loop system
must exhibit stability guarantees as: (i) it provides analytical
guarantees on the behavior of the controlled system solutions
as time evolves; (ii) it endows the system with robustness prop-
erties and is thus associated to safety considerations, see, e.g.,
[1]. Available results on the stability of systems controlled by
PI concentrate on the case where the attractor is a single point,
as in, e.g., [5,7,23,24]. They exclude set stability, which is
inevitable for instance in presence of clock or toggle variables
[9, Examples 3.1-3.2], and more generally when the desired
operating behaviour of the closed-loop system is given by a set
and not a point. Moreover, the commonly used assumptions
imposed on the plant model and the stage cost are also subject
to some conservatism, like requiring the stage cost to satisfy
positive definiteness properties. In addition, it is essential to
ensure that these stability properties are robust, which is not
automatically guaranteed, as pointed out in [12,21], and this
matter is often eluded in the literature at the exception of
the recent work in [25] in the linear quadratic case. There is
therefore a need for general conditions allowing to conclude
robust set stability properties for systems controlled by PI. We
further would like these stability properties to be preserved
at each iteration; that is, we want to ensure recursive robust
stability.
Finally, it is important to understand when and how the
sequence of value functions generated by PI at each iteration
converges to the optimal value function; we talk of near-
arXiv:2210.14459v1 [math.OC] 26 Oct 2022
2
optimality guarantees. The literature stands in two ways on
this issue. On the one hand, the value functions are known
to converge monotonically and point-wisely to the optimal
value function under mild conditions on the model and the
cost [5,15]; uniform convergence properties are only ensured,
as far as we know, for discounted costs functions [2]. On
the other hand, it is important to be able to evaluate the
mismatch between the returned value function at each iteration
and the optimal one. These computable or explicit bounds
on the mismatch of generated value functions are vital to
decide when to stop iterating the algorithm. Existing results
concentrate on discounted costs [2], which are not always
natural in control applications. In the discounted setting, the
provided near-optimality bounds explode when the discount
factor γ(0,1) converges to one [2], while to ensure stability,
in opposition, γshould be close enough to one in view of
[11,26]. Hence, there is a need for stronger near-optimality
guarantees for PI. In particular, we seek to develop results that
provide computable near-optimality bounds, without relying
on a discount factor, and to provide conditions under which
the sequence of constructed value functions satisfies a uniform
monotonic convergence property towards the optimal one.
In this context, we consider deterministic nonlinear discrete-
time systems, whose inputs are generated by PI for an undis-
counted infinite-horizon cost function. We first assume that
PI is recursively feasible and we provide general conditions
inspired from the model predictive control literature [13] to
ensure the recursive robust stability of the closed-loop system
at each iteration, where the attractor is a set. These conditions
relate to the detectability of the system with respect to the
stage cost and the stabilizing property of the initial policy.
We then exploit these stability properties to derive explicit
and computable bounds on the mismatch between the optimal
value function and the value function obtained by PI at each
iteration. We also show that the sequence of value functions
satisfies a uniform convergence property towards the optimal
value function by exploiting stability.
Afterwards, we show via a counter-example that PI may
actually fail to be recursively feasible under commonly used
assumptions. We thus propose to modify the original formu-
lation of PI so that we can guarantee the recursive feasibility
of the algorithm under mild conditions on the model, the
stage cost and the input set. We call this new algorithm PI
plus (PI+). PI+differs from PI in two aspects. First, an
(outer semicontinuous) regularization is performed at the so-
called improvement step, which is a common technique in the
discontinuous/hybrid systems literature [9]. Second, instead
of letting the algorithm select any policy that minimizes
the (regularized) improvement step, we select any of those
generating the smallest cost that we aim to minimize thereby
requiring an extra layer of computation. In this paper, we do
not address the question of the practical implementation of
PI+, which is left for future work. Instead, we concentrate
on the methodological challenges raised by the algorithm. We
then prove that PI+is indeed recursively feasible, and that it
preserves the recursive robust stability when the attractor is
compact as well as the near-optimality properties established
for PI. Compared to our preliminary work in [10], novel
elements include the results for PI, the robust stability analysis,
the fact that the admissible input set can be state-dependent,
and new technical developments to derive less conservative
near-optimality bounds. Moreover, the full proofs are provided.
The rest of the paper is organized as follows. Preliminaries
are given in Section II. The analysis of PI is carried out in
Section III. The new algorithm PI+and its properties are
presented in Section IV. We defer the robustness analysis of
the stability properties ensured by PI and PI+in Section V.
Concluding remarks are provided in Section VI. In order
to streamline the presentation, the proofs are given in the
appendices.
II. PRELIMINARIES
In this section, we define the notation, provide important
definitions and formalize the problem.
A. Notation
Let R:= (−∞,),R0:= [0,),R:= [−∞,],
Z0:={0,1,2, . . .}and Z>0:={1,2, . . .}. The notation
(x, y)stands for [x>, y>]>, where xRn,yRmand
n, m Z>0. The Euclidean norm of a vector xRnwith
nZ>0is denoted by |x|and the distance of xRnto a
non-empty set A ⊆ Rnis denoted by |x|A:= inf{|xy|:
y∈ A}. The unit closed ball of Rnfor nZ>0centered at the
origin is denoted by B. We consider K,Kand KL functions
as defined in [9, Section 3.5]. We write βexp−KL when
β(s1, s2) = λ1s1eλ2s2for some λ1[1,)and λ2>0
for any (s1, s2)R2
0. For any set A ⊆ Rn,xRn, the
indicator function δA:RnRis defined as δA(x)=0
when x A and δA(x) = when x / A as in [27].
Moreover, when Ais closed, we say σ:RnR0is a
proper indicator of set Awhenever σis continuous and there
exist ασ, ασ∈ Ksuch that ασ(|x|A)σ(x)ασ(|x|A)
for any xRn. The identity map from R0to R0is
denoted by I, and the zero map from R0to 0by 0. Let
f:R0R0. We use f(k)for the composition of function
fto itself ktimes, where kZ0and f(0) :=I. Given a
set-valued map S:RnRm, a selection of Sis a single-
valued mapping s:dom SRmsuch that s(x)S(x)for
any xdom S. For the sake of convenience, we write sS
to denote a selection sof S. We also employ the following
definition from [27, Def. 1.16].
Definition 1 (uniform level boundedness): A function f:
Rn×RmRwhere n, m Z>0with values f(x, u)is
level-bounded in u, locally uniform in xif for each xRn
and αRthere is a neighborhood Sof xalong a bounded
set BRnsuch that {uRm:f(z, u)α} ⊂ Bfor any
z∈ S.
B. Plant model and cost function
Consider the plant model
x(k+ 1) = f(x(k), u(k)),(1)
where xRnxis the state, u∈ U(x)Rnuis the control
input, the time-step is kZ0,U(x)is a non-empty set
of admissible inputs for state xRnx, and nx, nuZ>0.
3
We wish to find, for any given xRnx, an infinite-length
sequence of admissible inputs u= (u(0), u(1), . . . ), that
minimizes the infinite-horizon cost
J(x, u):=
X
k=0
`(φ(k, x, u|k), u(k)),(2)
where `:Rnx×RnuR0is a non-negative stage cost and
φ(k, x, u|k)is the solution to (1) at the kth-step, initialized
at xat time 0 with inputs u|k:= (u(0), . . . , u(k1)). The
minimum of J(x, ·)is denoted as
V?(x):= min
uJ(x, u)(3)
for any xRnx, where V?is the optimal value function
associated to the minimization of (2).
Standing Assumption 1 (SA1): For any xRnx, there
exists an optimal sequence of admissible inputs u?(x)such
that V?(x) = J(x, u?(x)) <+and for any infinite-length
sequence of admissible inputs u,V?(x)J(x, u).
Conditions to ensure SA1 can be found in, e.g., [19]. Given
(3), we define the set of optimal inputs as
H?(x):=argmin
u∈U(x)
{`(x, u) + V?(f(x, u))}.(4)
To compute H?in (4) for the general dynamics in (1) is
notoriously hard. Dynamic programming provides algorithms
to iteratively obtain feedback laws, which instead converge to
H?[4]. A fundamental algorithm of dynamic programming is
PI, which is presented and analysed in the next section. Before
that, we introduce some notation, which will be convenient in
the sequel. Given a feedback law h:RnxRnuthat is
admissible, i.e., h∈ U, we denote the solution to system (1)
in closed-loop with feedback law hat time kZ0with
initial condition xat time 0 as φ(k, x, h). Likewise, J(x, h)
is the cost induced by hat initial state x, i.e., J(x, h) =
P
k=0 `(φ(k, x, h), h(φ(k, x, h))). In this way, we have that
for any selection h?H?,V?(·) = J(·, h?), as H?(x)is
non-empty for any xRnxby SA1.
III. POLICY ITERATION
We recall in this section the original formulation of PI
and we assume the algorithm is feasible at any iteration.
We then establish novel recursive stability and near-optimality
guarantees, assuming a detectability property is satisfied and
the initial policy is stabilizing. Finally, we present an example
where PI is not recursively feasible despite supposedly favor-
able properties, thereby motivating the need to modify PI to
overcome this issue, which will be the topic of Section IV. As
mentioned in the introduction, the robustness of the stability
properties is analyzed in Section V.
A. The algorithm
PI is presented in Algorithm 1. Given an initial admissible
policy h0, PI generates at each iteration iZ0a policy
hi+1 with cost Vi+1(x):=J(x, hi+1)and it can be proved
that Vi+1(x)Vi(x)for all xRnx[28, Section 4.2]. This
is done via the improvement step in (PI.2). The policy hi+1
obtained at iteration i+ 1 is an arbitrary selection of Hi+1
in (PI.2) where Hi+1 may be set-valued. We then evaluate
Algorithm 1 Policy Iteration (PI)
Input: fin (1), `in (2), initial policy h0∈ U
Output: Policy h, cost V
1: Initial evaluation step: for all xRnx,
V0(x):=J(x, h0).(PI.1)
2: for iZ0do
3: Policy improvement step: for all xRnx,
Hi+1(x):=argmin
u∈U(x)
{`(x, u) + Vi(f(x, u))}.(PI.2)
4: Select hi+1 Hi+1.
5: Policy evaluation step: for all xRnx,
Vi+1(x):=J(x, hi+1).(PI.3)
6: end for
7: return hHand V.
the cost induced by hi+1, namely Vi+1 =J(·, hi+1), this
is the evaluation step in (PI.3). By doing so repeatedly, Vi
converges to the optimal value function V=V?under mild
conditions, see [2].
Remark 1: In practice, PI is often stopped at some iteration,
typically by looking at the difference between Viand Vi1
for some iZ>0. We will return to this point in Remark 2
in Section III-D.2.
B. Desired properties
For the remainder of this section, we proceed as is often
done in the literature and assume Algorithm 1 is recursively
feasible, see, e.g., [15,23], in the sense that the optimization
problem in (PI.2) admits a solution for any xRnxat any
iteration iZ>0, which is formalized below.
Assumption 1: Set-valued map Hi(x)is non-empty for any
iZ>0and xRnx.
We note that verifying Assumption 1 is hard in general. At
a given iteration i+ 1 with iZ0, a sufficient condition for
Hi+1(x)to be non-empty for any xRnxis establishing the
lower semicontinuity of map (x, u)7→ `(x, u)+Vi(f(x, u))+
δU(x)(u)on Rnx×Rnu, see, e.g., [3]. However, this is hard to
check in advance as Viis not known a priori. We will return to
the question of the recursive feasibility of PI in Section III-E.
Under Assumption 1, the goal of this section is to establish
the recursive stability and near-optimality bounds for PI, in
particular we aim at showing that PI is
recursively stabilizing, i.e., if h0stabilizes system (1),
then this property is also ensured by any hiHifor
any iZ>0, in the sense that the difference inclusion
x(k+ 1) f(x(k), Hi(x(k))) =:Fi(x(k)) (5)
exhibits desirable set stability properties;
near-optimal in the sense that we have guaranteed bounds
on ViV?for any iZ0, despite the fact that V?in
(3) is typically unknown.
These results rely on assumptions given in the next section.
For convenience, solutions to system (5) are denoted in the
sequel as φi(·, x)when initialized at some xRnxfor any
iZ0.
4
C. Standing assumptions
To define stability, we use a continuous function σ:Rnx
R0that serves as a state “measure” relating the distance of
the state to a given attractor where σvanishes. By stability, we
mean that there exists β∈ KL (independent of i) such that,
for any xRnx,iZ0, any solution φito (5) verifies, for
any kZ0,
σ(φi(k, x)) β(σ(x), k).(6)
Property (6) is a KL-stability property of system (5) with
respect to σ. When σis a proper indicator function of a closed
set A ⊆ Rnx, the uniform global asymptotic stability of set
A={xRnx:σ(x) = 0}is guaranteed by (6). When
σ(x) = |x|afor any xRnxwith a1, the uniform global
asymptotic stability of the origin x= 0 is ensured by (6),
for instance. Function σis thus convenient to address stability
properties for general attractors.
We make the next detectability assumption on system (1)
and stage cost `consistently with e.g., [11,13,18,26].
Standing Assumption 2 (SA2): There exist a continuous
function W:RnxR0,αW, χW∈ Kand αW:R0
R0continuous, nondecreasing and zero at zero, such that,
for any (x, u)∈ W :={(x, u)Rn×Rm:u∈ U(x)},
W(x)αW(σ(x))
W(f(x, u)) W(x)≤ −αW(σ(x)) + χW(`(x, u)).(7)
SA2 is a detectability property of system (1) and stage
cost `, see [13,16] for more details. SA2 holds for instance
with W= 0 when {xRnx:σ(x) = 0}is compact,
`(x, u) = `1(x)+`2(x, u)with `1continuous, positive definite
with respect to the set {xRnx:σ(x)=0},`1(x)→ ∞
as |x|→∞, and `2(x, u)0for any (x, u)∈ W. Note that
SA2 relaxes the requirement that `(·, u)be positive definite for
any u6= 0 as found in, e.g., [5,7,23,24], and `is not required
to satisfy convexity properties.
Finally, like in e.g., [3,15,23], we assume that we initialize
the algorithm with a stabilizing feedback law h0. In particular,
we make the next assumption.
Standing Assumption 3 (SA3): There exists αV∈ Ksuch
that, for any xRnx,V0(x) = J(x, h0)αV(σ(x)).
SA2 and SA3 are related to the stability property of sys-
tem (1) in closed-loop with h0as in (6). This is established
below.
D. Results
We now establish the desired properties listed in Section III-
D. The proofs of the forthcoming results follow very similar
lines as the equivalent ones stated in Section IV for PI+. For
this reason, the proofs are carried out in details for the results
of Section IV in Appendices I and II, and their application to
the results of the present section is discussed in Appendix III.
1) Recursive stability:The next theorem establishes recur-
sive stability.
Theorem 1: Suppose Assumption 1 holds. For any iZ0,
for any xRnx, solution φito (5) and kZ0,
σ(φi(k, x)) β(σ(x), k),(8)
where1β:(k, s)7→ α1
Y(IeαY)(k)αY(s)∈ KL with
αY, αY, αYin Table I.
Theorem 1 ensures the desired KL-stability property of sys-
tem (5) with respect to σat any iteration iZ0, and thus
at i= 0, which confirms that h0is stabilizing as mentioned
above. It is important to note that βin (8) is independent of
the number of iterations i, which makes the stability property
uniform with respect to i.
Under extra conditions, we can derive an exponential sta-
bility result.
Corollary 1: Suppose Assumption 1 holds and that there
exist cW, aW, aV>0and aW0such that χW(s)cWs,
αW(s)aWs,αV(s)aVs,αW(s)aWsfor any s0,
where χW, αW, αWcome from SA2 and αVcomes from
SA3. Then, Theorem 1 holds with β:(s, k)7→ aY
aY(1
eaY)ksexp−KL, and eaY, aY, aYin Table I.
2) Near-optimality properties:Next, we establish near-
optimality properties of PI.
Theorem 2: Suppose Assumption 1 holds. For any iZ0
and xRnx,
(ViV?)(x)(V0V?)(φ(i, x, h?)),(9)
where h?H?from (4). Moreover, for any iZ0and
xRnx,
Vi(x)V?(x)eα(β(σ(x), i)) ,(10)
where β KL comes from Theorem 1 and eα=
maxˆs[0,s]bαs)is non-decreasing and positive definite func-
tion with bαin Table I.
Theorem 2 provides novel characterisations of the near-
optimality properties of PI. In (9), (ViV?)(x)is the near-
optimality error term at iteration iZ0and state xRnx.
This error is upper-bounded in (9) by (V0V?)(φ(i, x, h?)),
which is the “initial” near-optimality error term for i= 0, but
evaluated at state φ(i, x, h?)instead of x. In its turn, state
φ(i, x, h?)corresponds to the ith time-step of the solution
of (1) initialized at xin closed-loop with an optimal (and
typically unknown) policy h?H?. This bound decreases to
zero point-wisely as i→ ∞ thanks for the stability property of
system (1) in closed-loop with (4), which follows similarly as
Theorem 1. However, the upper-bound in (9) is typically un-
known. To overcome this possible issue, a conservative upper-
bound of (V0V?)(φ(i, x, h?)) in the form of eα(β(σ(x), i))
is given in (10), where βand eαcome from Theorems 1 and
2 and can be both computed. The fact that function βin (8)
is independent of the number of iterations iis vital for (10).
Indeed, as a result, the upper-bound is ensured to converge to
zero as iincreases to infinity. We emphasize that the above
near-optimality properties exploit stability to provide explicit
bounds as in (10). This is in contrast to the literature, which
1To simplify presentation, we assume IeαYis non-decreasing. If this is
not the case, we can always replace (IeαY)(k)by s7→ maxˆs[0,s](I
eαY)(k)(ˆs)in the expression of β.
摘要:

1Policyiteration:forwantofrecursivefeasibility,allisnotlostMathieuGranzotto,OlivierLindamulageDeSilva,RomainPostoyan,DraganNeši´c,andZhong-PingJiangAbstract—Thispaperinvestigatesrecursivefeasibility,recursiverobuststabilityandnear-optimalitypropertiesofpolicyiteration(PI).Forthispurpose,weconsiderde...

展开>> 收起<<
1 Policy iteration for want of recursive feasibility all is not lost Mathieu Granzotto Olivier Lindamulage De Silva Romain Postoyan Dragan Neši c and Zhong-Ping Jiang.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.13MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注