1 Policy iteration for want of recursive feasibility all is not lost Mathieu Granzotto Olivier Lindamulage De Silva Romain Postoyan Dragan Neši c and Zhong-Ping Jiang

2025-04-30 0 0 2.13MB 16 页 10玖币

侵权投诉

Policy iteration: for want of recursive feasibility, all is not lost

Mathieu Granzotto, Olivier Lindamulage De Silva, Romain Postoyan, Dragan Neši´

c, and Zhong-Ping Jiang

Abstract—This paper investigates recursive feasibility,

recursive robust stability and near-optimality properties

of policy iteration (PI). For this purpose, we consider de-

terministic nonlinear discrete-time systems whose inputs

are generated by PI for undiscounted cost functions. We

ﬁrst assume that PI is recursively feasible, in the sense

that the optimization problems solved at each iteration

admit a solution. In this case, we provide novel condi-

tions to establish recursive robust stability properties for

a general attractor, meaning that the policies generated at

each iteration ensure a robust KL-stability property with

respect to a general state measure. We then derive novel

explicit bounds on the mismatch between the (suboptimal)

value function returned by PI at each iteration and the

optimal one. Afterwards, motivated by a counter-example

that shows that PI may fail to be recursively feasible, we

modify PI so that recursive feasibility is guaranteed a priori

under mild conditions. This modiﬁed algorithm, called PI+,

is shown to preserve the recursive robust stability when

the attractor is compact. Additionally, PI+enjoys the same

near-optimality properties as its PI counterpart under the

same assumptions. Therefore, PI+is an attractive tool for

generating near-optimal stabilizing control of deterministic

discrete-time nonlinear systems.

I. INTRODUCTION

Policy iteration (PI) is an optimization algorithm that forms

one of the pillars of dynamic programming [2]. PI iteratively

generates control laws, also called policies, that converge to

an optimal control law for general dynamical systems and cost

functions under mild conditions, see, e.g., [2,5,15,17]. Also,

PI may exhibit the attractive feature of converging faster to

the optimal value function than its counterpart value iteration

(VI) [15] at the price of more computations. For these reasons,

PI attracts a lot of attention both in terms of theoretical

investigations see, e.g., [3,5,7,15,23,24], and practical appli-

cations e.g., [14,22,29,30]. Nevertheless, several fundamental

questions remain largely open regarding the properties of PI

in a control context: (i) its recursive feasibility; (ii) general

conditions for recursive robust stability when the attractor is

Mathieu Granzotto and Dragan Neši´

c are with the Department of

Electrical and Electronic Engineering, University of Melbourne, Parkville,

VIC 3010, Australia (e-mail: {mgranzotto, dnesic}@unimelb.edu.au).

Their work was supported by the Australian Research Council under

the Discovery Project DP210102600.

Olivier Lindamulage De Silva and Romain Postoyan are with the

Université de Lorraine, CNRS, CRAN, F-54000 Nancy, France (e-mails:

{name.surname}@univ-lorraine.fr).

Zhong-Ping Jiang is with the Department of Electrical and Computer

Engineering, New York University, 370 Jay Street, Brooklyn, NY 11201,

USA. email: zjiang@nyu.edu. His work was supported partly by the

National Science Foundation under Grant EPCN-1903781.

This work was supported by the France-Australia collaboration project

IRP-ARS CNRS.

not necessarily a single point but a more general set; (iii) near-

optimality guarantees, in particular when the cost function is

not discounted. We explain each of these challenges next.

It is essential that PI is recursively feasible in the sense that

the optimization problem admits a solution at each iteration.

Surprisingly, we have not been able to ﬁnd general conditions

for the recursive feasibility of PI in the literature when

dealing with deterministic nonlinear discrete-time systems

with general cost functions, whose state and inputs evolves

on a Euclidean space. The only results we came across

concentrate on special cases like when the input set is ﬁnite

[3] or the system is linear and the cost is quadratic [2]. The

dominant approach in the literature for nonlinear discrete-

time systems on Euclidean spaces is thus to assume that the

algorithm is recursively feasible, see, e.g., [15,23], or to rely

on conditions that are hard to verify a priori in general as they

employ feasibility tests at each iteration [3]. Model predictive

control literature recognised a long time ago the importance

of recursive feasibility. Hence we believe that the recursive

feasibility of PI is a property of major importance in view

of the burgeoning literature on dynamic programming and

reinforcement learning where PI plays a major role [4,6].

A second challenge for PI is related to its application in a

control context. In many applications, the closed-loop system

must exhibit stability guarantees as: (i) it provides analytical

guarantees on the behavior of the controlled system solutions

as time evolves; (ii) it endows the system with robustness prop-

erties and is thus associated to safety considerations, see, e.g.,

[1]. Available results on the stability of systems controlled by

PI concentrate on the case where the attractor is a single point,

as in, e.g., [5,7,23,24]. They exclude set stability, which is

inevitable for instance in presence of clock or toggle variables

[9, Examples 3.1-3.2], and more generally when the desired

operating behaviour of the closed-loop system is given by a set

and not a point. Moreover, the commonly used assumptions

imposed on the plant model and the stage cost are also subject

to some conservatism, like requiring the stage cost to satisfy

positive deﬁniteness properties. In addition, it is essential to

ensure that these stability properties are robust, which is not

automatically guaranteed, as pointed out in [12,21], and this

matter is often eluded in the literature at the exception of

the recent work in [25] in the linear quadratic case. There is

therefore a need for general conditions allowing to conclude

robust set stability properties for systems controlled by PI. We

further would like these stability properties to be preserved

at each iteration; that is, we want to ensure recursive robust

stability.

Finally, it is important to understand when and how the

sequence of value functions generated by PI at each iteration

converges to the optimal value function; we talk of near-

arXiv:2210.14459v1 [math.OC] 26 Oct 2022

optimality guarantees. The literature stands in two ways on

this issue. On the one hand, the value functions are known

to converge monotonically and point-wisely to the optimal

value function under mild conditions on the model and the

cost [5,15]; uniform convergence properties are only ensured,

as far as we know, for discounted costs functions [2]. On

the other hand, it is important to be able to evaluate the

mismatch between the returned value function at each iteration

and the optimal one. These computable or explicit bounds

on the mismatch of generated value functions are vital to

decide when to stop iterating the algorithm. Existing results

concentrate on discounted costs [2], which are not always

natural in control applications. In the discounted setting, the

provided near-optimality bounds explode when the discount

factor γ∈(0,1) converges to one [2], while to ensure stability,

in opposition, γshould be close enough to one in view of

[11,26]. Hence, there is a need for stronger near-optimality

guarantees for PI. In particular, we seek to develop results that

provide computable near-optimality bounds, without relying

on a discount factor, and to provide conditions under which

the sequence of constructed value functions satisﬁes a uniform

monotonic convergence property towards the optimal one.

In this context, we consider deterministic nonlinear discrete-

time systems, whose inputs are generated by PI for an undis-

counted inﬁnite-horizon cost function. We ﬁrst assume that

PI is recursively feasible and we provide general conditions

inspired from the model predictive control literature [13] to

ensure the recursive robust stability of the closed-loop system

at each iteration, where the attractor is a set. These conditions

relate to the detectability of the system with respect to the

stage cost and the stabilizing property of the initial policy.

We then exploit these stability properties to derive explicit

and computable bounds on the mismatch between the optimal

value function and the value function obtained by PI at each

iteration. We also show that the sequence of value functions

satisﬁes a uniform convergence property towards the optimal

value function by exploiting stability.

Afterwards, we show via a counter-example that PI may

actually fail to be recursively feasible under commonly used

assumptions. We thus propose to modify the original formu-

lation of PI so that we can guarantee the recursive feasibility

of the algorithm under mild conditions on the model, the

stage cost and the input set. We call this new algorithm PI

plus (PI+). PI+differs from PI in two aspects. First, an

(outer semicontinuous) regularization is performed at the so-

called improvement step, which is a common technique in the

discontinuous/hybrid systems literature [9]. Second, instead

of letting the algorithm select any policy that minimizes

the (regularized) improvement step, we select any of those

generating the smallest cost that we aim to minimize thereby

requiring an extra layer of computation. In this paper, we do

not address the question of the practical implementation of

PI+, which is left for future work. Instead, we concentrate

on the methodological challenges raised by the algorithm. We

then prove that PI+is indeed recursively feasible, and that it

preserves the recursive robust stability when the attractor is

compact as well as the near-optimality properties established

for PI. Compared to our preliminary work in [10], novel

elements include the results for PI, the robust stability analysis,

the fact that the admissible input set can be state-dependent,

and new technical developments to derive less conservative

near-optimality bounds. Moreover, the full proofs are provided.

The rest of the paper is organized as follows. Preliminaries

are given in Section II. The analysis of PI is carried out in

Section III. The new algorithm PI+and its properties are

presented in Section IV. We defer the robustness analysis of

the stability properties ensured by PI and PI+in Section V.

Concluding remarks are provided in Section VI. In order

to streamline the presentation, the proofs are given in the

appendices.

II. PRELIMINARIES

In this section, we deﬁne the notation, provide important

deﬁnitions and formalize the problem.

A. Notation

Let R:= (−∞,∞),R≥0:= [0,∞),R:= [−∞,∞],

Z≥0:={0,1,2, . . .}and Z>0:={1,2, . . .}. The notation

(x, y)stands for [x>, y>]>, where x∈Rn,y∈Rmand

n, m ∈Z>0. The Euclidean norm of a vector x∈Rnwith

n∈Z>0is denoted by |x|and the distance of x∈Rnto a

non-empty set A ⊆ Rnis denoted by |x|A:= inf{|x−y|:

y∈ A}. The unit closed ball of Rnfor n∈Z>0centered at the

origin is denoted by B. We consider K,K∞and KL functions

as deﬁned in [9, Section 3.5]. We write β∈exp−KL when

β(s1, s2) = λ1s1e−λ2s2for some λ1∈[1,∞)and λ2>0

for any (s1, s2)∈R2

≥0. For any set A ⊆ Rn,x∈Rn, the

indicator function δA:Rn→Ris deﬁned as δA(x)=0

when x∈ A and δA(x) = ∞when x /∈ A as in [27].

Moreover, when Ais closed, we say σ:Rn→R≥0is a

proper indicator of set Awhenever σis continuous and there

exist ασ, ασ∈ K∞such that ασ(|x|A)≤σ(x)≤ασ(|x|A)

for any x∈Rn. The identity map from R≥0to R≥0is

denoted by I, and the zero map from R≥0to 0by 0. Let

f:R≥0→R≥0. We use f(k)for the composition of function

fto itself ktimes, where k∈Z≥0and f(0) :=I. Given a

set-valued map S:Rn⇒Rm, a selection of Sis a single-

valued mapping s:dom S→Rmsuch that s(x)∈S(x)for

any x∈dom S. For the sake of convenience, we write s∈S

to denote a selection sof S. We also employ the following

deﬁnition from [27, Def. 1.16].

Deﬁnition 1 (uniform level boundedness): A function f:

Rn×Rm→Rwhere n, m ∈Z>0with values f(x, u)is

level-bounded in u, locally uniform in xif for each x∈Rn

and α∈Rthere is a neighborhood Sof xalong a bounded

set B⊂Rnsuch that {u∈Rm:f(z, u)≤α} ⊂ Bfor any

z∈ S.

B. Plant model and cost function

Consider the plant model

x(k+ 1) = f(x(k), u(k)),(1)

where x∈Rnxis the state, u∈ U(x)⊆Rnuis the control

input, the time-step is k∈Z≥0,U(x)is a non-empty set

of admissible inputs for state x∈Rnx, and nx, nu∈Z>0.

We wish to ﬁnd, for any given x∈Rnx, an inﬁnite-length

sequence of admissible inputs u= (u(0), u(1), . . . ), that

minimizes the inﬁnite-horizon cost

J(x, u):=

∞

k=0

`(φ(k, x, u|k), u(k)),(2)

where `:Rnx×Rnu→R≥0is a non-negative stage cost and

φ(k, x, u|k)is the solution to (1) at the kth-step, initialized

at xat time 0 with inputs u|k:= (u(0), . . . , u(k−1)). The

minimum of J(x, ·)is denoted as

V?(x):= min

uJ(x, u)(3)

for any x∈Rnx, where V?is the optimal value function

associated to the minimization of (2).

Standing Assumption 1 (SA1): For any x∈Rnx, there

exists an optimal sequence of admissible inputs u?(x)such

that V?(x) = J(x, u?(x)) <+∞and for any inﬁnite-length

sequence of admissible inputs u,V?(x)≤J(x, u).

Conditions to ensure SA1 can be found in, e.g., [19]. Given

(3), we deﬁne the set of optimal inputs as

H?(x):=argmin

u∈U(x)

{`(x, u) + V?(f(x, u))}.(4)

To compute H?in (4) for the general dynamics in (1) is

notoriously hard. Dynamic programming provides algorithms

to iteratively obtain feedback laws, which instead converge to

H?[4]. A fundamental algorithm of dynamic programming is

PI, which is presented and analysed in the next section. Before

that, we introduce some notation, which will be convenient in

the sequel. Given a feedback law h:Rnx→Rnuthat is

admissible, i.e., h∈ U, we denote the solution to system (1)

in closed-loop with feedback law hat time k∈Z≥0with

initial condition xat time 0 as φ(k, x, h). Likewise, J(x, h)

is the cost induced by hat initial state x, i.e., J(x, h) =

P∞

k=0 `(φ(k, x, h), h(φ(k, x, h))). In this way, we have that

for any selection h?∈H?,V?(·) = J(·, h?), as H?(x)is

non-empty for any x∈Rnxby SA1.

III. POLICY ITERATION

We recall in this section the original formulation of PI

and we assume the algorithm is feasible at any iteration.

We then establish novel recursive stability and near-optimality

guarantees, assuming a detectability property is satisﬁed and

the initial policy is stabilizing. Finally, we present an example

where PI is not recursively feasible despite supposedly favor-

able properties, thereby motivating the need to modify PI to

overcome this issue, which will be the topic of Section IV. As

mentioned in the introduction, the robustness of the stability

properties is analyzed in Section V.

A. The algorithm

PI is presented in Algorithm 1. Given an initial admissible

policy h0, PI generates at each iteration i∈Z≥0a policy

hi+1 with cost Vi+1(x):=J(x, hi+1)and it can be proved

that Vi+1(x)≤Vi(x)for all x∈Rnx[28, Section 4.2]. This

is done via the improvement step in (PI.2). The policy hi+1

obtained at iteration i+ 1 is an arbitrary selection of Hi+1

in (PI.2) where Hi+1 may be set-valued. We then evaluate

Algorithm 1 Policy Iteration (PI)

Input: fin (1), `in (2), initial policy h0∈ U

Output: Policy h∞, cost V∞

1: Initial evaluation step: for all x∈Rnx,

V0(x):=J(x, h0).(PI.1)

2: for i∈Z≥0do

3: Policy improvement step: for all x∈Rnx,

Hi+1(x):=argmin

u∈U(x)

{`(x, u) + Vi(f(x, u))}.(PI.2)

4: Select hi+1 ∈Hi+1.

5: Policy evaluation step: for all x∈Rnx,

Vi+1(x):=J(x, hi+1).(PI.3)

6: end for

7: return h∞∈H∞and V∞.

the cost induced by hi+1, namely Vi+1 =J(·, hi+1), this

is the evaluation step in (PI.3). By doing so repeatedly, Vi

converges to the optimal value function V∞=V?under mild

conditions, see [2].

Remark 1: In practice, PI is often stopped at some iteration,

typically by looking at the difference between Viand Vi−1

for some i∈Z>0. We will return to this point in Remark 2

in Section III-D.2. 

B. Desired properties

For the remainder of this section, we proceed as is often

done in the literature and assume Algorithm 1 is recursively

feasible, see, e.g., [15,23], in the sense that the optimization

problem in (PI.2) admits a solution for any x∈Rnxat any

iteration i∈Z>0, which is formalized below.

Assumption 1: Set-valued map Hi(x)is non-empty for any

i∈Z>0and x∈Rnx.

We note that verifying Assumption 1 is hard in general. At

a given iteration i+ 1 with i∈Z≥0, a sufﬁcient condition for

Hi+1(x)to be non-empty for any x∈Rnxis establishing the

lower semicontinuity of map (x, u)7→ `(x, u)+Vi(f(x, u))+

δU(x)(u)on Rnx×Rnu, see, e.g., [3]. However, this is hard to

check in advance as Viis not known a priori. We will return to

the question of the recursive feasibility of PI in Section III-E.

Under Assumption 1, the goal of this section is to establish

the recursive stability and near-optimality bounds for PI, in

particular we aim at showing that PI is

•recursively stabilizing, i.e., if h0stabilizes system (1),

then this property is also ensured by any hi∈Hifor

any i∈Z>0, in the sense that the difference inclusion

x(k+ 1) ∈f(x(k), Hi(x(k))) =:Fi(x(k)) (5)

exhibits desirable set stability properties;

•near-optimal in the sense that we have guaranteed bounds

on Vi−V?for any i∈Z≥0, despite the fact that V?in

(3) is typically unknown.

These results rely on assumptions given in the next section.

For convenience, solutions to system (5) are denoted in the

sequel as φi(·, x)when initialized at some x∈Rnxfor any

i∈Z≥0.

C. Standing assumptions

To deﬁne stability, we use a continuous function σ:Rnx→

R≥0that serves as a state “measure” relating the distance of

the state to a given attractor where σvanishes. By stability, we

mean that there exists β∈ KL (independent of i) such that,

for any x∈Rnx,i∈Z≥0, any solution φito (5) veriﬁes, for

any k∈Z≥0,

σ(φi(k, x)) ≤β(σ(x), k).(6)

Property (6) is a KL-stability property of system (5) with

respect to σ. When σis a proper indicator function of a closed

set A ⊆ Rnx, the uniform global asymptotic stability of set

A={x∈Rnx:σ(x) = 0}is guaranteed by (6). When

σ(x) = |x|afor any x∈Rnxwith a≥1, the uniform global

asymptotic stability of the origin x= 0 is ensured by (6),

for instance. Function σis thus convenient to address stability

properties for general attractors.

We make the next detectability assumption on system (1)

and stage cost `consistently with e.g., [11,13,18,26].

Standing Assumption 2 (SA2): There exist a continuous

function W:Rnx→R≥0,αW, χW∈ K∞and αW:R≥0→

R≥0continuous, nondecreasing and zero at zero, such that,

for any (x, u)∈ W :={(x, u)∈Rn×Rm:u∈ U(x)},

W(x)≤αW(σ(x))

W(f(x, u)) −W(x)≤ −αW(σ(x)) + χW(`(x, u)).(7)



SA2 is a detectability property of system (1) and stage

cost `, see [13,16] for more details. SA2 holds for instance

with W= 0 when {x∈Rnx:σ(x) = 0}is compact,

`(x, u) = `1(x)+`2(x, u)with `1continuous, positive deﬁnite

with respect to the set {x∈Rnx:σ(x)=0},`1(x)→ ∞

as |x|→∞, and `2(x, u)≥0for any (x, u)∈ W. Note that

SA2 relaxes the requirement that `(·, u)be positive deﬁnite for

any u6= 0 as found in, e.g., [5,7,23,24], and `is not required

to satisfy convexity properties.

Finally, like in e.g., [3,15,23], we assume that we initialize

the algorithm with a stabilizing feedback law h0. In particular,

we make the next assumption.

Standing Assumption 3 (SA3): There exists αV∈ K∞such

that, for any x∈Rnx,V0(x) = J(x, h0)≤αV(σ(x)).

SA2 and SA3 are related to the stability property of sys-

tem (1) in closed-loop with h0as in (6). This is established

below.

D. Results

We now establish the desired properties listed in Section III-

D. The proofs of the forthcoming results follow very similar

lines as the equivalent ones stated in Section IV for PI+. For

this reason, the proofs are carried out in details for the results

of Section IV in Appendices I and II, and their application to

the results of the present section is discussed in Appendix III.

1) Recursive stability:The next theorem establishes recur-

sive stability.

Theorem 1: Suppose Assumption 1 holds. For any i∈Z≥0,

for any x∈Rnx, solution φito (5) and k∈Z≥0,

σ(φi(k, x)) ≤β(σ(x), k),(8)

where1β:(k, s)7→ α−1

Y◦(I−eαY)(k)◦αY(s)∈ KL with

αY, αY, αYin Table I. 

Theorem 1 ensures the desired KL-stability property of sys-

tem (5) with respect to σat any iteration i∈Z≥0, and thus

at i= 0, which conﬁrms that h0is stabilizing as mentioned

above. It is important to note that βin (8) is independent of

the number of iterations i, which makes the stability property

uniform with respect to i.

Under extra conditions, we can derive an exponential sta-

bility result.

Corollary 1: Suppose Assumption 1 holds and that there

exist cW, aW, aV>0and aW≥0such that χW(s)≤cWs,

αW(s)≥aWs,αV(s)≤aVs,αW(s)≤aWsfor any s≥0,

where χW, αW, αWcome from SA2 and αVcomes from

SA3. Then, Theorem 1 holds with β:(s, k)7→ aY

aY(1 −

eaY)ks∈exp−KL, and eaY, aY, aYin Table I. 

2) Near-optimality properties:Next, we establish near-

optimality properties of PI.

Theorem 2: Suppose Assumption 1 holds. For any i∈Z≥0

and x∈Rnx,

(Vi−V?)(x)≤(V0−V?)(φ(i, x, h?)),(9)

where h?∈H?from (4). Moreover, for any i∈Z≥0and

x∈Rnx,

Vi(x)−V?(x)≤eα(β(σ(x), i)) ,(10)

where β∈ KL comes from Theorem 1 and eα=

maxˆs∈[0,s]bα(ˆs)is non-decreasing and positive deﬁnite func-

tion with bαin Table I. 

Theorem 2 provides novel characterisations of the near-

optimality properties of PI. In (9), (Vi−V?)(x)is the near-

optimality error term at iteration i∈Z≥0and state x∈Rnx.

This error is upper-bounded in (9) by (V0−V?)(φ(i, x, h?)),

which is the “initial” near-optimality error term for i= 0, but

evaluated at state φ(i, x, h?)instead of x. In its turn, state

φ(i, x, h?)corresponds to the ith time-step of the solution

of (1) initialized at xin closed-loop with an optimal (and

typically unknown) policy h?∈H?. This bound decreases to

zero point-wisely as i→ ∞ thanks for the stability property of

system (1) in closed-loop with (4), which follows similarly as

Theorem 1. However, the upper-bound in (9) is typically un-

known. To overcome this possible issue, a conservative upper-

bound of (V0−V?)(φ(i, x, h?)) in the form of eα(β(σ(x), i))

is given in (10), where βand eαcome from Theorems 1 and

2 and can be both computed. The fact that function βin (8)

is independent of the number of iterations iis vital for (10).

Indeed, as a result, the upper-bound is ensured to converge to

zero as iincreases to inﬁnity. We emphasize that the above

near-optimality properties exploit stability to provide explicit

bounds as in (10). This is in contrast to the literature, which

1To simplify presentation, we assume I−eαYis non-decreasing. If this is

not the case, we can always replace (I−eαY)(k)by s7→ maxˆs∈[0,s](I−

eαY)(k)(ˆs)in the expression of β.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Policyiteration:forwantofrecursivefeasibility,allisnotlostMathieuGranzotto,OlivierLindamulageDeSilva,RomainPostoyan,DraganNei´c,andZhong-PingJiangAbstractThispaperinvestigatesrecursivefeasibility,recursiverobuststabilityandnear-optimalitypropertiesofpolicyiteration(PI).Forthispurpose,weconsiderde...

展开>> 收起<<

1 Policy iteration for want of recursive feasibility all is not lost Mathieu Granzotto Olivier Lindamulage De Silva Romain Postoyan Dragan Neši c and Zhong-Ping Jiang.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Policy iteration for want of recursive feasibility all is not lost Mathieu Granzotto Olivier Lindamulage De Silva Romain Postoyan Dragan Neši c and Zhong-Ping Jiang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: