Global Convergence of Direct Policy Search for State-FeedbackH1Robust Control A Revisit of Nonsmooth Synthesis with Goldstein Subdifferential

2025-05-06 1 0 881.42KB 29 页 10玖币

侵权投诉

Global Convergence of Direct Policy Search for

State-Feedback H∞Robust Control: A Revisit of

Nonsmooth Synthesis with Goldstein Subdifferential

Xingang Guo, Bin Hu

Department of Electrical and Computer Engineering

Coordinated Science Laboratory

University of Illinois at Urbana-Champaign

{xingang2,binhu7}@illinois.edu

Abstract

Direct policy search has been widely applied in modern reinforcement learning and

continuous control. However, the theoretical properties of direct policy search on

nonsmooth robust control synthesis have not been fully understood. The optimal

H∞

control framework aims at designing a policy to minimize the closed-loop

H∞

norm, and is arguably the most fundamental robust control paradigm. In this

work, we show that direct policy search is guaranteed to ﬁnd the global solution of

the robust

H∞

state-feedback control design problem. Notice that policy search

for optimal

H∞

control leads to a constrained nonconvex nonsmooth optimization

problem, where the nonconvex feasible set consists of all the policies stabilizing the

closed-loop dynamics. We show that for this nonsmooth optimization problem, all

Clarke stationary points are global minimum. Next, we identify the coerciveness

of the closed-loop

H∞

objective function, and prove that all the sublevel sets of

the resultant policy search problem are compact. Based on these properties, we

show that Goldstein’s subgradient method and its implementable variants can be

guaranteed to stay in the nonconvex feasible set and eventually ﬁnd the global

optimal solution of the

H∞

state-feedback synthesis problem. Our work builds a

new connection between nonconvex nonsmooth optimization theory and robust

control, leading to an interesting global convergence result for direct policy search

on optimal H∞synthesis.

1 Introduction

Reinforcement learning (RL) has achieved impressive performance on many continuous control

tasks [

], and policy optimization is one of the main workhorses for such applications [

]. Recently, there have been extensive research efforts studying the global convergence properties

of policy optimization methods on benchmark control problems including linear quadratic regulator

(LQR) [

], stabilization [

], linear robust/risk-sensitive control [

], Markov jump linear quadratic control [

], Lur’e system control [

], output

feedback control [

], and dynamic ﬁltering [

]. For all these benchmark

problems, the objective function in the policy optimization formulation is always differentiable over

the entire feasible set, and the existing convergence theory heavily relies on this fact. Consequently,

an important open question remains whether direct policy search can enjoy similar global convergence

properties when applied to the famous

H∞

control problem whose objective function can be non-

differentiable over certain points in the policy space [

–

]. Different from LQR which

considers stochastic disturbance sequences,

H∞

control directly addresses the worst-case disturbance,

and provides arguably the most fundamental robust control paradigm [

]. Regarding

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11577v1 [math.OC] 20 Oct 2022

the connection with RL, it has also been shown that

H∞

control can be applied to stabilize the training

of adversarial RL schemes in the linear quadratic setup [

, Section 5]. Given the fundamental

importance of

H∞

control, we view it as an important benchmark for understanding the theoretical

properties of direct policy search in the context of robust control and adversarial RL. In this work, we

study and prove the global convergence properties of direct policy search on the

H∞

state-feedback

synthesis problem.

The objective of the

H∞

state-feedback synthesis is to design a linear state-feedback policy that

stabilizes the closed-loop system and minimizes the

H∞

norm from the disturbance to a performance

signal at the same time. The design goal is also equivalent to synthesizing a state-feedback

policy that minimizes a quadratic cost subject to the worst-case disturbance. We will present the

problem formulation for the

H∞

state-feedback synthesis and discuss such connections in Section 2.

Essentially,

H∞

state-feedback synthesis can be formulated as a constrained policy optimization

problem

minK∈K J(K)

, where the decision variable

is a matrix parameterizing the linear state-

feedback policy, the objective function

J(K)

is the closed-loop

H∞

-norm for given

, and the

feasible set

consists of all the linear state-feedback policies stabilizing the closed-loop dynamics.

Notice that the feasible set for the

H∞

state-feedback control problem is the same as the nonconvex

feasible set for the LQR policy search problem [

]. However, the objective function

J(K)

for the

H∞

control problem can be non-differential over certain feasible points, introducing new difﬁculty

to direct policy search. There has been a large family of nonsmooth

H∞

policy search algorithms

developed based on the concept of Clarke subdifferential [

–

]. However, a satisfying

global convergence theory is still missing from the literature. Our paper bridges this gap by making

the following two contributions.

We show that all Clarke stationary points for the

H∞

state-feedback policy search problem

are also global minimum.

We identify the coerciveness of the

H∞

cost function and use this property to show that

Goldstein’s subgradient method [

] and its implementable variants [

]

can be guaranteed to stay in the nonconvex feasible set of stabilizing policies during the

optimization process and eventually ﬁnd the global optimal solution of the

H∞

state-

feedback control problem. Finite-time complexity bounds for ﬁnding

(δ, )

-stationary points

are also provided.

Our work sheds new light on the theoretical properties of policy optimization methods on

H∞

control

problems, and serves as a meaningful initial step towards a general global convergence theory of

direct policy search on nonsmooth robust control synthesis.

Finally, it is worth clarifying the differences between

H∞

control and mixed

H2/H∞

design.

For mixed

H2/H∞

control, the objective is to design a stabilizing policy that minimizes an

performance bound and satisﬁes an

H∞

constraint at the same time [

]. In other words,

mixed

H2/H∞

control aims at improving the average

performance while “maintaining" a certain

level of robustness by keeping the closed-loop

H∞

norm to be smaller than a pre-speciﬁed number.

In contrast,

H∞

control aims at “improving" the system robustness and the worst-case performance

via achieving the smallest closed-loop

H∞

norm. In [

], it has been shown that the natural policy

gradient method initialized from a policy satisfying the

H∞

constraint can be guaranteed to maintain

the

H∞

requirement during the optimization process and eventually converge to the optimal solution

of the mixed design problem. However, notice that the objective function for the mixed

H2/H∞

control problem is still differentiable over all the feasible points, and hence the analysis technique

in [

] cannot be applied to our

H∞

control setting. More discussions on the connections and

differences between these two problems will be given in the supplementary material.

2 Problem Formulation and Preliminaries

2.1 Notation

The set of

-dimensional real vectors is denoted as

. For a matrix

, we use the notation

kAk

tr A

σmin(A)

kAk2

, and

ρ(A)

to denote its transpose, largest singular value, trace, smallest singular

value, Frobenius norm, and spectral radius, respectively. When a matrix

is negative semideﬁnite

(deﬁnite), we will use the notation

P(≺)0

. When

is positive semideﬁnite (deﬁnite), we use the

notation

P()0

. Consider a (real) sequence

u:= {u0, u1,···}

where

ut∈Rnu

for all

. This

sequence is said to be in

`nu

P∞

t=0 kutk2<∞

where

kutk

denotes the standard (vector) 2-norm

of ut. In addition, the 2-norm for u∈`nu

2is deﬁned as kuk2:= P∞

t=0 kutk2.

2.2 Problem statement: H∞state-feedback synthesis and a policy optimization formulation

We consider the following linear time-invariant (LTI) system

xt+1 =Axt+But+wt, x0= 0 (1)

where

xt∈Rnx

is the state,

ut∈Rnu

is the control action, and

wt∈Rnw

is the disturbance. We

have

A∈Rnx×nx

B∈Rnx×nu

, and

nw=nx

. We denote

x:= {x0, x1,···}

u:= {u0, u1,···}

and

w:= {w0, w1,···}

. The initial condition is ﬁxed as

x0= 0

. The objective of

H∞

control

is to choose

{ut}

to minimize the quadratic cost

P∞

t=0(xT

tQxt+uT

tRut)

in the presence of the

worst-case `2disturbance satisfying kwk ≤ 1. In this paper, the following assumption is adopted.

Assumption 1. The matrices Qand Rare positive deﬁnite. The matrix pair (A, B)is stabilizable.

H∞

control,

{wt}

is considered to be the worst-case disturbance satisfying the

norm bound

kwk ≤ 1

, and can be chosen in an adversarial manner. This is different from LQR which makes

stochastic assumptions on

{wt}

. Without loss of generality, we have chosen the

upper bound on

to be

. In principle, we can formulate the

H∞

control problem with any arbitrary

upper bound

, and there is no technical difference. We will provide more explanations on this fact in the

supplementary material. Therefore,

H∞

control can be formulated as the following minimax problem

min

max

w:kwk≤1

∞

t=0

(xT

tQxt+uT

tRut)(2)

Under Assumption 1, it is well known that the optimal solution for

(2)

can be achieved using a linear

state-feedback policy ut=−Kxt(see [4]). Given any K, the LTI system (1) can be rewritten as

xt+1 = (A−BK)xt+wt, x0= 0.(3)

Now we deﬁne

zt= (Q+KTRK)1

2xt

. We have

kztk2=xT

t(Q+KTRK)xt=xT

tQxt+uT

tRut

We denote

z:= {z0, z1,···}

. If

x∈`nx

, then we have

kzk2=P∞

t=0(xT

tQxt+uT

tRut)<+∞

Therefore, the closed-loop LTI system

(3)

can be viewed as a linear operator mapping any disturbance

sequence

{wt}

to another sequence

{zt}

. We denote this operator as

, where the subscript

highlights the dependence of this operator on

. If

is stabilizing, i.e.

ρ(A−BK)<1

, then

is bounded in the sense that it maps any

sequence

to another sequence

`nx

. For any

stabilizing K, the `2→`2induced norm of GKcan be deﬁned as:

kGKk2→2:= sup

06=kwk≤1

kzk

kwk(4)

Since GKis a linear operator, it is straightforward to show

kGKk2

2→2:= max

w:kwk≤1

∞

t=0

t(Q+KTRK)xt= max

w:kwk≤1

∞

t=0

(xT

tQxt+uT

tRut).

Therefore, the minimax optimization problem

(2)

can be rewritten as the policy optimization problem:

minK∈KkGKk2

2→2

, where

is the set of all linear state-feedback stabilizing policies, i.e.

{K∈Rnx×nu:ρ(A−BK)<1}

. In the robust control literature [

–

], it is standard to

drop the square in the cost function and just reformulate

(2)

minK∈KkGKk2→2

. This is exactly the

policy optimization formulation for

H∞

state-feedback control. The main reason why this problem is

termed as

H∞

state-feedback control is that in the frequency domain,

can be viewed as a transfer

function which lives in the Hardy

H∞

space and has an

H∞

norm being exactly equal to

kGKk2→2

Applying the frequency-domain formula for the H∞norm, we can calculate kGKk2→2as

kGKk2→2= sup

ω∈[0,2π]

λ1/2

max(e−jωI−A+BK)−T(Q+KTRK)(ejωI−A+BK)−1,(5)

where

is the identity matrix, and

λmax

denotes the largest eigenvalue of a given symmetric matrix.

Therefore, eventually the H∞state-feedback control problem can be formulated as

min

K∈K J(K),(6)

where

J(K)

is equal to the

H∞

norm speciﬁed by

(5)

. Classical

H∞

control theory typically

solves (6) via introducing extra Lyapunov variables and reparameterizing the problem into a higher-

dimensional convex domain over which convex optimization algorithms can be applied [

In this paper, we revisit

(6)

as a benchmark for direct policy search, and discuss how to search the

optimal solution of

(6)

in the policy space directly. Applying direct policy search to address

(6)

leads

to a nonconvex nonsmooth optimization problem. A main technical challenge is that the objective

function (5) can be non-differentiable over some important feasible points [1–3, 28, 9, 13].

2.3 Direct policy search: A nonsmooth optimization perspective

Now we brieﬂy review several key facts known for the H∞policy optimization problem (6).

Proposition 1.

The set

K={K:ρ(A−BK)<1}

is open. In general, it can be unbounded and

nonconvex. The cost function (5) is continuous and nonconvex in K.

See [

] for some related proofs. We have also included more explanations in the supplementary

material. An immediate consequence is that

(6)

becomes a nonconvex optimization problem. Another

important fact is that the objective function

(5)

is also nonsmooth. As a matter of fact,

(5)

is subject

to two sources of nonsmoothness. Based on

(5)

, we can see that the largest eigenvalue for a ﬁxed

frequency

is nonsmooth, and the optimization step over

ω∈[0,2π]

is also nonsmooth. As a matter

of fact, the

H∞

objective function

(5)

can be non-differentiable over important feasible points, e.g.

optimal points. Fortunately, it is well known

that the

H∞

objective function

(5)

has the following

desired property so it is Clarke subdifferentiable.

Proposition 2.

The

H∞

objective function

(5)

is locally Lipschitz and subdifferentially regular over

the stabilizing feasible set K.

Recall that

J:K → R

is locally Lipschitz if for any bounded

S⊂ K

, there exists a constant

L > 0

such that

|J(K)−J(K0)| ≤ LkK−K0k2

for all

K, K0∈S

. Based on Rademacher’s theorem, a

locally Lipschitz function is differentiable almost everywhere, and the Clarke subdifferential is well

deﬁned for all feasible points. Formally, the Clarke subdifferential is deﬁned as

∂CJ(K) := conv{lim

i→∞ ∇J(Ki) : Ki→K, Ki∈dom(∇J)⊂ K} (7)

where

conv

denotes the convex hull. Then we know that the Clarke subdifferential for the

H∞

objective function

(5)

is well deﬁned for all

K∈ K

. We say that

is a Clarke stationary point if

0∈∂CJ(K). The following fact is also well known.

Proposition 3. If Kis a local min of J, then 0∈∂CJ(K)and Kis a Clarke stationary point.

Under Assumption 1, it is well known that there exists

K∗∈ K

achieving the minimum of

(6)

. Since

is an open set,

K∗

has to be an interior point of

and hence

K∗

has to be a Clarke stationary point.

In Section 3, we will prove that any Clarke stationary points for (6) are actually global minimum.

Now we brieﬂy elaborate on the subdifferentially regular property stated in Proposition 2. For any

given direction

(which has the same dimension as

), the generalized Clarke directional derivative

of Jis deﬁned as

J◦(K, d) := lim

K0→Ksup

t&0

J(K0+td)−J(K0)

t.(8)

In contrast, the (ordinary) directional derivative is deﬁned as follows (when existing)

J0(K, d) := lim

t&0

J(K+td)−J(K)

t.(9)

We cannot ﬁnd a formal statement of Proposition 2 in the literature. However, based on our discussion

with other researchers who have worked on nonsmooth

H∞

synthesis for long time, this fact is well known

and hence we do not claim any credits in deriving this result. As a matter of fact, although not explicitly stated,

the proof of Proposition 2 is hinted in the last paragraph of [

, Section III] given the facts that the

H∞

norm

is a convex function over the Hardy

H∞

space (which is a Banach space) and the mapping from

K∈ K

to the (inﬁnite-dimensional) Hardy

H∞

space is strictly differentiable. For completeness, a simple proof of

Proposition 2 based on Clarke’s chain rule [11] is included in the supplementary material.

In general, the Clarke directional derivative can be different from the (ordinary) directional derivative.

Sometimes the ordinary directional derivative may not even exist. The objective function

J(K)

subdifferentially regular if for every

K∈ K

, the ordinary directional derivative always exists and

coincides with the generalized one for every direction, i.e.

J0(K, d) = J◦(K, d)

. The most important

consequence of the subdifferentially regular property is given as follows.

Corollary 1.

Suppose

K†∈ K

is a Clarke stationary point for

. If

is subdifferentially regular,

then the directional derivatives J0(K†, d)are non-negative for all d.

See [

, Theorem 10.1] for related proofs and more discussions. Notice that having non-negative

directional derivatives does not mean that the point

K†

is a local minimum. Nevertheless, the above

fact will be used in our main theoretical developments. Now we brieﬂy summarize two key difﬁculties

in establishing a global convergence theory for direct policy search on the

H∞

state-feedback control

problem (6). First, it is unclear whether the direct policy search method will get stuck at some local

minimum. Second, it is challenging to guarantee the direct policy search method to stay in the

nonconvex feasible set

during the optimization process. Since

is nonconvex, we cannot use a

projection step to maintain feasibility. Our main results will address these two issues.

2.4 Goldstein subdifferential

Generating a good descent direction for nonsmooth optimization is not trivial. Many nonsmooth

optimization algorithms are based on the concept of Goldstein subdifferential [

]. Before proceeding

to our main result, we brieﬂy review this concept here.

Deﬁnition 1

(Goldstein subdifferential)

Suppose

is locally Lipschitz. Given a point

K∈ K

and a

parameter δ > 0, the Goldstein subdifferential of Jat Kis deﬁned to be the following set

∂δJ(K) := conv ∪K0∈Bδ(K)∂CJ(K0),(10)

where Bδ(K)denotes the δ-ball around K. The above deﬁnition implicitly requires Bδ(K)⊂ K.

Based on the above deﬁnition, one can further deﬁne the notion of

(δ, )

-stationarity. A point

said to be

(δ, )

-stationary if

dist(0, ∂δJ(K)) ≤

. It is well-known that the minimal norm element

of the Goldstein subdifferential generates a good descent direction. This fact is stated as follows.

Proposition 4

([

])

Let

be the minimal norm element in

∂δJ(K)

. Suppose

K−αF/kFk2∈ K

for any 0≤α≤δ. Then we have

J(K−δF/kFk2)≤J(K)−δkFk2.(11)

The idea of Goldstein subdifferential has been used in designing algorithms for nonsmooth

H∞

control [

]. We will show that such policy search algorithms can be guaranteed to

ﬁnd the global minimum of

(6)

. It is worth mentioning that there are other notions of enlarged

subdifferential [

] which can lead to good descent directions for nonsmooth

H∞

synthesis. In this

paper, we focus on the notion of Goldstein subdifferential and related policy search algorithms.

3 Optimization Landscape for H∞State-Feedback Control

In this section, we investigate the optimization landscape of the

H∞

state-feedback policy search

problem, and show that any Clarke stationary points of

(6)

are also global minimum. We start by

showing the coerciveness of the H∞objective function (5).

Lemma 1.

The

H∞

objective function

J(K)

deﬁned by

(5)

is coercive over the set

in the sense that

for any sequence

{Kl}∞

l=1 ⊂ K

we have

J(Kl)→+∞

, if either

kKlk2→+∞

, or

converges

to an element in the boundary ∂K.

Proof.

We will only provide a proof sketch here. A detailed proof is presented in the supplementary

material. Suppose we have a sequence

{Kl}

satisfying

kKlk2→+∞

. We can choose

{w0,0,0,···}

with

kw0k= 1

and show

J(Kl)≥wT

0(Q+ (Kl)TRKl)w0≥λmin(R)kKlw0k2

Clearly, we have used the positive deﬁniteness of

in the above derivation. Then by carefully

choosing

, we can ensure

J(Kl)→+∞

kKlk2→+∞

. Next, we assume

Kl→K∈∂K

We have

ρ(A−BK)=1

, and hence there exists some

ω0

such that

(ejω0I−A+BK)

becomes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GlobalConvergenceofDirectPolicySearchforState-FeedbackH1RobustControl:ARevisitofNonsmoothSynthesiswithGoldsteinSubdifferentialXingangGuo,BinHuDepartmentofElectricalandComputerEngineeringCoordinatedScienceLaboratoryUniversityofIllinoisatUrbana-Champaign{xingang2,binhu7}@illinois.eduAbstractDirectpoli...

展开>> 收起<<

Global Convergence of Direct Policy Search for State-FeedbackH1Robust Control A Revisit of Nonsmooth Synthesis with Goldstein Subdifferential.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Global Convergence of Direct Policy Search for State-FeedbackH1Robust Control A Revisit of Nonsmooth Synthesis with Goldstein Subdifferential

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: