Global Convergence of Direct Policy Search for State-FeedbackH1Robust Control A Revisit of Nonsmooth Synthesis with Goldstein Subdifferential

2025-05-06 0 0 881.42KB 29 页 10玖币
侵权投诉
Global Convergence of Direct Policy Search for
State-Feedback HRobust Control: A Revisit of
Nonsmooth Synthesis with Goldstein Subdifferential
Xingang Guo, Bin Hu
Department of Electrical and Computer Engineering
Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
{xingang2,binhu7}@illinois.edu
Abstract
Direct policy search has been widely applied in modern reinforcement learning and
continuous control. However, the theoretical properties of direct policy search on
nonsmooth robust control synthesis have not been fully understood. The optimal
H
control framework aims at designing a policy to minimize the closed-loop
H
norm, and is arguably the most fundamental robust control paradigm. In this
work, we show that direct policy search is guaranteed to find the global solution of
the robust
H
state-feedback control design problem. Notice that policy search
for optimal
H
control leads to a constrained nonconvex nonsmooth optimization
problem, where the nonconvex feasible set consists of all the policies stabilizing the
closed-loop dynamics. We show that for this nonsmooth optimization problem, all
Clarke stationary points are global minimum. Next, we identify the coerciveness
of the closed-loop
H
objective function, and prove that all the sublevel sets of
the resultant policy search problem are compact. Based on these properties, we
show that Goldstein’s subgradient method and its implementable variants can be
guaranteed to stay in the nonconvex feasible set and eventually find the global
optimal solution of the
H
state-feedback synthesis problem. Our work builds a
new connection between nonconvex nonsmooth optimization theory and robust
control, leading to an interesting global convergence result for direct policy search
on optimal Hsynthesis.
1 Introduction
Reinforcement learning (RL) has achieved impressive performance on many continuous control
tasks [
59
,
40
], and policy optimization is one of the main workhorses for such applications [
18
,
65
,
58
,
60
]. Recently, there have been extensive research efforts studying the global convergence properties
of policy optimization methods on benchmark control problems including linear quadratic regulator
(LQR) [
21
,
7
,
41
,
70
,
44
,
22
,
29
], stabilization [
52
,
51
], linear robust/risk-sensitive control [
73
,
72
,
26
,
74
,
75
,
12
], Markov jump linear quadratic control [
32
,
31
,
33
,
55
], Lur’e system control [
53
], output
feedback control [
20
,
77
,
39
,
17
,
16
,
43
,
76
], and dynamic filtering [
68
]. For all these benchmark
problems, the objective function in the policy optimization formulation is always differentiable over
the entire feasible set, and the existing convergence theory heavily relies on this fact. Consequently,
an important open question remains whether direct policy search can enjoy similar global convergence
properties when applied to the famous
H
control problem whose objective function can be non-
differentiable over certain points in the policy space [
1
3
,
28
,
9
,
13
,
48
]. Different from LQR which
considers stochastic disturbance sequences,
H
control directly addresses the worst-case disturbance,
and provides arguably the most fundamental robust control paradigm [
78
,
19
,
62
,
4
,
15
,
23
]. Regarding
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11577v1 [math.OC] 20 Oct 2022
the connection with RL, it has also been shown that
H
control can be applied to stabilize the training
of adversarial RL schemes in the linear quadratic setup [
72
, Section 5]. Given the fundamental
importance of
H
control, we view it as an important benchmark for understanding the theoretical
properties of direct policy search in the context of robust control and adversarial RL. In this work, we
study and prove the global convergence properties of direct policy search on the
H
state-feedback
synthesis problem.
The objective of the
H
state-feedback synthesis is to design a linear state-feedback policy that
stabilizes the closed-loop system and minimizes the
H
norm from the disturbance to a performance
signal at the same time. The design goal is also equivalent to synthesizing a state-feedback
policy that minimizes a quadratic cost subject to the worst-case disturbance. We will present the
problem formulation for the
H
state-feedback synthesis and discuss such connections in Section 2.
Essentially,
H
state-feedback synthesis can be formulated as a constrained policy optimization
problem
minK∈K J(K)
, where the decision variable
K
is a matrix parameterizing the linear state-
feedback policy, the objective function
J(K)
is the closed-loop
H
-norm for given
K
, and the
feasible set
K
consists of all the linear state-feedback policies stabilizing the closed-loop dynamics.
Notice that the feasible set for the
H
state-feedback control problem is the same as the nonconvex
feasible set for the LQR policy search problem [
21
,
7
]. However, the objective function
J(K)
for the
H
control problem can be non-differential over certain feasible points, introducing new difficulty
to direct policy search. There has been a large family of nonsmooth
H
policy search algorithms
developed based on the concept of Clarke subdifferential [
1
3
,
28
,
9
,
13
]. However, a satisfying
global convergence theory is still missing from the literature. Our paper bridges this gap by making
the following two contributions.
1.
We show that all Clarke stationary points for the
H
state-feedback policy search problem
are also global minimum.
2.
We identify the coerciveness of the
H
cost function and use this property to show that
Goldstein’s subgradient method [
25
] and its implementable variants [
71
,
14
,
9
,
10
,
37
,
38
]
can be guaranteed to stay in the nonconvex feasible set of stabilizing policies during the
optimization process and eventually find the global optimal solution of the
H
state-
feedback control problem. Finite-time complexity bounds for finding
(δ, )
-stationary points
are also provided.
Our work sheds new light on the theoretical properties of policy optimization methods on
H
control
problems, and serves as a meaningful initial step towards a general global convergence theory of
direct policy search on nonsmooth robust control synthesis.
Finally, it is worth clarifying the differences between
H
control and mixed
H2/H
design.
For mixed
H2/H
control, the objective is to design a stabilizing policy that minimizes an
H2
performance bound and satisfies an
H
constraint at the same time [
24
,
36
,
34
,
47
]. In other words,
mixed
H2/H
control aims at improving the average
H2
performance while “maintaining" a certain
level of robustness by keeping the closed-loop
H
norm to be smaller than a pre-specified number.
In contrast,
H
control aims at “improving" the system robustness and the worst-case performance
via achieving the smallest closed-loop
H
norm. In [
73
], it has been shown that the natural policy
gradient method initialized from a policy satisfying the
H
constraint can be guaranteed to maintain
the
H
requirement during the optimization process and eventually converge to the optimal solution
of the mixed design problem. However, notice that the objective function for the mixed
H2/H
control problem is still differentiable over all the feasible points, and hence the analysis technique
in [
73
] cannot be applied to our
H
control setting. More discussions on the connections and
differences between these two problems will be given in the supplementary material.
2 Problem Formulation and Preliminaries
2.1 Notation
The set of
p
-dimensional real vectors is denoted as
Rp
. For a matrix
A
, we use the notation
AT
,
kAk
,
tr A
,
σmin(A)
,
kAk2
, and
ρ(A)
to denote its transpose, largest singular value, trace, smallest singular
value, Frobenius norm, and spectral radius, respectively. When a matrix
P
is negative semidefinite
(definite), we will use the notation
P()0
. When
P
is positive semidefinite (definite), we use the
notation
P()0
. Consider a (real) sequence
u:= {u0, u1,···}
where
utRnu
for all
t
. This
2
sequence is said to be in
`nu
2
if
P
t=0 kutk2<
where
kutk
denotes the standard (vector) 2-norm
of ut. In addition, the 2-norm for u`nu
2is defined as kuk2:= P
t=0 kutk2.
2.2 Problem statement: Hstate-feedback synthesis and a policy optimization formulation
We consider the following linear time-invariant (LTI) system
xt+1 =Axt+But+wt, x0= 0 (1)
where
xtRnx
is the state,
utRnu
is the control action, and
wtRnw
is the disturbance. We
have
ARnx×nx
,
BRnx×nu
, and
nw=nx
. We denote
x:= {x0, x1,···}
,
u:= {u0, u1,···}
,
and
w:= {w0, w1,···}
. The initial condition is fixed as
x0= 0
. The objective of
H
control
is to choose
{ut}
to minimize the quadratic cost
P
t=0(xT
tQxt+uT
tRut)
in the presence of the
worst-case `2disturbance satisfying kwk ≤ 1. In this paper, the following assumption is adopted.
Assumption 1. The matrices Qand Rare positive definite. The matrix pair (A, B)is stabilizable.
In
H
control,
{wt}
is considered to be the worst-case disturbance satisfying the
`2
norm bound
kwk ≤ 1
, and can be chosen in an adversarial manner. This is different from LQR which makes
stochastic assumptions on
{wt}
. Without loss of generality, we have chosen the
`2
upper bound on
w
to be
1
. In principle, we can formulate the
H
control problem with any arbitrary
`2
upper bound
on
w
, and there is no technical difference. We will provide more explanations on this fact in the
supplementary material. Therefore,
H
control can be formulated as the following minimax problem
min
u
max
w:kwk≤1
X
t=0
(xT
tQxt+uT
tRut)(2)
Under Assumption 1, it is well known that the optimal solution for
(2)
can be achieved using a linear
state-feedback policy ut=Kxt(see [4]). Given any K, the LTI system (1) can be rewritten as
xt+1 = (ABK)xt+wt, x0= 0.(3)
Now we define
zt= (Q+KTRK)1
2xt
. We have
kztk2=xT
t(Q+KTRK)xt=xT
tQxt+uT
tRut
.
We denote
z:= {z0, z1,···}
. If
x`nx
2
, then we have
kzk2=P
t=0(xT
tQxt+uT
tRut)<+
.
Therefore, the closed-loop LTI system
(3)
can be viewed as a linear operator mapping any disturbance
sequence
{wt}
to another sequence
{zt}
. We denote this operator as
GK
, where the subscript
highlights the dependence of this operator on
K
. If
K
is stabilizing, i.e.
ρ(ABK)<1
, then
GK
is bounded in the sense that it maps any
`2
sequence
w
to another sequence
z
in
`nx
2
. For any
stabilizing K, the `2`2induced norm of GKcan be defined as:
kGKk22:= sup
06=kwk≤1
kzk
kwk(4)
Since GKis a linear operator, it is straightforward to show
kGKk2
22:= max
w:kwk≤1
X
t=0
xT
t(Q+KTRK)xt= max
w:kwk≤1
X
t=0
(xT
tQxt+uT
tRut).
Therefore, the minimax optimization problem
(2)
can be rewritten as the policy optimization problem:
minK∈KkGKk2
22
, where
K
is the set of all linear state-feedback stabilizing policies, i.e.
K=
{KRnx×nu:ρ(ABK)<1}
. In the robust control literature [
1
3
,
28
,
9
,
13
], it is standard to
drop the square in the cost function and just reformulate
(2)
as
minK∈KkGKk22
. This is exactly the
policy optimization formulation for
H
state-feedback control. The main reason why this problem is
termed as
H
state-feedback control is that in the frequency domain,
GK
can be viewed as a transfer
function which lives in the Hardy
H
space and has an
H
norm being exactly equal to
kGKk22
.
Applying the frequency-domain formula for the Hnorm, we can calculate kGKk22as
kGKk22= sup
ω[0,2π]
λ1/2
max(ejωIA+BK)T(Q+KTRK)(ejωIA+BK)1,(5)
where
I
is the identity matrix, and
λmax
denotes the largest eigenvalue of a given symmetric matrix.
Therefore, eventually the Hstate-feedback control problem can be formulated as
min
K∈K J(K),(6)
3
where
J(K)
is equal to the
H
norm specified by
(5)
. Classical
H
control theory typically
solves (6) via introducing extra Lyapunov variables and reparameterizing the problem into a higher-
dimensional convex domain over which convex optimization algorithms can be applied [
78
,
19
,
6
].
In this paper, we revisit
(6)
as a benchmark for direct policy search, and discuss how to search the
optimal solution of
(6)
in the policy space directly. Applying direct policy search to address
(6)
leads
to a nonconvex nonsmooth optimization problem. A main technical challenge is that the objective
function (5) can be non-differentiable over some important feasible points [1–3, 28, 9, 13].
2.3 Direct policy search: A nonsmooth optimization perspective
Now we briefly review several key facts known for the Hpolicy optimization problem (6).
Proposition 1.
The set
K={K:ρ(ABK)<1}
is open. In general, it can be unbounded and
nonconvex. The cost function (5) is continuous and nonconvex in K.
See [
21
,
8
] for some related proofs. We have also included more explanations in the supplementary
material. An immediate consequence is that
(6)
becomes a nonconvex optimization problem. Another
important fact is that the objective function
(5)
is also nonsmooth. As a matter of fact,
(5)
is subject
to two sources of nonsmoothness. Based on
(5)
, we can see that the largest eigenvalue for a fixed
frequency
ω
is nonsmooth, and the optimization step over
ω[0,2π]
is also nonsmooth. As a matter
of fact, the
H
objective function
(5)
can be non-differentiable over important feasible points, e.g.
optimal points. Fortunately, it is well known
1
that the
H
objective function
(5)
has the following
desired property so it is Clarke subdifferentiable.
Proposition 2.
The
H
objective function
(5)
is locally Lipschitz and subdifferentially regular over
the stabilizing feasible set K.
Recall that
J:K → R
is locally Lipschitz if for any bounded
S⊂ K
, there exists a constant
L > 0
such that
|J(K)J(K0)| ≤ LkKK0k2
for all
K, K0S
. Based on Rademacher’s theorem, a
locally Lipschitz function is differentiable almost everywhere, and the Clarke subdifferential is well
defined for all feasible points. Formally, the Clarke subdifferential is defined as
CJ(K) := conv{lim
i→∞ J(Ki) : KiK, Kidom(J)⊂ K} (7)
where
conv
denotes the convex hull. Then we know that the Clarke subdifferential for the
H
objective function
(5)
is well defined for all
K∈ K
. We say that
K
is a Clarke stationary point if
0CJ(K). The following fact is also well known.
Proposition 3. If Kis a local min of J, then 0CJ(K)and Kis a Clarke stationary point.
Under Assumption 1, it is well known that there exists
K∈ K
achieving the minimum of
(6)
. Since
K
is an open set,
K
has to be an interior point of
K
and hence
K
has to be a Clarke stationary point.
In Section 3, we will prove that any Clarke stationary points for (6) are actually global minimum.
Now we briefly elaborate on the subdifferentially regular property stated in Proposition 2. For any
given direction
d
(which has the same dimension as
K
), the generalized Clarke directional derivative
of Jis defined as
J(K, d) := lim
K0Ksup
t&0
J(K0+td)J(K0)
t.(8)
In contrast, the (ordinary) directional derivative is defined as follows (when existing)
J0(K, d) := lim
t&0
J(K+td)J(K)
t.(9)
1
We cannot find a formal statement of Proposition 2 in the literature. However, based on our discussion
with other researchers who have worked on nonsmooth
H
synthesis for long time, this fact is well known
and hence we do not claim any credits in deriving this result. As a matter of fact, although not explicitly stated,
the proof of Proposition 2 is hinted in the last paragraph of [
2
, Section III] given the facts that the
H
norm
is a convex function over the Hardy
H
space (which is a Banach space) and the mapping from
K∈ K
to the (infinite-dimensional) Hardy
H
space is strictly differentiable. For completeness, a simple proof of
Proposition 2 based on Clarke’s chain rule [11] is included in the supplementary material.
4
In general, the Clarke directional derivative can be different from the (ordinary) directional derivative.
Sometimes the ordinary directional derivative may not even exist. The objective function
J(K)
is
subdifferentially regular if for every
K∈ K
, the ordinary directional derivative always exists and
coincides with the generalized one for every direction, i.e.
J0(K, d) = J(K, d)
. The most important
consequence of the subdifferentially regular property is given as follows.
Corollary 1.
Suppose
K∈ K
is a Clarke stationary point for
J
. If
J
is subdifferentially regular,
then the directional derivatives J0(K, d)are non-negative for all d.
See [
56
, Theorem 10.1] for related proofs and more discussions. Notice that having non-negative
directional derivatives does not mean that the point
K
is a local minimum. Nevertheless, the above
fact will be used in our main theoretical developments. Now we briefly summarize two key difficulties
in establishing a global convergence theory for direct policy search on the
H
state-feedback control
problem (6). First, it is unclear whether the direct policy search method will get stuck at some local
minimum. Second, it is challenging to guarantee the direct policy search method to stay in the
nonconvex feasible set
K
during the optimization process. Since
K
is nonconvex, we cannot use a
projection step to maintain feasibility. Our main results will address these two issues.
2.4 Goldstein subdifferential
Generating a good descent direction for nonsmooth optimization is not trivial. Many nonsmooth
optimization algorithms are based on the concept of Goldstein subdifferential [
25
]. Before proceeding
to our main result, we briefly review this concept here.
Definition 1
(Goldstein subdifferential)
.
Suppose
J
is locally Lipschitz. Given a point
K∈ K
and a
parameter δ > 0, the Goldstein subdifferential of Jat Kis defined to be the following set
δJ(K) := conv K0Bδ(K)CJ(K0),(10)
where Bδ(K)denotes the δ-ball around K. The above definition implicitly requires Bδ(K)⊂ K.
Based on the above definition, one can further define the notion of
(δ, )
-stationarity. A point
K
is
said to be
(δ, )
-stationary if
dist(0, ∂δJ(K))
. It is well-known that the minimal norm element
of the Goldstein subdifferential generates a good descent direction. This fact is stated as follows.
Proposition 4
([
25
])
.
Let
F
be the minimal norm element in
δJ(K)
. Suppose
KαF/kFk2∈ K
for any 0αδ. Then we have
J(KδF/kFk2)J(K)δkFk2.(11)
The idea of Goldstein subdifferential has been used in designing algorithms for nonsmooth
H
control [
3
,
28
,
9
,
13
]. We will show that such policy search algorithms can be guaranteed to
find the global minimum of
(6)
. It is worth mentioning that there are other notions of enlarged
subdifferential [
2
] which can lead to good descent directions for nonsmooth
H
synthesis. In this
paper, we focus on the notion of Goldstein subdifferential and related policy search algorithms.
3 Optimization Landscape for HState-Feedback Control
In this section, we investigate the optimization landscape of the
H
state-feedback policy search
problem, and show that any Clarke stationary points of
(6)
are also global minimum. We start by
showing the coerciveness of the Hobjective function (5).
Lemma 1.
The
H
objective function
J(K)
defined by
(5)
is coercive over the set
K
in the sense that
for any sequence
{Kl}
l=1 ⊂ K
we have
J(Kl)+
, if either
kKlk2+
, or
Kl
converges
to an element in the boundary K.
Proof.
We will only provide a proof sketch here. A detailed proof is presented in the supplementary
material. Suppose we have a sequence
{Kl}
satisfying
kKlk2+
. We can choose
w=
{w0,0,0,···}
with
kw0k= 1
and show
J(Kl)wT
0(Q+ (Kl)TRKl)w0λmin(R)kKlw0k2
.
Clearly, we have used the positive definiteness of
R
in the above derivation. Then by carefully
choosing
w0
, we can ensure
J(Kl)+
as
kKlk2+
. Next, we assume
KlKK
.
We have
ρ(ABK)=1
, and hence there exists some
ω0
such that
(ejω0IA+BK)
becomes
5
摘要:

GlobalConvergenceofDirectPolicySearchforState-FeedbackH1RobustControl:ARevisitofNonsmoothSynthesiswithGoldsteinSubdifferentialXingangGuo,BinHuDepartmentofElectricalandComputerEngineeringCoordinatedScienceLaboratoryUniversityofIllinoisatUrbana-Champaign{xingang2,binhu7}@illinois.eduAbstractDirectpoli...

展开>> 收起<<
Global Convergence of Direct Policy Search for State-FeedbackH1Robust Control A Revisit of Nonsmooth Synthesis with Goldstein Subdifferential.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:881.42KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注