the connection with RL, it has also been shown that
H∞
control can be applied to stabilize the training
of adversarial RL schemes in the linear quadratic setup [
72
, Section 5]. Given the fundamental
importance of
H∞
control, we view it as an important benchmark for understanding the theoretical
properties of direct policy search in the context of robust control and adversarial RL. In this work, we
study and prove the global convergence properties of direct policy search on the
H∞
state-feedback
synthesis problem.
The objective of the
H∞
state-feedback synthesis is to design a linear state-feedback policy that
stabilizes the closed-loop system and minimizes the
H∞
norm from the disturbance to a performance
signal at the same time. The design goal is also equivalent to synthesizing a state-feedback
policy that minimizes a quadratic cost subject to the worst-case disturbance. We will present the
problem formulation for the
H∞
state-feedback synthesis and discuss such connections in Section 2.
Essentially,
H∞
state-feedback synthesis can be formulated as a constrained policy optimization
problem
minK∈K J(K)
, where the decision variable
K
is a matrix parameterizing the linear state-
feedback policy, the objective function
J(K)
is the closed-loop
H∞
-norm for given
K
, and the
feasible set
K
consists of all the linear state-feedback policies stabilizing the closed-loop dynamics.
Notice that the feasible set for the
H∞
state-feedback control problem is the same as the nonconvex
feasible set for the LQR policy search problem [
21
,
7
]. However, the objective function
J(K)
for the
H∞
control problem can be non-differential over certain feasible points, introducing new difficulty
to direct policy search. There has been a large family of nonsmooth
H∞
policy search algorithms
developed based on the concept of Clarke subdifferential [
1
–
3
,
28
,
9
,
13
]. However, a satisfying
global convergence theory is still missing from the literature. Our paper bridges this gap by making
the following two contributions.
1.
We show that all Clarke stationary points for the
H∞
state-feedback policy search problem
are also global minimum.
2.
We identify the coerciveness of the
H∞
cost function and use this property to show that
Goldstein’s subgradient method [
25
] and its implementable variants [
71
,
14
,
9
,
10
,
37
,
38
]
can be guaranteed to stay in the nonconvex feasible set of stabilizing policies during the
optimization process and eventually find the global optimal solution of the
H∞
state-
feedback control problem. Finite-time complexity bounds for finding
(δ, )
-stationary points
are also provided.
Our work sheds new light on the theoretical properties of policy optimization methods on
H∞
control
problems, and serves as a meaningful initial step towards a general global convergence theory of
direct policy search on nonsmooth robust control synthesis.
Finally, it is worth clarifying the differences between
H∞
control and mixed
H2/H∞
design.
For mixed
H2/H∞
control, the objective is to design a stabilizing policy that minimizes an
H2
performance bound and satisfies an
H∞
constraint at the same time [
24
,
36
,
34
,
47
]. In other words,
mixed
H2/H∞
control aims at improving the average
H2
performance while “maintaining" a certain
level of robustness by keeping the closed-loop
H∞
norm to be smaller than a pre-specified number.
In contrast,
H∞
control aims at “improving" the system robustness and the worst-case performance
via achieving the smallest closed-loop
H∞
norm. In [
73
], it has been shown that the natural policy
gradient method initialized from a policy satisfying the
H∞
constraint can be guaranteed to maintain
the
H∞
requirement during the optimization process and eventually converge to the optimal solution
of the mixed design problem. However, notice that the objective function for the mixed
H2/H∞
control problem is still differentiable over all the feasible points, and hence the analysis technique
in [
73
] cannot be applied to our
H∞
control setting. More discussions on the connections and
differences between these two problems will be given in the supplementary material.
2 Problem Formulation and Preliminaries
2.1 Notation
The set of
p
-dimensional real vectors is denoted as
Rp
. For a matrix
A
, we use the notation
AT
,
kAk
,
tr A
,
σmin(A)
,
kAk2
, and
ρ(A)
to denote its transpose, largest singular value, trace, smallest singular
value, Frobenius norm, and spectral radius, respectively. When a matrix
P
is negative semidefinite
(definite), we will use the notation
P(≺)0
. When
P
is positive semidefinite (definite), we use the
notation
P()0
. Consider a (real) sequence
u:= {u0, u1,···}
where
ut∈Rnu
for all
t
. This
2