such scenarios include static output feedback problem (27), structured H∞synthesis (28–
33), and distributed control (34), all of significant importance in applications. The PO
framework is more flexible, as evidenced by the recent advances in deep RL. PO is also more
scalable for high-dimensional problems as it does not generally introduce extra variables in
the optimization problems and enjoys a broader range of optimization methods as compared
with the SDP or SOS formulations. However, as pointed out previously, nonconvexity of
the PO formulation, even on relatively simple linear control problems, have made deriving
theoretical guarantees for direct policy optimization challenging, preventing the acceptance
of PO as a mainstream control design tool.
In this survey, our aim is to revisit these issues from a modern optimization perspective,
and provide a unified perspective on the recently-developed global convergence/complexity
theory for PO in the context of control synthesis. Recent theoretical results on PO for par-
ticular classes of control synthesis problems, some of which are discussed in this survey, are
not only exciting, but also lead to a new research thrust at the interface of control theory
and machine learning. This survey includes control synthesis related to linear quadratic
regulator theory (35–44), stabilization (45–47), linear robust/risk-sensitive control (48–55),
Markov jump linear quadratic control (56–59), Lur’e system control (60), output feedback
control (61–67), and dynamic filtering (68). Surprisingly, some of these strong global conver-
gence results for PO have been obtained in the absence of convexity in the design objective
and/or the underlying feasible set.
These global convergence guarantees have a number of implications for learning and
control. Firstly, these results facilitate examining other classes of synthesis problems in
the same general framework. As it will be pointed out in this survey, there is an elegant
geometry at play between certificates and controllers in the synthesis process, with imme-
diate algorithmic implications. Secondly, the theoretical developments in PO have created
a renewed interest in the control community to examine synthesis of dynamic systems from
a complementary perspective, that in our view, is more integrated with learning in gen-
eral, and RL in particular. This will complement and strengthen the existing connections
between RL and control (69–71). Lastly, the geometric analysis of PO-inspired algorithms
may shed light on issues in state-of-the-art policy-based RL, critical for deriving guarantees
for any subsequent RL-based synthesis procedure for dynamic systems.
This survey is organized to reflect our perspective – and our excitement – on how PO
(and in particular PG) methods provide a streamlined approach for system synthesis, and
build a bridge between control and learning. First, we provide the PO formulations for
various control problems in §2. Then we delve into the PO convergence theory on the clas-
sic linear quadratic regulator (LQR) problem in §3. As it turns out, a key ingredient for
analyzing LQR PO hinges on coerciveness of the cost function and its gradient dominance
property (see §3.2). These properties can then be utilized to devise gradient updates ensur-
ing stabilizing feedback policies at each iteration, and convergence to the globally optimal
policy. In §3.3 we highlight some of the challenges in extending the LQR PO theory to other
classes of problems, including the role of coerciveness, gradient dominance, smoothness, and
the landscape of the optimization problem. The PO perspective is then extended to more
elaborate synthesis problems such as linear robust/risk-sensitive control, dynamic games,
and nonsmooth H∞state-feedback synthesis in §4. Through these extensions, we highlight
how variations on the general theme set by the LQR PO theory can be adopted to address
lack of coerciveness or nonsmoothness of the objective in these problems while ensuring
the convergence of the iterates to solutions of interest. This is then followed by examining
www.annualreviews.org •Policy Optimization for Learning Control Policies 3