Towards a Theoretical Foundation of Policy Optimization for Learning

2025-05-06 0 0 2.55MB 35 页 10玖币
侵权投诉
Towards a Theoretical
Foundation of Policy
Optimization for Learning
Control Policies
Bin Hu1, Kaiqing Zhang2, Na Li3, Mehran
Mesbahi4, Maryam Fazel5, and Tamer Ba¸sar6
1CSL & ECE, University of Illinois at Urbana-Champaign, IL, USA, 61801;
email: binhu7@illinois.edu
2LIDS & CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA,
02139; ECE & ISR, University of Maryland, College Park, MD, 20740;
kaiqing@{mit,umd}.edu
3SEAS, Harvard University, Cambridge, MA, USA, 02138;nali@seas.harvard.edu
4AA, University of Washington, Seattle, WA, USA, 98195; mesbahi@uw.edu
5ECE, University of Washington, Seattle, WA, USA, 98195;mfazel@uw.edu
6CSL & ECE, University of Illinois at Urbana-Champaign, IL, USA, 61801;
email: basar1@illinois.edu
Xxxx. Xxx. Xxx. Xxx. 2022. AA:1–35
https://doi.org/10.1146/((please add
article doi))
Copyright ©2022 by Annual Reviews.
All rights reserved
Keywords
Policy Optimization, Reinforcement Learning, Feedback Control
Synthesis
Abstract
Gradient-based methods have been widely used for system design and
optimization in diverse application domains. Recently, there has been a
renewed interest in studying theoretical properties of these methods in
the context of control and reinforcement learning. This article surveys
some of the recent developments on policy optimization, a gradient-
based iterative approach for feedback control synthesis, popularized
by successes of reinforcement learning. We take an interdisciplinary
perspective in our exposition that connects control theory, reinforce-
ment learning, and large-scale optimization. We review a number of
recently-developed theoretical results on the optimization landscape,
global convergence, and sample complexity of gradient-based methods
for various continuous control problems such as the linear quadratic
regulator (LQR), Hcontrol, risk-sensitive control, linear quadratic
Gaussian (LQG) control, and output feedback synthesis. In conjunction
with these optimization results, we also discuss how direct policy op-
timization handles stability and robustness concerns in learning-based
control, two main desiderata in control engineering. We conclude the
survey by pointing out several challenges and opportunities at the in-
tersection of learning and control.
1
arXiv:2210.04810v1 [math.OC] 10 Oct 2022
1. Introduction
Reinforcement learning (RL) has recently shown an impressive performance on a wide range
of applications, from playing Atari (1, 2) and mastering the game of Go (3, 4), to complex
robotic manipulations (5–7). Key to RL success is the algorithmic framework of policy
optimization (PO), where the policy, mapping observations to actions, is parameterized
and directly optimized upon to improve system-level performance. Mastering Go using PO
(combined with techniques such as efficient tree-search) is particularly encouraging,1as the
main idea behind the latter is rather straightforward – when learning has been formalized
as minimizing a certain cost as a function of the policy, devise an iterative procedure on the
policy to improve the objective. For example, in the policy gradient (PG) variant of PO,
when learning is represented as minimizing a (differentiable) cost J(K) over the policy K,
the policy is improved upon via a gradient update of the form Kn+1 =KnαJ(Kn),
for some step size α(also referred to as the learning rate) and data-driven evaluation of the
cost gradient Jat each iterate n. In fact, PO provides an umbrella formalism for not only
policy gradient (PG) methods (8), but also actor-critic (9), trust-region (10), proximal PO
methods (11).
More generally, PO provides a streamlined approach to learning-based system design.
For example, PO gives a general-purpose paradigm for addressing complex nonlinear dy-
namics with user-specified cost functions: for tasks involving nonlinear dynamics and com-
plex design objectives, one can parameterize the policy as a neural network to be “trained”
using gradient-based methods to obtain a reasonable solution. The PO perspective can
also be adopted for other insufficiently parameterized decision problems such as end-to-end
perception-based control (12–14). In this setting, it might be desired to synthesize a policy
directly on images. As such, one can envision parameterizing a mapping from pixels (ob-
servation) to actions (decisions) as a neural network, and learn the corresponding policy
using the PO formalism. Lastly, we mention the use of scalable gradient-based algorithms
to efficiently train nonlinear policies on many parameters, making PO suitable for high-
dimensional tasks. Computational flexibility and conceptual accessibility of PO have made
it a main workhorse for modern RL.
In yet another decision theoretic science, PO has a long history in control theory (15–
20); in fact, it has been popular among control practitioners when the system model is
poorly understood or parameterized. Nevertheless, despite its generality and flexibility, PO
formulation of control synthesis is typically nonconvex and as such, challenging for obtaining
strong performance certificates, rendering it unpopular amongst system theorists. Since the
1980’s, convex reformulations or relaxations of control problems have become popular due
to the development of convex programming and related global convergence theory (21). It
has been realized that many problems in optimal and robust control can be reformulated
as convex programs, namely, semidefinite programs (SDP) (22–24), or relaxed via sum-
of-squares (SOS) (25, 26), expressed in terms of “certificates,” e.g., matrix inequalities
that represent Lyapunov or dissipativity conditions. However, these formulations have
limitations when there is deviation from the canonical synthesis problems, e.g., when there
are constraints on the structure of the desired control/policy. When convex reformulations
are not available, PO assumes an important role as the main viable option. Examples of
1Go is considered a challenging game to master, partially as the number of its legal board
positions is significantly larger than the number of atoms in the observable universe.
2 Hu, Zhang, Li, Mesbahi, Fazel, Ba¸sar
such scenarios include static output feedback problem (27), structured Hsynthesis (28–
33), and distributed control (34), all of significant importance in applications. The PO
framework is more flexible, as evidenced by the recent advances in deep RL. PO is also more
scalable for high-dimensional problems as it does not generally introduce extra variables in
the optimization problems and enjoys a broader range of optimization methods as compared
with the SDP or SOS formulations. However, as pointed out previously, nonconvexity of
the PO formulation, even on relatively simple linear control problems, have made deriving
theoretical guarantees for direct policy optimization challenging, preventing the acceptance
of PO as a mainstream control design tool.
In this survey, our aim is to revisit these issues from a modern optimization perspective,
and provide a unified perspective on the recently-developed global convergence/complexity
theory for PO in the context of control synthesis. Recent theoretical results on PO for par-
ticular classes of control synthesis problems, some of which are discussed in this survey, are
not only exciting, but also lead to a new research thrust at the interface of control theory
and machine learning. This survey includes control synthesis related to linear quadratic
regulator theory (35–44), stabilization (45–47), linear robust/risk-sensitive control (48–55),
Markov jump linear quadratic control (56–59), Lur’e system control (60), output feedback
control (61–67), and dynamic filtering (68). Surprisingly, some of these strong global conver-
gence results for PO have been obtained in the absence of convexity in the design objective
and/or the underlying feasible set.
These global convergence guarantees have a number of implications for learning and
control. Firstly, these results facilitate examining other classes of synthesis problems in
the same general framework. As it will be pointed out in this survey, there is an elegant
geometry at play between certificates and controllers in the synthesis process, with imme-
diate algorithmic implications. Secondly, the theoretical developments in PO have created
a renewed interest in the control community to examine synthesis of dynamic systems from
a complementary perspective, that in our view, is more integrated with learning in gen-
eral, and RL in particular. This will complement and strengthen the existing connections
between RL and control (69–71). Lastly, the geometric analysis of PO-inspired algorithms
may shed light on issues in state-of-the-art policy-based RL, critical for deriving guarantees
for any subsequent RL-based synthesis procedure for dynamic systems.
This survey is organized to reflect our perspective – and our excitement – on how PO
(and in particular PG) methods provide a streamlined approach for system synthesis, and
build a bridge between control and learning. First, we provide the PO formulations for
various control problems in §2. Then we delve into the PO convergence theory on the clas-
sic linear quadratic regulator (LQR) problem in §3. As it turns out, a key ingredient for
analyzing LQR PO hinges on coerciveness of the cost function and its gradient dominance
property (see §3.2). These properties can then be utilized to devise gradient updates ensur-
ing stabilizing feedback policies at each iteration, and convergence to the globally optimal
policy. In §3.3 we highlight some of the challenges in extending the LQR PO theory to other
classes of problems, including the role of coerciveness, gradient dominance, smoothness, and
the landscape of the optimization problem. The PO perspective is then extended to more
elaborate synthesis problems such as linear robust/risk-sensitive control, dynamic games,
and nonsmooth Hstate-feedback synthesis in §4. Through these extensions, we highlight
how variations on the general theme set by the LQR PO theory can be adopted to address
lack of coerciveness or nonsmoothness of the objective in these problems while ensuring
the convergence of the iterates to solutions of interest. This is then followed by examining
www.annualreviews.org Policy Optimization for Learning Control Policies 3
PO for control synthesis with partial observations, and in particular, PO theory for linear
quadratic Gaussian and output feedback in §5. Our discussion in §5 underscores the im-
portance of the underlying geometry of the policy landscape in developing any PO-based
algorithms. Fundamental connections between PO theory and convex parameterization in
control are discussed in §6. In particular, it is shown how the geometry of policies and cer-
tificates are intertwined through appropriately constructed maps between nonconvex PO
formulation of the synthesis problems and the (convex) semidefinite programming parame-
terizations. This provides a unified approach for analyzing PO in various control problems
studied on a case-by-case basis so far. In §7, we present current challenges and our outlook
for a comprehensive PO theory for synthesizing dynamical systems that ensures stability,
robustness, safety, and optimality; and underscore the challenges in addressing synthesis
problems in the face of partial observations, nonlinearities, and for multiagent settings. §7
also examines further connections between PO theory and machine learning, and highlights
the possibility of integrating model-based (70) and model-free methods to achieve the best
of both worlds, illustrating how the main theme of this survey fits within the big picture of
learning-based control.
2. Policy Optimization for Linear Control: Formulation
Control design can generally be formulated as a policy optimization problem of the form,
min
K∈K J(K),1.
where the decision variable Kis determined by the controller parameterization (e.g., linear
mapping, polynomials, kernels, neural networks, etc.), the cost function J(K) is some task-
dependent control performance measure (e.g., tracking errors, closed-loop H2or Hnorm,
etc.), and the feasible set Krepresents the class of controllers of interest, for example, en-
suring closed loop stability/robustness requirements. Such a PO formulation is general, and
enables flexible policy parameterizations. For example, consider a modern deep RL setting
where one wants to design a policy maximizing some task-dependent reward function for a
complicated nonlinear system xt+1 =f(xt, ut, wt) with (xt, ut, wt) being the state, action,
and disturbance triplet. PO has served as the main workhorse for addressing such tasks.
Specifically, one just needs to parameterize the policy Kas a (deep) neural network and
then apply iterative PO algorithms such as trust-region policy optimization (TRPO) (10)
and proximal policy optimization (PPO) (11) to learn the optimal weights.
The focus of this survey article is the recently-developed (global) convergence, complex-
ity, and landscape theory of PO on classic control tasks including LQR, risk-sensitive/robust
control, and output feedback control. In this section, we formulate these linear control prob-
lems as PO via properly selecting (K, J, K) in Equation 1.
Case I: Linear quadratic regulator (LQR). There are several ways to formulate the
LQR problem. For simplicity, we start by considering a discrete-time linear time-invariant
(LTI) system xt+1 =Axt+But, where xtis the state and utis the control action. The
design objective is to choose the control actions {ut}to minimize a quadratic cost function
J:= Ex0∼D P
t=0 xT
tQxt+uT
tRutwith Q0 and R0 being pre-selected cost weighting
matrices. In this setting, the only randomness stems from the initial condition x0, which is
sampled from a certain distribution Dwith a full rank covariance matrix. It is well known
that under some standard stabilizability and detectability assumptions the optimal cost is
4 Hu, Zhang, Li, Mesbahi, Fazel, Ba¸sar
finite and can be achieved by a linear state-feedback controller of the form ut=Kxt.
Therefore, we can formulate the LQR problem as a special case of the PO problem as in
Equation 1. Specifically, the decision variable Kis simply the feedback gain matrix. Under
a fixed policy K, we have ut=Kxtfor all t, and the LQR cost can be rewritten as
J(K) = Ex0∼D P
t=0 xT
0((ABK)T)t(Q+KTRK)(ABK)tx0, which is a function of
K. This cost can also be computed as J(K) = Tr(PKΣ0), where Σ0=Ex0xT
0is the (full-
rank) covariance matrix of x0, and PKis the solution of the following Lyapunov equation:
(ABK)TPK(ABK) + Q+KTRK =PK.2.
The above cost J(K) is only well defined when the closed-loop system matrix (ABK)
is Schur stable, i.e., when the spectral radius satisfies ρ(ABK)<1. Therefore, one can
define the feasible set Kas,
K={K:ρ(ABK)<1}.3.
Now we can see that the LQR problem is a special case of the PO problem in Equation 1.
There are several other slightly different ways to formulate the LQR problem. In an alter-
native formulation, we can add stochastic process noise and consider the LTI system,
xt+1 =Axt+But+wt,4.
where the disturbance {wt}is a zero-mean i.i.d. process with a full rank covariance matrix
W. The design objective is then to choose {ut}to minimize the time-average cost
J:= lim
T→∞
1
TE"T1
X
t=0 xT
tQxt+uT
tRut#,5.
where Q0 and R0 are pre-selected weighting matrices. Again, it suffices to param-
eterize the policy as ut=Kxt. For a fixed policy K, the cost in Equation 5 can be
computed as J(K) = Tr(PKW), where PKis the solution for Equation 2. Again, the cost
is well defined only for Ksatisfying ρ(ABK)<1. This setting leads to almost the same
PO formulation as before. Similarly, discounted LQR can be formulated as PO.
Case II: Linear risk-sensitive/robust control. One can enforce risk-sensitivity and
robustness via the formulation of linear exponential quadratic Gaussian (LEQG) (72) or
Hcontrol (73), respectively. For linear risk-sensitive control, we still consider the LTI
system as in Equation 4 with wt∼ N(0, W ) being an i.i.d. Gaussian noise, and the design
objective is to choose control actions {ut}to minimize an exponentiated quadratic cost,
J:= lim sup
T→∞
1
T
2
βlog Eexp β
2
T1
X
t=0 xT
tQxt+uT
tRut,6.
where βis the parameter quantifying the intensity of risk-sensitivity, and the expectation is
taken over the distribution for x0and wtfor all t0. One typically chooses β > 0 to make
the control “risk-averse.” As β0, the objective in Equation 6 reduces to the LQR cost.
The above LEQG problem is also a special case of the PO problem as in Equation 1. It is
known that the optimal cost can be achieved by a linear state-feedback controller. Again,
one can just parameterize the controller as ut=Kxt, where the gain matrix Kis the
www.annualreviews.org Policy Optimization for Learning Control Policies 5
摘要:

TowardsaTheoreticalFoundationofPolicyOptimizationforLearningControlPoliciesBinHu1,KaiqingZhang2,NaLi3,MehranMesbahi4,MaryamFazel5,andTamerBasar61CSL&ECE,UniversityofIllinoisatUrbana-Champaign,IL,USA,61801;email:binhu7@illinois.edu2LIDS&CSAIL,MassachusettsInstituteofTechnology,Cambridge,MA,USA,02139...

展开>> 收起<<
Towards a Theoretical Foundation of Policy Optimization for Learning.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:2.55MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注