Towards a Theoretical Foundation of Policy Optimization for Learning

2025-05-06 0 0 2.55MB 35 页 10玖币

侵权投诉

Towards a Theoretical

Foundation of Policy

Optimization for Learning

Control Policies

Bin Hu1, Kaiqing Zhang2, Na Li3, Mehran

Mesbahi4, Maryam Fazel5, and Tamer Ba¸sar6

1CSL & ECE, University of Illinois at Urbana-Champaign, IL, USA, 61801;

email: binhu7@illinois.edu

2LIDS & CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA,

02139; ECE & ISR, University of Maryland, College Park, MD, 20740;

kaiqing@{mit,umd}.edu

3SEAS, Harvard University, Cambridge, MA, USA, 02138;nali@seas.harvard.edu

4AA, University of Washington, Seattle, WA, USA, 98195; mesbahi@uw.edu

5ECE, University of Washington, Seattle, WA, USA, 98195;mfazel@uw.edu

6CSL & ECE, University of Illinois at Urbana-Champaign, IL, USA, 61801;

email: basar1@illinois.edu

Xxxx. Xxx. Xxx. Xxx. 2022. AA:1–35

https://doi.org/10.1146/((please add

article doi))

Keywords

Policy Optimization, Reinforcement Learning, Feedback Control

Synthesis

Abstract

Gradient-based methods have been widely used for system design and

optimization in diverse application domains. Recently, there has been a

renewed interest in studying theoretical properties of these methods in

the context of control and reinforcement learning. This article surveys

some of the recent developments on policy optimization, a gradient-

based iterative approach for feedback control synthesis, popularized

by successes of reinforcement learning. We take an interdisciplinary

perspective in our exposition that connects control theory, reinforce-

ment learning, and large-scale optimization. We review a number of

recently-developed theoretical results on the optimization landscape,

global convergence, and sample complexity of gradient-based methods

for various continuous control problems such as the linear quadratic

regulator (LQR), H∞control, risk-sensitive control, linear quadratic

Gaussian (LQG) control, and output feedback synthesis. In conjunction

with these optimization results, we also discuss how direct policy op-

timization handles stability and robustness concerns in learning-based

control, two main desiderata in control engineering. We conclude the

survey by pointing out several challenges and opportunities at the in-

tersection of learning and control.

arXiv:2210.04810v1 [math.OC] 10 Oct 2022

1. Introduction

Reinforcement learning (RL) has recently shown an impressive performance on a wide range

of applications, from playing Atari (1, 2) and mastering the game of Go (3, 4), to complex

robotic manipulations (5–7). Key to RL success is the algorithmic framework of policy

optimization (PO), where the policy, mapping observations to actions, is parameterized

and directly optimized upon to improve system-level performance. Mastering Go using PO

(combined with techniques such as eﬃcient tree-search) is particularly encouraging,1as the

main idea behind the latter is rather straightforward – when learning has been formalized

as minimizing a certain cost as a function of the policy, devise an iterative procedure on the

policy to improve the objective. For example, in the policy gradient (PG) variant of PO,

when learning is represented as minimizing a (diﬀerentiable) cost J(K) over the policy K,

the policy is improved upon via a gradient update of the form Kn+1 =Kn−α∇J(Kn),

for some step size α(also referred to as the learning rate) and data-driven evaluation of the

cost gradient ∇Jat each iterate n. In fact, PO provides an umbrella formalism for not only

policy gradient (PG) methods (8), but also actor-critic (9), trust-region (10), proximal PO

methods (11).

More generally, PO provides a streamlined approach to learning-based system design.

For example, PO gives a general-purpose paradigm for addressing complex nonlinear dy-

namics with user-speciﬁed cost functions: for tasks involving nonlinear dynamics and com-

plex design objectives, one can parameterize the policy as a neural network to be “trained”

using gradient-based methods to obtain a reasonable solution. The PO perspective can

also be adopted for other insuﬃciently parameterized decision problems such as end-to-end

perception-based control (12–14). In this setting, it might be desired to synthesize a policy

directly on images. As such, one can envision parameterizing a mapping from pixels (ob-

servation) to actions (decisions) as a neural network, and learn the corresponding policy

using the PO formalism. Lastly, we mention the use of scalable gradient-based algorithms

to eﬃciently train nonlinear policies on many parameters, making PO suitable for high-

dimensional tasks. Computational ﬂexibility and conceptual accessibility of PO have made

it a main workhorse for modern RL.

In yet another decision theoretic science, PO has a long history in control theory (15–

20); in fact, it has been popular among control practitioners when the system model is

poorly understood or parameterized. Nevertheless, despite its generality and ﬂexibility, PO

formulation of control synthesis is typically nonconvex and as such, challenging for obtaining

strong performance certiﬁcates, rendering it unpopular amongst system theorists. Since the

1980’s, convex reformulations or relaxations of control problems have become popular due

to the development of convex programming and related global convergence theory (21). It

has been realized that many problems in optimal and robust control can be reformulated

as convex programs, namely, semideﬁnite programs (SDP) (22–24), or relaxed via sum-

of-squares (SOS) (25, 26), expressed in terms of “certiﬁcates,” e.g., matrix inequalities

that represent Lyapunov or dissipativity conditions. However, these formulations have

limitations when there is deviation from the canonical synthesis problems, e.g., when there

are constraints on the structure of the desired control/policy. When convex reformulations

are not available, PO assumes an important role as the main viable option. Examples of

1Go is considered a challenging game to master, partially as the number of its legal board

positions is signiﬁcantly larger than the number of atoms in the observable universe.

2 Hu, Zhang, Li, Mesbahi, Fazel, Ba¸sar

such scenarios include static output feedback problem (27), structured H∞synthesis (28–

33), and distributed control (34), all of signiﬁcant importance in applications. The PO

framework is more ﬂexible, as evidenced by the recent advances in deep RL. PO is also more

scalable for high-dimensional problems as it does not generally introduce extra variables in

the optimization problems and enjoys a broader range of optimization methods as compared

with the SDP or SOS formulations. However, as pointed out previously, nonconvexity of

the PO formulation, even on relatively simple linear control problems, have made deriving

theoretical guarantees for direct policy optimization challenging, preventing the acceptance

of PO as a mainstream control design tool.

In this survey, our aim is to revisit these issues from a modern optimization perspective,

and provide a uniﬁed perspective on the recently-developed global convergence/complexity

theory for PO in the context of control synthesis. Recent theoretical results on PO for par-

ticular classes of control synthesis problems, some of which are discussed in this survey, are

not only exciting, but also lead to a new research thrust at the interface of control theory

and machine learning. This survey includes control synthesis related to linear quadratic

regulator theory (35–44), stabilization (45–47), linear robust/risk-sensitive control (48–55),

Markov jump linear quadratic control (56–59), Lur’e system control (60), output feedback

control (61–67), and dynamic ﬁltering (68). Surprisingly, some of these strong global conver-

gence results for PO have been obtained in the absence of convexity in the design objective

and/or the underlying feasible set.

These global convergence guarantees have a number of implications for learning and

control. Firstly, these results facilitate examining other classes of synthesis problems in

the same general framework. As it will be pointed out in this survey, there is an elegant

geometry at play between certiﬁcates and controllers in the synthesis process, with imme-

diate algorithmic implications. Secondly, the theoretical developments in PO have created

a renewed interest in the control community to examine synthesis of dynamic systems from

a complementary perspective, that in our view, is more integrated with learning in gen-

eral, and RL in particular. This will complement and strengthen the existing connections

between RL and control (69–71). Lastly, the geometric analysis of PO-inspired algorithms

may shed light on issues in state-of-the-art policy-based RL, critical for deriving guarantees

for any subsequent RL-based synthesis procedure for dynamic systems.

This survey is organized to reﬂect our perspective – and our excitement – on how PO

(and in particular PG) methods provide a streamlined approach for system synthesis, and

build a bridge between control and learning. First, we provide the PO formulations for

various control problems in §2. Then we delve into the PO convergence theory on the clas-

sic linear quadratic regulator (LQR) problem in §3. As it turns out, a key ingredient for

analyzing LQR PO hinges on coerciveness of the cost function and its gradient dominance

property (see §3.2). These properties can then be utilized to devise gradient updates ensur-

ing stabilizing feedback policies at each iteration, and convergence to the globally optimal

policy. In §3.3 we highlight some of the challenges in extending the LQR PO theory to other

classes of problems, including the role of coerciveness, gradient dominance, smoothness, and

the landscape of the optimization problem. The PO perspective is then extended to more

elaborate synthesis problems such as linear robust/risk-sensitive control, dynamic games,

and nonsmooth H∞state-feedback synthesis in §4. Through these extensions, we highlight

how variations on the general theme set by the LQR PO theory can be adopted to address

lack of coerciveness or nonsmoothness of the objective in these problems while ensuring

the convergence of the iterates to solutions of interest. This is then followed by examining

www.annualreviews.org •Policy Optimization for Learning Control Policies 3

PO for control synthesis with partial observations, and in particular, PO theory for linear

quadratic Gaussian and output feedback in §5. Our discussion in §5 underscores the im-

portance of the underlying geometry of the policy landscape in developing any PO-based

algorithms. Fundamental connections between PO theory and convex parameterization in

control are discussed in §6. In particular, it is shown how the geometry of policies and cer-

tiﬁcates are intertwined through appropriately constructed maps between nonconvex PO

formulation of the synthesis problems and the (convex) semideﬁnite programming parame-

terizations. This provides a uniﬁed approach for analyzing PO in various control problems

studied on a case-by-case basis so far. In §7, we present current challenges and our outlook

for a comprehensive PO theory for synthesizing dynamical systems that ensures stability,

robustness, safety, and optimality; and underscore the challenges in addressing synthesis

problems in the face of partial observations, nonlinearities, and for multiagent settings. §7

also examines further connections between PO theory and machine learning, and highlights

the possibility of integrating model-based (70) and model-free methods to achieve the best

of both worlds, illustrating how the main theme of this survey ﬁts within the big picture of

learning-based control.

2. Policy Optimization for Linear Control: Formulation

Control design can generally be formulated as a policy optimization problem of the form,

min

K∈K J(K),1.

where the decision variable Kis determined by the controller parameterization (e.g., linear

mapping, polynomials, kernels, neural networks, etc.), the cost function J(K) is some task-

dependent control performance measure (e.g., tracking errors, closed-loop H2or H∞norm,

etc.), and the feasible set Krepresents the class of controllers of interest, for example, en-

suring closed loop stability/robustness requirements. Such a PO formulation is general, and

enables ﬂexible policy parameterizations. For example, consider a modern deep RL setting

where one wants to design a policy maximizing some task-dependent reward function for a

complicated nonlinear system xt+1 =f(xt, ut, wt) with (xt, ut, wt) being the state, action,

and disturbance triplet. PO has served as the main workhorse for addressing such tasks.

Speciﬁcally, one just needs to parameterize the policy Kas a (deep) neural network and

then apply iterative PO algorithms such as trust-region policy optimization (TRPO) (10)

and proximal policy optimization (PPO) (11) to learn the optimal weights.

The focus of this survey article is the recently-developed (global) convergence, complex-

ity, and landscape theory of PO on classic control tasks including LQR, risk-sensitive/robust

control, and output feedback control. In this section, we formulate these linear control prob-

lems as PO via properly selecting (K, J, K) in Equation 1.

Case I: Linear quadratic regulator (LQR). There are several ways to formulate the

LQR problem. For simplicity, we start by considering a discrete-time linear time-invariant

(LTI) system xt+1 =Axt+But, where xtis the state and utis the control action. The

design objective is to choose the control actions {ut}to minimize a quadratic cost function

J:= Ex0∼D P∞

t=0 xT

tQxt+uT

tRutwith Q0 and R0 being pre-selected cost weighting

matrices. In this setting, the only randomness stems from the initial condition x0, which is

sampled from a certain distribution Dwith a full rank covariance matrix. It is well known

that under some standard stabilizability and detectability assumptions the optimal cost is

4 Hu, Zhang, Li, Mesbahi, Fazel, Ba¸sar

ﬁnite and can be achieved by a linear state-feedback controller of the form ut=−Kxt.

Therefore, we can formulate the LQR problem as a special case of the PO problem as in

Equation 1. Speciﬁcally, the decision variable Kis simply the feedback gain matrix. Under

a ﬁxed policy K, we have ut=−Kxtfor all t, and the LQR cost can be rewritten as

J(K) = Ex0∼D P∞

t=0 xT

0((A−BK)T)t(Q+KTRK)(A−BK)tx0, which is a function of

K. This cost can also be computed as J(K) = Tr(PKΣ0), where Σ0=Ex0xT

0is the (full-

rank) covariance matrix of x0, and PKis the solution of the following Lyapunov equation:

(A−BK)TPK(A−BK) + Q+KTRK =PK.2.

The above cost J(K) is only well deﬁned when the closed-loop system matrix (A−BK)

is Schur stable, i.e., when the spectral radius satisﬁes ρ(A−BK)<1. Therefore, one can

deﬁne the feasible set Kas,

K={K:ρ(A−BK)<1}.3.

Now we can see that the LQR problem is a special case of the PO problem in Equation 1.

There are several other slightly diﬀerent ways to formulate the LQR problem. In an alter-

native formulation, we can add stochastic process noise and consider the LTI system,

xt+1 =Axt+But+wt,4.

where the disturbance {wt}is a zero-mean i.i.d. process with a full rank covariance matrix

W. The design objective is then to choose {ut}to minimize the time-average cost

J:= lim

T→∞

TE"T−1

t=0 xT

tQxt+uT

tRut#,5.

where Q0 and R0 are pre-selected weighting matrices. Again, it suﬃces to param-

eterize the policy as ut=−Kxt. For a ﬁxed policy K, the cost in Equation 5 can be

computed as J(K) = Tr(PKW), where PKis the solution for Equation 2. Again, the cost

is well deﬁned only for Ksatisfying ρ(A−BK)<1. This setting leads to almost the same

PO formulation as before. Similarly, discounted LQR can be formulated as PO.

Case II: Linear risk-sensitive/robust control. One can enforce risk-sensitivity and

robustness via the formulation of linear exponential quadratic Gaussian (LEQG) (72) or

H∞control (73), respectively. For linear risk-sensitive control, we still consider the LTI

system as in Equation 4 with wt∼ N(0, W ) being an i.i.d. Gaussian noise, and the design

objective is to choose control actions {ut}to minimize an exponentiated quadratic cost,

J:= lim sup

T→∞

βlog Eexp β

T−1

t=0 xT

tQxt+uT

tRut,6.

where βis the parameter quantifying the intensity of risk-sensitivity, and the expectation is

taken over the distribution for x0and wtfor all t≥0. One typically chooses β > 0 to make

the control “risk-averse.” As β→0, the objective in Equation 6 reduces to the LQR cost.

The above LEQG problem is also a special case of the PO problem as in Equation 1. It is

known that the optimal cost can be achieved by a linear state-feedback controller. Again,

one can just parameterize the controller as ut=−Kxt, where the gain matrix Kis the

www.annualreviews.org •Policy Optimization for Learning Control Policies 5

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TowardsaTheoreticalFoundationofPolicyOptimizationforLearningControlPoliciesBinHu1,KaiqingZhang2,NaLi3,MehranMesbahi4,MaryamFazel5,andTamerBasar61CSL&ECE,UniversityofIllinoisatUrbana-Champaign,IL,USA,61801;email:binhu7@illinois.edu2LIDS&CSAIL,MassachusettsInstituteofTechnology,Cambridge,MA,USA,02139...

展开>> 收起<<

Towards a Theoretical Foundation of Policy Optimization for Learning.pdf

共35页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Towards a Theoretical Foundation of Policy Optimization for Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: