Model Predictive Control via On-Policy Imitation Learning Kwangjun Ahn MIT EECSLIDS

2025-05-06 0 0 871.9KB 26 页 10玖币

侵权投诉

Model Predictive Control via On-Policy Imitation Learning

Kwangjun Ahn

MIT EECS/LIDS

kjahn@mit.edu

Zakaria Mhammedi

MIT IDSS/LIDS

mhammedi@mit.edu

Horia Mania

MIT EECS/LIDS

hmania@mit.edu

Zhang-Wei Hong

MIT EECS/CSAIL

zwhong@mit.edu

Ali Jadbabaie

MIT CEE/LIDS/IDSS

jadbabai@mit.edu

October 18, 2022

Abstract

In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent

focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results

and performance guarantees for data-driven Model Predictive Control (MPC) for constrained

linear systems. In its simplest form, imitation learning is an approach that tries to learn an

expert policy by querying samples from an expert. Recent approaches to data-driven MPC

have used the simplest form of imitation learning known as behavior cloning to learn controllers

that mimic the performance of MPC by online sampling of the trajectories of the closed-loop

MPC system. Behavior cloning, however, is a method that is known to be data ineﬃcient and

suﬀer from distribution shifts. As an alternative, we develop a variant of the forward training

algorithm which is an on-policy imitation learning method proposed by Ross and Bagnell

[39]

Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties

of the explicit MPC solution to theoretically bound the number of online MPC trajectories

needed to achieve optimal performance. We validate our results through simulations and show

that the forward training algorithm is indeed superior to behavior cloning when applied to MPC.

1 Introduction

Optimization-based control methods such as model predictive control (MPC) have been among the

most versatile techniques in feedback control design for more than 40 years. Such techniques have

been successfully applied to control of dynamic systems in a variety of domains such as autonomous

vehicles [

], chemical plants [

], humanoid robots [

], and many others. Nonetheless,

MPC’s versatility comes at a cost. Having to solve optimization problems online makes it diﬃcult

to deploy MPC on high-dimensional systems that have strict latency requirements and limited

computational or energy resources. To mitigate this issue, considerable eﬀort went into developing

faster, tailored optimization methods for MPC [4,10,14,17,20,21,23,37,47].

Instead of following these approaches, we pursue a data-driven methodology. We propose and

study a scheme to collect data interactively from a dynamical system in feedback with an MPC

controller and in order to learn an explicit controller that maps states to inputs. Such approaches

are known in the reinforcement learning literature as imitation learning [

] and they are well

suited for MPC because one can query MPC for the next input at any desired state; all that is

arXiv:2210.09206v1 [math.OC] 17 Oct 2022

needed is to solve the corresponding optimization problem. Nonetheless, in order to learn controllers

that are guaranteed to stabilize dynamical systems, to satisfy state and action constraints, and to

obtain low cost, we would need to exploit several properties of MPC.

Our goal of obtaining an explicit map from states to inputs that encapsulates an MPC controller

falls under the purview of explicit MPC [

], which aims to pre-compute and store the solutions of

the optimization problems that might be encountered at runtime [1].

In general, explicit MPC aims to pre-compute an exact representation of the MPC controller

while we aim to learn a controller that performs as well as MPC with high probability. In the same

vain, Hertneck et al.

[12]

and Karg and Lucia

[16]

suggest learning a controller from data. However,

their approaches collect all the trajectory data using MPC before any learning occurs and do not

interact with the dynamics further. The lack of interaction in imitation learning is known to lead to

sub-optimal performance because small learning errors would cause a controller produced by such a

method to result in states with a diﬀerent distribution than those produced by MPC during training.

In other words, distribution shift leads to error compounding. Our proposed approach completely

avoids this issue. To this end, our contributions in this paper can be summarized as follows:

•

We start by analyzing the imitation learning method known as the forward training algorithm

(Forward) in the setting of control aﬃne systems [39].

•

We modify Forward to make it suitable for MPC applications with constraints. Firstly,

Forward learns a diﬀerent controller for each distinct time step and hence it cannot be applied

straightforwardly to problems with long or inﬁnite horizons. Fortunately, after suﬃciently many

times steps, the MPC controller applied to time invariant linear systems becomes equivalent

to the classical linear quadratic regulator (LQR) [

]. We exploit this property; we modify

Forward to switch to LQR after a number of time steps estimated from data. Secondly,

to improve the robustness of our method we require Forward to imitate robust MPC [

]

instead of standard MPC. We refer to our modiﬁed method as Forward-switch.

•

We theoretically guarantee that a controller learned with Forward-switch stabilizes linear

systems and satisﬁes their constraints as long as certain amount of data is available. Moreover,

we bound the cost suboptimality of the learned controller, showing that it approaches optimal

performance as more data becomes available. None of the previous works on imitating MPC

included such guarantees. We also provide theoretical sample complexity bounds using state

of the art tools of high dimensional statistics and statistical learning theory.

•

We validate the eﬃcacy of the modiﬁed forward training algorithm on simulated MPC problems,

showing that it surpasses non-interactive approaches.

2 The Forward Training Algorithm for Control

In this section, we present the imitation learning method Forward [

] and bound the distance

between the trajectories produced by the learned controller and those produced by the expert when

the dynamics are control-aﬃne. In subsequent sections, we specialize our analysis to the case where

the expert is a MPC controller applied to constrained linear systems.

Imitation learning aims to learn from demonstrations a controller

ˆπ

that imitates the behavior of

a target controller

π?

, called expert policy or simply expert in the reinforcement learning literature.

Imitation learning is valuable when

π?

lacks a closed-form expression or is expensive to query in

general. For instance,

π?

could be a human performing a task or a MPC controller. More formally,

in imitation learning it is assumed that for a state

we can access the input

π?

(

). Then, the aim

is to use data {xi, π?(xi)}to learn a controller ˆπsuch that ˆπ(x)≈π?(x).

In this section, we consider control-aﬃne dynamical systems with constraints:

xt+1 =f(xt) + g(xt)ut, xt∈ X , ut∈ U ,(2.1)

where

X ⊂ Rdx

is the state space and

U ⊂ Rdu

is the input space. We also ﬁnd it useful to denote

ϕt

(

x0,{ut}t>0

)the state

that evolves according to

xt+1

(

) +

(

)

and starts at the initial

state

. When the dynamics evolve according to a time-varying feedback controller

π0:t−1

(i.e.

π0

is used at time 0,

π1

at time 1, etc.) we denote the state at time

ϕt

(

;

π0:t−1

). If the

controller πis time-invariant, we simply write ϕt(x0;π).

Behavior cloning (BC) is the simplest imitation learning method. It consists of collecting

independent trajectories

ϕt

(

x(i)

;

π?

)with initial states

x(1)

x(2)

, . . . ,

x(m)

sampled randomly from

an initial distribution D. Then, BC produces a controller ˆπBC through empirical risk minimization

(ERM):

ˆπBC ∈minimize

π∈Π

i=1

T−1

t=0 



π(ϕt(x(i)

0;π?)) −π?(ϕt(x(i)

0;π?))



,(Behavior Cloning)

where Πis a class of models that map the state space to the input space and

k·k

is any norm

(although it could be replaced by a more general loss function). All our results assume that

π?∈

Π.

Distribution Shift:

The states collected using the expert

π?

have a particular distribution

BC produces a controller

ˆπBC

that, when evaluated on samples from

, behaves similarly to the

expert

π?

. However,

ˆπBC

is not a perfect copy of the expert and hence the states encountered during

its deployment have a diﬀerent distribution than

. This discrepancy is well known and leads to

errors compounding in practice [

]. More explicitly, consider an initial state

sampled

from

. Then, at the ﬁrst time step

ˆπBC

and

π?

perform similarly since

ˆπBC

was trained using

data sampled from

. However, at the second time step the distributions over states produced by

ˆπBC

and

π?

are diﬀerent, which means that at the second time step

ˆπBC

would be evaluated on a

distribution diﬀerent than the one on which it was trained. Hence, with each time step,

ˆπBC

can

take the dynamical system to parts of the state space that are less and less covered by the training

trajectories resulting in error compounding.

Since BC does not account for the intrinsic distribution shift in imitation learning, the number of

training trajectories it requires to guarantee a good learned controller can be large (e.g. exponential

in the number of time steps or dimension). The methods for learning a MPC controller due to

Hertneck et al.

[12]

, and Karg and Lucia

[16]

are variants of behavior cloning and hence also suﬀer

from the presence of distribution shift. Instead, we use and theoretically analyze the forward training

algorithm that was initially used by Ross and Bagnell [39] for the tabular MDP setting.

Forward Training Algorithm:

Forward learns a time-varying feedback controller

ˆπ0:T−1

an inductive fashion: during stage 0, it obtains

ˆπ0

from the ERM

(2.2)

. The controller

ˆπ0

is used

in the dynamical system just at the initial time step. Then, given already learned controllers

Forward Training Algorithm

(Ross and Bagnell

[39]

)

Given

and

, a time-varying policy

ˆπ0:T−1is computed iteratively according to the following procedure:

Stage 0:Sample ninitial states x(1)

0,· · · , x(n)

0∼ D and solve the following ERM:

ˆπ0∈argmin

π∈Π

i=1

kπ?(x(i)

0)−π(x(i)

0)k.(2.2)

Stage t:

Sample fresh initial states

x(1)

0,· · · , x(nt)

0∼ D

, where

nt:

cntdln2

(

+ 1)

and

P∞

t=1

(

tln2

(

+ 1)), then evaluate the states

ˆx(i)

ϕt

(

x(i)

;

ˆπ0:t−1

), using the controllers

ˆπ0:t−1learned in previous stages. Then, select ˆπts.t.

ˆπt∈argmin

π∈Π

i=1

kπ?(ˆx(i)

t)−π(ˆx(i)

t)k.(2.3)

Since π?is only deﬁned on Xand since ˆx(i)

tcould lie outside X, we deﬁne π?(x) = π?(projXx).

Output: The time-varying controller ˆπ= ˆπ0:T−1.

ˆπ0,· · · ,ˆπt−1

, to learn the policy

ˆπt

for time step

,Forward samples states

ˆx(i)

ϕt

(

x(i)

;

ˆπ0:t−1

where x(1)

0, x(2)

0,· · · are sampled i.i.d. from the initial state distribution D.

The advantage of this method is that at time step

during deployment the controller

ˆπt

would

be evaluated on the same distribution as that on which it was trained. Other recent works have also

proposed learning inductively time-varying policies as a way to avoid distribution shifts [28,44].

2.1 The Sample Complexity of Learning a Controller with Forward

In this section, we discuss our statistical guarantees of the controllers produced by Forward. For

simplicity, in this section we consider the setting without state constraints, i.e.,

Rdx

. Before we

can state the main results of this section, we need to make an assumption on the class of controllers

Πused by Forward.

Assumption 2.1.

The model class Πis a ﬁnite and contains

π?

. Moreover, for any

π∈

Πand any

x∈ X we have π(x)∈ U.

The second part of the assumption just guarantees that Πenforces the input constraints. Any

controller class can be modiﬁed to satisfy this property by projecting the outputs of the controllers

onto

. We assume that the controller class Πis ﬁnite for simplicity. In this case, our sample

complexity guarantees scale with

ln |

—a quantity that arises through a standard generalization

bound. When Πis not ﬁnite, one can replace

ln |

by learning-theoretic complexity measures such

as the Rademacher complexity. Finally, in the MPC application we care about, the assumption

π?∈

Πis easily satisﬁed. In the case of constrained linear dynamics with quadratic costs the optimal

MPC controller is piecewise aﬃne and it can be expressed as a neural network with ReLU activations

as extensively discussed by Karg and Lucia [16, Section I-D] (see also [3]).

Now we are ready to state the main result of this section. Its proof relies on the empirical

Bernstein inequality [24] and is deferred to Subsection C.1.

Theorem 2.1.

Let

1be the target time step,

δ∈

1),

2, and

Bu:

supu∈U kuk

. Let

ˆxt

ϕt

(

;

ˆπ0:t−1

). When Assumption 2.1 holds, then under an event

of probability at least 1

−δ

(over the randomness in the training process),Forward produces a time-varying controller

ˆπ0:T−1

that satisﬁes

E[kπ?(ˆxt)−ˆπt(ˆxt)k| ˆπ0:T−1]67Buln(2T|Π|/δ)

,∀t>0,(2.4)

where

nt:

cntdln2

(

+ 1)

P∞

t=1

(

tln2

(

+ 1)), and the expectation in

(2.4)

is with respect

to the randomness in the initial state.

This result guarantees that the time-varying controller learned by Forward is close in expectation

to the optimal controller. The following corollary to Theorem 2.1 bounds this diﬀerence with high

probability using Markov’s inequality (see Subsection C.2 for a proof):

Corollary 2.2.

Let

δ∈

1),

2, and

Bu:

supu∈U kuk

. Let

ˆxt

ϕt

(

;

ˆπ0:t−1

). When

Assumption 2.1 holds, then under the event

of probability at least 1

−δ

(over the randomness in

the training process),Forward produces a time-varying controller ˆπ0:T−1such that

P∀t>0,kπ?(ˆxt)−ˆπt(ˆxt)k614Buln(2T|Π|/δ)

nδ ˆπ0:T−1>1−δ, (2.5)

where the probability is with respect to the randomness in the initial state.

Inﬁnite Model Classes:

The results presented in this section assume Πis ﬁnite for simplicity.

This assumption can be easily relaxed. For example, to get an analogue of the result of Theorem 2.1

for a inﬁnite class Π, one can use the empirical Bernstein inequality [

, Lemma 6], which replaces

ln |

by the logarithm of a “growth function” for the class Π(see Appendix A). In the case where

Πis a class of ReLU Neural Networks, the latter quantity can be bounded by

(

Nparams

), where

Nparams is the number of parameters of the Neural Networks in Π.

Trajectory Guarantees:

Theorem 2.1 guarantees that Forward produces a controller

ˆπ

that

generates inputs to the system that are close to those outputted by

π?

. However, this result does

not immediately imply that

ˆπ

and

π?

follow similar trajectories (errors could compound over time,

causing

ˆπ

’s trajectories to diverge from those of

π?

). Following the main ideas of Tu et al.

[46]

and

Pfrommer et al.

[31]

, one can in fact show guarantees in terms of trajectories when the closed-loop

system under π?is robust in an appropriate sense. See Appendix B the details.

In subsequent sections, we reﬁne the results presented so far to the case of MPC.

3 Background on the Control of Linear Systems

In this section, we review some background material on the control of linear systems, with a focus

on the linear quadratic regulator (LQR) and on MPC. This section is not intended to be exhaustive;

we only cover the notions needed in this work. We consider the linear dynamical system

xt+1 =Axt+But,(3.1)

which clearly maps to the control-aﬃne setting (2.1) with f(xt) = Axtand g(xt) = B.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ModelPredictiveControlviaOn-PolicyImitationLearningKwangjunAhnMITEECS/LIDSkjahn@mit.eduZakariaMhammediMITIDSS/LIDSmhammedi@mit.eduHoriaManiaMITEECS/LIDShmania@mit.eduZhang-WeiHongMITEECS/CSAILzwhong@mit.eduAliJadbabaieMITCEE/LIDS/IDSSjadbabai@mit.eduOctober18,2022AbstractInthispaper,weleveragetherap...

展开>> 收起<<

Model Predictive Control via On-Policy Imitation Learning Kwangjun Ahn MIT EECSLIDS.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model Predictive Control via On-Policy Imitation Learning Kwangjun Ahn MIT EECSLIDS

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: