Model Predictive Control via On-Policy Imitation Learning Kwangjun Ahn MIT EECSLIDS

2025-05-06 0 0 871.9KB 26 页 10玖币
侵权投诉
Model Predictive Control via On-Policy Imitation Learning
Kwangjun Ahn
MIT EECS/LIDS
kjahn@mit.edu
Zakaria Mhammedi
MIT IDSS/LIDS
mhammedi@mit.edu
Horia Mania
MIT EECS/LIDS
hmania@mit.edu
Zhang-Wei Hong
MIT EECS/CSAIL
zwhong@mit.edu
Ali Jadbabaie
MIT CEE/LIDS/IDSS
jadbabai@mit.edu
October 18, 2022
Abstract
In this paper, we leverage the rapid advances in imitation learning, a topic of intense recent
focus in the Reinforcement Learning (RL) literature, to develop new sample complexity results
and performance guarantees for data-driven Model Predictive Control (MPC) for constrained
linear systems. In its simplest form, imitation learning is an approach that tries to learn an
expert policy by querying samples from an expert. Recent approaches to data-driven MPC
have used the simplest form of imitation learning known as behavior cloning to learn controllers
that mimic the performance of MPC by online sampling of the trajectories of the closed-loop
MPC system. Behavior cloning, however, is a method that is known to be data inefficient and
suffer from distribution shifts. As an alternative, we develop a variant of the forward training
algorithm which is an on-policy imitation learning method proposed by Ross and Bagnell
[39]
.
Our algorithm uses the structure of constrained linear MPC, and our analysis uses the properties
of the explicit MPC solution to theoretically bound the number of online MPC trajectories
needed to achieve optimal performance. We validate our results through simulations and show
that the forward training algorithm is indeed superior to behavior cloning when applied to MPC.
1 Introduction
Optimization-based control methods such as model predictive control (MPC) have been among the
most versatile techniques in feedback control design for more than 40 years. Such techniques have
been successfully applied to control of dynamic systems in a variety of domains such as autonomous
vehicles [
30
,
13
,
7
,
38
], chemical plants [
33
], humanoid robots [
19
], and many others. Nonetheless,
MPC’s versatility comes at a cost. Having to solve optimization problems online makes it difficult
to deploy MPC on high-dimensional systems that have strict latency requirements and limited
computational or energy resources. To mitigate this issue, considerable effort went into developing
faster, tailored optimization methods for MPC [4,10,14,17,20,21,23,37,47].
Instead of following these approaches, we pursue a data-driven methodology. We propose and
study a scheme to collect data interactively from a dynamical system in feedback with an MPC
controller and in order to learn an explicit controller that maps states to inputs. Such approaches
are known in the reinforcement learning literature as imitation learning [
32
,
41
] and they are well
suited for MPC because one can query MPC for the next input at any desired state; all that is
1
arXiv:2210.09206v1 [math.OC] 17 Oct 2022
needed is to solve the corresponding optimization problem. Nonetheless, in order to learn controllers
that are guaranteed to stabilize dynamical systems, to satisfy state and action constraints, and to
obtain low cost, we would need to exploit several properties of MPC.
Our goal of obtaining an explicit map from states to inputs that encapsulates an MPC controller
falls under the purview of explicit MPC [
2
], which aims to pre-compute and store the solutions of
the optimization problems that might be encountered at runtime [1].
In general, explicit MPC aims to pre-compute an exact representation of the MPC controller
while we aim to learn a controller that performs as well as MPC with high probability. In the same
vain, Hertneck et al.
[12]
and Karg and Lucia
[16]
suggest learning a controller from data. However,
their approaches collect all the trajectory data using MPC before any learning occurs and do not
interact with the dynamics further. The lack of interaction in imitation learning is known to lead to
sub-optimal performance because small learning errors would cause a controller produced by such a
method to result in states with a different distribution than those produced by MPC during training.
In other words, distribution shift leads to error compounding. Our proposed approach completely
avoids this issue. To this end, our contributions in this paper can be summarized as follows:
We start by analyzing the imitation learning method known as the forward training algorithm
(Forward) in the setting of control affine systems [39].
We modify Forward to make it suitable for MPC applications with constraints. Firstly,
Forward learns a different controller for each distinct time step and hence it cannot be applied
straightforwardly to problems with long or infinite horizons. Fortunately, after sufficiently many
times steps, the MPC controller applied to time invariant linear systems becomes equivalent
to the classical linear quadratic regulator (LQR) [
45
]. We exploit this property; we modify
Forward to switch to LQR after a number of time steps estimated from data. Secondly,
to improve the robustness of our method we require Forward to imitate robust MPC [
27
]
instead of standard MPC. We refer to our modified method as Forward-switch.
We theoretically guarantee that a controller learned with Forward-switch stabilizes linear
systems and satisfies their constraints as long as certain amount of data is available. Moreover,
we bound the cost suboptimality of the learned controller, showing that it approaches optimal
performance as more data becomes available. None of the previous works on imitating MPC
included such guarantees. We also provide theoretical sample complexity bounds using state
of the art tools of high dimensional statistics and statistical learning theory.
We validate the efficacy of the modified forward training algorithm on simulated MPC problems,
showing that it surpasses non-interactive approaches.
2 The Forward Training Algorithm for Control
In this section, we present the imitation learning method Forward [
39
] and bound the distance
between the trajectories produced by the learned controller and those produced by the expert when
the dynamics are control-affine. In subsequent sections, we specialize our analysis to the case where
the expert is a MPC controller applied to constrained linear systems.
Imitation learning aims to learn from demonstrations a controller
ˆπ
that imitates the behavior of
a target controller
π?
, called expert policy or simply expert in the reinforcement learning literature.
2
Imitation learning is valuable when
π?
lacks a closed-form expression or is expensive to query in
general. For instance,
π?
could be a human performing a task or a MPC controller. More formally,
in imitation learning it is assumed that for a state
x
we can access the input
π?
(
x
). Then, the aim
is to use data {xi, π?(xi)}to learn a controller ˆπsuch that ˆπ(x)π?(x).
In this section, we consider control-affine dynamical systems with constraints:
xt+1 =f(xt) + g(xt)ut, xt X , ut∈ U ,(2.1)
where
X Rdx
is the state space and
U Rdu
is the input space. We also find it useful to denote
ϕt
(
x0,{ut}t>0
)the state
xt
that evolves according to
xt+1
=
f
(
xt
) +
g
(
xt
)
ut
and starts at the initial
state
x0
. When the dynamics evolve according to a time-varying feedback controller
π
=
π0:t1
(i.e.
π0
is used at time 0,
π1
at time 1, etc.) we denote the state at time
t
by
ϕt
(
x0
;
π0:t1
). If the
controller πis time-invariant, we simply write ϕt(x0;π).
Behavior cloning (BC) is the simplest imitation learning method. It consists of collecting
m
independent trajectories
ϕt
(
x(i)
0
;
π?
)with initial states
x(1)
0
,
x(2)
0
, . . . ,
x(m)
0
sampled randomly from
an initial distribution D. Then, BC produces a controller ˆπBC through empirical risk minimization
(ERM):
ˆπBC minimize
πΠ
n
X
i=1
T1
X
t=0
π(ϕt(x(i)
0;π?)) π?(ϕt(x(i)
0;π?))
,(Behavior Cloning)
where Πis a class of models that map the state space to the input space and
k·k
is any norm
(although it could be replaced by a more general loss function). All our results assume that
π?
Π.
Distribution Shift:
The states collected using the expert
π?
have a particular distribution
D?
.
BC produces a controller
ˆπBC
that, when evaluated on samples from
D?
, behaves similarly to the
expert
π?
. However,
ˆπBC
is not a perfect copy of the expert and hence the states encountered during
its deployment have a different distribution than
D?
. This discrepancy is well known and leads to
errors compounding in practice [
32
,
39
,
40
]. More explicitly, consider an initial state
x0
sampled
from
D
. Then, at the first time step
ˆπBC
and
π?
perform similarly since
ˆπBC
was trained using
data sampled from
D
. However, at the second time step the distributions over states produced by
ˆπBC
and
π?
are different, which means that at the second time step
ˆπBC
would be evaluated on a
distribution different than the one on which it was trained. Hence, with each time step,
ˆπBC
can
take the dynamical system to parts of the state space that are less and less covered by the training
trajectories resulting in error compounding.
Since BC does not account for the intrinsic distribution shift in imitation learning, the number of
training trajectories it requires to guarantee a good learned controller can be large (e.g. exponential
in the number of time steps or dimension). The methods for learning a MPC controller due to
Hertneck et al.
[12]
, and Karg and Lucia
[16]
are variants of behavior cloning and hence also suffer
from the presence of distribution shift. Instead, we use and theoretically analyze the forward training
algorithm that was initially used by Ross and Bagnell [39] for the tabular MDP setting.
Forward Training Algorithm:
Forward learns a time-varying feedback controller
ˆπ0:T1
in
an inductive fashion: during stage 0, it obtains
ˆπ0
from the ERM
(2.2)
. The controller
ˆπ0
is used
in the dynamical system just at the initial time step. Then, given already learned controllers
3
Forward Training Algorithm
(Ross and Bagnell
[39]
)
.
Given
n
and
T
, a time-varying policy
ˆπ0:T1is computed iteratively according to the following procedure:
Stage 0:Sample ninitial states x(1)
0,· · · , x(n)
0∼ D and solve the following ERM:
ˆπ0argmin
πΠ
1
n
n
X
i=1
kπ?(x(i)
0)π(x(i)
0)k.(2.2)
Stage t:
Sample fresh initial states
x(1)
0,· · · , x(nt)
0∼ D
, where
nt:
=
cntdln2
(
t
+ 1)
e
+
n
and
c:
=
P
t=1
1
/
(
tln2
(
t
+ 1)), then evaluate the states
ˆx(i)
t:
=
ϕt
(
x(i)
0
;
ˆπ0:t1
), using the controllers
ˆπ0:t1learned in previous stages. Then, select ˆπts.t.
ˆπtargmin
πΠ
1
nt
nt
X
i=1
kπ?(ˆx(i)
t)π(ˆx(i)
t)k.(2.3)
Since π?is only defined on Xand since ˆx(i)
tcould lie outside X, we define π?(x) = π?(projXx).
Output: The time-varying controller ˆπ= ˆπ0:T1.
ˆπ0,· · · ,ˆπt1
, to learn the policy
ˆπt
for time step
t
,Forward samples states
ˆx(i)
t
=
ϕt
(
x(i)
0
;
ˆπ0:t1
),
where x(1)
0, x(2)
0,· · · are sampled i.i.d. from the initial state distribution D.
The advantage of this method is that at time step
t
during deployment the controller
ˆπt
would
be evaluated on the same distribution as that on which it was trained. Other recent works have also
proposed learning inductively time-varying policies as a way to avoid distribution shifts [28,44].
2.1 The Sample Complexity of Learning a Controller with Forward
In this section, we discuss our statistical guarantees of the controllers produced by Forward. For
simplicity, in this section we consider the setting without state constraints, i.e.,
X
=
Rdx
. Before we
can state the main results of this section, we need to make an assumption on the class of controllers
Πused by Forward.
Assumption 2.1.
The model class Πis a finite and contains
π?
. Moreover, for any
π
Πand any
x∈ X we have π(x)∈ U.
The second part of the assumption just guarantees that Πenforces the input constraints. Any
controller class can be modified to satisfy this property by projecting the outputs of the controllers
onto
U
. We assume that the controller class Πis finite for simplicity. In this case, our sample
complexity guarantees scale with
ln |
Π
|
—a quantity that arises through a standard generalization
bound. When Πis not finite, one can replace
ln |
Π
|
by learning-theoretic complexity measures such
as the Rademacher complexity. Finally, in the MPC application we care about, the assumption
π?
Πis easily satisfied. In the case of constrained linear dynamics with quadratic costs the optimal
MPC controller is piecewise affine and it can be expressed as a neural network with ReLU activations
as extensively discussed by Karg and Lucia [16, Section I-D] (see also [3]).
4
Now we are ready to state the main result of this section. Its proof relies on the empirical
Bernstein inequality [24] and is deferred to Subsection C.1.
Theorem 2.1.
Let
T>
1be the target time step,
δ
(0
,
1),
n>
2, and
Bu:
=
supu∈U kuk
. Let
ˆxt
=
ϕt
(
x0
;
ˆπ0:t1
). When Assumption 2.1 holds, then under an event
E
of probability at least 1
δ
(over the randomness in the training process),Forward produces a time-varying controller
ˆπ0:T1
that satisfies
E[kπ?(ˆxt)ˆπt(ˆxt)k| ˆπ0:T1]67Buln(2T|Π|)
nt
,t>0,(2.4)
where
nt:
=
cntdln2
(
t
+ 1)
e
+
n
,
c:
=
P
t=1
1
/
(
tln2
(
t
+ 1)), and the expectation in
(2.4)
is with respect
to the randomness in the initial state.
This result guarantees that the time-varying controller learned by Forward is close in expectation
to the optimal controller. The following corollary to Theorem 2.1 bounds this difference with high
probability using Markov’s inequality (see Subsection C.2 for a proof):
Corollary 2.2.
Let
δ
(0
,
1),
n>
2, and
Bu:
=
supu∈U kuk
. Let
ˆxt
=
ϕt
(
x0
;
ˆπ0:t1
). When
Assumption 2.1 holds, then under the event
E
of probability at least 1
δ
(over the randomness in
the training process),Forward produces a time-varying controller ˆπ0:T1such that
Pt>0,kπ?(ˆxt)ˆπt(ˆxt)k614Buln(2T|Π|)
ˆπ0:T1>1δ, (2.5)
where the probability is with respect to the randomness in the initial state.
Infinite Model Classes:
The results presented in this section assume Πis finite for simplicity.
This assumption can be easily relaxed. For example, to get an analogue of the result of Theorem 2.1
for a infinite class Π, one can use the empirical Bernstein inequality [
24
, Lemma 6], which replaces
ln |
Π
|
by the logarithm of a “growth function” for the class Π(see Appendix A). In the case where
Πis a class of ReLU Neural Networks, the latter quantity can be bounded by
e
O
(
Nparams
), where
Nparams is the number of parameters of the Neural Networks in Π.
Trajectory Guarantees:
Theorem 2.1 guarantees that Forward produces a controller
ˆπ
that
generates inputs to the system that are close to those outputted by
π?
. However, this result does
not immediately imply that
ˆπ
and
π?
follow similar trajectories (errors could compound over time,
causing
ˆπ
’s trajectories to diverge from those of
π?
). Following the main ideas of Tu et al.
[46]
and
Pfrommer et al.
[31]
, one can in fact show guarantees in terms of trajectories when the closed-loop
system under π?is robust in an appropriate sense. See Appendix B the details.
In subsequent sections, we refine the results presented so far to the case of MPC.
3 Background on the Control of Linear Systems
In this section, we review some background material on the control of linear systems, with a focus
on the linear quadratic regulator (LQR) and on MPC. This section is not intended to be exhaustive;
we only cover the notions needed in this work. We consider the linear dynamical system
xt+1 =Axt+But,(3.1)
which clearly maps to the control-affine setting (2.1) with f(xt) = Axtand g(xt) = B.
5
摘要:

ModelPredictiveControlviaOn-PolicyImitationLearningKwangjunAhnMITEECS/LIDSkjahn@mit.eduZakariaMhammediMITIDSS/LIDSmhammedi@mit.eduHoriaManiaMITEECS/LIDShmania@mit.eduZhang-WeiHongMITEECS/CSAILzwhong@mit.eduAliJadbabaieMITCEE/LIDS/IDSSjadbabai@mit.eduOctober18,2022AbstractInthispaper,weleveragetherap...

展开>> 收起<<
Model Predictive Control via On-Policy Imitation Learning Kwangjun Ahn MIT EECSLIDS.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:26 页 大小:871.9KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注