Online Adaptive Policy Selection in Time-Varying Systems No-Regret via Contractive Perturbations Yiheng Lin James A. Preiss Emile Anand Yingying Li Yisong Yue Adam Wierman

2025-05-02 0 0 1.45MB 49 页 10玖币
侵权投诉
Online Adaptive Policy Selection in Time-Varying
Systems: No-Regret via Contractive Perturbations
Yiheng Lin, James A. Preiss, Emile Anand, Yingying Li, Yisong Yue, Adam Wierman
Department of Computing + Mathematical Sciences
California Institute of Technology
Pasadena, California, USA
{yihengl, japreiss, eanand, yingli2, yyue, adamw}@caltech.edu
Abstract
We study online adaptive policy selection in systems with time-varying costs and
dynamics. We develop the Gradient-based Adaptive Policy Selection (GAPS)
algorithm together with a general analytical framework for online policy selection
via online optimization. Under our proposed notion of contractive policy classes,
we show that GAPS approximates the behavior of an ideal online gradient descent
algorithm on the policy parameters while requiring less information and compu-
tation. When convexity holds, our algorithm is the first to achieve optimal policy
regret. When convexity does not hold, we provide the first local regret bound for
online policy selection. Our numerical experiments show that GAPS can adapt to
changing environments more quickly than existing benchmarks.
1 Introduction
We study the problem of online adaptive policy selection for nonlinear time-varying discrete-time
dynamical systems. The dynamics are given by
xt+1 =gt(xt, ut)
, where
xt
is the state and
ut
is
the control input at time
t
. The policy class is a time-varying mapping
πt
from the state
xt
and a
policy parameter
θt
to a control input
ut
. At every time step
t
, the online policy incurs a stage cost
ct=ft(xt, ut)
that depends on the current state and control input. The goal of policy selection is to
pick the parameter θtonline to minimize the total stage costs over a finite horizon T.
Online adaptive policy selection and general online control have received significant attention recently
[
1
8
] because many control tasks require running the policy on a single trajectory, as opposed to
restarting the episode to evaluate a different policy from the same initial state. Adaptivity is also
important when the dynamics and cost functions are time-varying. For example, in robotics, time-
varying dynamics arise when we control a drone under changing wind conditions [9].
In this paper, we are interested in developing a unified framework that can leverage a broad suite
of theoretical results from online optimization and efficiently translate them to online policy selec-
tion, where efficiency includes both preserving the tightness of the guarantees and computational
considerations. A central issue is that, in online policy selection, the stage cost
ct
depends on all
previously selected parameters
(θ0, . . . , θt1)
via the state
xt
. Many prior works along this direction
have addressed this issue by finite-memory reductions. This approach leads to the first regret bound
on online policy selection, but the bounds are not tight, the computational cost can be large, and the
dynamics and policy classes studied are restrictive [1, 3, 6–8].
Contributions. We propose and analyze the algorithm Gradient-based Adaptive Policy Selection
(GAPS, Algorithm 1) to address three limitations of existing results on online policy selection. First,
under the assumption that
ct
is a convex function of
(θ0, . . . , θt)
, prior work left a
log T
regret gap
between OCO and online policy selection. We close this gap by showing that GAPS achieves the
Preprint. Under review.
arXiv:2210.12320v3 [math.OC] 13 Jun 2023
optimal regret of
O(T)
(Theorem 3.3). Second, many previous approaches require oracle access to
the dynamics/costs and expensive resimulation from imaginary previous states. In contrast, GAPS
only requires partial derivatives of the dynamics and costs along the visited trajectory, and computes
O(log T)
matrix multiplications at each step. Third, the application of previous results is limited to
specific policy classes and systems because they require
ct
to be convex in
(θ0, . . . , θt)
. We address
this limitation by showing the first local regret bound for online policy selection when the convexity
does not hold. Specifically, GAPS achieves the local regret of
O(p(1 + V)T)
, where
V
is a measure
of how much (gt, ft, πt)changes over the entire horizon.
To derive these performance guarantees, we develop a novel proof framework based on a general
exponentially decaying, or “contractive”, perturbation property (Definition 2.6) on the policy-induced
closed-loop dynamics. This generalizes a key property of disturbance-action controllers [e.g.
1
,
8
]
and includes other important policy classes such as model predictive control (MPC) [e.g.
10
] and
linear feedback controllers [e.g.
11
]. Under this property, we prove an approximation error bound
(Theorem 3.2), which shows that GAPS can mimic the update of an ideal online gradient descent
(OGD) algorithm [
12
] that has oracle knowledge of how the current policy parameter
θt
would have
performed if used exclusively over the whole trajectory. This error bound bridges online policy
selection and online optimization, which means regret guarantees on OGD for online optimization
can be transferred to GAPS for online policy selection.
In numerical experiments, we demonstrate that GAPS can adapt faster than an existing follow-the-
leader-type baseline in MPC with imperfect disturbance predictions, and outperforms a strong optimal
control baseline in a nonlinear system with non-i.i.d. disturbances. We include the source code in the
supplementary material.
Related Work. Our work is related to online control and adaptive-learning-based control [
2
,
13
18
],
especially online control with adversarial disturbances and regret guarantees [
1
,
6
8
,
19
,
20
]. For
example, there is a rich literature on policy regret bounds for time-invariant dynamics [
1
,
6
,
8
,
14
,
19
,
21
]. There is also a growing interest in algorithms for time-varying systems with small adaptive regret
[
7
,
20
], dynamic regret [
22
24
], and competitive ratio [
25
28
]. Many prior works study a specific
policy class called disturbance-action controller (DAC) [
1
,
3
,
6
8
]. When applied to linear dynamics
gt
with convex cost functions
ft
, DAC renders the stage cost
ct
a convex function in past policy
parameters
(θ0, . . . , θt)
. Our work contributes to the literature by proposing a general contractive
perturbation property that includes DAC as a special case, and showing local regret bounds that do
not require
ct
to be convex in
(θ0, . . . , θt)
. A recent work also handles nonconvex
ct
, but it studies
an episodic setting and requires ctto be “nearly convex”, which holds under its policy class [29].
In addition to online control, this work is also related to online learning/optimization [
12
,
30
,
31
],
especially online optimization with memory and/or switching costs, where the cost at each time step
depends on past decisions. Specifically, our online adaptive policy selection problem is related to
online optimization with memory [
28
,
32
38
]. Our analysis for GAPS provides insight on how to
handle indefinite memory when the impact of a past decision decays exponentially with time.
Our contractive perturbation property and the analytical framework based on this property are closely
related to prior works on discrete-time incremental stability and contraction theory in nonlinear
systems [
39
45
], as well as works that leverage such properties to derive guarantees for (online)
controllers [
46
48
]. In complicated systems, it may be hard to design policies that provably satisfy
these properties. This motivates some recent works to study neural-based approaches that can learn
a controller together with its certificate for contraction properties simultaneously [
49
,
50
]. Our
work contributes to this field by showing that, when the system satisfies the contractive perturbation
property, one can leverage this property to bridge online policy selection with online optimization.
Notation. We use
[t1:t2]
to denote the sequence
(t1, . . . , t2)
,
at1:t2
to denote
(at1, at1+1, . . . , at2)
for
t1t2
, and
a×τ
for
(a, . . . , a)
with
a
repeated
τ0
times. We define
q(x, Q) = xQx
.
Symbols
1
and
0
denote the all-one and all-zero vectors/matrices respectively, with dimension
implied by context. The Euclidean ball with center
0
and radius
R
in
Rn
is denoted by
Bn(0, R)
.
We let
∥·∥
denote the (induced) Euclidean norm for vectors (matrices). The diameter of a set
Θ
is
diam(Θ) := supx,yΘxy. The projection onto the set Θis ΠΘ(x) = arg minyΘyx.
2
x0x1x2
u0u1
θ0θ1
c0c1ct
xtxt+1
ut
θt
πt
gt
ft
Figure 1: Diagram of the
causal relationships between
states, policy parameters, con-
trol inputs, and costs.
2 Preliminaries
We consider online policy selection on a single trajectory. The setting is a discrete-time dynamical
system with state
xtRn
for time index
t∈ T := [0 : T1]
. At time step
t∈ T
, the policy picks
a control action utRm, and the next state and the incurred cost are given by:
Dynamics: xt+1 =gt(xt, ut),Cost: ct:=ft(xt, ut),
respectively, where
gt(·,·)
is a time-varying dynamics function and
ft(·,·)
is a time-varying stage
cost. The goal is to minimize the total cost PT1
t=0 ct.
We consider parameterized time-varying policies of the form of
ut=πt(xt, θt)
, where
xt
is the
current state at time step
t
and
θtΘ
is the current policy parameter.
Θ
is a closed convex subset
of
Rd
. We assume the dynamics, cost, and policy functions
{gt, ft, πt}t∈T
are oblivious, meaning
they are fixed before the game begins. The online policy selection algorithm optimizes the total
cost by selecting
θt
sequentially. We illustrate how the policy parameter sequence
θ0:T1
affects the
trajectory
{xt, ut}t∈T
and per-step costs
c0:T1
in Figure 1. The online algorithm has access to the
partial derivatives of the dynamics
ft
and cost
gt
along the visited trajectory, but does not have oracle
access to the ft, gtfor arbitrary states and actions.
We provide two motivating examples for our setting. Appendix H contains more details and a third
example. The first example is learning-augmented Model Predictive Control, a generalization of [
37
].
Example 2.1 (MPC with Confidence Coefficients).Consider a linear time-varying (LTV) system
gt(xt, ut) = Atxt+Btut+wt
, with time-varying costs
ft(xt, ut) = q(xt, Qt) + q(ut, Rt)
. At
time
t
, the policy observes
{At:t+k1, Bt:t+k1, Qt:t+k1, Rt:t+k1, wt:t+k1|t}
, where
wτ|t
is a
(noisy) prediction of the future disturbance wτ. Then, πt(xt, θt)commits the first entry of
arg min
ut:t+k1|t
t+k1
X
τ=t
fτ(xτ|t, uτ|t) + q(xt+k|t,˜
Q)
s.t. xt|t=xt, xτ+1|t=Aτxτ|t+Bτuτ|t+λ[τt]
twτ|t:tτ < t+k,
(1)
where
θt=λ[0]
t, λ[1]
t, . . . , λ[k1]
t,Θ = [0,1]k
and
˜
Q
is a fixed positive-definite matrix. Intuitively,
λ[i]
t
represents our level of confidence in the disturbance prediction
i
steps into the future at time
step t, with entry 1being fully confident and 0being not confident at all.
The second example studies a nonlinear control model motivated by [11, 46].
Example 2.2 (Linear Feedback Control in Nonlinear Systems).Consider a time-varying nonlinear
control problem with dynamics
gt(xt, ut) = Axt+But+δt(xt, ut)
and costs
ft(xt, ut) = q(xt, Q)+
q(ut, R)
. Here, the nonlinear residual
δt
comes from linearization and is assumed to be sufficiently
small and Lipschitz. Inspired by [
11
], we construct an online policy based on the optimal controller
ut=¯
Kxt
for the linear-quadratic regulator
LQR(A, B, Q, R)
. Specifically, we let
πt(xt, θt) =
K(θt)xtwhere Kis a mapping from Θto Rn×msuch that
K(θt)¯
K
is uniformly bounded.
2.1 Policy Class and Performance Metrics
In our setting, the state
xt
at time
t
is uniquely determined by the combination of 1) a state
xτ
at a
previous time
τ < t
, and 2) the parameter sequence
θτ:t1
. Similarly, the cost at time
t
is uniquely
determined by
xτ
and
θτ:t
. Since we use these properties often, we introduce the following notation.
3
Definition 2.3 (Multi-Step Dynamics and Cost).The multi-step dynamics
gt|τ
between two time steps
τt
specifies the state
xt
as a function of the previous state
xτ
and previous policy parameters
θτ:t1. It is defined recursively, with the base case gτ|τ(xτ):=xτand the recursive case
gt+1|τ(xτ, θτ:t) = gt(zt, πt(zt, θt)),tτ,
in which
zt:=gt|τ(xτ, θτ:t1)
.
1
The multi-step cost
ft|τ
specifies the cost
ct
as function of
xτ
and
θτ:t. It is defined as ft|τ(xτ, θτ:t):=ft(zt, πt(zt, θt)).
In this paper, we frequently compare the trajectory of our algorithm against the trajectory achieved
by applying a fixed parameter
θ
since time step
0
, which we denote as
ˆxt(θ):=gt|0(x0, θ×t)
and
ˆut(θ):=πt(ˆxt(θ), θ)
. A related concept that is heavily used is the surrogate cost
Ft
, which maps a
single policy parameter to a real number.
Definition 2.4 (Surrogate Cost).The surrogate cost function is defined as
Ft(θ):=ft(ˆxt(θ),ˆut(θ))
.
Figure 1 shows the overall causal structure, from which these concepts follow.
To measure the performance of an online algorithm, we adopt the objective of adaptive policy regret,
which has been used by [
7
,
51
]. It is a stronger benchmark than the static policy regret [
1
,
6
] and
is more suited to time-varying environments. We use
{xt, ut, θt}t∈T
to denote the trajectory of the
online algorithm throughout the paper. The adaptive policy regret
RA(T)
is defined as the maximum
difference between the cost of the online policy and the cost of the optimal fixed-parameter policy
over any sub-interval of the whole horizon T, i.e.,
RA(T):= maxI=[t1:t2]⊆T PtIft(xt, ut)infθΘPtIFt(θ).(2)
In contrast, the (static) policy regret defined in [
1
,
6
] restricts the time interval
I
to be the whole
horizon
T
. Thus, a bound on adaptive regret is strictly stronger than the same bound on static regret.
This metric is particularly useful in time-varying environments like Examples 2.1 and 2.2 because
an online algorithm must adapt quickly to compete against a comparator policy parameter that can
change indefinitely with every time interval [31, Section 10.2].
In the general case when surrogate costs
F0:T1
are nonconvex, it is difficult (if not impossible) for
online algorithms to achieve meaningful guarantees on classic regret metrics like
RA(T)
or static
policy regret because they lack oracle knowledge of the surrogate costs. Therefore, we introduce the
metric of local regret, which bounds the sum of squared gradient norms over the whole horizon:
RL(T):=PT1
t=0 ∥∇Ft(θt)2.(3)
Similar metrics have been adopted by previous works on online nonconvex optimization [
52
]. In-
tuitively,
RL(T)
measures how well the online agent chases the (changing) stationary point of the
surrogate cost sequence
F0:T1
. Since the surrogate cost functions are changing over time, the bound
on
RL(T)
will depend on how much the system
{gt, ft, πt}t∈T
changes over the whole horizon
T
.
We defer the details to Section 3.3.
2.2 Contractive Perturbation and Stability
In this section, we introduce two key properties needed for our sub-linear regret guarantees in adaptive
online policy selection. We define both with respect to trajectories generated by “slowly” time-varying
parameters, which are easier to analyze than arbitrary parameter sequences.
Definition 2.5. We denote the set of policy parameter sequences with ε-constrained step size by
Sε(t1:t2):={θt1:t2Θt2t1+1 | ∥θτ+1 θτ∥ ≤ ε, τ[t1:t21]}.
The first property we require is an exponentially decaying, or “contractive”, perturbation property of
the closed-loop dynamics of the system with the policy class. We now formalize this property.
Definition 2.6 (
ε
-Time-varying Contractive Perturbation).The
ε
-time-varying contractive perturba-
tion property holds for RC>0, C > 0,ρ(0,1), and ε0if, for any θτ:t1Sε(τ:t1),
gt|τ(xτ, θτ:t1)gt|τ(x
τ, θτ:t1)
Cρtτxτx
τ
holds for arbitrary xτ, x
τBn(0, RC)and time steps τt.
1ztis an auxiliary variable to denote the state at tunder initial state xτand parameters θτ:t.
4
Intuitively,
ε
-time-varying contractive perturbation requires two trajectories starting from different
states (in a bounded ball) to converge towards each other if they adopt the same slowly time-varying
policy parameter sequence. We call the special case of
ε= 0
time-invariant contractive perturbation,
meaning the policy parameter is fixed. Although it may be difficult to verify the time-varying property
directly since it allows the policy parameters to change, we show in Lemma 2.8 that time-invariant
contractive perturbation implies that the time-varying version also holds for some small ε > 0.
The time-invariant contractive perturbation property is closely related to discrete-time incremental
stability [e.g.
44
] and contraction theory [e.g.
45
], which have been studied in control theory. While
some specific policies including DAC and MPC satisfy
ε
-time-varying contractive perturbation
globally in linear systems, in other case it is hard to verify. Our property is local and thus is easier to
establish for broader applications in nonlinear systems (e.g., Example 2.2).
Besides contractive perturbation, another important property we need is the stability of the policy
class, which requires
π0:T1
can stabilize the system starting from the zero state as long as the policy
parameter varies slowly. This property is stated formally below:
Definition 2.7 (
ε
-Time-varying Stability).The
ε
-time-varying stability property holds for
RS>0
and
ε0
if, for any
θτ:t1Sε(τ:t1)
,
gt|τ(0, θτ:t1)
RS
holds for any time steps
tτ
.
Intuitively,
ε
-time-varying stability guarantees that the policy class
π0:T1
can achieve stability if
the policy parameters
θ0:T1
vary slowly.
2
Similarly to contractive perturbation, one only needs to
verify time-invariant stability (i.e.,
ε= 0
and the policy parameter is fixed) to claim time-varying
stability holds for some strictly positive
ε
(see Lemma 2.8). The reason we still use the time-varying
contractive perturbation and stability in our assumptions is that they hold for
ε= +
in some cases,
including DAC and MPC with confidence coefficients. Applying Lemma 2.8 for those systems will
lead to a small, overly pessimistic ε.
2.3 Key Assumptions
We make two assumptions about the online policy selection problem to achieve regret guarantees.
Assumption 2.1. The dynamics
g0:T1
, policies
π0:T1
, and costs
f0:T1
are differentiable at
every time step and satisfy that, for any convex compact sets
X Rn,U ⊆ Rm
, one can find
Lipschitzness/smoothness constants (can depend on Xand U) such that:
1. The dynamics gt(x, u)is (Lg,x, Lg,u)-Lipschitz and (g,x, ℓg,u)-smooth in (x, u)on X × U.
2.
The policy function
πt(x, θ)
is
(Lπ,x, Lπ,θ)
-Lipschitz and
(π,x, ℓπ,θ)
-smooth in
(x, θ)
on
X ×Θ
.
3.
The stage cost function
ft(x, u)
is
(Lf, Lf)
-Lipschitz and
(f,x, ℓf,u)
-smooth in
(x, u)
on
X ×U
.
Assumption 2.1 is general because we only require the Lipschitzness/smoothness of
gt
and
ft
to hold
for bounded states/actions within
X
and
U
, where the coefficients may depend on
X
and
U
. Similar
assumptions are common in the literature of online control/optimization [24, 28, 46].
Our second assumption is on the contractive perturbation and the stability of the closed-loop dynamics
induced by a slowly time-varying policy parameter sequence.
Assumption 2.2. Let
G
denote the set of all possible dynamics/policy sequences
{gt, πt}t∈T
the
environment/policy class may provide. For a fixed
εR0
, the
ε
-time-varying contractive pertur-
bation (Definition 2.6) holds with
(RC, C, ρ)
for any sequence in
G
. The
ε
-time-varying stability
(Definition 2.7) holds with
RS< RC
for any sequence in
G
. We assume that the initial state satisfies
x0<(RCRS)/C
. Further, we assume that if
{g, π}
is the dynamics/policy at an intermediate
time step of a sequence in G, then the time-invariant sequence {g, π}×Tis also in G.
Compared to other settings where contractive perturbation holds globally [
1
,
8
,
53
], our assump-
tion brings a new challenge because we need to guarantee the starting state stays within
B(0, RC)
whenever we apply this property in the proof. Therefore, we assume
RC> RS+Cx0
in Assump-
tion 2.2. Similarly, to leverage the Lipschitzness/smoothness property, we require
X B(0, Rx)
where
RxC(RS+Cx0)+RS
and
U={π(x, θ)|x∈ X, θ Θ, π ∈ G}
. Since the coefficients
in Assumption 2.1 depend on
X
and
U
, we will set
X=B(0, Rx)
and
Rx=C(RS+Cx0) + RS
by default when presenting these constants.
2
This property is standard in online control and is satisfied by DAC [
1
,
3
,
6
8
] as well as Examples 2.1 & 2.2.
5
摘要:

OnlineAdaptivePolicySelectioninTime-VaryingSystems:No-RegretviaContractivePerturbationsYihengLin,JamesA.Preiss,EmileAnand,YingyingLi,YisongYue,AdamWiermanDepartmentofComputing+MathematicalSciencesCaliforniaInstituteofTechnologyPasadena,California,USA{yihengl,japreiss,eanand,yingli2,yyue,adamw}@calte...

展开>> 收起<<
Online Adaptive Policy Selection in Time-Varying Systems No-Regret via Contractive Perturbations Yiheng Lin James A. Preiss Emile Anand Yingying Li Yisong Yue Adam Wierman.pdf

共49页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:49 页 大小:1.45MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 49
客服
关注