Online Adaptive Policy Selection in Time-Varying Systems No-Regret via Contractive Perturbations Yiheng Lin James A. Preiss Emile Anand Yingying Li Yisong Yue Adam Wierman

2025-05-02 0 0 1.45MB 49 页 10玖币

侵权投诉

Online Adaptive Policy Selection in Time-Varying

Systems: No-Regret via Contractive Perturbations

Yiheng Lin, James A. Preiss, Emile Anand, Yingying Li, Yisong Yue, Adam Wierman

Department of Computing + Mathematical Sciences

California Institute of Technology

Pasadena, California, USA

{yihengl, japreiss, eanand, yingli2, yyue, adamw}@caltech.edu

Abstract

We study online adaptive policy selection in systems with time-varying costs and

dynamics. We develop the Gradient-based Adaptive Policy Selection (GAPS)

algorithm together with a general analytical framework for online policy selection

via online optimization. Under our proposed notion of contractive policy classes,

we show that GAPS approximates the behavior of an ideal online gradient descent

algorithm on the policy parameters while requiring less information and compu-

tation. When convexity holds, our algorithm is the ﬁrst to achieve optimal policy

regret. When convexity does not hold, we provide the ﬁrst local regret bound for

online policy selection. Our numerical experiments show that GAPS can adapt to

changing environments more quickly than existing benchmarks.

1 Introduction

We study the problem of online adaptive policy selection for nonlinear time-varying discrete-time

dynamical systems. The dynamics are given by

xt+1 =gt(xt, ut)

, where

is the state and

the control input at time

. The policy class is a time-varying mapping

πt

from the state

and a

policy parameter

θt

to a control input

. At every time step

, the online policy incurs a stage cost

ct=ft(xt, ut)

that depends on the current state and control input. The goal of policy selection is to

pick the parameter θtonline to minimize the total stage costs over a ﬁnite horizon T.

Online adaptive policy selection and general online control have received signiﬁcant attention recently

[

–

] because many control tasks require running the policy on a single trajectory, as opposed to

restarting the episode to evaluate a different policy from the same initial state. Adaptivity is also

important when the dynamics and cost functions are time-varying. For example, in robotics, time-

varying dynamics arise when we control a drone under changing wind conditions [9].

In this paper, we are interested in developing a uniﬁed framework that can leverage a broad suite

of theoretical results from online optimization and efﬁciently translate them to online policy selec-

tion, where efﬁciency includes both preserving the tightness of the guarantees and computational

considerations. A central issue is that, in online policy selection, the stage cost

depends on all

previously selected parameters

(θ0, . . . , θt−1)

via the state

. Many prior works along this direction

have addressed this issue by ﬁnite-memory reductions. This approach leads to the ﬁrst regret bound

on online policy selection, but the bounds are not tight, the computational cost can be large, and the

dynamics and policy classes studied are restrictive [1, 3, 6–8].

Contributions. We propose and analyze the algorithm Gradient-based Adaptive Policy Selection

(GAPS, Algorithm 1) to address three limitations of existing results on online policy selection. First,

under the assumption that

is a convex function of

(θ0, . . . , θt)

, prior work left a

log T

regret gap

between OCO and online policy selection. We close this gap by showing that GAPS achieves the

Preprint. Under review.

arXiv:2210.12320v3 [math.OC] 13 Jun 2023

optimal regret of

O(√T)

(Theorem 3.3). Second, many previous approaches require oracle access to

the dynamics/costs and expensive resimulation from imaginary previous states. In contrast, GAPS

only requires partial derivatives of the dynamics and costs along the visited trajectory, and computes

O(log T)

matrix multiplications at each step. Third, the application of previous results is limited to

speciﬁc policy classes and systems because they require

to be convex in

(θ0, . . . , θt)

. We address

this limitation by showing the ﬁrst local regret bound for online policy selection when the convexity

does not hold. Speciﬁcally, GAPS achieves the local regret of

O(p(1 + V)T)

, where

is a measure

of how much (gt, ft, πt)changes over the entire horizon.

To derive these performance guarantees, we develop a novel proof framework based on a general

exponentially decaying, or “contractive”, perturbation property (Deﬁnition 2.6) on the policy-induced

closed-loop dynamics. This generalizes a key property of disturbance-action controllers [e.g.

]

and includes other important policy classes such as model predictive control (MPC) [e.g.

] and

linear feedback controllers [e.g.

]. Under this property, we prove an approximation error bound

(Theorem 3.2), which shows that GAPS can mimic the update of an ideal online gradient descent

(OGD) algorithm [

] that has oracle knowledge of how the current policy parameter

θt

would have

performed if used exclusively over the whole trajectory. This error bound bridges online policy

selection and online optimization, which means regret guarantees on OGD for online optimization

can be transferred to GAPS for online policy selection.

In numerical experiments, we demonstrate that GAPS can adapt faster than an existing follow-the-

leader-type baseline in MPC with imperfect disturbance predictions, and outperforms a strong optimal

control baseline in a nonlinear system with non-i.i.d. disturbances. We include the source code in the

supplementary material.

Related Work. Our work is related to online control and adaptive-learning-based control [

–

especially online control with adversarial disturbances and regret guarantees [

–

]. For

example, there is a rich literature on policy regret bounds for time-invariant dynamics [

]. There is also a growing interest in algorithms for time-varying systems with small adaptive regret

[

], dynamic regret [

–

], and competitive ratio [

–

]. Many prior works study a speciﬁc

policy class called disturbance-action controller (DAC) [

–

]. When applied to linear dynamics

with convex cost functions

, DAC renders the stage cost

a convex function in past policy

parameters

(θ0, . . . , θt)

. Our work contributes to the literature by proposing a general contractive

perturbation property that includes DAC as a special case, and showing local regret bounds that do

not require

to be convex in

(θ0, . . . , θt)

. A recent work also handles nonconvex

, but it studies

an episodic setting and requires ctto be “nearly convex”, which holds under its policy class [29].

In addition to online control, this work is also related to online learning/optimization [

especially online optimization with memory and/or switching costs, where the cost at each time step

depends on past decisions. Speciﬁcally, our online adaptive policy selection problem is related to

online optimization with memory [

–

]. Our analysis for GAPS provides insight on how to

handle indeﬁnite memory when the impact of a past decision decays exponentially with time.

Our contractive perturbation property and the analytical framework based on this property are closely

related to prior works on discrete-time incremental stability and contraction theory in nonlinear

systems [

–

], as well as works that leverage such properties to derive guarantees for (online)

controllers [

–

]. In complicated systems, it may be hard to design policies that provably satisfy

these properties. This motivates some recent works to study neural-based approaches that can learn

a controller together with its certiﬁcate for contraction properties simultaneously [

]. Our

work contributes to this ﬁeld by showing that, when the system satisﬁes the contractive perturbation

property, one can leverage this property to bridge online policy selection with online optimization.

Notation. We use

[t1:t2]

to denote the sequence

(t1, . . . , t2)

at1:t2

to denote

(at1, at1+1, . . . , at2)

for

t1≤t2

, and

a×τ

for

(a, . . . , a)

with

repeated

τ≥0

times. We deﬁne

q(x, Q) = x⊤Qx

Symbols

and

denote the all-one and all-zero vectors/matrices respectively, with dimension

implied by context. The Euclidean ball with center

and radius

is denoted by

Bn(0, R)

We let

∥·∥

denote the (induced) Euclidean norm for vectors (matrices). The diameter of a set

diam(Θ) := supx,y∈Θ∥x−y∥. The projection onto the set Θis ΠΘ(x) = arg miny∈Θ∥y−x∥.

x0x1x2

u0u1

θ0θ1

c0c1ct

xtxt+1

θt

πt

Figure 1: Diagram of the

causal relationships between

states, policy parameters, con-

trol inputs, and costs.

2 Preliminaries

We consider online policy selection on a single trajectory. The setting is a discrete-time dynamical

system with state

xt∈Rn

for time index

t∈ T := [0 : T−1]

. At time step

t∈ T

, the policy picks

a control action ut∈Rm, and the next state and the incurred cost are given by:

Dynamics: xt+1 =gt(xt, ut),Cost: ct:=ft(xt, ut),

respectively, where

gt(·,·)

is a time-varying dynamics function and

ft(·,·)

is a time-varying stage

cost. The goal is to minimize the total cost PT−1

t=0 ct.

We consider parameterized time-varying policies of the form of

ut=πt(xt, θt)

, where

is the

current state at time step

and

θt∈Θ

is the current policy parameter.

is a closed convex subset

. We assume the dynamics, cost, and policy functions

{gt, ft, πt}t∈T

are oblivious, meaning

they are ﬁxed before the game begins. The online policy selection algorithm optimizes the total

cost by selecting

θt

sequentially. We illustrate how the policy parameter sequence

θ0:T−1

affects the

trajectory

{xt, ut}t∈T

and per-step costs

c0:T−1

in Figure 1. The online algorithm has access to the

partial derivatives of the dynamics

and cost

along the visited trajectory, but does not have oracle

access to the ft, gtfor arbitrary states and actions.

We provide two motivating examples for our setting. Appendix H contains more details and a third

example. The ﬁrst example is learning-augmented Model Predictive Control, a generalization of [

Example 2.1 (MPC with Conﬁdence Coefﬁcients).Consider a linear time-varying (LTV) system

gt(xt, ut) = Atxt+Btut+wt

, with time-varying costs

ft(xt, ut) = q(xt, Qt) + q(ut, Rt)

. At

time

, the policy observes

{At:t+k−1, Bt:t+k−1, Qt:t+k−1, Rt:t+k−1, wt:t+k−1|t}

, where

wτ|t

is a

(noisy) prediction of the future disturbance wτ. Then, πt(xt, θt)commits the ﬁrst entry of

arg min

ut:t+k−1|t

t+k−1

τ=t

fτ(xτ|t, uτ|t) + q(xt+k|t,˜

s.t. xt|t=xt, xτ+1|t=Aτxτ|t+Bτuτ|t+λ[τ−t]

twτ|t:t≤τ < t+k,

(1)

where

θt=λ[0]

t, λ[1]

t, . . . , λ[k−1]

t,Θ = [0,1]k

and

is a ﬁxed positive-deﬁnite matrix. Intuitively,

λ[i]

represents our level of conﬁdence in the disturbance prediction

steps into the future at time

step t, with entry 1being fully conﬁdent and 0being not conﬁdent at all.

The second example studies a nonlinear control model motivated by [11, 46].

Example 2.2 (Linear Feedback Control in Nonlinear Systems).Consider a time-varying nonlinear

control problem with dynamics

gt(xt, ut) = Axt+But+δt(xt, ut)

and costs

ft(xt, ut) = q(xt, Q)+

q(ut, R)

. Here, the nonlinear residual

δt

comes from linearization and is assumed to be sufﬁciently

small and Lipschitz. Inspired by [

], we construct an online policy based on the optimal controller

ut=−¯

Kxt

for the linear-quadratic regulator

LQR(A, B, Q, R)

. Speciﬁcally, we let

πt(xt, θt) =

−K(θt)xtwhere Kis a mapping from Θto Rn×msuch that 

K(θt)−¯

K

is uniformly bounded.

2.1 Policy Class and Performance Metrics

In our setting, the state

at time

is uniquely determined by the combination of 1) a state

xτ

at a

previous time

τ < t

, and 2) the parameter sequence

θτ:t−1

. Similarly, the cost at time

is uniquely

determined by

xτ

and

θτ:t

. Since we use these properties often, we introduce the following notation.

Deﬁnition 2.3 (Multi-Step Dynamics and Cost).The multi-step dynamics

gt|τ

between two time steps

τ≤t

speciﬁes the state

as a function of the previous state

xτ

and previous policy parameters

θτ:t−1. It is deﬁned recursively, with the base case gτ|τ(xτ):=xτand the recursive case

gt+1|τ(xτ, θτ:t) = gt(zt, πt(zt, θt)),∀t≥τ,

in which

zt:=gt|τ(xτ, θτ:t−1)

The multi-step cost

ft|τ

speciﬁes the cost

as function of

xτ

and

θτ:t. It is deﬁned as ft|τ(xτ, θτ:t):=ft(zt, πt(zt, θt)).

In this paper, we frequently compare the trajectory of our algorithm against the trajectory achieved

by applying a ﬁxed parameter

since time step

, which we denote as

ˆxt(θ):=gt|0(x0, θ×t)

and

ˆut(θ):=πt(ˆxt(θ), θ)

. A related concept that is heavily used is the surrogate cost

, which maps a

single policy parameter to a real number.

Deﬁnition 2.4 (Surrogate Cost).The surrogate cost function is deﬁned as

Ft(θ):=ft(ˆxt(θ),ˆut(θ))

Figure 1 shows the overall causal structure, from which these concepts follow.

To measure the performance of an online algorithm, we adopt the objective of adaptive policy regret,

which has been used by [

]. It is a stronger benchmark than the static policy regret [

] and

is more suited to time-varying environments. We use

{xt, ut, θt}t∈T

to denote the trajectory of the

online algorithm throughout the paper. The adaptive policy regret

RA(T)

is deﬁned as the maximum

difference between the cost of the online policy and the cost of the optimal ﬁxed-parameter policy

over any sub-interval of the whole horizon T, i.e.,

RA(T):= maxI=[t1:t2]⊆T Pt∈Ift(xt, ut)−infθ∈ΘPt∈IFt(θ).(2)

In contrast, the (static) policy regret deﬁned in [

] restricts the time interval

to be the whole

horizon

. Thus, a bound on adaptive regret is strictly stronger than the same bound on static regret.

This metric is particularly useful in time-varying environments like Examples 2.1 and 2.2 because

an online algorithm must adapt quickly to compete against a comparator policy parameter that can

change indeﬁnitely with every time interval [31, Section 10.2].

In the general case when surrogate costs

F0:T−1

are nonconvex, it is difﬁcult (if not impossible) for

online algorithms to achieve meaningful guarantees on classic regret metrics like

RA(T)

or static

policy regret because they lack oracle knowledge of the surrogate costs. Therefore, we introduce the

metric of local regret, which bounds the sum of squared gradient norms over the whole horizon:

RL(T):=PT−1

t=0 ∥∇Ft(θt)∥2.(3)

Similar metrics have been adopted by previous works on online nonconvex optimization [

]. In-

tuitively,

RL(T)

measures how well the online agent chases the (changing) stationary point of the

surrogate cost sequence

F0:T−1

. Since the surrogate cost functions are changing over time, the bound

RL(T)

will depend on how much the system

{gt, ft, πt}t∈T

changes over the whole horizon

We defer the details to Section 3.3.

2.2 Contractive Perturbation and Stability

In this section, we introduce two key properties needed for our sub-linear regret guarantees in adaptive

online policy selection. We deﬁne both with respect to trajectories generated by “slowly” time-varying

parameters, which are easier to analyze than arbitrary parameter sequences.

Deﬁnition 2.5. We denote the set of policy parameter sequences with ε-constrained step size by

Sε(t1:t2):={θt1:t2∈Θt2−t1+1 | ∥θτ+1 −θτ∥ ≤ ε, ∀τ∈[t1:t2−1]}.

The ﬁrst property we require is an exponentially decaying, or “contractive”, perturbation property of

the closed-loop dynamics of the system with the policy class. We now formalize this property.

Deﬁnition 2.6 (

-Time-varying Contractive Perturbation).The

-time-varying contractive perturba-

tion property holds for RC>0, C > 0,ρ∈(0,1), and ε≥0if, for any θτ:t−1∈Sε(τ:t−1),



gt|τ(xτ, θτ:t−1)−gt|τ(x′

τ, θτ:t−1)

≤Cρt−τ∥xτ−x′

τ∥

holds for arbitrary xτ, x′

τ∈Bn(0, RC)and time steps τ≤t.

1ztis an auxiliary variable to denote the state at tunder initial state xτand parameters θτ:t.

Intuitively,

-time-varying contractive perturbation requires two trajectories starting from different

states (in a bounded ball) to converge towards each other if they adopt the same slowly time-varying

policy parameter sequence. We call the special case of

ε= 0

time-invariant contractive perturbation,

meaning the policy parameter is ﬁxed. Although it may be difﬁcult to verify the time-varying property

directly since it allows the policy parameters to change, we show in Lemma 2.8 that time-invariant

contractive perturbation implies that the time-varying version also holds for some small ε > 0.

The time-invariant contractive perturbation property is closely related to discrete-time incremental

stability [e.g.

] and contraction theory [e.g.

], which have been studied in control theory. While

some speciﬁc policies including DAC and MPC satisfy

-time-varying contractive perturbation

globally in linear systems, in other case it is hard to verify. Our property is local and thus is easier to

establish for broader applications in nonlinear systems (e.g., Example 2.2).

Besides contractive perturbation, another important property we need is the stability of the policy

class, which requires

π0:T−1

can stabilize the system starting from the zero state as long as the policy

parameter varies slowly. This property is stated formally below:

Deﬁnition 2.7 (

-Time-varying Stability).The

-time-varying stability property holds for

RS>0

and

ε≥0

if, for any

θτ:t−1∈Sε(τ:t−1)



gt|τ(0, θτ:t−1)

≤RS

holds for any time steps

t≥τ

Intuitively,

-time-varying stability guarantees that the policy class

π0:T−1

can achieve stability if

the policy parameters

θ0:T−1

vary slowly.

Similarly to contractive perturbation, one only needs to

verify time-invariant stability (i.e.,

ε= 0

and the policy parameter is ﬁxed) to claim time-varying

stability holds for some strictly positive

(see Lemma 2.8). The reason we still use the time-varying

contractive perturbation and stability in our assumptions is that they hold for

ε= +∞

in some cases,

including DAC and MPC with conﬁdence coefﬁcients. Applying Lemma 2.8 for those systems will

lead to a small, overly pessimistic ε.

2.3 Key Assumptions

We make two assumptions about the online policy selection problem to achieve regret guarantees.

Assumption 2.1. The dynamics

g0:T−1

, policies

π0:T−1

, and costs

f0:T−1

are differentiable at

every time step and satisfy that, for any convex compact sets

X ⊆ Rn,U ⊆ Rm

, one can ﬁnd

Lipschitzness/smoothness constants (can depend on Xand U) such that:

1. The dynamics gt(x, u)is (Lg,x, Lg,u)-Lipschitz and (ℓg,x, ℓg,u)-smooth in (x, u)on X × U.

The policy function

πt(x, θ)

(Lπ,x, Lπ,θ)

-Lipschitz and

(ℓπ,x, ℓπ,θ)

-smooth in

(x, θ)

X ×Θ

The stage cost function

ft(x, u)

(Lf, Lf)

-Lipschitz and

(ℓf,x, ℓf,u)

-smooth in

(x, u)

X ×U

Assumption 2.1 is general because we only require the Lipschitzness/smoothness of

and

to hold

for bounded states/actions within

and

, where the coefﬁcients may depend on

and

. Similar

assumptions are common in the literature of online control/optimization [24, 28, 46].

Our second assumption is on the contractive perturbation and the stability of the closed-loop dynamics

induced by a slowly time-varying policy parameter sequence.

Assumption 2.2. Let

denote the set of all possible dynamics/policy sequences

{gt, πt}t∈T

the

environment/policy class may provide. For a ﬁxed

ε∈R≥0

, the

-time-varying contractive pertur-

bation (Deﬁnition 2.6) holds with

(RC, C, ρ)

for any sequence in

. The

-time-varying stability

(Deﬁnition 2.7) holds with

RS< RC

for any sequence in

. We assume that the initial state satisﬁes

∥x0∥<(RC−RS)/C

. Further, we assume that if

{g, π}

is the dynamics/policy at an intermediate

time step of a sequence in G, then the time-invariant sequence {g, π}×Tis also in G.

Compared to other settings where contractive perturbation holds globally [

], our assump-

tion brings a new challenge because we need to guarantee the starting state stays within

B(0, RC)

whenever we apply this property in the proof. Therefore, we assume

RC> RS+C∥x0∥

in Assump-

tion 2.2. Similarly, to leverage the Lipschitzness/smoothness property, we require

X ⊇ B(0, Rx)

where

Rx≥C(RS+C∥x0∥)+RS

and

U={π(x, θ)|x∈ X, θ ∈Θ, π ∈ G}

. Since the coefﬁcients

in Assumption 2.1 depend on

and

, we will set

X=B(0, Rx)

and

Rx=C(RS+C∥x0∥) + RS

by default when presenting these constants.

This property is standard in online control and is satisﬁed by DAC [

–

] as well as Examples 2.1 & 2.2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnlineAdaptivePolicySelectioninTime-VaryingSystems:No-RegretviaContractivePerturbationsYihengLin,JamesA.Preiss,EmileAnand,YingyingLi,YisongYue,AdamWiermanDepartmentofComputing+MathematicalSciencesCaliforniaInstituteofTechnologyPasadena,California,USA{yihengl,japreiss,eanand,yingli2,yyue,adamw}@calte...

展开>> 收起<<

Online Adaptive Policy Selection in Time-Varying Systems No-Regret via Contractive Perturbations Yiheng Lin James A. Preiss Emile Anand Yingying Li Yisong Yue Adam Wierman.pdf

共49页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Online Adaptive Policy Selection in Time-Varying Systems No-Regret via Contractive Perturbations Yiheng Lin James A. Preiss Emile Anand Yingying Li Yisong Yue Adam Wierman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: