
optimal regret of
O(√T)
(Theorem 3.3). Second, many previous approaches require oracle access to
the dynamics/costs and expensive resimulation from imaginary previous states. In contrast, GAPS
only requires partial derivatives of the dynamics and costs along the visited trajectory, and computes
O(log T)
matrix multiplications at each step. Third, the application of previous results is limited to
specific policy classes and systems because they require
ct
to be convex in
(θ0, . . . , θt)
. We address
this limitation by showing the first local regret bound for online policy selection when the convexity
does not hold. Specifically, GAPS achieves the local regret of
O(p(1 + V)T)
, where
V
is a measure
of how much (gt, ft, πt)changes over the entire horizon.
To derive these performance guarantees, we develop a novel proof framework based on a general
exponentially decaying, or “contractive”, perturbation property (Definition 2.6) on the policy-induced
closed-loop dynamics. This generalizes a key property of disturbance-action controllers [e.g.
1
,
8
]
and includes other important policy classes such as model predictive control (MPC) [e.g.
10
] and
linear feedback controllers [e.g.
11
]. Under this property, we prove an approximation error bound
(Theorem 3.2), which shows that GAPS can mimic the update of an ideal online gradient descent
(OGD) algorithm [
12
] that has oracle knowledge of how the current policy parameter
θt
would have
performed if used exclusively over the whole trajectory. This error bound bridges online policy
selection and online optimization, which means regret guarantees on OGD for online optimization
can be transferred to GAPS for online policy selection.
In numerical experiments, we demonstrate that GAPS can adapt faster than an existing follow-the-
leader-type baseline in MPC with imperfect disturbance predictions, and outperforms a strong optimal
control baseline in a nonlinear system with non-i.i.d. disturbances. We include the source code in the
supplementary material.
Related Work. Our work is related to online control and adaptive-learning-based control [
2
,
13
–
18
],
especially online control with adversarial disturbances and regret guarantees [
1
,
6
–
8
,
19
,
20
]. For
example, there is a rich literature on policy regret bounds for time-invariant dynamics [
1
,
6
,
8
,
14
,
19
,
21
]. There is also a growing interest in algorithms for time-varying systems with small adaptive regret
[
7
,
20
], dynamic regret [
22
–
24
], and competitive ratio [
25
–
28
]. Many prior works study a specific
policy class called disturbance-action controller (DAC) [
1
,
3
,
6
–
8
]. When applied to linear dynamics
gt
with convex cost functions
ft
, DAC renders the stage cost
ct
a convex function in past policy
parameters
(θ0, . . . , θt)
. Our work contributes to the literature by proposing a general contractive
perturbation property that includes DAC as a special case, and showing local regret bounds that do
not require
ct
to be convex in
(θ0, . . . , θt)
. A recent work also handles nonconvex
ct
, but it studies
an episodic setting and requires ctto be “nearly convex”, which holds under its policy class [29].
In addition to online control, this work is also related to online learning/optimization [
12
,
30
,
31
],
especially online optimization with memory and/or switching costs, where the cost at each time step
depends on past decisions. Specifically, our online adaptive policy selection problem is related to
online optimization with memory [
28
,
32
–
38
]. Our analysis for GAPS provides insight on how to
handle indefinite memory when the impact of a past decision decays exponentially with time.
Our contractive perturbation property and the analytical framework based on this property are closely
related to prior works on discrete-time incremental stability and contraction theory in nonlinear
systems [
39
–
45
], as well as works that leverage such properties to derive guarantees for (online)
controllers [
46
–
48
]. In complicated systems, it may be hard to design policies that provably satisfy
these properties. This motivates some recent works to study neural-based approaches that can learn
a controller together with its certificate for contraction properties simultaneously [
49
,
50
]. Our
work contributes to this field by showing that, when the system satisfies the contractive perturbation
property, one can leverage this property to bridge online policy selection with online optimization.
Notation. We use
[t1:t2]
to denote the sequence
(t1, . . . , t2)
,
at1:t2
to denote
(at1, at1+1, . . . , at2)
for
t1≤t2
, and
a×τ
for
(a, . . . , a)
with
a
repeated
τ≥0
times. We define
q(x, Q) = x⊤Qx
.
Symbols
1
and
0
denote the all-one and all-zero vectors/matrices respectively, with dimension
implied by context. The Euclidean ball with center
0
and radius
R
in
Rn
is denoted by
Bn(0, R)
.
We let
∥·∥
denote the (induced) Euclidean norm for vectors (matrices). The diameter of a set
Θ
is
diam(Θ) := supx,y∈Θ∥x−y∥. The projection onto the set Θis ΠΘ(x) = arg miny∈Θ∥y−x∥.
2