πDAC is given by
uπDAC
t=
h
X
k=1
M[k]
twt−k.(3)
Here, Mt=hM[1]
t, . . . , M[h]
tiare the feedback gains or the
disturbance gains and are the (time-varying) parameters of
πDAC. Here, we note that, πDAC can be dynamic, i.e., it’s
parameters can be varying with time. Therefore, the regret
defined in Eq. (2) is the notion of dynamic regret. We note
that the policy is implementable with disturbance feedback.
Extension to the case without the disturbance feedback can be
made by using estimates of the disturbances instead. We defer
the treatment without any disturbance feedback to future work.
Our objective here is to optimize the parameter Monline so
that the regret with respect to the best DAC policy in hindsight
is sub-linear.
The DAC policy is typically used in online control for
regulating systems with disturbances; see [4]. The important
feature of the DAC policy is that the optimization problem to
find the optimal fixed disturbance gain for a given sequence
of cost functions is a convex problem and is thus amenable to
online optimization and online performance analysis. A very
appealing feature of DAC is that, for time-invariant systems,
the optimal disturbance action control for a given sequence of
cost functions is very close in terms of the performance to the
optimal linear feedback controller of the state; see [4]. Thus,
for time-invariant systems, by optimizing the DAC online, it is
possible to achieve a sub-linear regret with respect to the best
linear feedback controller of the state, whose computation is
a non-convex optimization problem.
For time-varying dynamical systems, as pointed out in [1,
Thoerem 2.1], there exist problem instances where the DAC
class (with disturbance feedback) incurs a much better cost
than other types of classes such as linear state or output
feedback policies and vice versa. Therefore, the DAC class
is not a weaker class to compete against compared to these
standard classes. Moreover, as pointed out by the impossibility
result [1, Thoerem 3.1], it is an equally harder class to compete
against in terms of regret. In this work, we focus our study
on the regret minimization problem with respect to the DAC
class (with disturbance feedback) and defer the treatment of
other control structures to future work.
Even with the disturbance feedback, the challenge of es-
timating the unknown system parameters does not diminish.
This is because of the presence of measurement noise and
the variations itself. In the time-invariant case, following
an analysis similar to [5], it can be shown that, even with
disturbance feedback, only a regret of T2/3can be achieved
with the state-of-the-art methods, which is not any better than
the regret that can be achieved without disturbance feedback
(see [5]). The same holds for the time-varying case. It can be
shown that, what [1] can achieve for the system in Eq. (1), even
with disturbance feedback, cannot be improved. Therefore, the
conclusions we draw later on comparing the bounds we derive
and the regret upper bound of [1] are valid. We state our other
assumptions below.
Assumption 1 (System).(i) The system is stable, i.e.,
kCt+k+1At+k. . . At+1Btk2≤κaκb(1 −γ)k,∀k≥0,∀t,
where κa>0, κb>0and γis such that 0< γ < 1, and where
κa, κband γare constants. Btis bounded, i.e., kBtk ≤ κb. (ii)
The disturbance and noise wtand etis bounded. Specifically,
kwtk ≤ κw, where κw>0is a constant, and ketk ≤ κe,
where κe>0is a constant.
Assumption 2 (Cost Functions).(i) The cost function ctis
convex ∀t. (ii) kct(x, u)−ct(x0, u0)k ≤ LRkz−z0kfor a
given z>:= [x>, u>],(z0)>:= [(x0)>,(u0)>], where R:=
max{kzk,kz0k,1}. (iii) For any d > 0, when kxk ≤ dand
kuk ≤ d,∇xc(x, u)≤Gd, ∇uc(x, u)≤Gd.
Remark 1 (System Assumptions).Assumption 1.(i) is the
equivalent of stability assumption used in time invariant sys-
tems. Such an assumption is typically used in online control
when the system is unknown; see for eg., [1], [5]. Assumption
1.(iii) that noise is bounded is necessary, especially in the non-
stochastic setting [4], [5]. The assumption on cost functions
is also standard [4].
Definition 1. (i) M:={M= (M[1], . . . , M[h]) : kM[k]k ≤
κM}(Disturbance Action Policy Class). (ii) G={G[1:h]:
kG[k]k2≤κaκb(1 −γ)k−1}. (iii) Setting (S-1): Matched dis-
turbance system with convex cost functions: Bt=Bt,w, C =
I, et= 0. Setting (S-2): General system with linear cost
functions: Bt,w =I, and there exists a coefficient αt∈Rp+m
such that ct(y, u) = α>
tz,kα>
tk ≤ G.
III. ONLINE LEARNING CONTROL ALGORITHM
Typically, online learning control algorithms for time-
invariant dynamical systems explore first for a period of time,
and then exploit, i.e., adapt or optimize the control policy.
While, in the time-invariant case, this strategy results in sub-
linear regret, in the time-varying case, it can be less effective.
For instance, consider the case where the system remains
unchanged for the duration of the exploration phase and
then changes around the instant when the exploration ends.
Clearly, in this case, the estimate made at the end of the
exploration phase will be very distant from the underlying
system parameter realized after the exploration phase and
therefore not result in a sub-linear regret.
We propose an online algorithm that continuously learns
to compute an estimate of the time varying system param-
eters and that simultaneously optimizes the control policy
online. Our estimation algorithm combines (i) a change point
detection algorithm to detect the changes in the underlying
system and (ii) a regular estimation algorithm. The online
algorithm runs an online optimization parallel to the estimation
to optimize the parameters of the control policy, which in our
case is a DAC policy.
Online Optimization: Since the cost functions and the
disturbances are unknown a priori, the optimal parameter
Mof the DAC policy cannot be computed a priori. Rather,
the parameters have to be adapted online continuously with
the information gathered along the way to achieve the best
performance. Given the convexity of the cost functions and
the linearity of the system dynamics, we can apply the Online