
However, in many applications the loss of the learner depends not only on the current decisions but
on the entire history of decisions until that point. For example, in online linear control [Agarwal
et al.,2019b], in each round the learner chooses a “control policy” (i.e., decision), suffers a loss that
is a function of the action taken by this policy and the current state of the system, and the system’s
state evolves according to linear dynamics. The current state depends on the entire history of actions
and, therefore, the current loss depends not only on the current decision but the entire history of
decisions. The OCO framework cannot capture such long-term dependence of the current loss on the
past decisions and neither can existing generalizations that allow the loss to depend on a constant
number of past decisions [Anava et al.,2015]. Although a series of approximation arguments can be
used to apply finite memory generalizations of OCO to the online linear control problem, there is no
OCO framework that captures the complete long-term dependence of current losses on past decisions.
Furthermore, there are no non-trivial lower bounds for OCO in the memory setting,
1
which leaves
open the question whether the dependence on memory is tight.
Contributions. In this paper we introduce a generalization of the OCO framework, “Online Convex
Optimization with Unbounded Memory” (Section 2), that allows the loss in the current round to
depend on the entire history of decisions until that point. We introduce the notion of
p
-effective
memory capacity,
Hp
, that quantifies the maximum influence of past decisions on present losses. We
prove an
O(pHpT)
upper bound on the policy regret (Theorem 3.1) and a matching (worst-case)
lower bound (Theorem 3.2). As a special case, we prove the first non-trivial lower bound for OCO
with finite memory (Theorem 3.4), which could be of independent interest, and also improve existing
upper bounds (Theorem 3.3). Our lower bound technique extends existing techniques developed
for memoryless settings. We design novel adversarial loss functions that exploit the fact that an
algorithm cannot overwrite its history. We illustrate the power of our framework by bringing together
the regret analysis of two seemingly disparate problems under the same umbrella. First, we show
how our framework improves and simplifies existing regret bounds for the online linear control
problem [Agarwal et al.,2019b] in Theorem 4.1. Second, we show how our framework can be
used to derive regret bounds for an online variant of performative prediction [Perdomo et al.,2020]
in Theorem 4.2. This demonstrates the broad applicability of our framework for deriving regret
bounds for a variety of online learning problems, particularly those that exhibit long-term dependence
of current losses on past decisions.
Related work. The most closely related work to ours is the OCO with finite memory frame-
work [Anava et al.,2015]. They consider a generalization of the OCO framework that allows the
current loss to depend on a constant number of past decisions. There have been a number of follow-up
works that extend the framework in a variety of other ways, such as non-stationarity [Zhao et al.,
2022], incorporating switching costs [Shi et al.,2020], etc. However, none of these existing works go
beyond a constant memory length and do not prove a non-trivial lower bound with a dependence on
the memory length. In a different line of work, Bhatia and Sridharan [2020] consider a much more
general online learning framework that goes beyond a constant memory length, but they only provide
non-constructive upper bounds on regret. In contrast, our OCO with unbounded memory framework
allows the current loss to depend on an unbounded number of past decisions, provides constructive
upper bounds on regret, and lower bounds for a broad class of problems that includes OCO with finite
memory with a general memory length m.
A different framework for sequential decision-making is multi-armed bandits [Bubeck and Cesa-
Bianchi,2012,Slivkins,2019]. Qin et al. [2023] study a variant of contextual stochastic bandits
where the current loss can depend on a sparse subset of all prior contexts. This setting differs from
ours due to the feedback model, stochasticity, and decision space. Reinforcement learning [Sutton
and Barto,2018] is yet another popular framework for sequential decision-making that considers
very general state-action models of feedback and dynamics. In reinforcement learning one typically
measures regret with respect to the best state-action policy from some policy class, rather than the
best fixed decision as in online learning and OCO. In the special case of linear control, policies can
be reformulated as decisions while preserving convexity; we discuss this application in Section 4.
Considering the general framework is an active area of research.
We defer discussion of related work for specific applications to Section 4.
1The trivial lower bound refers to the Ω(√T)lower bound for OCO in the memoryless setting.
2