Online Convex Optimization with Unbounded Memory Raunak Kumar Department of Computer Science

2025-05-02 0 0 1.01MB 43 页 10玖币
侵权投诉
Online Convex Optimization with Unbounded Memory
Raunak Kumar
Department of Computer Science
Cornell University
Ithaca, NY 14853
raunak@cs.cornell.edu
Sarah Dean
Department of Computer Science
Cornell University
Ithaca, NY 14853
sdean@cornell.edu
Robert Kleinberg
Department of Computer Science
Cornell University
Ithaca, NY 14853
rdk@cs.cornell.edu
Abstract
Online convex optimization (OCO) is a widely used framework in online learning.
In each round, the learner chooses a decision in a convex set and an adversary
chooses a convex loss function, and then the learner suffers the loss associated with
their current decision. However, in many applications the learner’s loss depends not
only on the current decision but on the entire history of decisions until that point.
The OCO framework and its existing generalizations do not capture this, and they
can only be applied to many settings of interest after a long series of approximation
arguments. They also leave open the question of whether the dependence on
memory is tight because there are no non-trivial lower bounds. In this work we
introduce a generalization of the OCO framework, “Online Convex Optimization
with Unbounded Memory”, that captures long-term dependence on past decisions.
We introduce the notion of
p
-effective memory capacity,
Hp
, that quantifies the
maximum influence of past decisions on present losses. We prove an
O(pHpT)
upper bound on the policy regret and a matching (worst-case) lower bound. As
a special case, we prove the first non-trivial lower bound for OCO with finite
memory [Anava et al.,2015], which could be of independent interest, and also
improve existing upper bounds. We demonstrate the broad applicability of our
framework by using it to derive regret bounds, and to improve and simplify existing
regret bound derivations, for a variety of online learning problems including online
linear control and an online variant of performative prediction.
1 Introduction
Numerous applications are characterized by multiple rounds of sequential interactions with an
environment, e.g., prediction from expert advice [Littlestone and Warmuth,1989,1994], portoflio
selection [Cover,1991], routing [Awerbuch and Kleinberg,2008], etc. One of the most popular
frameworks for modelling such sequential decision-making problems is online convex optimization
(OCO) [Zinkevich,2003]. The OCO framework is as follows. In each round, the learner chooses a
decision in a convex set and an adversary chooses a convex loss function, and then the learner suffers
the loss associated with their current decision. The performance of an algorithm is measured by
regret: the difference between the algorithm’s total loss and that of the best fixed decision. We refer
the reader to Shalev-Shwartz [2012], Hazan [2019], Orabona [2019] for surveys on this topic.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.09903v5 [cs.LG] 29 Mar 2024
However, in many applications the loss of the learner depends not only on the current decisions but
on the entire history of decisions until that point. For example, in online linear control [Agarwal
et al.,2019b], in each round the learner chooses a “control policy” (i.e., decision), suffers a loss that
is a function of the action taken by this policy and the current state of the system, and the system’s
state evolves according to linear dynamics. The current state depends on the entire history of actions
and, therefore, the current loss depends not only on the current decision but the entire history of
decisions. The OCO framework cannot capture such long-term dependence of the current loss on the
past decisions and neither can existing generalizations that allow the loss to depend on a constant
number of past decisions [Anava et al.,2015]. Although a series of approximation arguments can be
used to apply finite memory generalizations of OCO to the online linear control problem, there is no
OCO framework that captures the complete long-term dependence of current losses on past decisions.
Furthermore, there are no non-trivial lower bounds for OCO in the memory setting,
1
which leaves
open the question whether the dependence on memory is tight.
Contributions. In this paper we introduce a generalization of the OCO framework, “Online Convex
Optimization with Unbounded Memory” (Section 2), that allows the loss in the current round to
depend on the entire history of decisions until that point. We introduce the notion of
p
-effective
memory capacity,
Hp
, that quantifies the maximum influence of past decisions on present losses. We
prove an
O(pHpT)
upper bound on the policy regret (Theorem 3.1) and a matching (worst-case)
lower bound (Theorem 3.2). As a special case, we prove the first non-trivial lower bound for OCO
with finite memory (Theorem 3.4), which could be of independent interest, and also improve existing
upper bounds (Theorem 3.3). Our lower bound technique extends existing techniques developed
for memoryless settings. We design novel adversarial loss functions that exploit the fact that an
algorithm cannot overwrite its history. We illustrate the power of our framework by bringing together
the regret analysis of two seemingly disparate problems under the same umbrella. First, we show
how our framework improves and simplifies existing regret bounds for the online linear control
problem [Agarwal et al.,2019b] in Theorem 4.1. Second, we show how our framework can be
used to derive regret bounds for an online variant of performative prediction [Perdomo et al.,2020]
in Theorem 4.2. This demonstrates the broad applicability of our framework for deriving regret
bounds for a variety of online learning problems, particularly those that exhibit long-term dependence
of current losses on past decisions.
Related work. The most closely related work to ours is the OCO with finite memory frame-
work [Anava et al.,2015]. They consider a generalization of the OCO framework that allows the
current loss to depend on a constant number of past decisions. There have been a number of follow-up
works that extend the framework in a variety of other ways, such as non-stationarity [Zhao et al.,
2022], incorporating switching costs [Shi et al.,2020], etc. However, none of these existing works go
beyond a constant memory length and do not prove a non-trivial lower bound with a dependence on
the memory length. In a different line of work, Bhatia and Sridharan [2020] consider a much more
general online learning framework that goes beyond a constant memory length, but they only provide
non-constructive upper bounds on regret. In contrast, our OCO with unbounded memory framework
allows the current loss to depend on an unbounded number of past decisions, provides constructive
upper bounds on regret, and lower bounds for a broad class of problems that includes OCO with finite
memory with a general memory length m.
A different framework for sequential decision-making is multi-armed bandits [Bubeck and Cesa-
Bianchi,2012,Slivkins,2019]. Qin et al. [2023] study a variant of contextual stochastic bandits
where the current loss can depend on a sparse subset of all prior contexts. This setting differs from
ours due to the feedback model, stochasticity, and decision space. Reinforcement learning [Sutton
and Barto,2018] is yet another popular framework for sequential decision-making that considers
very general state-action models of feedback and dynamics. In reinforcement learning one typically
measures regret with respect to the best state-action policy from some policy class, rather than the
best fixed decision as in online learning and OCO. In the special case of linear control, policies can
be reformulated as decisions while preserving convexity; we discuss this application in Section 4.
Considering the general framework is an active area of research.
We defer discussion of related work for specific applications to Section 4.
1The trivial lower bound refers to the Ω(T)lower bound for OCO in the memoryless setting.
2
2 Framework
We begin with some motivation for the formalism used in our framework (Section 2.1). Many
real-world applications involve controlling a physical dynamical system, for example, variable-speed
wind turbines in wind energy electric power production [Boukhezzar and Siguerdidjane,2010].
The typical solution for these problems has been to model them as offline control problems with
linear time-invariant dynamics and use classical methods such as LQR and LQG [Boukhezzar and
Siguerdidjane,2010]. Instead of optimizing over the space of control inputs, the typical feedback
control approach optimizes over the space of controllers, i.e., policies that choose a control input
as a function of the system state. The standard controllers considered in the literature are linear
controllers. Even when the losses are convex in the state and input, they are nonconvex in the linear
controller. In the special case of quadratic losses in terms of the state and input, there is a closed-form
solution for the optimal solution using the algebraic Riccati equations [Lancaster and Rodman,1995].
But this does not hold for general convex losses resulting in convex reparameterizations such as
Youla [Youla et al.,1976,Kuˇ
cera,1975] and SLS [Wang et al.,2019,Anderson et al.,2019]. The
resulting parameterization represents an infinite dimensional system response and is characterized
by a sequence of matrices. Recent work has studied an online approach for some of these control
theory problems, where a sequence of controllers is chosen adaptively rather than choosing one
offline [Abbasi-Yadkori and Szepesvári,2011,Dean et al.,2018,Simchowitz and Foster,2020,
Agarwal et al.,2019b].
The takeaway from the above is that there are online learning problems in which (i) the current
loss depends on the entire history of decisions; and (ii) the decision space can be more complicated
than just a subset of
Rd
, e.g., it can be an unbounded sequence of matrices. This motivates us to
model the decision space as a Hilbert space and the history space as a Banach space in the formal
problem setup below, and this subsumes the special cases of OCO and OCO with finite memory. This
formalism not only lets us consider a wide range of spaces, such as
Rd
, unbounded sequences of
matrices, etc., but also lets us define appropriate norms on these spaces. This latter feature is crucial
for deriving strong regret bounds for some applications such as online linear control. For this problem
we derive improved regret bounds (Theorem 4.1) by defining weighted norms on the decision and
history spaces, where the weights are chosen to leverage the problem structure.
Notation. We use
·U
to denote the norm associated with a space
U
. The operator norm for a linear
L
operator from space
U → V
is defined as
LU→V = maxu:uU1LuV
. For convenience,
sometimes we simply use
∥ · ∥
when the meaning is clear from the context. For a finite-dimensional
matrix we use ∥·∥Fand ∥·∥2to denote its Frobenius norm and operator norm respectively.
2.1 Setup
Let the decision space
X
be a closed and convex subset of a Hilbert space
W
with norm
∥·∥X
and
the history space
H
be a Banach space with norm
∥·∥H
. Let
A:H → H
and
B:W → H
be
linear operators. The game between the learner and an oblivious adversary proceeds as follows. Let
T
denote the time horizon and
ft:H → R
be loss functions chosen by the adversary. The initial
history is
h0= 0
. In each round
t[T]
, the learner chooses
xt∈ X
, the history is updated to
ht=Aht1+Bxt
, and the learner suffers loss
ft(ht)
. An instance of an online convex optimization
with unbounded memory problem is specified by the tuple (X,H, A, B).
We use the notion of policy regret [Dekel et al.,2012] as the performance measure in our framework.
The policy regret of a learner is the difference between its total loss and the total loss of a strategy
that plays the best fixed decision in every round. The history after round
t
for a strategy that chooses
xin every round is described by ht=Pt1
k=0 AkBx, which motivates the following definition.
Definition 2.1. Given
ft:H → R
, the function
˜
ft:X R
is defined by
˜
ft(x) = ft(Pt1
k=0 AkBx)
.
Definition 2.2 (Policy Regret).The policy regret of an algorithm
A
is defined as
RT(A) =
PT
t=1 ft(ht)minx∈X PT
t=1 ˜
ft(x).
In many motivating examples such as online linear control (Section 4.1), the history at the end of a
round is a sequence of linear transformations of past decisions. The following definition captures this
formally and we leverage this structure to prove stronger regret bounds (Theorem 3.1).
3
Definition 2.3 (Linear Sequence Dynamics).Consider an online convex optimization with unbounded
memory problem specified by
(X,H, A, B)
. Let
(ξk)
k=0
be a sequence of nonnegative real numbers
satisfying
ξ0= 1
. We say that
(X,H, A, B)
follows linear sequence dynamics with the
ξ
-weighted
p-norm for p1if
1. H
is the
ξ
-weighted
p
-direct sum of a finite or countably infinite number of copies of
W
:
every element
y∈ H
is a sequence
y= (yi)i∈I
, where
I=N
or
I={0, . . . , n}
for some
nN, and yH=Pi∈I (ξiyi)p1
/p<.
2. We have A(y0, y1, . . . ) = (0, A0y0, A1y1, . . . ), where Ai:W → W are linear operators.
3. The operator Bsatisfies B(x)=(x, 0, . . . ).
Note that since the norm on
H
depends on the weights
ξ
, the operator norm
Ak
also depends on
ξ
.
If the weights are all equal to 1, then we simply say p-norm instead of ξ-weighted p-norm.
2.2 Assumptions
We make the following assumptions about the feedback model and the loss functions.
A1 The learner knows the operators Aand B, and observes ftat the end of each round t.
A2 The operator norm of Bis at most 1, i.e., B∥ ≤ 1.
A3 The functions ftare convex.
A4
The functions
ft
are
L
-Lipschitz continuous:
h, ˜
h∈ H
and
t[T]
, we have
|ft(h)
ft(˜
h)| ≤ Lh˜
hH.
Regarding Assumption A1, our results easily extend to the case where instead of observing
ft
,
the learner receives a gradient
˜
ft(xt)
from a gradient oracle, which can be implemented using
knowledge of
ft
and the dynamics
A
and
B
. Handling the cases when the operators
A
and
B
are
unknown and/or the learner observes bandit feedback (i.e., only
ft(ht)
) are important problems and
we leave them as future work. Note that our assumption that
A
and
B
are known is no more restrictive
than in the existing literature on OCO with finite memory [Anava et al.,2015] where it is assumed
that the learner knows the constant memory length. In fact, our assumption is more general because
our framework not only captures constant memory length as a special case but allows for richer
dynamics as we illustrate in Section 4. Assumption A2 is made for convenience, and it amounts to
a rescaling of the problem. Assumption A3 can be replaced by the weaker assumption that
˜
ft
are
convex (similar to the literature on OCO with finite memory [Anava et al.,2015]) and this is what we
use in the rest of the paper.
Assumptions A1 and A4 imply that ˜
ftare ˜
L-Lipschitz continuous for the following ˜
L.
Theorem 2.1. Consider an online convex optimization with unbounded memory problem spec-
ified by
(X,H, A, B)
. If
ft
is
L
-Lipschitz continuous, then
˜
ft
is
˜
L
-Lipschitz continuous for
˜
LLP
k=0 Ak
. If
(X,H, A, B)
follows linear sequence dynamics with the
ξ
-weighted
p
-norm
for p1, then ˜
LLP
k=0 Akp1
p.
The proof follows from the definitions of
˜
ft
and
∥ · ∥H
, and we defer it to Appendix A. The above
bound is tighter than similar results in the literature on OCO with finite memory and online linear
control. This theorem is a key ingredient, amongst others, in improving existing upper bounds on
regret for OCO with finite memory (Theorem 3.3) and for online linear control (Theorem 4.1). Before
presenting our final assumption we introduce the notion of
p
-effective memory capacity that quantifies
the maximum influence of past decisions on present losses.
Definition 2.4 (
p
-Effective Memory Capacity).Consider an online convex optimization with un-
bounded memory problem specified by
(X,H, A, B)
. For
p1
, the
p
-effective memory capacity is
defined as
Hp(X,H, A, B) =
X
k=0
kpAkp!1
p
.(1)
4
When the meaning is clear from the context we simply use
Hp
instead. The
p
-effective memory
capacity is an upper bound on the difference in histories for two sequences of decisions whose
difference grows at most linearly with time. To see this, consider two sequences of decisions,
(xk)
and
(˜xk
), whose elements differ by no more than
k
at time
k
:
xk˜xk∥ ≤ k
. Then the histories generated
by the two sequences have difference between bounded as
h˜
h=PkAkB(xk˜xk)∥ ≤
PkkAkB∥ ≤ PkkAk=H1
, where the last inequality follows from Assumption A2. A
similar bound holds with
Hp
instead when
(X,H, A, B)
follows linear sequence dynamics with the
ξ-weighted p-norm.
A5 The 1-effective memory capacity is finite, i.e., H1<.
Since
Hp
is decreasing in
p
,
H1<
implies
Hp<
for all
p1
. For the case of linear sequence
dynamics with the
ξ
-weighted
p
-norm it suffices to make the weaker assumption that
Hp<
.
However, for simplicity of exposition, we assume that H1<.
2.3 Special Cases
OCO with Finite Memory. Consider the OCO with finite memory problem with constant memory
length
m
. It can be specified in our framework by
(X,H, Afinite,m, Bfinite,m)
, where
H
is the
2
-
direct sum of
m
copies of
X
,
Afinite,m(x[m], . . . , x[1]) = (0, x[m], . . . , x[2])
, and
Bfinite,m(x) =
(x, 0,...,0)
. Note that
(X,H, Afinite,m, Bfinite,m)
follows linear sequence dynamics with the
2
-norm.
Our framework can even model an extension where the problem follows linear sequence dynamics
with the p-norm for p1by simply defining Hto be the p-direct sum of mcopies of X.
OCO with
ρ
-discounted Infinite Memory. Our framework can also model OCO with infi-
nite memory problems that are not modelled by existing OCO frameworks. Let
ρ(0,1)
be the discount factor and
p1
. An OCO with
ρ
-discounted infinite memory problem is
specified by
(X,H, Ainfinite, Binfinite)
, where
H
is the
p
-direct sum of countably many copies
of
X
,
Ainfinite((y0, y1, . . . )) = (0, ρy0, ρy1, . . . )
, and
Binfinite(x)=(x, 0, . . . )
. Note that
(X,H, Ainfinite, Binfinite)
follows linear sequence dynamics with the
p
-norm. Due to space con-
straints we defer proofs of regret bounds for this problem to the appendix.
3 Regret Analysis
We present two algorithms for choosing the decisions
xt
. Algorithm 1uses follow-the-regularized-
leader (FTRL) [Shalev-Shwartz and Singer,2006,Abernethy et al.,2008] on the loss functions
˜
ft
.
Due to space constraints, we discuss how to implement it efficiently in Appendix Gand present
simple simulation experiments in Appendix I. Algorithm 2, which we only present in Appendix H,
combines FTRL with a mini-batching approach [Dekel et al.,2012,Altschuler and Talwar,2018,
Chen et al.,2020] to additionally guarantee that the decisions switch at most
O(T˜
L
/LH1)
times. We
defer the proofs of the following upper and lower bounds to Appendices Cand Drespectively.
Algorithm 1: FTRL
Input : Time horizon T, step size η,α-strongly-convex regularizer R:X R.
1Initialize history h0= 0.
2for t= 1,2, . . . , T do
3Learner chooses xtarg minx∈X Pt1
s=1 ˜
fs(x) + R(x)
η.
4Set ht=Aht1+Bxt.
5Learner suffers loss ft(ht)and observes ft.
6end
Theorem 3.1. Consider an online convex optimization with unbounded memory problem specified by
(X,H, A, B)
. Let the regularizer
R:X R
be
α
-strongly-convex and satisfy
|R(x)R(˜x)| ≤ D
for all
x, ˜x∈ X
. Algorithm 1with step-size
η
satisfies
RT(FTRL)D
η+ηT˜
L2
α+ηT L ˜
LH1
α
. If
5
摘要:

OnlineConvexOptimizationwithUnboundedMemoryRaunakKumarDepartmentofComputerScienceCornellUniversityIthaca,NY14853raunak@cs.cornell.eduSarahDeanDepartmentofComputerScienceCornellUniversityIthaca,NY14853sdean@cornell.eduRobertKleinbergDepartmentofComputerScienceCornellUniversityIthaca,NY14853rdk@cs.cor...

展开>> 收起<<
Online Convex Optimization with Unbounded Memory Raunak Kumar Department of Computer Science.pdf

共43页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:43 页 大小:1.01MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 43
客服
关注