Online Convex Optimization with Unbounded Memory Raunak Kumar Department of Computer Science

2025-05-02 0 0 1.01MB 43 页 10玖币

侵权投诉

Online Convex Optimization with Unbounded Memory

Raunak Kumar

Department of Computer Science

Cornell University

Ithaca, NY 14853

raunak@cs.cornell.edu

Sarah Dean

Department of Computer Science

Cornell University

Ithaca, NY 14853

sdean@cornell.edu

Robert Kleinberg

Department of Computer Science

Cornell University

Ithaca, NY 14853

rdk@cs.cornell.edu

Abstract

Online convex optimization (OCO) is a widely used framework in online learning.

In each round, the learner chooses a decision in a convex set and an adversary

chooses a convex loss function, and then the learner suffers the loss associated with

their current decision. However, in many applications the learner’s loss depends not

only on the current decision but on the entire history of decisions until that point.

The OCO framework and its existing generalizations do not capture this, and they

can only be applied to many settings of interest after a long series of approximation

arguments. They also leave open the question of whether the dependence on

memory is tight because there are no non-trivial lower bounds. In this work we

introduce a generalization of the OCO framework, “Online Convex Optimization

with Unbounded Memory”, that captures long-term dependence on past decisions.

We introduce the notion of

-effective memory capacity,

, that quantiﬁes the

maximum inﬂuence of past decisions on present losses. We prove an

O(pHpT)

upper bound on the policy regret and a matching (worst-case) lower bound. As

a special case, we prove the ﬁrst non-trivial lower bound for OCO with ﬁnite

memory [Anava et al.,2015], which could be of independent interest, and also

improve existing upper bounds. We demonstrate the broad applicability of our

framework by using it to derive regret bounds, and to improve and simplify existing

regret bound derivations, for a variety of online learning problems including online

linear control and an online variant of performative prediction.

1 Introduction

Numerous applications are characterized by multiple rounds of sequential interactions with an

environment, e.g., prediction from expert advice [Littlestone and Warmuth,1989,1994], portoﬂio

selection [Cover,1991], routing [Awerbuch and Kleinberg,2008], etc. One of the most popular

frameworks for modelling such sequential decision-making problems is online convex optimization

(OCO) [Zinkevich,2003]. The OCO framework is as follows. In each round, the learner chooses a

decision in a convex set and an adversary chooses a convex loss function, and then the learner suffers

the loss associated with their current decision. The performance of an algorithm is measured by

regret: the difference between the algorithm’s total loss and that of the best ﬁxed decision. We refer

the reader to Shalev-Shwartz [2012], Hazan [2019], Orabona [2019] for surveys on this topic.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.09903v5 [cs.LG] 29 Mar 2024

However, in many applications the loss of the learner depends not only on the current decisions but

on the entire history of decisions until that point. For example, in online linear control [Agarwal

et al.,2019b], in each round the learner chooses a “control policy” (i.e., decision), suffers a loss that

is a function of the action taken by this policy and the current state of the system, and the system’s

state evolves according to linear dynamics. The current state depends on the entire history of actions

and, therefore, the current loss depends not only on the current decision but the entire history of

decisions. The OCO framework cannot capture such long-term dependence of the current loss on the

past decisions and neither can existing generalizations that allow the loss to depend on a constant

number of past decisions [Anava et al.,2015]. Although a series of approximation arguments can be

used to apply ﬁnite memory generalizations of OCO to the online linear control problem, there is no

OCO framework that captures the complete long-term dependence of current losses on past decisions.

Furthermore, there are no non-trivial lower bounds for OCO in the memory setting,

which leaves

open the question whether the dependence on memory is tight.

Contributions. In this paper we introduce a generalization of the OCO framework, “Online Convex

Optimization with Unbounded Memory” (Section 2), that allows the loss in the current round to

depend on the entire history of decisions until that point. We introduce the notion of

-effective

memory capacity,

, that quantiﬁes the maximum inﬂuence of past decisions on present losses. We

prove an

O(pHpT)

upper bound on the policy regret (Theorem 3.1) and a matching (worst-case)

lower bound (Theorem 3.2). As a special case, we prove the ﬁrst non-trivial lower bound for OCO

with ﬁnite memory (Theorem 3.4), which could be of independent interest, and also improve existing

upper bounds (Theorem 3.3). Our lower bound technique extends existing techniques developed

for memoryless settings. We design novel adversarial loss functions that exploit the fact that an

algorithm cannot overwrite its history. We illustrate the power of our framework by bringing together

the regret analysis of two seemingly disparate problems under the same umbrella. First, we show

how our framework improves and simpliﬁes existing regret bounds for the online linear control

problem [Agarwal et al.,2019b] in Theorem 4.1. Second, we show how our framework can be

used to derive regret bounds for an online variant of performative prediction [Perdomo et al.,2020]

in Theorem 4.2. This demonstrates the broad applicability of our framework for deriving regret

bounds for a variety of online learning problems, particularly those that exhibit long-term dependence

of current losses on past decisions.

Related work. The most closely related work to ours is the OCO with ﬁnite memory frame-

work [Anava et al.,2015]. They consider a generalization of the OCO framework that allows the

current loss to depend on a constant number of past decisions. There have been a number of follow-up

works that extend the framework in a variety of other ways, such as non-stationarity [Zhao et al.,

2022], incorporating switching costs [Shi et al.,2020], etc. However, none of these existing works go

beyond a constant memory length and do not prove a non-trivial lower bound with a dependence on

the memory length. In a different line of work, Bhatia and Sridharan [2020] consider a much more

general online learning framework that goes beyond a constant memory length, but they only provide

non-constructive upper bounds on regret. In contrast, our OCO with unbounded memory framework

allows the current loss to depend on an unbounded number of past decisions, provides constructive

upper bounds on regret, and lower bounds for a broad class of problems that includes OCO with ﬁnite

memory with a general memory length m.

A different framework for sequential decision-making is multi-armed bandits [Bubeck and Cesa-

Bianchi,2012,Slivkins,2019]. Qin et al. [2023] study a variant of contextual stochastic bandits

where the current loss can depend on a sparse subset of all prior contexts. This setting differs from

ours due to the feedback model, stochasticity, and decision space. Reinforcement learning [Sutton

and Barto,2018] is yet another popular framework for sequential decision-making that considers

very general state-action models of feedback and dynamics. In reinforcement learning one typically

measures regret with respect to the best state-action policy from some policy class, rather than the

best ﬁxed decision as in online learning and OCO. In the special case of linear control, policies can

be reformulated as decisions while preserving convexity; we discuss this application in Section 4.

Considering the general framework is an active area of research.

We defer discussion of related work for speciﬁc applications to Section 4.

1The trivial lower bound refers to the Ω(√T)lower bound for OCO in the memoryless setting.

2 Framework

We begin with some motivation for the formalism used in our framework (Section 2.1). Many

real-world applications involve controlling a physical dynamical system, for example, variable-speed

wind turbines in wind energy electric power production [Boukhezzar and Siguerdidjane,2010].

The typical solution for these problems has been to model them as ofﬂine control problems with

linear time-invariant dynamics and use classical methods such as LQR and LQG [Boukhezzar and

Siguerdidjane,2010]. Instead of optimizing over the space of control inputs, the typical feedback

control approach optimizes over the space of controllers, i.e., policies that choose a control input

as a function of the system state. The standard controllers considered in the literature are linear

controllers. Even when the losses are convex in the state and input, they are nonconvex in the linear

controller. In the special case of quadratic losses in terms of the state and input, there is a closed-form

solution for the optimal solution using the algebraic Riccati equations [Lancaster and Rodman,1995].

But this does not hold for general convex losses resulting in convex reparameterizations such as

Youla [Youla et al.,1976,Kuˇ

cera,1975] and SLS [Wang et al.,2019,Anderson et al.,2019]. The

resulting parameterization represents an inﬁnite dimensional system response and is characterized

by a sequence of matrices. Recent work has studied an online approach for some of these control

theory problems, where a sequence of controllers is chosen adaptively rather than choosing one

ofﬂine [Abbasi-Yadkori and Szepesvári,2011,Dean et al.,2018,Simchowitz and Foster,2020,

Agarwal et al.,2019b].

The takeaway from the above is that there are online learning problems in which (i) the current

loss depends on the entire history of decisions; and (ii) the decision space can be more complicated

than just a subset of

, e.g., it can be an unbounded sequence of matrices. This motivates us to

model the decision space as a Hilbert space and the history space as a Banach space in the formal

problem setup below, and this subsumes the special cases of OCO and OCO with ﬁnite memory. This

formalism not only lets us consider a wide range of spaces, such as

, unbounded sequences of

matrices, etc., but also lets us deﬁne appropriate norms on these spaces. This latter feature is crucial

for deriving strong regret bounds for some applications such as online linear control. For this problem

we derive improved regret bounds (Theorem 4.1) by deﬁning weighted norms on the decision and

history spaces, where the weights are chosen to leverage the problem structure.

Notation. We use

∥·∥U

to denote the norm associated with a space

. The operator norm for a linear

operator from space

U → V

is deﬁned as

∥L∥U→V = maxu:∥u∥U≤1∥Lu∥V

. For convenience,

sometimes we simply use

∥ · ∥

when the meaning is clear from the context. For a ﬁnite-dimensional

matrix we use ∥·∥Fand ∥·∥2to denote its Frobenius norm and operator norm respectively.

2.1 Setup

Let the decision space

be a closed and convex subset of a Hilbert space

with norm

∥·∥X

and

the history space

be a Banach space with norm

∥·∥H

. Let

A:H → H

and

B:W → H

linear operators. The game between the learner and an oblivious adversary proceeds as follows. Let

denote the time horizon and

ft:H → R

be loss functions chosen by the adversary. The initial

history is

h0= 0

. In each round

t∈[T]

, the learner chooses

xt∈ X

, the history is updated to

ht=Aht−1+Bxt

, and the learner suffers loss

ft(ht)

. An instance of an online convex optimization

with unbounded memory problem is speciﬁed by the tuple (X,H, A, B).

We use the notion of policy regret [Dekel et al.,2012] as the performance measure in our framework.

The policy regret of a learner is the difference between its total loss and the total loss of a strategy

that plays the best ﬁxed decision in every round. The history after round

for a strategy that chooses

xin every round is described by ht=Pt−1

k=0 AkBx, which motivates the following deﬁnition.

Deﬁnition 2.1. Given

ft:H → R

, the function

ft:X → R

is deﬁned by

ft(x) = ft(Pt−1

k=0 AkBx)

Deﬁnition 2.2 (Policy Regret).The policy regret of an algorithm

is deﬁned as

RT(A) =

t=1 ft(ht)−minx∈X PT

t=1 ˜

ft(x).

In many motivating examples such as online linear control (Section 4.1), the history at the end of a

round is a sequence of linear transformations of past decisions. The following deﬁnition captures this

formally and we leverage this structure to prove stronger regret bounds (Theorem 3.1).

Deﬁnition 2.3 (Linear Sequence Dynamics).Consider an online convex optimization with unbounded

memory problem speciﬁed by

(X,H, A, B)

. Let

(ξk)∞

k=0

be a sequence of nonnegative real numbers

satisfying

ξ0= 1

. We say that

(X,H, A, B)

follows linear sequence dynamics with the

-weighted

p-norm for p≥1if

1. H

is the

-weighted

ℓp

-direct sum of a ﬁnite or countably inﬁnite number of copies of

every element

y∈ H

is a sequence

y= (yi)i∈I

, where

I=N

I={0, . . . , n}

for some

n∈N, and ∥y∥H=Pi∈I (ξi∥yi∥)p1

/p<∞.

2. We have A(y0, y1, . . . ) = (0, A0y0, A1y1, . . . ), where Ai:W → W are linear operators.

3. The operator Bsatisﬁes B(x)=(x, 0, . . . ).

Note that since the norm on

depends on the weights

, the operator norm

∥Ak∥

also depends on

If the weights are all equal to 1, then we simply say p-norm instead of ξ-weighted p-norm.

2.2 Assumptions

We make the following assumptions about the feedback model and the loss functions.

A1 The learner knows the operators Aand B, and observes ftat the end of each round t.

A2 The operator norm of Bis at most 1, i.e., ∥B∥ ≤ 1.

A3 The functions ftare convex.

The functions

are

-Lipschitz continuous:

∀h, ˜

h∈ H

and

t∈[T]

, we have

|ft(h)−

ft(˜

h)| ≤ L∥h−˜

h∥H.

Regarding Assumption A1, our results easily extend to the case where instead of observing

the learner receives a gradient

∇˜

ft(xt)

from a gradient oracle, which can be implemented using

knowledge of

and the dynamics

and

. Handling the cases when the operators

and

are

unknown and/or the learner observes bandit feedback (i.e., only

ft(ht)

) are important problems and

we leave them as future work. Note that our assumption that

and

are known is no more restrictive

than in the existing literature on OCO with ﬁnite memory [Anava et al.,2015] where it is assumed

that the learner knows the constant memory length. In fact, our assumption is more general because

our framework not only captures constant memory length as a special case but allows for richer

dynamics as we illustrate in Section 4. Assumption A2 is made for convenience, and it amounts to

a rescaling of the problem. Assumption A3 can be replaced by the weaker assumption that

are

convex (similar to the literature on OCO with ﬁnite memory [Anava et al.,2015]) and this is what we

use in the rest of the paper.

Assumptions A1 and A4 imply that ˜

ftare ˜

L-Lipschitz continuous for the following ˜

Theorem 2.1. Consider an online convex optimization with unbounded memory problem spec-

iﬁed by

(X,H, A, B)

. If

-Lipschitz continuous, then

-Lipschitz continuous for

L≤LP∞

k=0 ∥Ak∥

. If

(X,H, A, B)

follows linear sequence dynamics with the

-weighted

-norm

for p≥1, then ˜

L≤LP∞

k=0 ∥Ak∥p1

The proof follows from the deﬁnitions of

and

∥ · ∥H

, and we defer it to Appendix A. The above

bound is tighter than similar results in the literature on OCO with ﬁnite memory and online linear

control. This theorem is a key ingredient, amongst others, in improving existing upper bounds on

regret for OCO with ﬁnite memory (Theorem 3.3) and for online linear control (Theorem 4.1). Before

presenting our ﬁnal assumption we introduce the notion of

-effective memory capacity that quantiﬁes

the maximum inﬂuence of past decisions on present losses.

Deﬁnition 2.4 (

-Effective Memory Capacity).Consider an online convex optimization with un-

bounded memory problem speciﬁed by

(X,H, A, B)

. For

p≥1

, the

-effective memory capacity is

deﬁned as

Hp(X,H, A, B) = ∞

k=0

kp∥Ak∥p!1

.(1)

When the meaning is clear from the context we simply use

instead. The

-effective memory

capacity is an upper bound on the difference in histories for two sequences of decisions whose

difference grows at most linearly with time. To see this, consider two sequences of decisions,

(xk)

and

(˜xk

), whose elements differ by no more than

at time

∥xk−˜xk∥ ≤ k

. Then the histories generated

by the two sequences have difference between bounded as

∥h−˜

h∥=∥PkAkB(xk−˜xk)∥ ≤

Pkk∥AkB∥ ≤ Pkk∥Ak∥=H1

, where the last inequality follows from Assumption A2. A

similar bound holds with

instead when

(X,H, A, B)

follows linear sequence dynamics with the

ξ-weighted p-norm.

A5 The 1-effective memory capacity is ﬁnite, i.e., H1<∞.

Since

is decreasing in

H1<∞

implies

Hp<∞

for all

p≥1

. For the case of linear sequence

dynamics with the

-weighted

-norm it sufﬁces to make the weaker assumption that

Hp<∞

However, for simplicity of exposition, we assume that H1<∞.

2.3 Special Cases

OCO with Finite Memory. Consider the OCO with ﬁnite memory problem with constant memory

length

. It can be speciﬁed in our framework by

(X,H, Aﬁnite,m, Bﬁnite,m)

, where

is the

ℓ2

direct sum of

copies of

Aﬁnite,m(x[m], . . . , x[1]) = (0, x[m], . . . , x[2])

, and

Bﬁnite,m(x) =

(x, 0,...,0)

. Note that

(X,H, Aﬁnite,m, Bﬁnite,m)

follows linear sequence dynamics with the

-norm.

Our framework can even model an extension where the problem follows linear sequence dynamics

with the p-norm for p≥1by simply deﬁning Hto be the ℓp-direct sum of mcopies of X.

OCO with

-discounted Inﬁnite Memory. Our framework can also model OCO with inﬁ-

nite memory problems that are not modelled by existing OCO frameworks. Let

ρ∈(0,1)

be the discount factor and

p≥1

. An OCO with

-discounted inﬁnite memory problem is

speciﬁed by

(X,H, Ainﬁnite,ρ, Binﬁnite,ρ)

, where

is the

ℓp

-direct sum of countably many copies

Ainﬁnite,ρ((y0, y1, . . . )) = (0, ρy0, ρy1, . . . )

, and

Binﬁnite,ρ(x)=(x, 0, . . . )

. Note that

(X,H, Ainﬁnite,ρ, Binﬁnite,ρ)

follows linear sequence dynamics with the

-norm. Due to space con-

straints we defer proofs of regret bounds for this problem to the appendix.

3 Regret Analysis

We present two algorithms for choosing the decisions

. Algorithm 1uses follow-the-regularized-

leader (FTRL) [Shalev-Shwartz and Singer,2006,Abernethy et al.,2008] on the loss functions

Due to space constraints, we discuss how to implement it efﬁciently in Appendix Gand present

simple simulation experiments in Appendix I. Algorithm 2, which we only present in Appendix H,

combines FTRL with a mini-batching approach [Dekel et al.,2012,Altschuler and Talwar,2018,

Chen et al.,2020] to additionally guarantee that the decisions switch at most

O(T˜

/LH1)

times. We

defer the proofs of the following upper and lower bounds to Appendices Cand Drespectively.

Algorithm 1: FTRL

Input : Time horizon T, step size η,α-strongly-convex regularizer R:X → R.

1Initialize history h0= 0.

2for t= 1,2, . . . , T do

3Learner chooses xt∈arg minx∈X Pt−1

s=1 ˜

fs(x) + R(x)

η.

4Set ht=Aht−1+Bxt.

5Learner suffers loss ft(ht)and observes ft.

6end

Theorem 3.1. Consider an online convex optimization with unbounded memory problem speciﬁed by

(X,H, A, B)

. Let the regularizer

R:X → R

-strongly-convex and satisfy

|R(x)−R(˜x)| ≤ D

for all

x, ˜x∈ X

. Algorithm 1with step-size

satisﬁes

RT(FTRL)≤D

η+ηT˜

α+ηT L ˜

LH1

. If

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OnlineConvexOptimizationwithUnboundedMemoryRaunakKumarDepartmentofComputerScienceCornellUniversityIthaca,NY14853raunak@cs.cornell.eduSarahDeanDepartmentofComputerScienceCornellUniversityIthaca,NY14853sdean@cornell.eduRobertKleinbergDepartmentofComputerScienceCornellUniversityIthaca,NY14853rdk@cs.cor...

展开>> 收起<<

Online Convex Optimization with Unbounded Memory Raunak Kumar Department of Computer Science.pdf

共43页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Online Convex Optimization with Unbounded Memory Raunak Kumar Department of Computer Science

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: