Parameter-free Regret in High Probability with Heavy Tails Jiujia Zhang

2025-05-02 0 0 555.11KB 42 页 10玖币

侵权投诉

Parameter-free Regret in High Probability with

Heavy Tails

Jiujia Zhang

Electrical and Computer Engineering

Boston University

jiujiaz@bu.edu

Ashok Cutkosky

Electrical and Computer Engineering

Boston University

ashok@cutkosky.com

Abstract

We present new algorithms for online convex optimization over unbounded domains

that obtain parameter-free regret in high-probability given access only to potentially

heavy-tailed subgradient estimates. Previous work in unbounded domains con-

siders only in-expectation results for sub-exponential subgradients. Unlike in the

bounded domain case, we cannot rely on straight-forward martingale concentration

due to exponentially large iterates produced by the algorithm. We develop new

regularization techniques to overcome these problems. Overall, with probability at

most

, for all comparators

our algorithm achieves regret

O(kukT1/plog(1/δ))

for subgradients with bounded pth moments for some p∈(1,2].

1 Introduction

In this paper, we consider the problem of online learning with convex losses, also called online convex

optimization, with heavy-tailed stochastic subgradients. In the classical online convex optimization

setting, given a convex set

, a learning algorithm must repeatedly output a vector

wt∈ W

, and

then observe a convex loss function

`t:W → R

and incur a loss of

`t(wt)

. After

such rounds, the

algorithm’s quality is measured by the regret with respect to a ﬁxed competitor u∈ W:

RT(u) =

t=1

`t(wt)−

t=1

`t(u)

Online convex optimization is widely applicable, and has been used to design popular stochastic

optimization algorithms ([Duchi et al., 2010a, Kingma and Ba, 2014, Reddi et al., 2018]), for control

of linear dynamical systems [Agarwal et al., 2019], or even building concentration inequalities [Vovk,

2007, Waudby-Smith and Ramdas, 2020, Orabona and Jun, 2021].

A popular approach to this problem reduces it to online linear optimization (OLO): if

is a

subgradient of

, then

RT(u)≤PT

t=1hgt,wt−ui

so that it sufﬁces to design an algorithm

that considers only linear losses

w7→ hgt,wi

. Then, by assuming that the domain

has some ﬁnite

diameter

, standard arguments show that online gradient descent [Zinkevich, 2003] and its variants

achieve

RT(u)≤O(D√T)

for all

u∈ W

. See the excellent books Cesa-Bianchi and Lugosi [2006],

Shalev-Shwartz [2011], Hazan [2019], Orabona [2019] for more detail.

Deviating from the classical setting, we study the more difﬁcult case in which, (1) the domain

may have inﬁnite diameter (such as

W=Rd

), and (2) instead of observing the loss

, the

algorithm is presented only with a potentially heavy-tailed stochastic subgradient estimate

with

E[gt|wt]∈∂`t(wt)

. Our goal is to develop algorithms that, with high probability, obtain essentially

the same regret bound that would be achievable even if the full information was available.

Considering only the setting of inﬁnite diameter

with exact subgradients

gt∈∂`t(wt)

, past work

has achieved bounds of the form

RT(u)≤˜

O(+kuk√T)

for all

u∈ W

simultaneously for any

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.14355v2 [stat.ML] 25 Feb 2023

user-speciﬁed



, directly generalizing the

O(D√T)

rate available when

D < ∞

[Orabona and Pál,

2016, Cutkosky and Orabona, 2018, Foster et al., 2017, Mhammedi and Koolen, 2020, Chen et al.,

2021]. As such algorithms do not require knowledge of the norm

kuk

that is usually used to specify

a learning rate for gradient descent, we will call them parameter-free. Note that such algorithms

typically guarantee constant RT(0), which is not achieved by any known form of gradient descent.

While parameter-free algorithms appear to fully generalize the ﬁnite-diameter case, they fall short

when

is a stochastic subgradient estimate. In particular, lower-bounds suggest that parameter-free

algorithms must require Lipschitz

[Cutkosky and Boahen, 2017], which means that care must be

taken when using

with unbounded noise as this may make

“appear” to be non-Lipschitz. In the

case of sub-exponential

, Jun and Orabona [2019], van der Hoeven [2019] provide parameter-free

algorithms that achieve

E[RT(u)] ≤˜

O(+kuk√T)

, but these techniques do not easily extend to

heavy-tailed

or to high-probability bounds. The high-probability statement is particularly elusive

(even with sub-exponential

) because standard martingale concentration approaches appear to

fail spectacularly. This failure may be counterintuitive: for ﬁnite diameter

, one can observe

that

hgt−E[gt],wt−ui

forms a martingale difference sequence with variance determined by

kwt−uk ≤ D

, which allows for relatively straightforward high-probability bounds. However,

parameter-free algorithms typically exhibit exponentially growing

kwtk

in order to compete with all

possible scales of kuk, which appears to stymie such arguments.

Our work overcomes these issues. Requiring only that

have a bounded

pth

moment for some

p∈(1,2]

, we devise a new algorithm whose regret with probability at least

1−δ

RT(u)≤

O(+kukT1/plog(1/δ))

for all

simultaneously. The

T1/p

dependency is unimprovable Bubeck

et al. [2013], Vural et al. [2022]. Moreover, we achieve these results simply by adding novel and

carefully designed regularizers to the losses

in a way that converts any parameter-free algorithm

with sufﬁciently small regret into one with the desired high probability guarantee.

Motivation:

High-probability analysis is appealing since it provides a conﬁdence guarantee for an

algorithm over a single run. This is crucially important in the online setting in which we must make

irrevocable decisions. It is also important in the standard stochastic optimization setting encountered

throughout machine learning as it ensures that even a single potentially very expensive training run

will produce a good result. See Harvey et al. [2019], Li and Orabona [2020], Madden et al. [2020],

Kavis et al. [2022] for more discussion on the importance of high-probability bounds in this setting.

This goal naturally synergizes with the overall objective of parameter-free algorithms, which attempt

to provide the best-tuned performance after a single pass over the data. In addition, we consider

the presence of heavy-tailed stochastic gradients, which are empirically observed in large neural

network architectures Zhang et al. [2020], Zhou et al. [2020]. The online optimization problem we

consider is actually fundamentally more difﬁcult than the stochastic optimization problem: indeed

Carmon and Hinder [2022] show that lower bounds for parameter-free online optimization to not

apply to stochastic optimization, and provide a high-probability analysis for the latter setting. In

contrast, the more ﬂexible online setting allows us build more robust algorithms that can perform

well in non-stationary or even adversarial environments.

Contribution and Organization:

After formally introducing and discussing our setup in Sections 2,

we then proceed to conduct an initial analysis for the 1-D case

W=R

in 3. First (Section 4),

we introduce a parameter-free algorithm for sub-exponential

that achieves regret

O(+|u|√T)

in high probability. This already improves signiﬁcantly on prior work, and is accomplished by

introducing a novel regularizer that “cancels” some unbounded martingale concentration terms, a

technique that may have wider application. Secondly (Section 5), we extend to heavy-tailed

by employing clipping, which has been used in prior work on optimization [Bubeck et al., 2013,

Gorbunov et al., 2020, Zhang et al., 2020, Cutkosky and Mehta, 2021] to convert heavy-tailed

estimates into sub-exponential ones. This clipping introduces some bias that must be carefully offset

by yet another novel regularization (which may again be of independent interest) in order to yield

our ﬁnal

O(+|u|T1/p)

parameter-free regret guarantee. Finally (Section 6), we extend to arbitrary

dimensions via the reduction from Cutkosky and Orabona [2018].

2 Preliminaries

Our algorithms interact with an adversary in which for

t= 1 . . . T

the algorithm ﬁrst outputs

a vector

wt∈ W

for

a convex subset of some real Hilbert space, and then the adversary

chooses a convex and

-Lipschitz loss function

`t:W → R

and a distribution

such that for

gt∼Pt

E[gt]∈∂`t(wt)

and

E[kgt−E[gt]kp]≤σp

for some

p∈(1,2]

. The algorithm then

observes a random sample

gt∼Pt

. After

rounds, we compute the regret, which is a function

Rt(u) = Pt

i=1 `i(wi)−`i(u)

. Our goal is to guarantee

RT(u)≤+˜

O(kukT1/p)

for all

simultaneously with high probability.

Throughout this paper we will employ the notion of a sub-exponential random sequence:

Deﬁnition 1.

Suppose

{Xt}

is a sequence of random variables adapted to a ﬁltration

such that

{Xt,Ft}

is a martingale difference sequence. Further, suppose

{σt, bt}

are random variables such

that σt, btare both Ft−1-measurable for all t. Then, {Xt,Ft}is {σt, bt}sub-exponential if

E[exp(λXt)|Ft−1]≤exp(λ2σ2

t/2)

almost everywhere for all Ft−1-measurable λsatisfying λ < 1/bt.

We drop the subscript

when we have uniform (not time-varying) sub-exponential parameters

(σ, b)

We use bold font (

) to refer to vectors and normal font

(gt)

to refer to scalars. Occasionally, we

abuse notation to write ∇`t(wt)for an arbitrary element of ∂`t(wt).

We present our results using

O(·)

to hide constant factors, and

O(·)

to hide

log

factors (such as some

power of

log T

dependence) in the main text, the exact results are left at the last line of the proof for

interested readers.

Finally, observe that by the unconstrained-to-constrained conversion of Cutkosky and Orabona [2018],

we need only consider the case that

is an entire vector space. By solving the problem for this case,

the reduction implies a high-probability regret algorithm for any convex W.

3 Challenges

A reader experienced with high probability bounds in online optimization may suspect that one

could apply fairly standard approaches such as gradient clipping and martingale concentration to

easily achieve high probability bounds with heavy tails. While such techniques do appear in our

development, the story is far from straightforward. In this section, we will outline these non-intuitive

difﬁculties. For a further discussion, see Section 3 of Jun and Orabona [2019].

For simplicity, consider

wt∈R

. Before attempting a high probability bound, one may try to derive

a regret bound in expectation with heavy-tailed (or even light-tailed) gradient

via the following

calculation:

E[RT(u)] = E"T

t=1

`t(wt)−`t(u)#≤

t=1

E[hgt, wt−ui] +

t=1

E[h∇`t(wt)−gt, wt−ui]

The second sum from above vanishes, so one is tempted to send

directly to some existing parameter-

free algorithm to obtain low regret. Unfortunately, most parameter-free algorithms require a uniform

bound on

|gt|

- even a single bound-violating

could be catastrophic [Cutkosky and Boahen, 2017].

With heavy-tailed

, we are quite likely to encounter such a bound-violating

for any reasonable

uniform bound. In fact, the issue is difﬁcult even for light-tailed

, as described in detail by Jun and

Orabona [2019].

A natural approach to overcome this uniform bound issue is to incorporate some form of clipping, a

commonly used technique controlling for heavy-tailed subgradients. The clipped subgradient

ˆgt

deﬁned below with a positive clipping parameter τas:

ˆgt=gt

|gt|min(τ, |gt|)

If we run algorithms on uniformly bounded ˆgtinstead, the expected regret can now be written as:

E[RT(u)] ≤

t=1

E[hˆgt, wt−ui]

| {z }

parameter-free regret

t=1

E[hE[ˆgt]−ˆgt, wt−ui]

| {z }

martingale concentration?

t=1

E[h∇`t(wt)−E[ˆgt], wt−ui]

| {z }

bias

(1)

Since

|ˆgt| ≤ τ

, the ﬁrst term can in fact be controlled for appropriate

at a rate of

O(+|u|√T)

using sufﬁciently advanced parameter-free algorithms (e.g. Cutkosky and Orabona [2018]). However,

now bias accumulates in the last term, which is difﬁcult to bound due to the dependency on

. On

the surface, understanding this dependency appears to require detailed (and difﬁcult) analysis of the

dynamics of the parameter-free algorithm. In fact, from naive inspection of the updates for standard

parameter-free algorithms, one expects that

|wt|

could actually grow exponentially fast in

, leading

to a very large bias term.

Finally, disregarding these challenges faced even in expectation, to derive a high-probability bound

the natural approach is to bound the middle sum in (1) via some martingale concentration argument.

Unfortunately, the variance process for this martingale depends on

just like the bias term. In fact,

this issue appears even if the original

already have bounded norm, which is the most extreme

version of light tails! Thus, we again appear to encounter a need for small

, which may instead grow

exponentially. In summary, the unbounded nature of

makes dealing with any kind of stochasticity

in the

very difﬁcult. In this work we will develop techniques based on regularization that intuitively

force the wtto behave well, eventually enabling our high-probability regret bounds.

4 Bounded Sub-exponential Noise via Cancellation

In this section, we describe how to obtain regret bound in high probability for stochastic subgradients

for which

E[g2

t]≤σ2

and

|gt| ≤ b

for some

and

(in particular,

exhibits

(σ, 4b)

sub-

exponential noise). We focus on the 1-dimensional case with

W=R

. The extension to more general

is covered in Section 6. Our method involves two coordinated techniques. First, we introduce a

carefully designed regularizer

ψt

such that any algorithm that achieves low regret with respect to the

losses

w7→ gtw+ψt(w)

will automatically ensure low regret with high probability on the original

losses

. Unfortunately,

ψt

is not Lipschitz and so it is still not obvious how to obtain low regret. We

overcome this ﬁnal issue by an “implicit” modiﬁcation of the optimistic parameter-free algorithm of

Cutkosky [2019]. Our overall goal is a regret bound of

RT(u)≤˜

O(+|u|(σ+G)√T+b|u|)

for all

with high probability. Note that with this bound,

can be

O(√T)

before it becomes a signiﬁcant

factor in the regret.

Let us proceed to sketch the ﬁrst (and most critical) part of this procedure: Deﬁne

t=∇`t(wt)−gt

so that

t

captures the “noise” in the gradient estimate

. In this section, we assume that

t

(σ, 4b)

sub-exponential for all tfor some given σ, b and |gt| ≤ b. Then we can write:

RT(u)≤

t=1h∇`t(wt), wt−ui=

t=1hgt, wt−ui+

t=1ht, wti −

t=1ht, ui

≤

t=1hgt, wt−ui+

t=1

twt

+|u|

t=1

t

| {z }

“noise term”, NOISE

(2)

Now, the natural strategy is to run an OLO algorithm

on the observed

, which will obtain some

regret

T(u) = PT

t=1hgt, wt−ui

, and then show that the remaining NOISE terms are small. To

this end, from sub-exponential martingale concentration, we might hope to show that with probability

1−δ, we have an identity similar to:

NOISE ≤σv

t=1

tlog(1/δ) + bmax

t|wt|log(1/δ) + |u|σpTlog(1/δ) + |u|blog(1/δ)

The dependency of

|u|

above appears to be relatively innocuous as it only contributes

O(|u|σ√T+

|u|b)

to the regret. The

-dependent term is more difﬁcult as it involves a dependency on the

algorithm

. This captures the complexity of our unbounded setting: in a bounded domain, the

situation is far simpler as we can uniformly bound

|wt| ≤ D

, ideally leaving us with an

O(D√T)

bound overall.

Unfortunately, in the unconstrained case,

|wt|

could grow exponentially (

|wt| ∼ 2t)

even when

is very small, so we cannot rely on a uniform bound. In fact, even in the ﬁnite-diameter case, if we

wish to guarantee

RT(0) ≤

, the bound

|wt| ≤ D

is still too coarse. The resolution is to instead

feed the algorithm

aregularized loss

`t(w) = hgt, wi+ψt(w)

, where

ψt

will “cancel” the

dependency in the martingale concentration. That is, we now deﬁne

T(u) = PT

t=1 ˆ

`t(wt)−ˆ

`t(u)

and rearrange:

t=1hgt, wt−ui ≤ RA

T(u)−

t=1

ψt(wt) +

t=1

ψt(u)(3)

And now combine equations (2) and (3):

RT(u)≤RA

T(u)−

t=1

ψt(wt) +

t=1

ψt(u) + NOISE

≤RA

T(u) + σv

t=1

tlog(1/δ) + bmax

t|wt|log(1/δ)−

t=1

ψt(wt)

+|u|σpTlog(1/δ) + |u|blog(1/δ) +

t=1

ψt(u)(4)

From this, we can read off the desired properties of

ψt

: (1)

ψt

should be large enough that

t=1 ψt(wt)≥σqPT

t=1 w2

tlog(1/δ) + bmaxt|wt|log(1/δ)

, (2)

ψt

should be small enough

that

t=1 ψt(u)≤˜

O(|u|√T)

, and (3)

ψt

should be such that

T(u) = ˜

O(+|u|√T)

for an

appropriate algorithm

. If we can exhibit a

ψt

satisfying all three properties, we will have developed

a regret bound of ˜

O(+|u|√T)in high probability.

It turns out that the modiﬁed Huber loss

rt(w)

deﬁned in equation (5) and (6) with appropriately

chosen constants c1, c2, p1, p2, α1, α2satisﬁes criterion (1) and (2).

rt(w;c, p, α0) = (c(p|w| − (p−1)|wt|)|wt|p−1

(Pt

i=1 |wi|p+αp

0)1−1/p ,|w|>|wt|

c|w|p1

(Pt

i=1 |wi|p+αp

0)1−1/p ,|w|≤|wt|(5)

ψt(w) = rt(w;c1, p1, α1) + rt(w;c2, p2, α2)(6)

Let us take a moment to gain some intuition for these functions

and

ψt

. First, observe that

is always continuously differentiable, and that

’s deﬁnition requires knowledge of

. This is

acceptable because online learning algorithms must be able to handle even adaptively chosen losses.

In particular, consider the

p= 2

case,

rt(w;c, 2, α)

for some positive constants

and

. We plot this

function in Figure 1, where one can see that

grows quadratically for

|w|≤|wt|

, but grows only

linearly afterwards so that rtis Lipschitz.

Figure 1

rt(w; 1,2,1)

when

i=1 w2

and

wt= 2

. The dashed line has

slope

cp |wt|p−1

(Pt

i=1 |wi|p+αp

0)1/p

, so that

quadratic for

|w| ≤ |wt|

and linear oth-

erwise. Notice that

is a constant used

to deﬁne

- it is not the argument of the

function.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Parameter-freeRegretinHighProbabilitywithHeavyTailsJiujiaZhangElectricalandComputerEngineeringBostonUniversityjiujiaz@bu.eduAshokCutkoskyElectricalandComputerEngineeringBostonUniversityashok@cutkosky.comAbstractWepresentnewalgorithmsforonlineconvexoptimizationoverunboundeddomainsthatobtainparameter-...

展开>> 收起<<

Parameter-free Regret in High Probability with Heavy Tails Jiujia Zhang.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Parameter-free Regret in High Probability with Heavy Tails Jiujia Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: