Parameter-free Regret in High Probability with Heavy Tails Jiujia Zhang

2025-05-02 0 0 555.11KB 42 页 10玖币
侵权投诉
Parameter-free Regret in High Probability with
Heavy Tails
Jiujia Zhang
Electrical and Computer Engineering
Boston University
jiujiaz@bu.edu
Ashok Cutkosky
Electrical and Computer Engineering
Boston University
ashok@cutkosky.com
Abstract
We present new algorithms for online convex optimization over unbounded domains
that obtain parameter-free regret in high-probability given access only to potentially
heavy-tailed subgradient estimates. Previous work in unbounded domains con-
siders only in-expectation results for sub-exponential subgradients. Unlike in the
bounded domain case, we cannot rely on straight-forward martingale concentration
due to exponentially large iterates produced by the algorithm. We develop new
regularization techniques to overcome these problems. Overall, with probability at
most
δ
, for all comparators
u
our algorithm achieves regret
˜
O(kukT1/plog(1))
for subgradients with bounded pth moments for some p(1,2].
1 Introduction
In this paper, we consider the problem of online learning with convex losses, also called online convex
optimization, with heavy-tailed stochastic subgradients. In the classical online convex optimization
setting, given a convex set
W
, a learning algorithm must repeatedly output a vector
wt∈ W
, and
then observe a convex loss function
`t:W R
and incur a loss of
`t(wt)
. After
T
such rounds, the
algorithm’s quality is measured by the regret with respect to a fixed competitor u∈ W:
RT(u) =
T
X
t=1
`t(wt)
T
X
t=1
`t(u)
Online convex optimization is widely applicable, and has been used to design popular stochastic
optimization algorithms ([Duchi et al., 2010a, Kingma and Ba, 2014, Reddi et al., 2018]), for control
of linear dynamical systems [Agarwal et al., 2019], or even building concentration inequalities [Vovk,
2007, Waudby-Smith and Ramdas, 2020, Orabona and Jun, 2021].
A popular approach to this problem reduces it to online linear optimization (OLO): if
gt
is a
subgradient of
`t
at
wt
, then
RT(u)PT
t=1hgt,wtui
so that it suffices to design an algorithm
that considers only linear losses
w7→ hgt,wi
. Then, by assuming that the domain
W
has some finite
diameter
D
, standard arguments show that online gradient descent [Zinkevich, 2003] and its variants
achieve
RT(u)O(DT)
for all
u∈ W
. See the excellent books Cesa-Bianchi and Lugosi [2006],
Shalev-Shwartz [2011], Hazan [2019], Orabona [2019] for more detail.
Deviating from the classical setting, we study the more difficult case in which, (1) the domain
W
may have infinite diameter (such as
W=Rd
), and (2) instead of observing the loss
`t
, the
algorithm is presented only with a potentially heavy-tailed stochastic subgradient estimate
gt
with
E[gt|wt]`t(wt)
. Our goal is to develop algorithms that, with high probability, obtain essentially
the same regret bound that would be achievable even if the full information was available.
Considering only the setting of infinite diameter
W
with exact subgradients
gt`t(wt)
, past work
has achieved bounds of the form
RT(u)˜
O(+kukT)
for all
u∈ W
simultaneously for any
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.14355v2 [stat.ML] 25 Feb 2023
user-specified
, directly generalizing the
O(DT)
rate available when
D <
[Orabona and Pál,
2016, Cutkosky and Orabona, 2018, Foster et al., 2017, Mhammedi and Koolen, 2020, Chen et al.,
2021]. As such algorithms do not require knowledge of the norm
kuk
that is usually used to specify
a learning rate for gradient descent, we will call them parameter-free. Note that such algorithms
typically guarantee constant RT(0), which is not achieved by any known form of gradient descent.
While parameter-free algorithms appear to fully generalize the finite-diameter case, they fall short
when
gt
is a stochastic subgradient estimate. In particular, lower-bounds suggest that parameter-free
algorithms must require Lipschitz
`t
[Cutkosky and Boahen, 2017], which means that care must be
taken when using
gt
with unbounded noise as this may make
`t
“appear” to be non-Lipschitz. In the
case of sub-exponential
gt
, Jun and Orabona [2019], van der Hoeven [2019] provide parameter-free
algorithms that achieve
E[RT(u)] ˜
O(+kukT)
, but these techniques do not easily extend to
heavy-tailed
gt
or to high-probability bounds. The high-probability statement is particularly elusive
(even with sub-exponential
gt
) because standard martingale concentration approaches appear to
fail spectacularly. This failure may be counterintuitive: for finite diameter
W
, one can observe
that
hgtE[gt],wtui
forms a martingale difference sequence with variance determined by
kwtuk ≤ D
, which allows for relatively straightforward high-probability bounds. However,
parameter-free algorithms typically exhibit exponentially growing
kwtk
in order to compete with all
possible scales of kuk, which appears to stymie such arguments.
Our work overcomes these issues. Requiring only that
gt
have a bounded
pth
moment for some
p(1,2]
, we devise a new algorithm whose regret with probability at least
1δ
is
RT(u)
˜
O(+kukT1/plog(1))
for all
u
simultaneously. The
T1/p
dependency is unimprovable Bubeck
et al. [2013], Vural et al. [2022]. Moreover, we achieve these results simply by adding novel and
carefully designed regularizers to the losses
`t
in a way that converts any parameter-free algorithm
with sufficiently small regret into one with the desired high probability guarantee.
Motivation:
High-probability analysis is appealing since it provides a confidence guarantee for an
algorithm over a single run. This is crucially important in the online setting in which we must make
irrevocable decisions. It is also important in the standard stochastic optimization setting encountered
throughout machine learning as it ensures that even a single potentially very expensive training run
will produce a good result. See Harvey et al. [2019], Li and Orabona [2020], Madden et al. [2020],
Kavis et al. [2022] for more discussion on the importance of high-probability bounds in this setting.
This goal naturally synergizes with the overall objective of parameter-free algorithms, which attempt
to provide the best-tuned performance after a single pass over the data. In addition, we consider
the presence of heavy-tailed stochastic gradients, which are empirically observed in large neural
network architectures Zhang et al. [2020], Zhou et al. [2020]. The online optimization problem we
consider is actually fundamentally more difficult than the stochastic optimization problem: indeed
Carmon and Hinder [2022] show that lower bounds for parameter-free online optimization to not
apply to stochastic optimization, and provide a high-probability analysis for the latter setting. In
contrast, the more flexible online setting allows us build more robust algorithms that can perform
well in non-stationary or even adversarial environments.
Contribution and Organization:
After formally introducing and discussing our setup in Sections 2,
we then proceed to conduct an initial analysis for the 1-D case
W=R
in 3. First (Section 4),
we introduce a parameter-free algorithm for sub-exponential
gt
that achieves regret
˜
O(+|u|T)
in high probability. This already improves significantly on prior work, and is accomplished by
introducing a novel regularizer that “cancels” some unbounded martingale concentration terms, a
technique that may have wider application. Secondly (Section 5), we extend to heavy-tailed
gt
by employing clipping, which has been used in prior work on optimization [Bubeck et al., 2013,
Gorbunov et al., 2020, Zhang et al., 2020, Cutkosky and Mehta, 2021] to convert heavy-tailed
estimates into sub-exponential ones. This clipping introduces some bias that must be carefully offset
by yet another novel regularization (which may again be of independent interest) in order to yield
our final
˜
O(+|u|T1/p)
parameter-free regret guarantee. Finally (Section 6), we extend to arbitrary
dimensions via the reduction from Cutkosky and Orabona [2018].
2 Preliminaries
Our algorithms interact with an adversary in which for
t= 1 . . . T
the algorithm first outputs
a vector
wt∈ W
for
W
a convex subset of some real Hilbert space, and then the adversary
2
chooses a convex and
G
-Lipschitz loss function
`t:W R
and a distribution
Pt
such that for
gtPt
,
E[gt]`t(wt)
and
E[kgtE[gt]kp]σp
for some
p(1,2]
. The algorithm then
observes a random sample
gtPt
. After
t
rounds, we compute the regret, which is a function
Rt(u) = Pt
i=1 `i(wi)`i(u)
. Our goal is to guarantee
RT(u)+˜
O(kukT1/p)
for all
u
simultaneously with high probability.
Throughout this paper we will employ the notion of a sub-exponential random sequence:
Definition 1.
Suppose
{Xt}
is a sequence of random variables adapted to a filtration
Ft
such that
{Xt,Ft}
is a martingale difference sequence. Further, suppose
{σt, bt}
are random variables such
that σt, btare both Ft1-measurable for all t. Then, {Xt,Ft}is {σt, bt}sub-exponential if
E[exp(λXt)|Ft1]exp(λ2σ2
t/2)
almost everywhere for all Ft1-measurable λsatisfying λ < 1/bt.
We drop the subscript
t
when we have uniform (not time-varying) sub-exponential parameters
(σ, b)
.
We use bold font (
gt
) to refer to vectors and normal font
(gt)
to refer to scalars. Occasionally, we
abuse notation to write `t(wt)for an arbitrary element of `t(wt).
We present our results using
O(·)
to hide constant factors, and
˜
O(·)
to hide
log
factors (such as some
power of
log T
dependence) in the main text, the exact results are left at the last line of the proof for
interested readers.
Finally, observe that by the unconstrained-to-constrained conversion of Cutkosky and Orabona [2018],
we need only consider the case that
W
is an entire vector space. By solving the problem for this case,
the reduction implies a high-probability regret algorithm for any convex W.
3 Challenges
A reader experienced with high probability bounds in online optimization may suspect that one
could apply fairly standard approaches such as gradient clipping and martingale concentration to
easily achieve high probability bounds with heavy tails. While such techniques do appear in our
development, the story is far from straightforward. In this section, we will outline these non-intuitive
difficulties. For a further discussion, see Section 3 of Jun and Orabona [2019].
For simplicity, consider
wtR
. Before attempting a high probability bound, one may try to derive
a regret bound in expectation with heavy-tailed (or even light-tailed) gradient
gt
via the following
calculation:
E[RT(u)] = E"T
X
t=1
`t(wt)`t(u)#
T
X
t=1
E[hgt, wtui] +
T
X
t=1
E[h∇`t(wt)gt, wtui]
The second sum from above vanishes, so one is tempted to send
gt
directly to some existing parameter-
free algorithm to obtain low regret. Unfortunately, most parameter-free algorithms require a uniform
bound on
|gt|
- even a single bound-violating
gt
could be catastrophic [Cutkosky and Boahen, 2017].
With heavy-tailed
gt
, we are quite likely to encounter such a bound-violating
gt
for any reasonable
uniform bound. In fact, the issue is difficult even for light-tailed
gt
, as described in detail by Jun and
Orabona [2019].
A natural approach to overcome this uniform bound issue is to incorporate some form of clipping, a
commonly used technique controlling for heavy-tailed subgradients. The clipped subgradient
ˆgt
is
defined below with a positive clipping parameter τas:
ˆgt=gt
|gt|min(τ, |gt|)
If we run algorithms on uniformly bounded ˆgtinstead, the expected regret can now be written as:
E[RT(u)]
T
X
t=1
E[hˆgt, wtui]
| {z }
parameter-free regret
+
T
X
t=1
E[hE[ˆgt]ˆgt, wtui]
| {z }
martingale concentration?
+
T
X
t=1
E[h∇`t(wt)E[ˆgt], wtui]
| {z }
bias
(1)
3
Since
|ˆgt| ≤ τ
, the first term can in fact be controlled for appropriate
τ
at a rate of
˜
O(+|u|T)
using sufficiently advanced parameter-free algorithms (e.g. Cutkosky and Orabona [2018]). However,
now bias accumulates in the last term, which is difficult to bound due to the dependency on
wt
. On
the surface, understanding this dependency appears to require detailed (and difficult) analysis of the
dynamics of the parameter-free algorithm. In fact, from naive inspection of the updates for standard
parameter-free algorithms, one expects that
|wt|
could actually grow exponentially fast in
t
, leading
to a very large bias term.
Finally, disregarding these challenges faced even in expectation, to derive a high-probability bound
the natural approach is to bound the middle sum in (1) via some martingale concentration argument.
Unfortunately, the variance process for this martingale depends on
wt
just like the bias term. In fact,
this issue appears even if the original
gt
already have bounded norm, which is the most extreme
version of light tails! Thus, we again appear to encounter a need for small
wt
, which may instead grow
exponentially. In summary, the unbounded nature of
wt
makes dealing with any kind of stochasticity
in the
gt
very difficult. In this work we will develop techniques based on regularization that intuitively
force the wtto behave well, eventually enabling our high-probability regret bounds.
4 Bounded Sub-exponential Noise via Cancellation
In this section, we describe how to obtain regret bound in high probability for stochastic subgradients
gt
for which
E[g2
t]σ2
and
|gt| ≤ b
for some
σ
and
b
(in particular,
gt
exhibits
(σ, 4b)
sub-
exponential noise). We focus on the 1-dimensional case with
W=R
. The extension to more general
W
is covered in Section 6. Our method involves two coordinated techniques. First, we introduce a
carefully designed regularizer
ψt
such that any algorithm that achieves low regret with respect to the
losses
w7→ gtw+ψt(w)
will automatically ensure low regret with high probability on the original
losses
`t
. Unfortunately,
ψt
is not Lipschitz and so it is still not obvious how to obtain low regret. We
overcome this final issue by an “implicit” modification of the optimistic parameter-free algorithm of
Cutkosky [2019]. Our overall goal is a regret bound of
RT(u)˜
O(+|u|(σ+G)T+b|u|)
for all
u
with high probability. Note that with this bound,
b
can be
O(T)
before it becomes a significant
factor in the regret.
Let us proceed to sketch the first (and most critical) part of this procedure: Define
t=`t(wt)gt
,
so that
t
captures the “noise” in the gradient estimate
gt
. In this section, we assume that
t
is
(σ, 4b)
sub-exponential for all tfor some given σ, b and |gt| ≤ b. Then we can write:
RT(u)
T
X
t=1h∇`t(wt), wtui=
T
X
t=1hgt, wtui+
T
X
t=1ht, wti −
T
X
t=1ht, ui
T
X
t=1hgt, wtui+
T
X
t=1
twt
+|u|
T
X
t=1
t
| {z }
“noise term”, NOISE
(2)
Now, the natural strategy is to run an OLO algorithm
A
on the observed
gt
, which will obtain some
regret
RA
T(u) = PT
t=1hgt, wtui
, and then show that the remaining NOISE terms are small. To
this end, from sub-exponential martingale concentration, we might hope to show that with probability
1δ, we have an identity similar to:
NOISE σv
u
u
t
T
X
t=1
w2
tlog(1) + bmax
t|wt|log(1) + |u|σpTlog(1) + |u|blog(1)
The dependency of
|u|
above appears to be relatively innocuous as it only contributes
˜
O(|u|σT+
|u|b)
to the regret. The
wt
-dependent term is more difficult as it involves a dependency on the
algorithm
A
. This captures the complexity of our unbounded setting: in a bounded domain, the
situation is far simpler as we can uniformly bound
|wt| ≤ D
, ideally leaving us with an
˜
O(DT)
bound overall.
4
Unfortunately, in the unconstrained case,
|wt|
could grow exponentially (
|wt| ∼ 2t)
even when
u
is very small, so we cannot rely on a uniform bound. In fact, even in the finite-diameter case, if we
wish to guarantee
RT(0)
, the bound
|wt| ≤ D
is still too coarse. The resolution is to instead
feed the algorithm
A
aregularized loss
ˆ
`t(w) = hgt, wi+ψt(w)
, where
ψt
will “cancel” the
wt
dependency in the martingale concentration. That is, we now define
RA
T(u) = PT
t=1 ˆ
`t(wt)ˆ
`t(u)
and rearrange:
T
X
t=1hgt, wtui ≤ RA
T(u)
T
X
t=1
ψt(wt) +
T
X
t=1
ψt(u)(3)
And now combine equations (2) and (3):
RT(u)RA
T(u)
T
X
t=1
ψt(wt) +
T
X
t=1
ψt(u) + NOISE
RA
T(u) + σv
u
u
t
T
X
t=1
w2
tlog(1) + bmax
t|wt|log(1)
T
X
t=1
ψt(wt)
+|u|σpTlog(1) + |u|blog(1) +
T
X
t=1
ψt(u)(4)
From this, we can read off the desired properties of
ψt
: (1)
ψt
should be large enough that
PT
t=1 ψt(wt)σqPT
t=1 w2
tlog(1) + bmaxt|wt|log(1)
, (2)
ψt
should be small enough
that
PT
t=1 ψt(u)˜
O(|u|T)
, and (3)
ψt
should be such that
RA
T(u) = ˜
O(+|u|T)
for an
appropriate algorithm
A
. If we can exhibit a
ψt
satisfying all three properties, we will have developed
a regret bound of ˜
O(+|u|T)in high probability.
It turns out that the modified Huber loss
rt(w)
defined in equation (5) and (6) with appropriately
chosen constants c1, c2, p1, p2, α1, α2satisfies criterion (1) and (2).
rt(w;c, p, α0) = (c(p|w| − (p1)|wt|)|wt|p1
(Pt
i=1 |wi|p+αp
0)11/p ,|w|>|wt|
c|w|p1
(Pt
i=1 |wi|p+αp
0)11/p ,|w|≤|wt|(5)
ψt(w) = rt(w;c1, p1, α1) + rt(w;c2, p2, α2)(6)
Let us take a moment to gain some intuition for these functions
rt
and
ψt
. First, observe that
rt
is always continuously differentiable, and that
rt
s definition requires knowledge of
wt
. This is
acceptable because online learning algorithms must be able to handle even adaptively chosen losses.
In particular, consider the
p= 2
case,
rt(w;c, 2, α)
for some positive constants
c
and
α
. We plot this
function in Figure 1, where one can see that
rt
grows quadratically for
|w|≤|wt|
, but grows only
linearly afterwards so that rtis Lipschitz.
Figure 1
:
rt(w; 1,2,1)
when
Pt
i=1 w2
i=
10
and
wt= 2
. The dashed line has
slope
cp |wt|p1
(Pt
i=1 |wi|p+αp
0)1/p
, so that
rt
is
quadratic for
|w| ≤ |wt|
and linear oth-
erwise. Notice that
wt
is a constant used
to define
rt
- it is not the argument of the
function.
5
摘要:

Parameter-freeRegretinHighProbabilitywithHeavyTailsJiujiaZhangElectricalandComputerEngineeringBostonUniversityjiujiaz@bu.eduAshokCutkoskyElectricalandComputerEngineeringBostonUniversityashok@cutkosky.comAbstractWepresentnewalgorithmsforonlineconvexoptimizationoverunboundeddomainsthatobtainparameter-...

展开>> 收起<<
Parameter-free Regret in High Probability with Heavy Tails Jiujia Zhang.pdf

共42页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:42 页 大小:555.11KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 42
客服
关注