
user-specified
, directly generalizing the
O(D√T)
rate available when
D < ∞
[Orabona and Pál,
2016, Cutkosky and Orabona, 2018, Foster et al., 2017, Mhammedi and Koolen, 2020, Chen et al.,
2021]. As such algorithms do not require knowledge of the norm
kuk
that is usually used to specify
a learning rate for gradient descent, we will call them parameter-free. Note that such algorithms
typically guarantee constant RT(0), which is not achieved by any known form of gradient descent.
While parameter-free algorithms appear to fully generalize the finite-diameter case, they fall short
when
gt
is a stochastic subgradient estimate. In particular, lower-bounds suggest that parameter-free
algorithms must require Lipschitz
`t
[Cutkosky and Boahen, 2017], which means that care must be
taken when using
gt
with unbounded noise as this may make
`t
“appear” to be non-Lipschitz. In the
case of sub-exponential
gt
, Jun and Orabona [2019], van der Hoeven [2019] provide parameter-free
algorithms that achieve
E[RT(u)] ≤˜
O(+kuk√T)
, but these techniques do not easily extend to
heavy-tailed
gt
or to high-probability bounds. The high-probability statement is particularly elusive
(even with sub-exponential
gt
) because standard martingale concentration approaches appear to
fail spectacularly. This failure may be counterintuitive: for finite diameter
W
, one can observe
that
hgt−E[gt],wt−ui
forms a martingale difference sequence with variance determined by
kwt−uk ≤ D
, which allows for relatively straightforward high-probability bounds. However,
parameter-free algorithms typically exhibit exponentially growing
kwtk
in order to compete with all
possible scales of kuk, which appears to stymie such arguments.
Our work overcomes these issues. Requiring only that
gt
have a bounded
pth
moment for some
p∈(1,2]
, we devise a new algorithm whose regret with probability at least
1−δ
is
RT(u)≤
˜
O(+kukT1/plog(1/δ))
for all
u
simultaneously. The
T1/p
dependency is unimprovable Bubeck
et al. [2013], Vural et al. [2022]. Moreover, we achieve these results simply by adding novel and
carefully designed regularizers to the losses
`t
in a way that converts any parameter-free algorithm
with sufficiently small regret into one with the desired high probability guarantee.
Motivation:
High-probability analysis is appealing since it provides a confidence guarantee for an
algorithm over a single run. This is crucially important in the online setting in which we must make
irrevocable decisions. It is also important in the standard stochastic optimization setting encountered
throughout machine learning as it ensures that even a single potentially very expensive training run
will produce a good result. See Harvey et al. [2019], Li and Orabona [2020], Madden et al. [2020],
Kavis et al. [2022] for more discussion on the importance of high-probability bounds in this setting.
This goal naturally synergizes with the overall objective of parameter-free algorithms, which attempt
to provide the best-tuned performance after a single pass over the data. In addition, we consider
the presence of heavy-tailed stochastic gradients, which are empirically observed in large neural
network architectures Zhang et al. [2020], Zhou et al. [2020]. The online optimization problem we
consider is actually fundamentally more difficult than the stochastic optimization problem: indeed
Carmon and Hinder [2022] show that lower bounds for parameter-free online optimization to not
apply to stochastic optimization, and provide a high-probability analysis for the latter setting. In
contrast, the more flexible online setting allows us build more robust algorithms that can perform
well in non-stationary or even adversarial environments.
Contribution and Organization:
After formally introducing and discussing our setup in Sections 2,
we then proceed to conduct an initial analysis for the 1-D case
W=R
in 3. First (Section 4),
we introduce a parameter-free algorithm for sub-exponential
gt
that achieves regret
˜
O(+|u|√T)
in high probability. This already improves significantly on prior work, and is accomplished by
introducing a novel regularizer that “cancels” some unbounded martingale concentration terms, a
technique that may have wider application. Secondly (Section 5), we extend to heavy-tailed
gt
by employing clipping, which has been used in prior work on optimization [Bubeck et al., 2013,
Gorbunov et al., 2020, Zhang et al., 2020, Cutkosky and Mehta, 2021] to convert heavy-tailed
estimates into sub-exponential ones. This clipping introduces some bias that must be carefully offset
by yet another novel regularization (which may again be of independent interest) in order to yield
our final
˜
O(+|u|T1/p)
parameter-free regret guarantee. Finally (Section 6), we extend to arbitrary
dimensions via the reduction from Cutkosky and Orabona [2018].
2 Preliminaries
Our algorithms interact with an adversary in which for
t= 1 . . . T
the algorithm first outputs
a vector
wt∈ W
for
W
a convex subset of some real Hilbert space, and then the adversary
2