Trading Off Resource Budgets For Improved Regret Bounds Damon Falck

2025-04-26 0 0 637.16KB 27 页 10玖币
侵权投诉
Trading Off Resource Budgets For
Improved Regret Bounds
Damon Falck
University of Oxford
damon.falck@gmail.com
Thomas Orton
University of Oxford
thomas.orton@cs.ox.ac.uk
Abstract
In this work we consider a variant of adversarial online learning where in each
round one picks
B
out of
N
arms and incurs cost equal to the minimum of the costs
of each arm chosen. We propose an algorithm called Follow the Perturbed Multiple
Leaders (FPML) for this problem, which we show (by adapting the techniques
of Kalai and Vempala [2005]) achieves expected regret
O(T1
B+1 ln(N)B
B+1 )
over
time horizon
T
relative to the single best arm in hindsight. This introduces a
trade-off between the budget Band the single-best-arm regret, and we proceed to
investigate several applications of this trade-off. First, we observe that algorithms
which use standard regret minimizers as subroutines can sometimes be adapted
by replacing these subroutines with FPML, and we use this to generalize existing
algorithms for Online Submodular Function Maximization [Streeter and Golovin,
2008] in both the full feedback and semi-bandit feedback settings. Next, we
empirically evaluate our new algorithms on an online black-box hyperparameter
optimization problem. Finally, we show how FPML can lead to new algorithms for
Linear Programming which require stronger oracles at the benefit of fewer oracle
calls.
1 Introduction
Adversarial online learning is a well-studied framework for sequential decision making with numerous
applications. In each round
t= 1, . . . , T
, an adversary chooses a hidden cost function
ct:A → [0,1]
from a set of arms
A
to costs in
[0,1]
. An algorithm must then choose an arm
at∈ A
, and incurs
cost
ct(at)
. In the full feedback setting (Online Learning with Experts (OLwE)), the algorithm then
observes the cost function
ct
. In the partial feedback setting (Multi-Armed Bandits (MAB)) the
algorithm only observes the incurred cost
ct(at)
. The objective is to find algorithms which minimize
regret, defined as the difference between the algorithm’s cumulative cost and the cumulative cost of
the single best arm in hindsight.
In this paper we consider a search-like variant of these problems where in each round one can pick
asubset of arms
St⊂ A
with
|St|=B1
, and keep the arm with the smallest cost. This variant
appears naturally in many settings, including:
1.
Online algorithm portfolios [Gomes and Selman, 2001]: In each round
t
, one receives a
problem instance
xt
, and can pick a subset
St⊂ A
of algorithms to run in parallel to solve
xt
. For example,
xt
could be a boolean satisfiability (SAT) problem, and
A
could be a
collection of different SAT solving algorithms. We let
ct(a)=0
if
a
solves
xt
and
ct(a)=1
otherwise. Then if any
aSt
finds a solution to
xt
we incur
0
cost in this round. Another
example is online hyperparameter optimization (see Section 4).
*Equal contribution.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05789v1 [cs.LG] 11 Oct 2022
2.
Online bidding [Chen et al., 2016a]: In each round
t
, an auctioneer sets up a first-price
auction for bidders
St⊂ A
. Each bidder
a∈ A
has a price
1ct(a)
they are willing to pay,
and the auctioneer receives
maxaSt1ct(a) = 1 minaStct(a)
, and so maximizing
revenue is equivalent to minimizing costs.
3.
Adaptive network routing [Awerbuch and Kleinberg, 2008]: In each round
t
, a network
router receives a data packet
xt
and can pick a selection of network routes
St⊂ A
to send it
to its destination via in parallel. Let the cost
ct(a)
of a route
a
be the total time taken for
xt
to reach its destination via
a
; the router receives cost
minaStct(a)
equal to the smallest
delay.
In many applications the budget
B
is a restricted resource (e.g. compute time or number of cores)
we would like to keep small; this paper studies how one can trade off budget resources for better
guarantees on the standard regret objective.
Formally, for any randomized algorithm
ALG
which chooses subset
St⊂ A
in round
t
, and thus
incurs cost ct(St) := minaStct(a), define
R
T(ALG) := max
c1,...,cT
E"T
X
t=1
ct(St)min
a∈A
T
X
t=1
ct(a)#
to be the worst-case expected regret of
ALG
relative to the single best arm in hindsight, where the
expectation is with respect to the randomness of
ALG
.
*
What guarantees can we give on
R
T
as a
function of our budget
B
? In the full feedback setting when
B= 1
, this is the standard OLwE problem
where it is known that
R
T= Ω(T)
and there are algorithms which achieve
R
T2pTln(N)
[Lattimore and Szepesvári, 2020], where
N=|A|
. When
B=N
the algorithm which chooses
St=A
in each round achieves
R
T= 0
. But what bounds on
R
T
can be achieved in the intermediate
regime when
1< B < N
? To the best of our knowledge this question has not been directly answered
by any prior work.
1.1 Contributions
Theoretical results:
We present a new algorithm for this learning problem called Follow the
Perturbed Multiple Leaders (
FPML
), and show that in the full feedback setting
R
T(FPML)
O(T1
B+1 ln(N)B
B+1 )
. This allows for a direct trade-off between the budget
B
and the regret bound
(in particular, allowing resources
BΩ(ln(T))
leads to regret constant in
T
) and recovers the
familiar
O(pTln(N))
bound when
B= 1
. We then show that in the semi-bandit feedback setting
(where the algorithm finds out only the costs of the arms it chooses) this bound can be converted to
R
T(FPML)≤ O(T1
B+1 (Kln(N)) B
B+1 )if one has unbiased cost estimators bounded in [0, K].
We also consider the more general problem of Online Submodular Function Maximization (OSFM),
for which prior work gives an online greedy algorithm
OG
[Streeter and Golovin, 2008]. When given
a budget of
B
per round,
OG
achieves regret
O(pT B ln(N))
with respect to
(1 e1)
OPT(
B
),
where OPT(
B
) is the performance of the best fixed length-
B
schedule (see Section 3 for a formal
definition of OSFM). Note that in this guarantee the regret benchmark is a function of the algorithm
budget. By replacing a subroutine in
OG
with
FPML
, we generalize
OG
to a new algorithm
OGhybrid
.
Unlike
OG
,
OGhybrid
is able to give regret bounds against benchmarks which are decoupled from
the algorithm budget. This allows one to more easily quantify the trade-off of increasing the
budget against a fixed regret objective. As a special case, we are able to show that having a budget
of
B=B0dln(T)2e
per round allows one to achieve regret
O(B0ln(T) ln(N))
with respect to
OPT(
B0
). One interpretation of this result is that if you are willing to increase your budget (e.g.
runtime) by a factor of
ln(T)2
, you are able to improve your performance guarantee benchmark from
(1 e1)
OPT(
B0
) to OPT(
B0
). Likewise, your regret growth rate in terms of the number of rounds
changes from O(T)to O(ln(T)).
Finally, in Section 5 we show how to use
FTML
to generalize a technique for solving linear programs
assuming access to an oracle which solves relaxed forms of the linear program. To obtain an
ε
-
approximate solution to the linear program requires
1
εB+1
B(4ρ)B+1
B(1 + ln(n))
oracle calls, where
*
Here we consider an oblivious adversary model for simplicity, but we believe the results of this paper carry
through to adaptive adversaries as well.
2
the parameters
(B, ρ)
are related to the power of the oracle and
n
is the number of linear constraints.
The case B= 1 coincides with known results.
Experimental results:
We benchmark both
FPML
and
OGhybrid
on an online black-box hyper-
parameter optimization problem based on the 2020 NeurIPS BBO challenge [Turner et al., 2021].
We find that both these new algorithms outperform
OG
for various compute budgets. We are able
to explain why this happens for this specific dataset, and discuss the scenarios under which each
algorithm would perform better.
Techniques:
Minimizing
R
T
is an important subroutine for a large variety of applications including
Linear Programming, Boosting, and solving zero sum games [Arora et al., 2012]. Traditionally an
experts algorithm such as
Hedge
[Littlestone and Warmuth, 1994], which pulls a single arm per
round, will be used as a subroutine to minimize
R
T
. We highlight how in the cases of OSFM and
Linear Programming, one can simply replace a single arm
R
T
-minimizing subroutine with
FPML
and get performance bounds with little or no alteration to the original proofs. The resulting algorithms
have improved bounds (due to improved bounds on
R
T
when
B > 1
) at the cost of qualitatively
changing the algorithm (e.g. requiring a larger budget or more powerful oracle). This is significant
because it highlights the potential of how bounds on
R
T
when
B > 1
can lead to new results in other
application areas. In Section 2.1 we also highlight how the proof techniques of Kalai and Vempala
[2005] for bounding
R
T
in the traditional experts setting can naturally be generalized to the case
when B > 1, which is of independent interest.
1.2 Relation to prior work
One can alternatively formulate more gewe consider as receiving the maximum reward
rt(a) =
1ct(a)
of each arm chosen instead of the minimum cost. In this maximum of rewards formulation,
the problem fits within the OSFM framework where (a) all actions are unit-time and (b) the submodular
job function is always a maximum of rewards. The rewards formulation of the problem has also
been separately studied as the K-MAX problem (here
K=B
) Chen et al. [2016a]. In the OSFM
setting, Streeter and Golovin [2008] give an online greedy approximation algorithm which guarantees
E[(1 e1)OPT(B)RewardT]≤ O(pT B ln(N))
in the full feedback adversarial setting, where
OPT(B)
is the cumulative reward of the best fixed subset of
B
arms in hindsight, and
RewardT
is the
cumulative reward of the algorithm. A similar bound of
O(BpT N ln(N))
can be given in a semi-
feedback setting. Conversely in the full feedback setting, Streeter and Golovin [2007] shows that any
algorithm has worst-case regret
E[OPT(B)RewardT]Ω(pT B ln(N/B))
when one receives
the maximum of rewards in each round. Chen et al. [2016a] study the K-MAX problem and other
non-linear reward functions in the stochastic combinatorial multi-armed bandit setting. Assuming the
rewards satisfy certain distributional assumptions, they give an algorithm which achieves distribution-
independent regret bounds of
E[(1 ε)OPT(B)RewardT]≤ O(pT BN ln(T))
for
ε > 0
with
semi-bandit feedback. Note however that we consider the adversarial setting in this paper.
More broadly, these problems fall within the combinatorial online learning setting where an algorithm
may pull a subset of arms in each round. Much prior work has focused on combinatorial bandits
where the reward is linear in the subset of arms chosen, which can model applications including online
advertising and online shortest paths [Cesa-Bianchi and Lugosi, 2012, Audibert et al., 2014, Combes
et al., 2015]. The case of non-linear reward is comparatively less studied, but having non-linear
rewards (such as max) allows one to model a wider variety of problems including online expected
utility maximization [Li and Deshpande, 2011, Chen et al., 2016a]. As some examples of prior work
in the stochastic setting, [Gopalan et al., 2014] uses Thompson Sampling to deal with non-linear
rewards of functions of subsets of arms (including the max function), but requires the rewards to
come from a known parametric distribution. Chen et al. [2016b] considers a model where the subset
of arms pulled is randomized based on pulling a ‘super-arm’, and the reward is a non-linear function
of the values of the arms pulled. In the adversarial setting, Han et al. [2021] study the combinatorial
MAB problem when rewards can be expressed as a d-degree polynomial.
In contrast to prior work which focuses on giving algorithms which compete against benchmarks
which have the same budget as the algorithm, this work is concerned with the trade-off between
regret bounds and budget size. We focus on giving regret bounds against OPT
(1)
, and we use this
result in Section 3 to get regret bounds against
OPT(B0)
for
B0< B
in OSFM. Decoupling the regret
3
benchmark
OPT(B0)
from the algorithm budget
B
can be useful when one would like to control
the strength of a regret bound against a specific target
OPT(B0)
for theoretical or applied reasons.
For example Arora et al. [2012] survey a wide variety of applications which rely on bounding
R
T
,
but bounds such as
E[(1 e1)OPT(B)RewardT]≤ O(pT B ln(N))
do not immediately imply
useful bounds on OPT(1) RewardT.
2 Follow the Perturbed Multiple Leaders
We begin by considering the full feedback setting. We first check that allowing the algorithm to
choose
B > 1
arms per round, while only competing against the best single fixed arm in hindsight,
does not make the problem trivial. We do this by showing that any deterministic algorithm with
budget
B < N
still achieves linear regret in the number of rounds. This is achieved by setting
ct(a) = 1 if aSt,ct(a)=0otherwise.
Proposition 1.
In the full feedback setting, any deterministic algorithm with arm budget
BN
per
round has worst-case regret R
T1B
NT.
Likewise, it can be shown that the algorithm which chooses a uniformly random subset of
B
arms in
each round has worst-case expected regret at least
T(1 B
N)B
(achieved by having one arm have cost
0
across all rounds and every other arm having cost
1
). These two observations show that any solution
for achieving sub-linear regret in
T
requires randomization which depends in some non-trivial way
on the prior observed costs even when B > 1.
2.1 Generalizing Follow the Perturbed Leader
Choosing the current lowest perturbed-cost arm in each round, Follow the Perturbed Leader (
FPL
)
[Kalai and Vempala, 2005], is a well-known regret minimization technique which achieves optimal
worst-case regret against adaptive adversaries in the OLwE setting. In this section we generalize the
FPL
algorithm to Follow the Perturbed Multiple Leaders (
FPML
). In each round,
FPML
perturbs
the cumulative costs of each arm by adding noise, and then picks the
B
arms with lowest cumulative
perturbed cost. This is precisely
FPL
when
B= 1
. We show how one can extend the proof
techniques of Kalai and Vempala [2005] in a natural way to prove that
FPML
achieves worst-case
regret R
T(FPML)2T1
B+1 (1 + ln(N)) B
B+1 .
Algorithm 1 FPML(B,ε)
Require: NB1, ε > 0.
Initialize the cumulative cost C0(a)0for each arm a∈ A.
for round t= 1, . . . , T do
1. For each arm a∈ A, draw a noise perturbation pt(a)1
εExp.
2. Calculate the perturbed cumulative costs for round t1,˜
Ct1(a)Ct1(a)pt(a).
3. Pull the
B
arms with the lowest perturbed cumulative costs according to
˜
Ct1
. Break ties
arbitrarily.
4. Update the cumulative costs for each arm, Ct(a)Ct1(a) + ct(a).
end for
Theorem 2.
In the full feedback setting, where
St⊂ A
is the subset of arms chosen by
FPML
in
round t, we have:
max
c1,...,cT
E"(1 εB)
T
X
t=1
ct(St)min
a∈A
T
X
t=1
ct(a)#(1 + ln(N))
ε.
In particular, for ε= ((ln(N) + 1)/T )1/(B+1), we have
R
T(FPML)2T1
B+1 (1 + ln(N)) B
B+1 .
The proof follows the same three high level steps which appear in Kalai and Vempala [2005] for
FPL
,
but we extend these ideas to the case where
B > 1
. We first observe that the algorithm which picks
the
B
lowest cumulative cost arms in each round only incurs regret when the best arm in round
t
is
not one of the best Barms in round t1.
4
Lemma 3.
Consider a fixed sequence of cost functions
c1, . . . , cT
. Let
a,j
t
be the
jth
lowest
cumulative cost arm in hindsight after the first
t
rounds, breaking ties arbitrarily. Let
S
t:={a,j
t1|
j[B]}be the set of the Blowest cost arms at the end of round t1. Then for each i[T],
Ri:=
i
X
t=1
ct(S
t)min
a∈A
i
X
t=1
ct(a)
i
X
t=1
1[a,1
t6∈ S
t]
and RiRi11[a,1
i6∈ S
i].
This is a generalization of the familiar result that when
B= 1
, following the leader has regret
bounded by the number of times the leader is overtaken [Kalai and Vempala, 2005].
The second step is to argue that if the cumulative costs are perturbed slightly, it becomes unlikely that
the event
{a,1
t6∈ S
t}
will occur. One way to see this is as follows: fix a round
t
, and let
Ct1(a)
be the cumulative cost of
a
at the end of round
t1
. Let
M:=Ct1(a,B+1)
. Then every
aS
t
has
Ct1(a)M
. If it is also true that
Ct1(a)< M ct(a)
for any
aS
t
then the event
{a,1
t6∈ S
t}
cannot occur. This is because
Ct(a)<(Mct(a)) + ct(a) = M
but any
a0∈ ASt
has
Ct(a0)M
, so
a06=a,1
t
. If we had initially perturbed each
Ct1(a)
by subtracting independent
exponential noise
p(a)1
εExp
, then conditional on
M
the event
{Ct1(a)< M ct(a)}
is jointly
independent for each
aS
t
. Moreover the probability of this inequality not holding is equal to
P[p(a)< v +ct(a)|p(a)v]
for
v
equal to the unperturbed cost of
a
at round
t1
minus
M
,
which is bounded by εct(a)(due to the memorylessness property of the exponential distribution).
Lemma 4.
Fix a sequence of cost functions
c1, . . . , cT
. Let
Ci(a) = Pi
t=1 ct(a)
and
˜
Ci(a) =
Ci(a)p(a)
be the perturbed cumulative cost of arm
a
at the end of round
i
, where
p(a)1
εExp
.
Let
˜a,j
t
be the
jth
lowest cumulative cost arm in hindsight after the first
t
rounds using these perturbed
costs, and let ˜
S
t={˜a,j
t1|j[B]}. Then
Eh1h˜a,1
t6∈ ˜
S
tiiEhεBct(˜
S
t)i.
Again, when
B= 1
this argument and bound coincides with the argument given by Kalai and
Vempala [2005].
The final step is to combine Lemmas 3 and 4 to argue that
FPML
achieves expected regret at most
EhεBPT
t=1 ct(˜
S
t)i
with respect to the perturbed cumulative cost
˜
CT
. Since
maxa∈A E[p(a)]
1+ln(N)
ε
we can argue we also achieve low expected regret with respect to the unperturbed cost
CT
.
In the setting of this paper, drawing new random perturbations
pt(a)
in each round is not strictly
necessary (we can take
pt(a) = p1(a)
for
t > 1
), but it is necessary to achieve regret bounds when
cost functions can depend on prior arm choices of the algorithm (the adaptive adversarial setting). In
the setting of this paper where the costs are fixed, the expected regret in either case is the same.
Probabilistic guarantees:
One advantage of this proof technique is that the regret is bounded
using the positive random variable
PT
t=1 1h˜a,1
t6∈ ˜
S
ti
. This means that one can apply methods like
Markov inequality to give a probabilistic guarantee of small regret, which is substantially stronger
than saying the regret is small in expectation.
Comments on settings of parameters:
When
B= 1
we recover the standard
O(pTln(N))
regret bound for the OLwE problem. For
B > 1
, the regret growth rate as a function of the number
of rounds is
T1
B+1
. In particular, when
B= Ω(ln(T))
grows slowly with the number of rounds,
the expected regret becomes
O(ln(N))
and does not grow with the number of rounds
T
. If we use
a tighter inequality in the proof of Lemma 4, it is possible to get constant expected regret when
B= ln(T) ln(A)grows slowly with the number of arms and rounds.
Lower bounds:
A standard technique for constructing lower bounds in the online experts setting
with
B= 1
is to consider costs which are i.i.d. Bernoulli
(p)
[Lattimore and Szepesvári, 2020].
Unfortunately this technique fails when
B > 1
because the expected cost of the minimum of
B
i.i.d.
Bernoulli random variables is generally smaller than the expected cost of the best arm in hindsight
5
摘要:

TradingOffResourceBudgetsForImprovedRegretBoundsDamonFalckUniversityofOxforddamon.falck@gmail.comThomasOrtonUniversityofOxfordthomas.orton@cs.ox.ac.ukAbstractInthisworkweconsideravariantofadversarialonlinelearningwhereineachroundonepicksBoutofNarmsandincurscostequaltotheminimumofthecostsofeacharmc...

展开>> 收起<<
Trading Off Resource Budgets For Improved Regret Bounds Damon Falck.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:637.16KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注