Trading Off Resource Budgets For Improved Regret Bounds Damon Falck

2025-04-26 0 0 637.16KB 27 页 10玖币

侵权投诉

Trading Off Resource Budgets For

Improved Regret Bounds

Damon Falck∗

University of Oxford

damon.falck@gmail.com

Thomas Orton∗

University of Oxford

thomas.orton@cs.ox.ac.uk

Abstract

In this work we consider a variant of adversarial online learning where in each

round one picks

out of

arms and incurs cost equal to the minimum of the costs

of each arm chosen. We propose an algorithm called Follow the Perturbed Multiple

Leaders (FPML) for this problem, which we show (by adapting the techniques

of Kalai and Vempala [2005]) achieves expected regret

O(T1

B+1 ln(N)B

B+1 )

over

time horizon

relative to the single best arm in hindsight. This introduces a

trade-off between the budget Band the single-best-arm regret, and we proceed to

investigate several applications of this trade-off. First, we observe that algorithms

which use standard regret minimizers as subroutines can sometimes be adapted

by replacing these subroutines with FPML, and we use this to generalize existing

algorithms for Online Submodular Function Maximization [Streeter and Golovin,

2008] in both the full feedback and semi-bandit feedback settings. Next, we

empirically evaluate our new algorithms on an online black-box hyperparameter

optimization problem. Finally, we show how FPML can lead to new algorithms for

Linear Programming which require stronger oracles at the beneﬁt of fewer oracle

calls.

1 Introduction

Adversarial online learning is a well-studied framework for sequential decision making with numerous

applications. In each round

t= 1, . . . , T

, an adversary chooses a hidden cost function

ct:A → [0,1]

from a set of arms

to costs in

[0,1]

. An algorithm must then choose an arm

at∈ A

, and incurs

cost

ct(at)

. In the full feedback setting (Online Learning with Experts (OLwE)), the algorithm then

observes the cost function

. In the partial feedback setting (Multi-Armed Bandits (MAB)) the

algorithm only observes the incurred cost

ct(at)

. The objective is to ﬁnd algorithms which minimize

regret, deﬁned as the difference between the algorithm’s cumulative cost and the cumulative cost of

the single best arm in hindsight.

In this paper we consider a search-like variant of these problems where in each round one can pick

asubset of arms

St⊂ A

with

|St|=B≥1

, and keep the arm with the smallest cost. This variant

appears naturally in many settings, including:

Online algorithm portfolios [Gomes and Selman, 2001]: In each round

, one receives a

problem instance

, and can pick a subset

St⊂ A

of algorithms to run in parallel to solve

. For example,

could be a boolean satisﬁability (SAT) problem, and

could be a

collection of different SAT solving algorithms. We let

ct(a)=0

solves

and

ct(a)=1

otherwise. Then if any

a∈St

ﬁnds a solution to

we incur

cost in this round. Another

example is online hyperparameter optimization (see Section 4).

*Equal contribution.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05789v1 [cs.LG] 11 Oct 2022

Online bidding [Chen et al., 2016a]: In each round

, an auctioneer sets up a ﬁrst-price

auction for bidders

St⊂ A

. Each bidder

a∈ A

has a price

1−ct(a)

they are willing to pay,

and the auctioneer receives

maxa∈St1−ct(a) = 1 −mina∈Stct(a)

, and so maximizing

revenue is equivalent to minimizing costs.

Adaptive network routing [Awerbuch and Kleinberg, 2008]: In each round

, a network

router receives a data packet

and can pick a selection of network routes

St⊂ A

to send it

to its destination via in parallel. Let the cost

ct(a)

of a route

be the total time taken for

to reach its destination via

; the router receives cost

mina∈Stct(a)

equal to the smallest

delay.

In many applications the budget

is a restricted resource (e.g. compute time or number of cores)

we would like to keep small; this paper studies how one can trade off budget resources for better

guarantees on the standard regret objective.

Formally, for any randomized algorithm

ALG

which chooses subset

St⊂ A

in round

, and thus

incurs cost ct(St) := mina∈Stct(a), deﬁne

R∗

T(ALG) := max

c1,...,cT

E"T

t=1

ct(St)−min

a∗∈A

t=1

ct(a∗)#

to be the worst-case expected regret of

ALG

relative to the single best arm in hindsight, where the

expectation is with respect to the randomness of

ALG

What guarantees can we give on

R∗

as a

function of our budget

? In the full feedback setting when

B= 1

, this is the standard OLwE problem

where it is known that

R∗

T= Ω(√T)

and there are algorithms which achieve

R∗

T≤2pTln(N)

[Lattimore and Szepesvári, 2020], where

N=|A|

. When

B=N

the algorithm which chooses

St=A

in each round achieves

R∗

T= 0

. But what bounds on

R∗

can be achieved in the intermediate

regime when

1< B < N

? To the best of our knowledge this question has not been directly answered

by any prior work.

1.1 Contributions

Theoretical results:

We present a new algorithm for this learning problem called Follow the

Perturbed Multiple Leaders (

FPML

), and show that in the full feedback setting

R∗

T(FPML)≤

O(T1

B+1 ln(N)B

B+1 )

. This allows for a direct trade-off between the budget

and the regret bound

(in particular, allowing resources

B≥Ω(ln(T))

leads to regret constant in

) and recovers the

familiar

O(pTln(N))

bound when

B= 1

. We then show that in the semi-bandit feedback setting

(where the algorithm ﬁnds out only the costs of the arms it chooses) this bound can be converted to

R∗

T(FPML)≤ O(T1

B+1 (Kln(N)) B

B+1 )if one has unbiased cost estimators bounded in [0, K].

We also consider the more general problem of Online Submodular Function Maximization (OSFM),

for which prior work gives an online greedy algorithm

[Streeter and Golovin, 2008]. When given

a budget of

per round,

achieves regret

O(pT B ln(N))

with respect to

(1 −e−1)

OPT(

where OPT(

) is the performance of the best ﬁxed length-

schedule (see Section 3 for a formal

deﬁnition of OSFM). Note that in this guarantee the regret benchmark is a function of the algorithm

budget. By replacing a subroutine in

with

FPML

, we generalize

to a new algorithm

OGhybrid

Unlike

OGhybrid

is able to give regret bounds against benchmarks which are decoupled from

the algorithm budget. This allows one to more easily quantify the trade-off of increasing the

budget against a ﬁxed regret objective. As a special case, we are able to show that having a budget

B=B0dln(T)2e

per round allows one to achieve regret

O(B0ln(T) ln(N))

with respect to

OPT(

). One interpretation of this result is that if you are willing to increase your budget (e.g.

runtime) by a factor of

ln(T)2

, you are able to improve your performance guarantee benchmark from

(1 −e−1)

OPT(

) to OPT(

). Likewise, your regret growth rate in terms of the number of rounds

changes from O(√T)to O(ln(T)).

Finally, in Section 5 we show how to use

FTML

to generalize a technique for solving linear programs

assuming access to an oracle which solves relaxed forms of the linear program. To obtain an

approximate solution to the linear program requires

1

εB+1

B(4ρ)B+1

B(1 + ln(n))

oracle calls, where

Here we consider an oblivious adversary model for simplicity, but we believe the results of this paper carry

through to adaptive adversaries as well.

the parameters

(B, ρ)

are related to the power of the oracle and

is the number of linear constraints.

The case B= 1 coincides with known results.

Experimental results:

We benchmark both

FPML

and

OGhybrid

on an online black-box hyper-

parameter optimization problem based on the 2020 NeurIPS BBO challenge [Turner et al., 2021].

We ﬁnd that both these new algorithms outperform

for various compute budgets. We are able

to explain why this happens for this speciﬁc dataset, and discuss the scenarios under which each

algorithm would perform better.

Techniques:

Minimizing

R∗

is an important subroutine for a large variety of applications including

Linear Programming, Boosting, and solving zero sum games [Arora et al., 2012]. Traditionally an

experts algorithm such as

Hedge

[Littlestone and Warmuth, 1994], which pulls a single arm per

round, will be used as a subroutine to minimize

R∗

. We highlight how in the cases of OSFM and

Linear Programming, one can simply replace a single arm

R∗

-minimizing subroutine with

FPML

and get performance bounds with little or no alteration to the original proofs. The resulting algorithms

have improved bounds (due to improved bounds on

R∗

when

B > 1

) at the cost of qualitatively

changing the algorithm (e.g. requiring a larger budget or more powerful oracle). This is signiﬁcant

because it highlights the potential of how bounds on

R∗

when

B > 1

can lead to new results in other

application areas. In Section 2.1 we also highlight how the proof techniques of Kalai and Vempala

[2005] for bounding

R∗

in the traditional experts setting can naturally be generalized to the case

when B > 1, which is of independent interest.

1.2 Relation to prior work

One can alternatively formulate more gewe consider as receiving the maximum reward

rt(a) =

1−ct(a)

of each arm chosen instead of the minimum cost. In this maximum of rewards formulation,

the problem ﬁts within the OSFM framework where (a) all actions are unit-time and (b) the submodular

job function is always a maximum of rewards. The rewards formulation of the problem has also

been separately studied as the K-MAX problem (here

K=B

) Chen et al. [2016a]. In the OSFM

setting, Streeter and Golovin [2008] give an online greedy approximation algorithm which guarantees

E[(1 −e−1)OPT(B)−RewardT]≤ O(pT B ln(N))

in the full feedback adversarial setting, where

OPT(B)

is the cumulative reward of the best ﬁxed subset of

arms in hindsight, and

RewardT

is the

cumulative reward of the algorithm. A similar bound of

O(BpT N ln(N))

can be given in a semi-

feedback setting. Conversely in the full feedback setting, Streeter and Golovin [2007] shows that any

algorithm has worst-case regret

E[OPT(B)−RewardT]≥Ω(pT B ln(N/B))

when one receives

the maximum of rewards in each round. Chen et al. [2016a] study the K-MAX problem and other

non-linear reward functions in the stochastic combinatorial multi-armed bandit setting. Assuming the

rewards satisfy certain distributional assumptions, they give an algorithm which achieves distribution-

independent regret bounds of

E[(1 −ε)OPT(B)−RewardT]≤ O(pT BN ln(T))

for

ε > 0

with

semi-bandit feedback. Note however that we consider the adversarial setting in this paper.

More broadly, these problems fall within the combinatorial online learning setting where an algorithm

may pull a subset of arms in each round. Much prior work has focused on combinatorial bandits

where the reward is linear in the subset of arms chosen, which can model applications including online

advertising and online shortest paths [Cesa-Bianchi and Lugosi, 2012, Audibert et al., 2014, Combes

et al., 2015]. The case of non-linear reward is comparatively less studied, but having non-linear

rewards (such as max) allows one to model a wider variety of problems including online expected

utility maximization [Li and Deshpande, 2011, Chen et al., 2016a]. As some examples of prior work

in the stochastic setting, [Gopalan et al., 2014] uses Thompson Sampling to deal with non-linear

rewards of functions of subsets of arms (including the max function), but requires the rewards to

come from a known parametric distribution. Chen et al. [2016b] considers a model where the subset

of arms pulled is randomized based on pulling a ‘super-arm’, and the reward is a non-linear function

of the values of the arms pulled. In the adversarial setting, Han et al. [2021] study the combinatorial

MAB problem when rewards can be expressed as a d-degree polynomial.

In contrast to prior work which focuses on giving algorithms which compete against benchmarks

which have the same budget as the algorithm, this work is concerned with the trade-off between

regret bounds and budget size. We focus on giving regret bounds against OPT

(1)

, and we use this

result in Section 3 to get regret bounds against

OPT(B0)

for

B0< B

in OSFM. Decoupling the regret

benchmark

OPT(B0)

from the algorithm budget

can be useful when one would like to control

the strength of a regret bound against a speciﬁc target

OPT(B0)

for theoretical or applied reasons.

For example Arora et al. [2012] survey a wide variety of applications which rely on bounding

R∗

but bounds such as

E[(1 −e−1)OPT(B)−RewardT]≤ O(pT B ln(N))

do not immediately imply

useful bounds on OPT(1) −RewardT.

2 Follow the Perturbed Multiple Leaders

We begin by considering the full feedback setting. We ﬁrst check that allowing the algorithm to

choose

B > 1

arms per round, while only competing against the best single ﬁxed arm in hindsight,

does not make the problem trivial. We do this by showing that any deterministic algorithm with

budget

B < N

still achieves linear regret in the number of rounds. This is achieved by setting

ct(a) = 1 if a∈St,ct(a)=0otherwise.

Proposition 1.

In the full feedback setting, any deterministic algorithm with arm budget

B≤N

per

round has worst-case regret R∗

T≥1−B

NT.

Likewise, it can be shown that the algorithm which chooses a uniformly random subset of

arms in

each round has worst-case expected regret at least

T(1 −B

N)B

(achieved by having one arm have cost

across all rounds and every other arm having cost

). These two observations show that any solution

for achieving sub-linear regret in

requires randomization which depends in some non-trivial way

on the prior observed costs even when B > 1.

2.1 Generalizing Follow the Perturbed Leader

Choosing the current lowest perturbed-cost arm in each round, Follow the Perturbed Leader (

FPL

)

[Kalai and Vempala, 2005], is a well-known regret minimization technique which achieves optimal

worst-case regret against adaptive adversaries in the OLwE setting. In this section we generalize the

FPL

algorithm to Follow the Perturbed Multiple Leaders (

FPML

). In each round,

FPML

perturbs

the cumulative costs of each arm by adding noise, and then picks the

arms with lowest cumulative

perturbed cost. This is precisely

FPL

when

B= 1

. We show how one can extend the proof

techniques of Kalai and Vempala [2005] in a natural way to prove that

FPML

achieves worst-case

regret R∗

T(FPML)≤2T1

B+1 (1 + ln(N)) B

B+1 .

Algorithm 1 FPML(B,ε)

Require: N≥B≥1, ε > 0.

Initialize the cumulative cost C0(a)←0for each arm a∈ A.

for round t= 1, . . . , T do

1. For each arm a∈ A, draw a noise perturbation pt(a)∼1

εExp.

2. Calculate the perturbed cumulative costs for round t−1,˜

Ct−1(a)←Ct−1(a)−pt(a).

3. Pull the

arms with the lowest perturbed cumulative costs according to

Ct−1

. Break ties

arbitrarily.

4. Update the cumulative costs for each arm, Ct(a)←Ct−1(a) + ct(a).

end for

Theorem 2.

In the full feedback setting, where

St⊂ A

is the subset of arms chosen by

FPML

round t, we have:

max

c1,...,cT

E"(1 −εB)

t=1

ct(St)−min

a∗∈A

t=1

ct(a∗)#≤(1 + ln(N))

ε.

In particular, for ε= ((ln(N) + 1)/T )1/(B+1), we have

R∗

T(FPML)≤2T1

B+1 (1 + ln(N)) B

B+1 .

The proof follows the same three high level steps which appear in Kalai and Vempala [2005] for

FPL

but we extend these ideas to the case where

B > 1

. We ﬁrst observe that the algorithm which picks

the

lowest cumulative cost arms in each round only incurs regret when the best arm in round

not one of the best Barms in round t−1.

Lemma 3.

Consider a ﬁxed sequence of cost functions

c1, . . . , cT

. Let

a∗,j

be the

jth

lowest

cumulative cost arm in hindsight after the ﬁrst

rounds, breaking ties arbitrarily. Let

S∗

t:={a∗,j

t−1|

j∈[B]}be the set of the Blowest cost arms at the end of round t−1. Then for each i∈[T],

Ri:=

t=1

ct(S∗

t)−min

a∗∈A

t=1

ct(a∗)≤

t=1

1[a∗,1

t6∈ S∗

and Ri−Ri−1≤1[a∗,1

i6∈ S∗

i].

This is a generalization of the familiar result that when

B= 1

, following the leader has regret

bounded by the number of times the leader is overtaken [Kalai and Vempala, 2005].

The second step is to argue that if the cumulative costs are perturbed slightly, it becomes unlikely that

the event

{a∗,1

t6∈ S∗

will occur. One way to see this is as follows: ﬁx a round

, and let

Ct−1(a)

be the cumulative cost of

at the end of round

t−1

. Let

M:=Ct−1(a∗,B+1)

. Then every

a∈S∗

has

Ct−1(a)≤M

. If it is also true that

Ct−1(a)< M −ct(a)

for any

a∈S∗

then the event

{a∗,1

t6∈ S∗

cannot occur. This is because

Ct(a)<(M−ct(a)) + ct(a) = M

but any

a0∈ A−St

has

Ct(a0)≥M

, so

a06=a∗,1

. If we had initially perturbed each

Ct−1(a)

by subtracting independent

exponential noise

p(a)∼1

εExp

, then conditional on

the event

{Ct−1(a)< M −ct(a)}

is jointly

independent for each

a∈S∗

. Moreover the probability of this inequality not holding is equal to

P[p(a)< v +ct(a)|p(a)≥v]

for

equal to the unperturbed cost of

at round

t−1

minus

which is bounded by εct(a)(due to the memorylessness property of the exponential distribution).

Lemma 4.

Fix a sequence of cost functions

c1, . . . , cT

. Let

Ci(a) = Pi

t=1 ct(a)

and

Ci(a) =

Ci(a)−p(a)

be the perturbed cumulative cost of arm

at the end of round

, where

p(a)∼1

εExp

Let

˜a∗,j

be the

jth

lowest cumulative cost arm in hindsight after the ﬁrst

rounds using these perturbed

costs, and let ˜

S∗

t={˜a∗,j

t−1|j∈[B]}. Then

Eh1h˜a∗,1

t6∈ ˜

S∗

tii≤EhεBct(˜

S∗

t)i.

Again, when

B= 1

this argument and bound coincides with the argument given by Kalai and

Vempala [2005].

The ﬁnal step is to combine Lemmas 3 and 4 to argue that

FPML

achieves expected regret at most

EhεBPT

t=1 ct(˜

S∗

t)i

with respect to the perturbed cumulative cost

. Since

maxa∈A E[p(a)] ≤

1+ln(N)

we can argue we also achieve low expected regret with respect to the unperturbed cost

In the setting of this paper, drawing new random perturbations

pt(a)

in each round is not strictly

necessary (we can take

pt(a) = p1(a)

for

t > 1

), but it is necessary to achieve regret bounds when

cost functions can depend on prior arm choices of the algorithm (the adaptive adversarial setting). In

the setting of this paper where the costs are ﬁxed, the expected regret in either case is the same.

Probabilistic guarantees:

One advantage of this proof technique is that the regret is bounded

using the positive random variable

t=1 1h˜a∗,1

t6∈ ˜

S∗

. This means that one can apply methods like

Markov inequality to give a probabilistic guarantee of small regret, which is substantially stronger

than saying the regret is small in expectation.

Comments on settings of parameters:

When

B= 1

we recover the standard

O(pTln(N))

regret bound for the OLwE problem. For

B > 1

, the regret growth rate as a function of the number

of rounds is

B+1

. In particular, when

B= Ω(ln(T))

grows slowly with the number of rounds,

the expected regret becomes

O(ln(N))

and does not grow with the number of rounds

. If we use

a tighter inequality in the proof of Lemma 4, it is possible to get constant expected regret when

B= ln(T) ln(A)grows slowly with the number of arms and rounds.

Lower bounds:

A standard technique for constructing lower bounds in the online experts setting

with

B= 1

is to consider costs which are i.i.d. Bernoulli

(p)

[Lattimore and Szepesvári, 2020].

Unfortunately this technique fails when

B > 1

because the expected cost of the minimum of

i.i.d.

Bernoulli random variables is generally smaller than the expected cost of the best arm in hindsight

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TradingOffResourceBudgetsForImprovedRegretBoundsDamonFalckUniversityofOxforddamon.falck@gmail.comThomasOrtonUniversityofOxfordthomas.orton@cs.ox.ac.ukAbstractInthisworkweconsideravariantofadversarialonlinelearningwhereineachroundonepicksBoutofNarmsandincurscostequaltotheminimumofthecostsofeacharmc...

展开>> 收起<<

Trading Off Resource Budgets For Improved Regret Bounds Damon Falck.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Trading Off Resource Budgets For Improved Regret Bounds Damon Falck

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: