One Arrow Two Kills An Uniﬁed Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits Pierre GaillardAadirupa SahaSoham Dan

2025-05-02 0 0 1.4MB 26 页 10玖币

侵权投诉

One Arrow, Two Kills: An Uniﬁed Framework for

Achieving Optimal Regret Guarantees in Sleeping Bandits

Pierre Gaillard ∗Aadirupa Saha†Soham Dan‡

Abstract

We address the problem of ‘Internal Regret’ in Sleeping Bandits in the fully adversarial

setup, as well as draw connections between diﬀerent existing notions of sleeping regrets in the

multiarmed bandits (MAB) literature and consequently analyze the implications: Our ﬁrst

contribution is to propose the new notion of Internal Regret for sleeping MAB. We then proposed

an algorithm that yields sublinear regret in that measure, even for a completely adversarial

sequence of losses and availabilities. We further show that a low sleeping internal regret always

implies a low external regret, and as well as a low policy regret for iid sequence of losses. The

main contribution of this work precisely lies in unifying diﬀerent notions of existing regret in

sleeping bandits and understand the implication of one to another. Finally, we also extend

our results to the setting of Dueling Bandits (DB)–a preference feedback variant of MAB, and

proposed a reduction to MAB idea to design a low regret algorithm for sleeping dueling bandits

with stochastic preferences and adversarial availabilities. The eﬃcacy of our algorithms is justiﬁed

through empirical evaluations.

1 Introduction

The problem of online sequential decision-making in standard

-armed multiarmed bandit (MAB)

is well studied in machine learning [4, 45] and used to model online decision-making problems

under uncertainty. Due to their implicit exploration-vs-exploitation tradeoﬀ, bandits are able to

model clinical trials, movie recommendations, retail management job scheduling etc., where the goal

is to keep pulling the ‘best-item’ in hindsight through sequentially querying one item at a time and

subsequently observing a noisy reward feedback of the queried arm [

]. However, from

a practical viewpoint, the decision space (or arm space

{

, . . . , K}

) often changes over time

due to unavailability of some items: For example, some items might go out of stock in a retail store,

some websites could be down, some restaurants might be closed etc. This setting is studied in the

multiarmed bandit (MAB) literature as sleeping bandits [

], where at any round the set

St⊆ A

of available actions could vary stochastically [

] or adversarially [

]. Over the

years, several lines of research have been conducted for sleeping multi-armed bandits (MAB) with

diﬀerent notions of regret performance, e.g. policy, ordering, or sleeping external regret [

In this paper, we introduce a new notion of sleeping regret, called Sleeping Internal Regret, that

helps to bridge the gaps between diﬀerent existing notions of sleeping regret in MAB. We show

∗Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. pierre.gaillard@inria.fr

†Toyota Technological Institute at Chicago (TTIC), US; aadirupa@ttic.edu

‡

IBM Research, US; soham.dan@ibm.com (Major part of the work was done while the author was at the University

of Pennsylvania)

arXiv:2210.14998v1 [cs.LG] 26 Oct 2022

Figure 1: One Arrow, Two Kills: The connections between our proposed notion of Sleeping Internal

Regret and diﬀerent existing notions of regret for sleeping MAB and their implications

that our regret notion can be applied to the fully adversarial setup, which implies sleeping external

regret in the fully adversarial setup (i.e. when both losses and item availabilities are adversarial), as

well as policy regret in the stochastic setting (i.e. when losses are stochastic). We further propose

an eﬃcient

(

√T

)worst-case regret algorithm for sleeping internal regret. Finally we also motivate

the implication of our results for the Dueling Bandits (DB) framework, which is an online learning

framework that generalizes the standard multiarmed bandit (MAB) [

] setting for identifying a set

of ‘good’ arms from a ﬁxed decision-space (set of items) by querying preference feedback of actively

chosen item-pairs [49, 2, 52, 35, 36]. The main contributions can be listed as follows:

•Connecting Existing Sleeping Regret.

The ﬁrst contribution (Sec. 2) lies in relating the

existing notions of sleeping regret given as:

–

The ﬁrst one, sleeping external regret, is mostly used in prediction with expert advice [

]. If

the learner had played

instead of

at all rounds where

was available, we want the learner to

not incur large regret. It is well-used to design dynamic regret algorithms [

]. It

has the advantage that eﬃcient no-regret algorithms can be designed even when both

and losses

`tare adversarial.

–

The second one, called ordering regret, is mostly used in the bandit literature [

]. It

compares the cumulative loss of the learner, with the one of the best ordering

σ∗

that selects the

best available action according to

σ∗

at every round. No eﬃcient algorithm exists when both

and

Stare adversarial: either Stor `tshould be i.i.d [23].

–

We also note that in some works, policies

π∗

(i.e., functions from subsets of [

]to [

]) are

considered instead of orderings

σ∗

, termed as policy regret [

]. The latter two are equivalent

when the losses are i.i.d., or come from an oblivious adversary with stochastic sleeping.

•General Notion of Sleeping Regret.

Our second and one of the primary contribution

lies in introducing a new notion of sleeping regret, called Internal Sleeping Regret (Deﬁnition 1),

which we show actually uniﬁes the diﬀerent notions of sleeping regret under a general umbrella (see

Fig. 1): We show that (i) Low sleeping internal regret always implies a low sleeping external regret,

even under fully adversarial setup. (ii) For stochastic losses is also implies a low ordering regret

(equivalently policy regret), even under adversarial availabilities. Thus we now have a tool, Sleeping

Internal Regret, optimizing which can simultaneously optimize all the existing notions of sleeping

regret (and justiﬁes the title of this work too!) (Sec. 2.3).

•Algorithm Design and Regret Implications.

The main contribution of this works is to

propose an eﬃcient algorithm (SI-EXP3, Alg. 1) w.r.t. Sleeping Internal Regret, and design an

(

√T

)regret algorithm (Thm. 4). As motivated, the generalizability of our regret further implies

(

√T

)external regret at any setting and also ordering regret for i.i.d losses Rem. 3. We are the

ﬁrst to achieve this regret uniﬁcation with only a single algorithm (Sec. 3).

•Extensions: Generalized Regret for Dueling-Bandits (DB) and Algorithm.

Another

versatility of Internal Sleeping Regret is it can be made useful for designing no-regret algorithms

for the sleeping dueling bandits (DB) setup, which is a relative feedback based variant of standard

MAB [52, 2, 7] (Sec. 4).

– General Sleeping DB.

Towards this, we propose a new and more unifying notion of sleeping

dueling bandits setup that allows the environment to play from diﬀerent subsets of available dueling

pairs (

At⊆

[

]

) at each round

. This generalizes standard notion of DB setting where

= [

]

without sleeping, but also the setup of Sleeping DB for At=St×St, [31].

– Unifying Sleeping DB Regret.

Next, taking ques from our notion of Sleeping Internal Regret

for MAB, we propose a generalized dueling bandit regret, Internal Sleeping DB Regret (Eq. (10)),

which uniﬁes the classical dueling bandit regret [52] as well as sleeping DB regret [31] (Rem. 4).

– Optimal Algorithm Design.

Having established this new notion of sleeping regret in dueling

bandits, we propose an eﬃcient and order optimal

(

√T

)sleeping DB algorithm, using a reduction

to MAB setup [

] (Thm. 5). This improves the regret bound of [

] that only get

(

T2/3

)worst-case

regret even in the simpler At=St×Stsetting.

•Experiments.

Finally, in Sec. 5, we corroborate our theoretical results with extensive empirical

evaluation (see Sec. 5). In particular, our algorithm signiﬁcantly outperforms baselines as soon as

there is dependency between Stand `t. Experiments also seem to show that our algorithm can be

used eﬃciently to converge to Nash equilibria of two-player zero-sum games with sleeping actions

(see Rem. 5).

Related Works.

The problem of regret minimization for stochastic multiarmed bandits

(MAB) is widely studied in the online learning literature [

], and as motivated above, the

problem of item non-availability in the MAB setting is a practical one, which is studied as the

problem of sleeping MAB [

], for both stochastic rewards and adversarial availabilities

[

] as well as adversarial rewards and stochastic availabilities [

]. In case of

stochastic rewards and adversarial availabilities the achievable regret lower bound is known to be

Ω(

√KT

being the number of actions in the decision space

= [

]. The well studied EXP4

algorithm does achieve the above optimal regret bound, although it is computationally ineﬃcient

[

]. The optimal and eﬃcient algorithm for this case is by [

], which is known to yield

(

√T

)

regret,1.

On the other hand over the last decade, the relative feedback variants of stochastic MAB problem

has seen a widespread resurgence in the form of the Dueling Bandit problem, where, instead of

getting noisy feedback of the reward of the chosen arm, the learner only gets to see a noisy feedback

on the pairwise preference of two arms selected by the learner [

], or

even extending the pairwise preference to subsetwise preferences [44, 9, 33, 36, 37, 17, 29].

Surprisingly, there has been almost no work on dueling bandits in sleeping setup, despite the

huge practicality of the problem framework. In a very recent work, [

] attempted the problem of

Sleeping DB for the setup of stochastic preferences and adversarial availabilities, however there

1˜

O(·)notation hides the logarithmic dependencies.

proposed algorithms can only yield a suboptimal regret guarantee of

(

T2/3

). Our work is the ﬁrst

to achieve ˜

O(√T)regret for Sleeping Dueling Bandits (see Thm. 5).

2 Problem Formulation

In this section, we introduce problem of sleeping multiarmed bandit formally, followed by the

deﬁnition of Internal Sleeping Regret – a new notion of learner’s performance in sleeping MAB

(Sec. 2.3). The last part of this section discusses the diﬀerent notions of existing regret bounds in

Sleeping MAB (Sec. 2.1) and their connections (Sec. 2.2, summarized in Fig. 1).

Problem Setting: Sleeping MAB.

Let [

] =

{

, . . . , K}

be a set of arms. At each round

t≥

1, a set of available arms

St⊆

[

]is revealed to a learner, that is asked to select an arm

kt∈St

, upon which the learner gets to observe the loss

(

)of the selected arm. Note the sequence

of item-availabilities

{St}T

t=1

as well as the loss sequence

{`t}T

t=1

can be stochastic or adversarial

(oblivious) in nature. We consider the hardest setting of adversarial losses and availabilities, which

clearly subsumes the other settings as special cases (see Sec. 2.3 for details).

The next thing to understand is how should we evaluate the learner or what is the ﬁnal objective?

Before proceeding to our unifying notion of Sleeping MAB regret, let us do a quick overview of

existing notions of sleeping MAB regret studied in the prior bandit literature.

2.1 Existing Objectives for Sleeping MAB

1. External Sleeping Regret.

The ﬁrst notion was introduced by [

]. Here, the learner is

compared with each arm, only on the rounds in which the arm is available:

Rext

T(k) :=

t=1 `t(kt)−`t(k)1{k∈St}.(1)

The learner is asked to control

maxk∈[K]RT

(

) =

(

)as

T→ ∞

. In [

], the authors provide an

algorithm which achieves RT(k)≤O(√T)for all k.

2. Ordering Regret.

This second notion compares the performance of the learner on all rounds,

with any ﬁxed ordering

= (

σ1, . . . , σK

)

∈

Σof the arms, where Σdenotes the set of all possible

orderings of [K]:

Rordering

T(σ) :=

t=1

`t(kt)−`tσ(St),(2)

where

(

) =

σks.t. k

argmin{i

σi∈St}

denotes the best arm available in

. Con-

sequently, in this case, the learner’s regret is evaluated is evaluated against the best ordering

maxσ∈ΣRordering

T(σ).

It is known that no polynomial time algorithm can achieve a sublinear regret without stochastic

assumptions on the losses

or the availabilities

, as the problem is known to be NP-hard when

both rewards and availabilities are adversarial [

]. For adversarial losses and i.i.d.

(where each arm is independently available according to a Bernoulli distribution), [

] proposed an

algorithm with

(

√T

)regret. For i.i.d. losses and adversarial availabilities, a UCB based algorithm

with logarithmic regret was proposed in [23].

3. Policy Regret

A policy

: 2

[K]7→

[

]denotes here a mapping from a set of available

actions/experts to an item. Let Π :=

{π|

[K]7→

[

]

}

be the class of all policies. In this case, the

regret of the learner is measured against a ﬁxed policy πis deﬁned as:

Rpolicy

T(π) = ET

t=1

`t(it)−

t=1

`t(π(St)),(3)

where the expectation is taken w.r.t. the availabilities and the randomness of the player’s strategy

[

]. As usual, in this case, the learner’s regret is evaluated is evaluated against the best policy

maxπ∈ΠRpolicy

T(π).

2.2 Relations across Diﬀerent Notions of Existing Sleeping MAB Regret

One may wonder how these above notions are related. Is one stronger than the other? Or

does optimizing one implies optimizing the other? Under what assumptions on the sequence of

losses

{`t}t∈[T]

and availabilities

{St}t∈[T]

? We answer all these questions in this section and also

summarized them in Fig. 1.

1. Relationship between (ii) Ordering Regret and (iii) Policy Regret.

These two

notions are very close, in principle they are equivalent in all practical contexts where they can be

controlled. Note for stochastic losses and availabilities, it is easy to see both are equivalent, i.e.

maxσ∈ΣRordering

T(σ)

maxπ∈ΠRpolicy

T(π)

. In fact, even when either losses or the availabilities are

stochastic, and losses are independent of the availabilities (which are the only settings in which

algorithms exist for these notions), we can claim the same equivalence! See App. A.3 for a proof.

Thus, for the rest of this paper, we will only work with Ordering Regret (Rordering

T).

2. Relationship between (i) External Sleeping Regret and (ii) Ordering Regret.

•Does Ordering Regret (2) Implies External Regret (1)?

– Case (i): Stochastic losses, Adversarial St:

Yes, in this case it does. Since losses are

stochastic, let at any round t,E[`t(i)] = µifor all i∈[K]. Then,

E[Rext

T(k)] =

t=1

(µkt−µk)1{k∈St}

≤

t=1

(µkt−µk∗

t)1{k∈St}

≤

t=1

(µkt−µk∗) = E[Rordering

T(σ)]

where the ﬁrst inequality simply follows by the deﬁnition of

k∗

σ∗

(

σ∗

being the best ordering

in the hindsight, and by noting that for i.i.d. losses for any i∈St,µi−µk∗

t≥0.

– Case (ii): Adversarial losses, Stochastic St:

The implication does not hold in this case. We

can construct examples to show that it is possible to have

[

Rordering

(

)] = 0 but

[

Rext

(

)] =

(

)

for some

k∈

[

](see App. A). The key observation lies in making the losses

dependent of

availability St.

•Does External Regret (1) Implies Ordering Regret (2)?

Clearly, this direction is not true

in general, as indeed, it would otherwise contradict the hardness result for ordering regret: This is

since minimizing ordering regret is known to be NP-Hard for adversarial

and

[

], while one

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OneArrow,TwoKills:AnUniedFrameworkforAchievingOptimalRegretGuaranteesinSleepingBanditsPierreGaillard*AadirupaSahaSohamDanAbstractWeaddresstheproblemof`InternalRegret'inSleepingBanditsinthefullyadversarialsetup,aswellasdrawconnectionsbetweendierentexistingnotionsofsleepingregretsinthemultiarmedba...

展开>> 收起<<

One Arrow Two Kills An Uniﬁed Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits Pierre GaillardAadirupa SahaSoham Dan.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

One Arrow Two Kills An Uniﬁed Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits Pierre GaillardAadirupa SahaSoham Dan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: