One Arrow Two Kills An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits Pierre GaillardAadirupa SahaSoham Dan

2025-05-02 0 0 1.4MB 26 页 10玖币
侵权投诉
One Arrow, Two Kills: An Unified Framework for
Achieving Optimal Regret Guarantees in Sleeping Bandits
Pierre Gaillard Aadirupa SahaSoham Dan
Abstract
We address the problem of ‘Internal Regret’ in Sleeping Bandits in the fully adversarial
setup, as well as draw connections between different existing notions of sleeping regrets in the
multiarmed bandits (MAB) literature and consequently analyze the implications: Our first
contribution is to propose the new notion of Internal Regret for sleeping MAB. We then proposed
an algorithm that yields sublinear regret in that measure, even for a completely adversarial
sequence of losses and availabilities. We further show that a low sleeping internal regret always
implies a low external regret, and as well as a low policy regret for iid sequence of losses. The
main contribution of this work precisely lies in unifying different notions of existing regret in
sleeping bandits and understand the implication of one to another. Finally, we also extend
our results to the setting of Dueling Bandits (DB)–a preference feedback variant of MAB, and
proposed a reduction to MAB idea to design a low regret algorithm for sleeping dueling bandits
with stochastic preferences and adversarial availabilities. The efficacy of our algorithms is justified
through empirical evaluations.
1 Introduction
The problem of online sequential decision-making in standard
K
-armed multiarmed bandit (MAB)
is well studied in machine learning [4, 45] and used to model online decision-making problems
under uncertainty. Due to their implicit exploration-vs-exploitation tradeoff, bandits are able to
model clinical trials, movie recommendations, retail management job scheduling etc., where the goal
is to keep pulling the ‘best-item’ in hindsight through sequentially querying one item at a time and
subsequently observing a noisy reward feedback of the queried arm [
14
,
5
,
4
,
1
,
10
]. However, from
a practical viewpoint, the decision space (or arm space
A
=
{
1
, . . . , K}
) often changes over time
due to unavailability of some items: For example, some items might go out of stock in a retail store,
some websites could be down, some restaurants might be closed etc. This setting is studied in the
multiarmed bandit (MAB) literature as sleeping bandits [
22
,
26
,
20
,
19
], where at any round the set
St⊆ A
of available actions could vary stochastically [
26
,
13
] or adversarially [
19
,
23
,
20
]. Over the
years, several lines of research have been conducted for sleeping multi-armed bandits (MAB) with
different notions of regret performance, e.g. policy, ordering, or sleeping external regret [
8
,
26
,
39
].
In this paper, we introduce a new notion of sleeping regret, called Sleeping Internal Regret, that
helps to bridge the gaps between different existing notions of sleeping regret in MAB. We show
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. pierre.gaillard@inria.fr
Toyota Technological Institute at Chicago (TTIC), US; aadirupa@ttic.edu
IBM Research, US; soham.dan@ibm.com (Major part of the work was done while the author was at the University
of Pennsylvania)
1
arXiv:2210.14998v1 [cs.LG] 26 Oct 2022
Figure 1: One Arrow, Two Kills: The connections between our proposed notion of Sleeping Internal
Regret and different existing notions of regret for sleeping MAB and their implications
that our regret notion can be applied to the fully adversarial setup, which implies sleeping external
regret in the fully adversarial setup (i.e. when both losses and item availabilities are adversarial), as
well as policy regret in the stochastic setting (i.e. when losses are stochastic). We further propose
an efficient
O
(
T
)worst-case regret algorithm for sleeping internal regret. Finally we also motivate
the implication of our results for the Dueling Bandits (DB) framework, which is an online learning
framework that generalizes the standard multiarmed bandit (MAB) [
5
] setting for identifying a set
of ‘good’ arms from a fixed decision-space (set of items) by querying preference feedback of actively
chosen item-pairs [49, 2, 52, 35, 36]. The main contributions can be listed as follows:
Connecting Existing Sleeping Regret.
The first contribution (Sec. 2) lies in relating the
existing notions of sleeping regret given as:
The first one, sleeping external regret, is mostly used in prediction with expert advice [
8
,
16
]. If
the learner had played
j
instead of
kt
at all rounds where
j
was available, we want the learner to
not incur large regret. It is well-used to design dynamic regret algorithms [
28
,
51
,
11
,
50
,
46
]. It
has the advantage that efficient no-regret algorithms can be designed even when both
St
and losses
`tare adversarial.
The second one, called ordering regret, is mostly used in the bandit literature [
23
,
39
,
21
,
27
]. It
compares the cumulative loss of the learner, with the one of the best ordering
σ
that selects the
best available action according to
σ
at every round. No efficient algorithm exists when both
`t
and
Stare adversarial: either Stor `tshould be i.i.d [23].
We also note that in some works, policies
π
(i.e., functions from subsets of [
K
]to [
K
]) are
considered instead of orderings
σ
, termed as policy regret [
26
,
39
]. The latter two are equivalent
when the losses are i.i.d., or come from an oblivious adversary with stochastic sleeping.
General Notion of Sleeping Regret.
Our second and one of the primary contribution
lies in introducing a new notion of sleeping regret, called Internal Sleeping Regret (Definition 1),
which we show actually unifies the different notions of sleeping regret under a general umbrella (see
Fig. 1): We show that (i) Low sleeping internal regret always implies a low sleeping external regret,
even under fully adversarial setup. (ii) For stochastic losses is also implies a low ordering regret
(equivalently policy regret), even under adversarial availabilities. Thus we now have a tool, Sleeping
Internal Regret, optimizing which can simultaneously optimize all the existing notions of sleeping
regret (and justifies the title of this work too!) (Sec. 2.3).
Algorithm Design and Regret Implications.
The main contribution of this works is to
2
propose an efficient algorithm (SI-EXP3, Alg. 1) w.r.t. Sleeping Internal Regret, and design an
O
(
T
)regret algorithm (Thm. 4). As motivated, the generalizability of our regret further implies
O
(
T
)external regret at any setting and also ordering regret for i.i.d losses Rem. 3. We are the
first to achieve this regret unification with only a single algorithm (Sec. 3).
Extensions: Generalized Regret for Dueling-Bandits (DB) and Algorithm.
Another
versatility of Internal Sleeping Regret is it can be made useful for designing no-regret algorithms
for the sleeping dueling bandits (DB) setup, which is a relative feedback based variant of standard
MAB [52, 2, 7] (Sec. 4).
General Sleeping DB.
Towards this, we propose a new and more unifying notion of sleeping
dueling bandits setup that allows the environment to play from different subsets of available dueling
pairs (
At
[
K
]
2
) at each round
t
. This generalizes standard notion of DB setting where
At
= [
K
]
2
without sleeping, but also the setup of Sleeping DB for At=St×St, [31].
Unifying Sleeping DB Regret.
Next, taking ques from our notion of Sleeping Internal Regret
for MAB, we propose a generalized dueling bandit regret, Internal Sleeping DB Regret (Eq. (10)),
which unifies the classical dueling bandit regret [52] as well as sleeping DB regret [31] (Rem. 4).
Optimal Algorithm Design.
Having established this new notion of sleeping regret in dueling
bandits, we propose an efficient and order optimal
O
(
T
)sleeping DB algorithm, using a reduction
to MAB setup [
32
] (Thm. 5). This improves the regret bound of [
31
] that only get
O
(
T2/3
)worst-case
regret even in the simpler At=St×Stsetting.
Experiments.
Finally, in Sec. 5, we corroborate our theoretical results with extensive empirical
evaluation (see Sec. 5). In particular, our algorithm significantly outperforms baselines as soon as
there is dependency between Stand `t. Experiments also seem to show that our algorithm can be
used efficiently to converge to Nash equilibria of two-player zero-sum games with sleeping actions
(see Rem. 5).
Related Works.
The problem of regret minimization for stochastic multiarmed bandits
(MAB) is widely studied in the online learning literature [
5
,
1
,
25
,
3
], and as motivated above, the
problem of item non-availability in the MAB setting is a practical one, which is studied as the
problem of sleeping MAB [
22
,
26
,
20
,
19
], for both stochastic rewards and adversarial availabilities
[
19
,
23
,
20
] as well as adversarial rewards and stochastic availabilities [
22
,
26
,
13
]. In case of
stochastic rewards and adversarial availabilities the achievable regret lower bound is known to be
Ω(
KT
),
K
being the number of actions in the decision space
A
= [
K
]. The well studied EXP4
algorithm does achieve the above optimal regret bound, although it is computationally inefficient
[
23
,
19
]. The optimal and efficient algorithm for this case is by [
39
], which is known to yield
˜
O
(
T
)
regret,1.
On the other hand over the last decade, the relative feedback variants of stochastic MAB problem
has seen a widespread resurgence in the form of the Dueling Bandit problem, where, instead of
getting noisy feedback of the reward of the chosen arm, the learner only gets to see a noisy feedback
on the pairwise preference of two arms selected by the learner [
52
,
53
,
24
,
47
,
40
,
38
,
30
,
41
], or
even extending the pairwise preference to subsetwise preferences [44, 9, 33, 36, 37, 17, 29].
Surprisingly, there has been almost no work on dueling bandits in sleeping setup, despite the
huge practicality of the problem framework. In a very recent work, [
31
] attempted the problem of
Sleeping DB for the setup of stochastic preferences and adversarial availabilities, however there
1˜
O(·)notation hides the logarithmic dependencies.
3
proposed algorithms can only yield a suboptimal regret guarantee of
O
(
T2/3
). Our work is the first
to achieve ˜
O(T)regret for Sleeping Dueling Bandits (see Thm. 5).
2 Problem Formulation
In this section, we introduce problem of sleeping multiarmed bandit formally, followed by the
definition of Internal Sleeping Regret – a new notion of learner’s performance in sleeping MAB
(Sec. 2.3). The last part of this section discusses the different notions of existing regret bounds in
Sleeping MAB (Sec. 2.1) and their connections (Sec. 2.2, summarized in Fig. 1).
Problem Setting: Sleeping MAB.
Let [
K
] =
{
1
, . . . , K}
be a set of arms. At each round
t
1, a set of available arms
St
[
K
]is revealed to a learner, that is asked to select an arm
ktSt
, upon which the learner gets to observe the loss
`t
(
kt
)of the selected arm. Note the sequence
of item-availabilities
{St}T
t=1
as well as the loss sequence
{`t}T
t=1
can be stochastic or adversarial
(oblivious) in nature. We consider the hardest setting of adversarial losses and availabilities, which
clearly subsumes the other settings as special cases (see Sec. 2.3 for details).
The next thing to understand is how should we evaluate the learner or what is the final objective?
Before proceeding to our unifying notion of Sleeping MAB regret, let us do a quick overview of
existing notions of sleeping MAB regret studied in the prior bandit literature.
2.1 Existing Objectives for Sleeping MAB
1. External Sleeping Regret.
The first notion was introduced by [
8
]. Here, the learner is
compared with each arm, only on the rounds in which the arm is available:
Rext
T(k) :=
T
X
t=1 `t(kt)`t(k)1{kSt}.(1)
The learner is asked to control
maxk[K]RT
(
k
) =
o
(
T
)as
T→ ∞
. In [
8
], the authors provide an
algorithm which achieves RT(k)O(T)for all k.
2. Ordering Regret.
This second notion compares the performance of the learner on all rounds,
with any fixed ordering
σ
= (
σ1, . . . , σK
)
Σof the arms, where Σdenotes the set of all possible
orderings of [K]:
Rordering
T(σ) :=
T
X
t=1
`t(kt)`tσ(St),(2)
where
σ
(
St
) =
σks.t. k
=
argmin{i
:
σiSt}
denotes the best arm available in
St
. Con-
sequently, in this case, the learner’s regret is evaluated is evaluated against the best ordering
maxσΣRordering
T(σ).
It is known that no polynomial time algorithm can achieve a sublinear regret without stochastic
assumptions on the losses
`t
or the availabilities
St
, as the problem is known to be NP-hard when
both rewards and availabilities are adversarial [
23
,
20
,
19
]. For adversarial losses and i.i.d.
St
(where each arm is independently available according to a Bernoulli distribution), [
39
] proposed an
algorithm with
O
(
T
)regret. For i.i.d. losses and adversarial availabilities, a UCB based algorithm
with logarithmic regret was proposed in [23].
4
3. Policy Regret
A policy
π
: 2
[K]7→
[
K
]denotes here a mapping from a set of available
actions/experts to an item. Let Π :=
{π|
2
[K]7→
[
K
]
}
be the class of all policies. In this case, the
regret of the learner is measured against a fixed policy πis defined as:
Rpolicy
T(π) = ET
X
t=1
`t(it)
T
X
t=1
`t(π(St)),(3)
where the expectation is taken w.r.t. the availabilities and the randomness of the player’s strategy
[
39
]. As usual, in this case, the learner’s regret is evaluated is evaluated against the best policy
maxπΠRpolicy
T(π).
2.2 Relations across Different Notions of Existing Sleeping MAB Regret
One may wonder how these above notions are related. Is one stronger than the other? Or
does optimizing one implies optimizing the other? Under what assumptions on the sequence of
losses
{`t}t[T]
and availabilities
{St}t[T]
? We answer all these questions in this section and also
summarized them in Fig. 1.
1. Relationship between (ii) Ordering Regret and (iii) Policy Regret.
These two
notions are very close, in principle they are equivalent in all practical contexts where they can be
controlled. Note for stochastic losses and availabilities, it is easy to see both are equivalent, i.e.
maxσΣRordering
T(σ)
=
maxπΠRpolicy
T(π)
. In fact, even when either losses or the availabilities are
stochastic, and losses are independent of the availabilities (which are the only settings in which
algorithms exist for these notions), we can claim the same equivalence! See App. A.3 for a proof.
Thus, for the rest of this paper, we will only work with Ordering Regret (Rordering
T).
2. Relationship between (i) External Sleeping Regret and (ii) Ordering Regret.
Does Ordering Regret (2) Implies External Regret (1)?
Case (i): Stochastic losses, Adversarial St:
Yes, in this case it does. Since losses are
stochastic, let at any round t,E[`t(i)] = µifor all i[K]. Then,
E[Rext
T(k)] =
T
X
t=1
(µktµk)1{kSt}
T
X
t=1
(µktµk
t)1{kSt}
T
X
t=1
(µktµk) = E[Rordering
T(σ)]
where the first inequality simply follows by the definition of
k
t
=
σ
(
St
),
σ
being the best ordering
in the hindsight, and by noting that for i.i.d. losses for any iSt,µiµk
t0.
Case (ii): Adversarial losses, Stochastic St:
The implication does not hold in this case. We
can construct examples to show that it is possible to have
E
[
Rordering
T
(
σ
)] = 0 but
E
[
Rext
T
(
k
)] =
O
(
T
)
for some
k
[
K
](see App. A). The key observation lies in making the losses
`t
dependent of
availability St.
Does External Regret (1) Implies Ordering Regret (2)?
Clearly, this direction is not true
in general, as indeed, it would otherwise contradict the hardness result for ordering regret: This is
since minimizing ordering regret is known to be NP-Hard for adversarial
`t
and
St
[
23
], while one
5
摘要:

OneArrow,TwoKills:AnUniedFrameworkforAchievingOptimalRegretGuaranteesinSleepingBanditsPierreGaillard*AadirupaSaha„SohamDan…AbstractWeaddresstheproblemof`InternalRegret'inSleepingBanditsinthefullyadversarialsetup,aswellasdrawconnectionsbetweendierentexistingnotionsofsleepingregretsinthemultiarmedba...

展开>> 收起<<
One Arrow Two Kills An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits Pierre GaillardAadirupa SahaSoham Dan.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:1.4MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注