
propose an efficient algorithm (SI-EXP3, Alg. 1) w.r.t. Sleeping Internal Regret, and design an
O
(
√T
)regret algorithm (Thm. 4). As motivated, the generalizability of our regret further implies
O
(
√T
)external regret at any setting and also ordering regret for i.i.d losses Rem. 3. We are the
first to achieve this regret unification with only a single algorithm (Sec. 3).
•Extensions: Generalized Regret for Dueling-Bandits (DB) and Algorithm.
Another
versatility of Internal Sleeping Regret is it can be made useful for designing no-regret algorithms
for the sleeping dueling bandits (DB) setup, which is a relative feedback based variant of standard
MAB [52, 2, 7] (Sec. 4).
– General Sleeping DB.
Towards this, we propose a new and more unifying notion of sleeping
dueling bandits setup that allows the environment to play from different subsets of available dueling
pairs (
At⊆
[
K
]
2
) at each round
t
. This generalizes standard notion of DB setting where
At
= [
K
]
2
without sleeping, but also the setup of Sleeping DB for At=St×St, [31].
– Unifying Sleeping DB Regret.
Next, taking ques from our notion of Sleeping Internal Regret
for MAB, we propose a generalized dueling bandit regret, Internal Sleeping DB Regret (Eq. (10)),
which unifies the classical dueling bandit regret [52] as well as sleeping DB regret [31] (Rem. 4).
– Optimal Algorithm Design.
Having established this new notion of sleeping regret in dueling
bandits, we propose an efficient and order optimal
O
(
√T
)sleeping DB algorithm, using a reduction
to MAB setup [
32
] (Thm. 5). This improves the regret bound of [
31
] that only get
O
(
T2/3
)worst-case
regret even in the simpler At=St×Stsetting.
•Experiments.
Finally, in Sec. 5, we corroborate our theoretical results with extensive empirical
evaluation (see Sec. 5). In particular, our algorithm significantly outperforms baselines as soon as
there is dependency between Stand `t. Experiments also seem to show that our algorithm can be
used efficiently to converge to Nash equilibria of two-player zero-sum games with sleeping actions
(see Rem. 5).
Related Works.
The problem of regret minimization for stochastic multiarmed bandits
(MAB) is widely studied in the online learning literature [
5
,
1
,
25
,
3
], and as motivated above, the
problem of item non-availability in the MAB setting is a practical one, which is studied as the
problem of sleeping MAB [
22
,
26
,
20
,
19
], for both stochastic rewards and adversarial availabilities
[
19
,
23
,
20
] as well as adversarial rewards and stochastic availabilities [
22
,
26
,
13
]. In case of
stochastic rewards and adversarial availabilities the achievable regret lower bound is known to be
Ω(
√KT
),
K
being the number of actions in the decision space
A
= [
K
]. The well studied EXP4
algorithm does achieve the above optimal regret bound, although it is computationally inefficient
[
23
,
19
]. The optimal and efficient algorithm for this case is by [
39
], which is known to yield
˜
O
(
√T
)
regret,1.
On the other hand over the last decade, the relative feedback variants of stochastic MAB problem
has seen a widespread resurgence in the form of the Dueling Bandit problem, where, instead of
getting noisy feedback of the reward of the chosen arm, the learner only gets to see a noisy feedback
on the pairwise preference of two arms selected by the learner [
52
,
53
,
24
,
47
,
40
,
38
,
30
,
41
], or
even extending the pairwise preference to subsetwise preferences [44, 9, 33, 36, 37, 17, 29].
Surprisingly, there has been almost no work on dueling bandits in sleeping setup, despite the
huge practicality of the problem framework. In a very recent work, [
31
] attempted the problem of
Sleeping DB for the setup of stochastic preferences and adversarial availabilities, however there
1˜
O(·)notation hides the logarithmic dependencies.
3