
2 Discussion of related work
Games with limited feedback and O√Tregret:
In the standard setting of multi-
armed bandit problem, the learner has to repeatedly obtain rewards (or incur losses) by
choosing from a fixed set of
k
actions and gets to see only the reward of the chosen action.
Algorithms such as EXP3-IX [Neu, 2015] or EXP3.P [Auer et al., 2002] achieve the optimal
regret of order
O√KT
up to a logarithmic factor, with high probability. A more general
setting closer to ours was introduced by Seldin et al. [2014]. Given a budget
m∈JKK
, in
each round
t
, the learner plays the prediction of one expert
It
, then gets to choose a subset
of experts
Ct
such that
It∈Ct
in order to see their prediction. A careful adaptation of the
EXP3 algorithm to this setting leads to an expected regret of order
Op(K/m)T
, which is
optimal up to logarithmic factor in K.
There are two significant differences between our framework and the setting presented
by Seldin et al. [2014]. First, we allow the player to combine up to
p
experts out of
K
in
each round for prediction. Second, we make an additional exp-concavity-type assumption
(Assumption 1) on the loss function. These two differences allow us to achieve constant regrets
bounds (independent of T).
Playing multiple arms per round was considered in the literature of multiple-play multi-
armed bandits. This problem was investigated under a budget constraint
C
by Zhou and
Tomlin [2018] and Xia et al. [2016]. In each round, the player picks
m
out of
K
arms, incurs
the sum of their losses. In addition to observing the losses of the played arms, the learner
learns a vector of costs which has to be covered by a pre-defined budget
C
. Once the budget is
consumed, the game finishes. An extension of the EXP3 algorithm allows deriving a strategy
in the adversarial setting with regret of order
OpKC log(K/m)
. The cost of each arm is
supposed to be in an interval [
cmin,
1], for a positive constant
cmin
. Hence the total number of
rounds in this game
T
satisfies
T
= Θ(
C/m
). Another online problem aims at minimizing the
cumulative regret in an adversarial setting with a small effective range of losses. Gerchinovitz
and Lattimore [2016] have shown the impossibility of regret scaling with the effective range
of losses in the bandit setting, while Thune and Seldin [2018] showed that it is possible to
circumvent this impossibility result if the player is allowed one additional observation per
round. However, it is impossible to achieve a regret dependence on
T
better than the rate of
order O√Tin this setting.
Decoupling exploration and exploitation was considered by Avner et al. [2012]. In each
round, the player plays one arm, then chooses one arm out of
K
to see its prediction (not
necessarily the played arm as in the canonical multi-armed bandits problem). They devised
algorithms for this setting and showed that the dependence on the number of arms
K
can be
improved. However, it is impossible to achieve a regret dependence on
T
better than
O√T
.
Prediction with limited expert advice was also investigated by Helmbold and Panizza
[1997],Cesa-Bianchi and Lugosi [2006, Chap. 6] and Cesa-Bianchi et al. [2005]. However, in
these problems, known as label efficient prediction, the forecaster has full access to the experts
advice but limited information about the past outcomes of the sequence to be predicted. More
precisely, the outcome ytis not necessarily revealed to the learner. In such a framework, the
optimal regret is of order O√T.
5