MAX-QUANTILE GROUPED INFINITE-ARM BANDITS
In this paper, we consider a novel setup featuring aspects of both settings (ii) and (iv) above.
Specifically, we are given a finite number of groups, with each group having an (uncountably) infinite
number of arms, and our goal is to identify the group whose top
p1´αq
-quantile arm (in terms of
the mean reward) is as high as possible. While this setup builds on existing ones, we will find that it
comes with unique challenges, with non-minor differences in both the upper and lower bounds.
A natural motivation for this problem is comparing large populations (e.g., of users, items, or
otherwise). As a concrete example, in recommendation systems, one may wish to know which
among several populations has the highest median click-through rate, while displaying as few total
recommendations as possible. Beyond any specific applications, we believe that our problem is
natural to consider in the broader context of infinite-arm bandits.
Before formally introducing the problem and stating our contributions, we outline some related work.
1.1. Related Work
The related work on multi-armed bandits is extensive (e.g., see (Slivkins,2019;Lattimore and
Szepesv
´
ari,2020) and the references therein); we only provide a brief outline here, with an emphasis
on only the most closely related works.
Finite-arm settings.
The standard (single) best arm identification problem was studied in (Audibert
et al.,2010;Gabillon et al.,2012;Jamieson and Nowak,2014;Kaufmann et al.,2016;Garivier and
Kaufmann,2016), among others. These works are commonly distinguished according to whether the
time horizon is fixed (fixed-budget setting) or the target error probability is fixed (fixed-confidence
setting), and our focus is on the latter. In particular, we will utilize anytime confidence bounds from
(Kaufmann et al.,2016) for upper-bounding the number of pulls.
Agrouped best-arm identification problem was studied in (Gabillon et al.,2011;Bubeck et al.,2013;
Scarlett et al.,2019), where the arms are allocated into groups, and the goal is to find the best arm in
each group. Another notable setting in which multiple arms are returned is that of subset selection,
where one seeks to find a subset of
k
arms attaining the highest mean rewards (Kalyanakrishnan et al.,
2012;Kaufmann and Kalyanakrishnan,2013;Kaufmann et al.,2016). Group structure can also be
incorporated into structured bandit frameworks (Huang et al.,2017;Gupta et al.,2020;Mukherjee
et al.,2020;Neopane et al.,2021). Perhaps the most related among these is that of max-min grouped
bandits (Wang and Scarlett,2021), which seeks a group whose worst arm is as high as possible in a
finite-arm setting. We discuss the main similarities and differences to our setting in Appendix F, as
well as highlighting a connection to structured best-arm identification (Huang et al.,2017).
Infinite-arm settings.
One line of works in infinite-arm bandits assumes that the actions lie in a
metric space, and the associated mean rewards satisfy some smoothness condition such as being
locally Lipschitz (Kleinberg et al.,2008;Bubeck et al.,2008,2011;Grill et al.,2015;Kleinberg
et al.,2019). More relevant to our paper is the line of works considering a reservoir distribution on
arms; throughout the learning process, new arms can be requested and previously-requested ones
can be pulled. Earlier works in this direction focused primarily on regret minimization (e.g., (Berry
et al.,1997;Wang et al.,2008;Bonald and Prouti
`
ere,2013;David and Shimkin,2014;Li and Xia,
2017;Kalvit and Zeevi,2021)), and several recent works have considered pure exploration (e.g.,
2