137 Max-Quantile Grouped Infinite-Arm Bandits Ivan Lau IVAN .LAURICE .EDU

2025-04-28 0 0 1.24MB 37 页 10玖币
侵权投诉
137
Max-Quantile Grouped Infinite-Arm Bandits
Ivan Lau IVAN.LAU@RICE.EDU
Rice University
Yan Hao Ling LINGYH@NUS.EDU.SG
National University of Singapore
Mayank Shrivastava MAYANKS4@ILLINOIS.EDU
University of Illinois Urbana-Champaign
Jonathan Scarlett SCARLETT@COMP.NUS.EDU.SG
National University of Singapore
Abstract
In this paper, we consider a bandit problem in which there are a number of groups each consisting
of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward
is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in
the arm’s mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify
the infinite-arm group whose reservoir distribution has the highest
p1´αq
-quantile (e.g., median
if
α1
2
), using as few total arm pulls as possible. We introduce a two-step algorithm that first
requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile
bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a
matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic
improvements, and potential lower bounds associated with our instance-dependent upper bounds.
Keywords: Bandit algorithms, infinite-arm bandits, grouped bandits, lower bounds
1. Introduction
Multi-armed bandit (MAB) algorithms are widely adopted in scenarios of decision-making under
uncertainty (Slivkins,2019;Lattimore and Szepesv
´
ari,2020). In theoretical MAB studies, two
particularly common performance goals are regret minimization and best arm identification, and
this paper is more related to the latter. The most basic form of best arm identification seeks to
identify a single arm with the highest mean reward from a finite set of arms (Kaufmann et al.,2016).
Other variations include but not limited to (i) identifying multiple arms with highest mean rewards
(Kalyanakrishnan et al.,2012); (ii) identifying a single arm with a near-maximal mean reward from
an infinite set of arms (Aziz et al.,2018;Chaudhuri and Kalyanakrishnan,2018); (iii) identifying, for
each group of arms, a single arm with the highest mean reward (Gabillon et al.,2011;Bubeck et al.,
2013); (iv) identifying a single arm-group from a set of groups (of arms) whose worst arm has the
highest mean reward (Wang and Scarlett,2021).
© I. Lau, Y.H. Ling, M. Shrivastava & J. Scarlett.
arXiv:2210.01295v3 [stat.ML] 1 Feb 2023
MAX-QUANTILE GROUPED INFINITE-ARM BANDITS
In this paper, we consider a novel setup featuring aspects of both settings (ii) and (iv) above.
Specifically, we are given a finite number of groups, with each group having an (uncountably) infinite
number of arms, and our goal is to identify the group whose top
p1´αq
-quantile arm (in terms of
the mean reward) is as high as possible. While this setup builds on existing ones, we will find that it
comes with unique challenges, with non-minor differences in both the upper and lower bounds.
A natural motivation for this problem is comparing large populations (e.g., of users, items, or
otherwise). As a concrete example, in recommendation systems, one may wish to know which
among several populations has the highest median click-through rate, while displaying as few total
recommendations as possible. Beyond any specific applications, we believe that our problem is
natural to consider in the broader context of infinite-arm bandits.
Before formally introducing the problem and stating our contributions, we outline some related work.
1.1. Related Work
The related work on multi-armed bandits is extensive (e.g., see (Slivkins,2019;Lattimore and
Szepesv
´
ari,2020) and the references therein); we only provide a brief outline here, with an emphasis
on only the most closely related works.
Finite-arm settings.
The standard (single) best arm identification problem was studied in (Audibert
et al.,2010;Gabillon et al.,2012;Jamieson and Nowak,2014;Kaufmann et al.,2016;Garivier and
Kaufmann,2016), among others. These works are commonly distinguished according to whether the
time horizon is fixed (fixed-budget setting) or the target error probability is fixed (fixed-confidence
setting), and our focus is on the latter. In particular, we will utilize anytime confidence bounds from
(Kaufmann et al.,2016) for upper-bounding the number of pulls.
Agrouped best-arm identification problem was studied in (Gabillon et al.,2011;Bubeck et al.,2013;
Scarlett et al.,2019), where the arms are allocated into groups, and the goal is to find the best arm in
each group. Another notable setting in which multiple arms are returned is that of subset selection,
where one seeks to find a subset of
k
arms attaining the highest mean rewards (Kalyanakrishnan et al.,
2012;Kaufmann and Kalyanakrishnan,2013;Kaufmann et al.,2016). Group structure can also be
incorporated into structured bandit frameworks (Huang et al.,2017;Gupta et al.,2020;Mukherjee
et al.,2020;Neopane et al.,2021). Perhaps the most related among these is that of max-min grouped
bandits (Wang and Scarlett,2021), which seeks a group whose worst arm is as high as possible in a
finite-arm setting. We discuss the main similarities and differences to our setting in Appendix F, as
well as highlighting a connection to structured best-arm identification (Huang et al.,2017).
Infinite-arm settings.
One line of works in infinite-arm bandits assumes that the actions lie in a
metric space, and the associated mean rewards satisfy some smoothness condition such as being
locally Lipschitz (Kleinberg et al.,2008;Bubeck et al.,2008,2011;Grill et al.,2015;Kleinberg
et al.,2019). More relevant to our paper is the line of works considering a reservoir distribution on
arms; throughout the learning process, new arms can be requested and previously-requested ones
can be pulled. Earlier works in this direction focused primarily on regret minimization (e.g., (Berry
et al.,1997;Wang et al.,2008;Bonald and Prouti
`
ere,2013;David and Shimkin,2014;Li and Xia,
2017;Kalvit and Zeevi,2021)), and several recent works have considered pure exploration (e.g.,
2
MAX-QUANTILE GROUPED INFINITE-ARM BANDITS
(Carpentier and Valko,2015;Jamieson et al.,2016;Chaudhuri and Kalyanakrishnan,2018;Aziz
et al.,2018;Chaudhuri and Kalyanakrishnan,2019;Ren et al.,2019;Katz-Samuels and Jamieson,
2020;Zhang and Ong,2021)). Throughout the paper, we particularly focus on the work (Aziz et al.,
2018), which studies the problem of finding a single arm in the top
p1´αq
-quantile. See Appendix F
for a discussion of some key similarities and differences.
The idea of measuring the performance against a quantile has appeared in several of these works,
including (Chaudhuri and Kalyanakrishnan,2018;Aziz et al.,2018). Related notions include finding
a “good” arm in a scenario where there are two types of arm (Jamieson et al.,2016), and finding a
subset of a top fraction of arms in a finite-arm setting (Chaudhuri and Kalyanakrishnan,2019;Ren
et al.,2019). The notion of finding an optimal quantile has also appeared extensively in risk-aware
bandits (Tan et al.,2022), but these problems consist of a single set of arms and consider quantiles of
the reward distributions, which is handled very differently to quantiles of reservoir distributions.
2. Problem Setup and Contributions
Arms and rewards.
We first describe the problem aspects that are the same as regular MAB problems.
We are given a collection
A
of arms, which has some unknown mean-reward mapping
µ:AÑR
.
In each round, indexed by
tě1
, the algorithm pulls one or more arms
1
in
A
and observes their
corresponding rewards. We consider the stochastic reward setting, in which for each arm
jPA
, the
observations of its reward
tXj,tutě1
are i.i.d. random variables from some distribution with mean
µj:µpjq. It is also useful to define the empirical mean of arm jat round t, defined as
ˆµj,Tjptq:1
Tjptqÿ
τPt1,...,tu:
arm jpulled
Xj,τ ,(1)
where
Tjptq ď t
is the number of pulls of arm
j
up to round
t
. In standard best-arm identification
problems, the goal is to identify the best arm with high probability using as few pulls as possible.
We will restrict our attention to classes of arm distributions that are uniquely parametrized by their
mean, e.g., Bernoulli(
µ
) or
Npµ, σ2q
with
σ2ą0
being the same for every arm. By doing so, the
notion of a reservoir distribution (see below) can be introduced using probability distributions on
R
,
which is significantly more convenient compared to more general distributions.
Infinite-arm and group notions.
In our setup, the number of arms
A
is (potentially uncountably)
infinite. Furthermore, these arms are partitioned into a finite number of disjoint groups, with each
group having a (potentially uncountably) infinite number of arms. The set of these disjoint groups
is denoted by
G
. The mean rewards for each group
GPG
form a probability space
pMG,FG, PGq
,
where
MGµpGq “ tµj:jPGu
. For each group
G
, we define a corresponding reservoir
distribution CDF FG:MGÑ r0,1s, as well as a quantile function F´1
G:r0,1s Ñ MG, by
FGpτq:PGptµďτuq and F´1
Gppq:inf tµ:FGpµq ě pu,(2)
1.
We will find it useful to let
t
index “rounds” each possibly consisting of several arm pulls, but we will still be interested
in the total number of arm pulls.
3
MAX-QUANTILE GROUPED INFINITE-ARM BANDITS
where
FG
is assumed to have a bounded support. Our goal is to identify a group in
G
with the highest
p1´αq-quantile, i.e., a group Gsatisfying
F´1
Gp1´αq “ max
G1PGF´1
G1p1´αq.(3)
Observe that when
α
is close to zero, this problem resembles that of finding a near-maximal arm
across all groups, whereas when
α
is close to one, the problem resembles the max-min problem of
finding the group whose worst arm is as high as possible (Wang and Scarlett,2021). Throughout the
paper, we treat
αP p0,1q
as a fixed constant (e.g.,
α1
2
for the median), meaning its dependence
may be omitted in Op¨q notation.
Throughout the course of the algorithm, the following can be performed:
The algorithm can request to receive one or more additional arms from any group of its choice;
if that group is GPG, then the arm mean is drawn according to FG.
Among all the arms requested so far, the algorithm can perform pulls of the arms and observe
the corresponding rewards, as usual.
We are interested in keeping the total number of arm pulls low; the total number of arms requested is
not of direct importance (similar to previous infinite-arm problems such as (Aziz et al.,2018)).
It will be convenient to uniquely index each arm in a given group
G
by
jP r0,1s
such that any new
arm drawn from the reservoir distribution has its
j
value drawn uniformly from
r0,1s
, and has mean
reward
µG,j F´1
Gpjq
.
2
We will use this convention, but it is important to note that whenever the
algorithm requests an arm, its index
j
remains unknown. We also note that two different indices
jj1
in a given group could still have the same means (i.e.,
µG,j µG,j1
), e.g., when the underlying
reservoir distribution is discrete.
, -relaxations.
When no assumptions are made on the reservoir distributions, one may encounter
scenarios where the
p1´αq
-quantile of a given group is arbitrarily hard to pinpoint (e.g., because
the underlying CDF has a near-horizontal region, implying a very low probability of any given arm
being near the precise quantile). To alleviate this challenge, we add an
-relaxation on
α
for some
ă
minpα, 1´αq
to limit the effort spent on identifying a quantile that is hard to pinpoint. In particular,
we relax the task into finding a group
G
satisfying
F´1
Gp1´α`q ě maxG1PGF´1
G1p1´α´q.
Moreover, we further relax the task by only requiring the chosen group Gsatisfies
F´1
Gp1´α`q ě max
G1PGF´1
G1p1´α´q ´ ,(4)
for some
ą0
. This relaxation allows us to limit the effort on distinguishing arms (potentially from
different groups) whose expected rewards are very close to each other; analogous relaxations are
common in standard best-arm identification problems. The general goal of the algorithm for our setup
is to identify a group satisfying (4) with high probability while using as few arm pulls as possible.
Summary of contributions
With the problem setup now in place, we can now summarize our main
contributions (beyond the problem formulation itself):
2.
We may assume that the same
j
value is never drawn twice, since this is a zero-probability event. We emphasize that
with
µF´1
Gpjq
, the assumption of unique
j
values does not necessarily imply unique arm means, as we still allow
FGto have mass points.
4
MAX-QUANTILE GROUPED INFINITE-ARM BANDITS
We introduce a two-step algorithm based on first requesting a fixed number of arms for each
group, and then running a finite-arm subroutine. Our main result for this algorithm is an
instance-dependent upper bound on the number of pulls to guarantee
(4)
with high probability
(Corollary 4), expressed in terms of the reservoir distributions and fundamental gap quantities
introduced in Section 3.3.
In addition, we establish a worst-case upper bound in terms of
and
alone (Eq.
(13)
),
and a guarantee for the finite-arm setting (Theorem 3). We believe that these should be of
independent interest.
We show that adapting our two-step algorithm to a multi-step algorithm can lead to a better
instance-dependent bound (Corollary 5), and discuss how the resulting gap terms may be similar
to those of a potential instance-dependent lower bound (Appendix F.3), though formalizing the
latter is left for future work.
By deriving a worst-case lower bound (Theorem 6) and comparing it with our upper bound,
we show that for worst-case instances, the optimal number of arm pulls scales as
|G|
22
(for
|G| ě 2) up to logarithmic factors.
Some of the innovations in our analysis include (i) suitably identifying the relevant “gaps” in the finite-
arm elimination algorithm and adapting the analysis accordingly; (ii) setting up the relaxation
(4)
and determining how this impacts the number of arms to request and how to characterize the high-
probability behavior of their gaps; (iii) suitably extending the two-step algorithm to a multi-step
algorithm; and (iv) proving the lower bound via a likelihood-ratio based analysis that is distinct from
others in the bandit literature to the best of our knowledge.
3. Algorithm and Upper Bound
In this section, we introduce our main algorithm and provide its performance guarantee. We make
the following standard assumptions on the reward distributions.
Assumption 1
For every arm
j
(in every group
G
; the group dependence is left implicit here),
we assume that the mean reward
µj
is bounded in
r0,1s
,
3
and that the reward distribution is
sub-Gaussian with parameter
σ2ď1
.
4
That is, for each arm
j
, and for each
λPR
, if
X
is a
random variable drawn from the arm’s reward distribution, then
ErXs “ µj
and
EreλpX´µjqs ď
exppλ2σ2{2q
. Moreover, we assume that all of the reward distributions come from a common family
of distributions that are uniquely parametrized by their mean (e.g., Bernoulli, or Gaussian with a
fixed variance).
3.1. Description of the Algorithm
We present our two-step algorithm in Algorithm 1. This algorithm uses a finite-arm best-quantile
identification (BQID) algorithm
FiniteArmBQID
as a sub-routine, and its description is deferred
to Algorithm 2in Appendix C.3. For now, we only need to treat this step in a “black-box” manner
with certain guarantees outlined at the end of this subsection.
3. Any finite interval can be shifted and scaled to this range.
4. The upper bound σ2ď1is for convenience in applying known confidence bounds, and can easily be relaxed.
5
摘要:

1–37Max-QuantileGroupedInnite-ArmBanditsIvanLauIVAN.LAU@RICE.EDURiceUniversityYanHaoLingLINGYH@NUS.EDU.SGNationalUniversityofSingaporeMayankShrivastavaMAYANKS4@ILLINOIS.EDUUniversityofIllinoisUrbana-ChampaignJonathanScarlettSCARLETT@COMP.NUS.EDU.SGNationalUniversityofSingaporeAbstractInthispaper,we...

展开>> 收起<<
137 Max-Quantile Grouped Infinite-Arm Bandits Ivan Lau IVAN .LAURICE .EDU.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:1.24MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注