137 Max-Quantile Grouped Inﬁnite-Arm Bandits Ivan Lau IVAN .LAURICE .EDU

2025-04-28 1 0 1.24MB 37 页 10玖币

侵权投诉

1–37

Max-Quantile Grouped Inﬁnite-Arm Bandits

Ivan Lau IVAN.LAU@RICE.EDU

Rice University

Yan Hao Ling LINGYH@NUS.EDU.SG

National University of Singapore

Mayank Shrivastava MAYANKS4@ILLINOIS.EDU

University of Illinois Urbana-Champaign

Jonathan Scarlett SCARLETT@COMP.NUS.EDU.SG

National University of Singapore

Abstract

In this paper, we consider a bandit problem in which there are a number of groups each consisting

of inﬁnitely many arms. Whenever a new arm is requested from a given group, its mean reward

is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in

the arm’s mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify

the inﬁnite-arm group whose reservoir distribution has the highest

p1´αq

-quantile (e.g., median

α“1

), using as few total arm pulls as possible. We introduce a two-step algorithm that ﬁrst

requests a ﬁxed number of arms from each group and then runs a ﬁnite-arm grouped max-quantile

bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a

matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic

improvements, and potential lower bounds associated with our instance-dependent upper bounds.

Keywords: Bandit algorithms, inﬁnite-arm bandits, grouped bandits, lower bounds

1. Introduction

Multi-armed bandit (MAB) algorithms are widely adopted in scenarios of decision-making under

uncertainty (Slivkins,2019;Lattimore and Szepesv

ari,2020). In theoretical MAB studies, two

particularly common performance goals are regret minimization and best arm identiﬁcation, and

this paper is more related to the latter. The most basic form of best arm identiﬁcation seeks to

identify a single arm with the highest mean reward from a ﬁnite set of arms (Kaufmann et al.,2016).

Other variations include but not limited to (i) identifying multiple arms with highest mean rewards

(Kalyanakrishnan et al.,2012); (ii) identifying a single arm with a near-maximal mean reward from

an inﬁnite set of arms (Aziz et al.,2018;Chaudhuri and Kalyanakrishnan,2018); (iii) identifying, for

each group of arms, a single arm with the highest mean reward (Gabillon et al.,2011;Bubeck et al.,

2013); (iv) identifying a single arm-group from a set of groups (of arms) whose worst arm has the

highest mean reward (Wang and Scarlett,2021).

arXiv:2210.01295v3 [stat.ML] 1 Feb 2023

MAX-QUANTILE GROUPED INFINITE-ARM BANDITS

In this paper, we consider a novel setup featuring aspects of both settings (ii) and (iv) above.

Speciﬁcally, we are given a ﬁnite number of groups, with each group having an (uncountably) inﬁnite

number of arms, and our goal is to identify the group whose top

p1´αq

-quantile arm (in terms of

the mean reward) is as high as possible. While this setup builds on existing ones, we will ﬁnd that it

comes with unique challenges, with non-minor differences in both the upper and lower bounds.

A natural motivation for this problem is comparing large populations (e.g., of users, items, or

otherwise). As a concrete example, in recommendation systems, one may wish to know which

among several populations has the highest median click-through rate, while displaying as few total

recommendations as possible. Beyond any speciﬁc applications, we believe that our problem is

natural to consider in the broader context of inﬁnite-arm bandits.

Before formally introducing the problem and stating our contributions, we outline some related work.

1.1. Related Work

The related work on multi-armed bandits is extensive (e.g., see (Slivkins,2019;Lattimore and

Szepesv

ari,2020) and the references therein); we only provide a brief outline here, with an emphasis

on only the most closely related works.

Finite-arm settings.

The standard (single) best arm identiﬁcation problem was studied in (Audibert

et al.,2010;Gabillon et al.,2012;Jamieson and Nowak,2014;Kaufmann et al.,2016;Garivier and

Kaufmann,2016), among others. These works are commonly distinguished according to whether the

time horizon is ﬁxed (ﬁxed-budget setting) or the target error probability is ﬁxed (ﬁxed-conﬁdence

setting), and our focus is on the latter. In particular, we will utilize anytime conﬁdence bounds from

(Kaufmann et al.,2016) for upper-bounding the number of pulls.

Agrouped best-arm identiﬁcation problem was studied in (Gabillon et al.,2011;Bubeck et al.,2013;

Scarlett et al.,2019), where the arms are allocated into groups, and the goal is to ﬁnd the best arm in

each group. Another notable setting in which multiple arms are returned is that of subset selection,

where one seeks to ﬁnd a subset of

arms attaining the highest mean rewards (Kalyanakrishnan et al.,

2012;Kaufmann and Kalyanakrishnan,2013;Kaufmann et al.,2016). Group structure can also be

incorporated into structured bandit frameworks (Huang et al.,2017;Gupta et al.,2020;Mukherjee

et al.,2020;Neopane et al.,2021). Perhaps the most related among these is that of max-min grouped

bandits (Wang and Scarlett,2021), which seeks a group whose worst arm is as high as possible in a

ﬁnite-arm setting. We discuss the main similarities and differences to our setting in Appendix F, as

well as highlighting a connection to structured best-arm identiﬁcation (Huang et al.,2017).

Inﬁnite-arm settings.

One line of works in inﬁnite-arm bandits assumes that the actions lie in a

metric space, and the associated mean rewards satisfy some smoothness condition such as being

locally Lipschitz (Kleinberg et al.,2008;Bubeck et al.,2008,2011;Grill et al.,2015;Kleinberg

et al.,2019). More relevant to our paper is the line of works considering a reservoir distribution on

arms; throughout the learning process, new arms can be requested and previously-requested ones

can be pulled. Earlier works in this direction focused primarily on regret minimization (e.g., (Berry

et al.,1997;Wang et al.,2008;Bonald and Prouti

ere,2013;David and Shimkin,2014;Li and Xia,

2017;Kalvit and Zeevi,2021)), and several recent works have considered pure exploration (e.g.,

MAX-QUANTILE GROUPED INFINITE-ARM BANDITS

(Carpentier and Valko,2015;Jamieson et al.,2016;Chaudhuri and Kalyanakrishnan,2018;Aziz

et al.,2018;Chaudhuri and Kalyanakrishnan,2019;Ren et al.,2019;Katz-Samuels and Jamieson,

2020;Zhang and Ong,2021)). Throughout the paper, we particularly focus on the work (Aziz et al.,

2018), which studies the problem of ﬁnding a single arm in the top

p1´αq

-quantile. See Appendix F

for a discussion of some key similarities and differences.

The idea of measuring the performance against a quantile has appeared in several of these works,

including (Chaudhuri and Kalyanakrishnan,2018;Aziz et al.,2018). Related notions include ﬁnding

a “good” arm in a scenario where there are two types of arm (Jamieson et al.,2016), and ﬁnding a

subset of a top fraction of arms in a ﬁnite-arm setting (Chaudhuri and Kalyanakrishnan,2019;Ren

et al.,2019). The notion of ﬁnding an optimal quantile has also appeared extensively in risk-aware

bandits (Tan et al.,2022), but these problems consist of a single set of arms and consider quantiles of

the reward distributions, which is handled very differently to quantiles of reservoir distributions.

2. Problem Setup and Contributions

Arms and rewards.

We ﬁrst describe the problem aspects that are the same as regular MAB problems.

We are given a collection

of arms, which has some unknown mean-reward mapping

µ:AÑR

In each round, indexed by

tě1

, the algorithm pulls one or more arms

and observes their

corresponding rewards. We consider the stochastic reward setting, in which for each arm

jPA

, the

observations of its reward

tXj,tutě1

are i.i.d. random variables from some distribution with mean

µj:“µpjq. It is also useful to deﬁne the empirical mean of arm jat round t, deﬁned as

ˆµj,Tjptq:“1

Tjptqÿ

τPt1,...,tu:

arm jpulled

Xj,τ ,(1)

where

Tjptq ď t

is the number of pulls of arm

up to round

. In standard best-arm identiﬁcation

problems, the goal is to identify the best arm with high probability using as few pulls as possible.

We will restrict our attention to classes of arm distributions that are uniquely parametrized by their

mean, e.g., Bernoulli(

) or

Npµ, σ2q

with

σ2ą0

being the same for every arm. By doing so, the

notion of a reservoir distribution (see below) can be introduced using probability distributions on

which is signiﬁcantly more convenient compared to more general distributions.

Inﬁnite-arm and group notions.

In our setup, the number of arms

is (potentially uncountably)

inﬁnite. Furthermore, these arms are partitioned into a ﬁnite number of disjoint groups, with each

group having a (potentially uncountably) inﬁnite number of arms. The set of these disjoint groups

is denoted by

. The mean rewards for each group

GPG

form a probability space

pMG,FG, PGq

where

MG“µpGq “ tµj:jPGu

. For each group

, we deﬁne a corresponding reservoir

distribution CDF FG:MGÑ r0,1s, as well as a quantile function F´1

G:r0,1s Ñ MG, by

FGpτq:“PGptµďτuq and F´1

Gppq:“inf tµ:FGpµq ě pu,(2)

We will ﬁnd it useful to let

index “rounds” each possibly consisting of several arm pulls, but we will still be interested

in the total number of arm pulls.

MAX-QUANTILE GROUPED INFINITE-ARM BANDITS

where

is assumed to have a bounded support. Our goal is to identify a group in

with the highest

p1´αq-quantile, i.e., a group Gsatisfying

F´1

Gp1´αq “ max

G1PGF´1

G1p1´αq.(3)

Observe that when

is close to zero, this problem resembles that of ﬁnding a near-maximal arm

across all groups, whereas when

is close to one, the problem resembles the max-min problem of

ﬁnding the group whose worst arm is as high as possible (Wang and Scarlett,2021). Throughout the

paper, we treat

αP p0,1q

as a ﬁxed constant (e.g.,

α“1

for the median), meaning its dependence

may be omitted in Op¨q notation.

Throughout the course of the algorithm, the following can be performed:

•

The algorithm can request to receive one or more additional arms from any group of its choice;

if that group is GPG, then the arm mean is drawn according to FG.

•

Among all the arms requested so far, the algorithm can perform pulls of the arms and observe

the corresponding rewards, as usual.

We are interested in keeping the total number of arm pulls low; the total number of arms requested is

not of direct importance (similar to previous inﬁnite-arm problems such as (Aziz et al.,2018)).

It will be convenient to uniquely index each arm in a given group

jP r0,1s

such that any new

arm drawn from the reservoir distribution has its

value drawn uniformly from

r0,1s

, and has mean

reward

µG,j “F´1

Gpjq

We will use this convention, but it is important to note that whenever the

algorithm requests an arm, its index

remains unknown. We also note that two different indices

j‰j1

in a given group could still have the same means (i.e.,

µG,j “µG,j1

), e.g., when the underlying

reservoir distribution is discrete.

, ∆-relaxations.

When no assumptions are made on the reservoir distributions, one may encounter

scenarios where the

p1´αq

-quantile of a given group is arbitrarily hard to pinpoint (e.g., because

the underlying CDF has a near-horizontal region, implying a very low probability of any given arm

being near the precise quantile). To alleviate this challenge, we add an



-relaxation on

for some

ă

minpα, 1´αq

to limit the effort spent on identifying a quantile that is hard to pinpoint. In particular,

we relax the task into ﬁnding a group

satisfying

F´1

Gp1´α`q ě maxG1PGF´1

G1p1´α´q.

Moreover, we further relax the task by only requiring the chosen group Gsatisﬁes

F´1

Gp1´α`q ě max

G1PGF´1

G1p1´α´q ´ ∆,(4)

for some

∆ą0

. This relaxation allows us to limit the effort on distinguishing arms (potentially from

different groups) whose expected rewards are very close to each other; analogous relaxations are

common in standard best-arm identiﬁcation problems. The general goal of the algorithm for our setup

is to identify a group satisfying (4) with high probability while using as few arm pulls as possible.

Summary of contributions

With the problem setup now in place, we can now summarize our main

contributions (beyond the problem formulation itself):

We may assume that the same

value is never drawn twice, since this is a zero-probability event. We emphasize that

with

µ“F´1

Gpjq

, the assumption of unique

values does not necessarily imply unique arm means, as we still allow

FGto have mass points.

MAX-QUANTILE GROUPED INFINITE-ARM BANDITS

•

We introduce a two-step algorithm based on ﬁrst requesting a ﬁxed number of arms for each

group, and then running a ﬁnite-arm subroutine. Our main result for this algorithm is an

instance-dependent upper bound on the number of pulls to guarantee

(4)

with high probability

(Corollary 4), expressed in terms of the reservoir distributions and fundamental gap quantities

introduced in Section 3.3.

•

In addition, we establish a worst-case upper bound in terms of



and

∆

alone (Eq.

(13)

and a guarantee for the ﬁnite-arm setting (Theorem 3). We believe that these should be of

independent interest.

•

We show that adapting our two-step algorithm to a multi-step algorithm can lead to a better

instance-dependent bound (Corollary 5), and discuss how the resulting gap terms may be similar

to those of a potential instance-dependent lower bound (Appendix F.3), though formalizing the

latter is left for future work.

•

By deriving a worst-case lower bound (Theorem 6) and comparing it with our upper bound,

we show that for worst-case instances, the optimal number of arm pulls scales as

|G|

∆22

(for

|G| ě 2) up to logarithmic factors.

Some of the innovations in our analysis include (i) suitably identifying the relevant “gaps” in the ﬁnite-

arm elimination algorithm and adapting the analysis accordingly; (ii) setting up the relaxation

(4)

and determining how this impacts the number of arms to request and how to characterize the high-

probability behavior of their gaps; (iii) suitably extending the two-step algorithm to a multi-step

algorithm; and (iv) proving the lower bound via a likelihood-ratio based analysis that is distinct from

others in the bandit literature to the best of our knowledge.

3. Algorithm and Upper Bound

In this section, we introduce our main algorithm and provide its performance guarantee. We make

the following standard assumptions on the reward distributions.

Assumption 1

For every arm

(in every group

; the group dependence is left implicit here),

we assume that the mean reward

µj

is bounded in

r0,1s

and that the reward distribution is

sub-Gaussian with parameter

σ2ď1

That is, for each arm

, and for each

λPR

, if

is a

random variable drawn from the arm’s reward distribution, then

ErXs “ µj

and

EreλpX´µjqs ď

exppλ2σ2{2q

. Moreover, we assume that all of the reward distributions come from a common family

of distributions that are uniquely parametrized by their mean (e.g., Bernoulli, or Gaussian with a

ﬁxed variance).

3.1. Description of the Algorithm

We present our two-step algorithm in Algorithm 1. This algorithm uses a ﬁnite-arm best-quantile

identiﬁcation (BQID) algorithm

FiniteArmBQID

as a sub-routine, and its description is deferred

to Algorithm 2in Appendix C.3. For now, we only need to treat this step in a “black-box” manner

with certain guarantees outlined at the end of this subsection.

3. Any ﬁnite interval can be shifted and scaled to this range.

4. The upper bound σ2ď1is for convenience in applying known conﬁdence bounds, and can easily be relaxed.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

137Max-QuantileGroupedInnite-ArmBanditsIvanLauIVAN.LAU@RICE.EDURiceUniversityYanHaoLingLINGYH@NUS.EDU.SGNationalUniversityofSingaporeMayankShrivastavaMAYANKS4@ILLINOIS.EDUUniversityofIllinoisUrbana-ChampaignJonathanScarlettSCARLETT@COMP.NUS.EDU.SGNationalUniversityofSingaporeAbstractInthispaper,we...

展开>> 收起<<

137 Max-Quantile Grouped Inﬁnite-Arm Bandits Ivan Lau IVAN .LAURICE .EDU.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

137 Max-Quantile Grouped Inﬁnite-Arm Bandits Ivan Lau IVAN .LAURICE .EDU

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: