The Typical Behavior of Bandit Algorithms Lin Fan Department of Management Science and Engineering Stanford University Stanford CA 94305 linfanstanford.edu

2025-04-24 0 0 1.22MB 31 页 10玖币

侵权投诉

The Typical Behavior of Bandit Algorithms

Lin Fan

Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, linfan@stanford.edu

Peter W. Glynn

Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, glynn@stanford.edu

We establish strong laws of large numbers and central limit theorems for the regret of two of the most popular

bandit algorithms: Thompson sampling and UCB. Here, our characterizations of the regret distribution

complement the characterizations of the tail of the regret distribution recently developed in Fan and Glynn

(2021b). The tail characterizations there are associated with atypical bandit behavior on trajectories where

the optimal arm mean is under-estimated, leading to mis-identiﬁcation of the optimal arm and large regret.

In contrast, our SLLN’s and CLT’s here describe the typical behavior and ﬂuctuation of regret on trajectories

where the optimal arm mean is properly estimated. We ﬁnd that Thompson sampling and UCB satisfy the

same SLLN and CLT, with the asymptotics of both the SLLN and the (mean) centering sequence in the

CLT matching the asymptotics of expected regret. Both the mean and variance in the CLT grow at log(T)

rates with the time horizon T. Asymptotically as T→ ∞, the variability in the number of plays of each

sub-optimal arm depends only on the rewards received for that arm, which indicates that each sub-optimal

arm contributes independently to the overall CLT variance.

Key words : Multi-armed Bandits, Regret Distribution, Limit Theorems

Contents

1 Introduction 2

1.1 Model and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Analysis of Thompson Sampling 5

2.1 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Extension to Multiple Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Analysis of UCB 15

3.1 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Extension to Multiple Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Modiﬁcations and Model Mis-speciﬁcation 26

5 Numerical Simulations 27

arXiv:2210.05660v1 [cs.LG] 11 Oct 2022

A Technical Lemmas 28

1. Introduction

The multi-armed bandit (MAB) problem has become an extremely fruitful area of both research

and practice in recent decades. Along with the widespread deployment of bandit algorithms in

numerous diverse electronic applications, there has been a great deal of eﬀort to better understand

the performance of empirically successful algorithms from a theoretical perspective. This literature,

by now vast, is almost entirely focused on algorithm design principles which produce small regret

in expectation. Here, small means that expected regret grows as a constant multiple of log(T) with

the time horizon T, as motivated by the Lai-Robbins lower bound (Lai and Robbins 1985) which

characterizes the minimum possible log(T) rate.

However, as highlighted by the recent work of Fan and Glynn (2021b), it is important to consider

other aspects of the regret distribution besides just the expected regret in bandit algorithm design.

It is shown there that focusing purely on expected regret minimization comes with several highly

undesirable side eﬀects. First, algorithms with small or minimal rates of expected regret growth,

including the most popular ones based on the Thompson sampling (TS) (Thompson 1933) and

upper conﬁdence bound (UCB) (Lai and Robbins 1985,Auer et al. 2002) strategies, must have

regret distributions with heavy (power law) tails. These tails are essentially that of a truncated

Cauchy distribution, implying that there is a quite large probability of suﬀering very large regret.

Second, expected regret minimization provides no control over the growth rate of higher moments of

expected regret, and notably there is no control over the variability of regret. Third, the truncated

Cauchy tails cause an algorithm to suﬀer large expected regret (growing as Tafor some 0 < a < 1)

when the bandit environment is just slightly mis-speciﬁed relative to the algorithm’s design.

In this paper, we develop approximations to the regret distribution that complement those of

Fan and Glynn (2021b). For ﬁxed bandit environments, we show that as the time horizon T→∞,

the regret of TS and UCB satisfy strong laws of large numbers (SLLN’s) and central limit theorems

(CLT’s). (For simplicity, we consider versions of TS and UCB designed for Gaussian rewards.)

In fact, these limit theorems for TS and UCB are the same, with the asymptotics of both the

SLLN and the (mean) centering sequence in the CLT matching the asymptotics of expected regret.

Complementary to the characterizations of the regret distribution tail in Fan and Glynn (2021b),

the CLT’s here describe the concentration and shape of the main probability mass of the regret

distribution, centered around the expected regret. The tail characterizations in Fan and Glynn

(2021b) are obtained through changes of measure associated with trajectories where the optimal

arm mean is under-estimated, thereby causing the optimal arm to be mis-identiﬁed and resulting

in large regret. Here, our SLLN’s and CLT’s are implicitly associated with trajectories where the

mean of the optimal arm is properly estimated and the optimal arm is correctly identiﬁed. In

this sense, the tail of the regret distribution describes the atypical behavior of regret, whereas the

SLLN’s and CLT’s describe the typical behavior and ﬂuctuation of regret.

Some additional highlights of our results are as follows. Both the means and variances in our

CLT’s grow at log(T) rates with T. By analogy with the large deviations theory for sums of iid

random variables, this suggests that large deviations of regret correspond to deviations from the

expected regret that are of order log(T). (Fan and Glynn (2021b) characterize the tail of the

regret distribution beyond log1+(T) for small  > 0. Future work will analyze deviations on the

log(T) scale.) The variability in our CLT’s is purely due to the variability of the sub-optimal arm

rewards. Asymptotically as T→ ∞, the number of plays of each sub-optimal arm depends only

on the rewards received for that arm. So the numbers of plays of diﬀerent sub-optimal arms are

asymptotically independent and contribute additively to the overall CLT variance. Lastly, we ﬁnd

that the CLT becomes a better approximation to the regret distribution as the regret distribution

tail is made lighter by increasing the amount of exploration performed by the algorithm. (See

Section 5 of Fan and Glynn (2021b) for a sharp trade-oﬀ between the amount of exploration

performed by UCB-type algorithms and the resulting heaviness of the regret tail.)

In terms of related work, Wager and Xu (2021) and Fan and Glynn (2021a) develop diﬀusion

approximations for the regret of TS and related algorithms, and Kalvit and Zeevi (2021) develop

diﬀusion approximations for the regret of UCB. The approximations of the regret distribution in

these works are developed for bandit settings where the gaps between the arm means are roughly

of size 1/√Tfor time horizon T. So these distributional approximations are distinct from those

in this paper, which are developed for bandit settings with ﬁxed gaps between arm means as

T→ ∞.InCowan and Katehakis (2019), SLLN’s and laws of the iterated logarithm (LIL’s) are

developed for a version of UCB and also for algorithms based on forced arm sampling according to

a predeﬁned schedule. (Our SLLN for UCB is adapted from that of Cowan and Katehakis (2019),

but our SLLN’s for TS and our CLT’s for both UCB and TS are new.)

The rest of the paper is structured as follows. In Section 1.1, we provide a formal framework

for the MAB problem and introduce notation. In Section 2, we develop SLLN’s (Theorems 1and

3) and CLT’s (Theorems 2and 4) for the regret of TS in two- and multi-armed settings. Then, in

Section 3, we develop a SLLN (Theorem 5) and CLT’s (Theorems 6and 7) for the regret of UCB in

two- and multi-armed settings. In Sections 2and 3, we work with versions of TS and UCB designed

for environments with iid Gaussian rewards with variance 1 (for simplicity), and we analyze their

regret behavior when they operate in such environments (i.e., in well-speciﬁed settings). Later, in

Section 4, we develop SLLN’s and CLT’s (Propositions 1and 2) for the regret of TS and UCB in

possibly mis-speciﬁed settings. In such settings, we work with versions of TS and UCB designed

for iid Gaussian rewards with a speciﬁed variance σ2>0, but we analyze their regret behavior in

environments with essentially arbitrary reward distributions. (In these mis-speciﬁed settings, we

still assume that the rewards are iid for simplicity, but our technical arguments can be adapted

to accommodate rewards evolving as stochastic processes.) Finally, we examine the validity of the

CLT’s over ﬁnite time horizons through numerical simulations in Section 5.

1.1. Model and Preliminaries

AK-armed MAB evolves within a bandit environment ν= (P1,...,PK), where each Pkis a dis-

tribution on R. At time t, the decision-maker selects an arm A(t)∈[K] = {1,...,K}to play. The

conditional distribution of A(t) given A(1), Y (1),...,A(t−1), Y (t−1) is πt(·|A(1), Y (1),...,A(t−

1), Y (t−1)), where π= (πt, t ≥1) is a sequence of probability kernels, which constitutes the bandit

algorithm (with πtdeﬁned on ([K]×R)t×2[K]). Upon selecting the arm A(t), a reward Y(t) from

arm A(t) is received as feedback. The conditional distribution of Y(t) given A(1), Y (1),...,A(t−

1), Y (t−1), A(t) is PA(t)(·). We write Xk(t) to denote the reward received when arm kis played

for the t-th instance, so that Y(t) = XA(t)(NA(t)(t)), where Nk(t) = Pt

i=1 I(A(i) = k)denotes the

number of plays of arm kup to and including time t. For each arm k, corresponding to Nk(t), we

use Tk(j) to denote the time of the j-th play of arm k, and we use

τk(j) = Tk(j+ 1) −Tk(j)

to denote the time in between the j-th and (j+ 1)-th plays of arm k. At time t, the ﬁltration for

the bandit algorithms studied in this paper is given by

Ft={A(1),...,A(t), Xk(1),...,Xk(Nk(t)),1≤k≤K}.

For any time n, the interaction between the algorithm πand the environment νinduces a unique

probability Pνπ (·) on ([K]×R)∞for which

Pνπ (A(1) = a1, Y (1) ∈dy1,...,A(n) = an, Y (n)∈dyn) =

t=1

πt(at|a1, y1,...,at−1, yt−1)Pat(dyt).

Throughout the paper, all expectations and probabilities will be taken with respect to Pνπ (·). The

particular environment νand algorithm πunder consideration will be clear from the context, and

we will not write it explicitly.

The performance of an algorithm πis measured by the (pseudo-)regret (at time t):

R(t) = X

k6=k∗

Nk(t)∆k,

where ∆k=µk∗−µk,k∗is the optimal arm, and for any arm k0,µk0is the mean of its reward

distribution. (We will always assume the optimal arm is unique for technical simplicity.) The goal

in most settings is to ﬁnd an algorithm πwhich minimizes the expected regret E[R(t)] as t→∞.

2. Analysis of Thompson Sampling

In this section, we analyze a version of TS that is designed for iid Gaussian rewards with variance 1.

We assume that the actual arm rewards are also iid Gaussian with variance 1, i.e., TS is operating

in a well-speciﬁed environment. For modiﬁcations and consideration of model mis-speciﬁcation, see

Section 4.

Asymptotically, the prior on the arm means used in TS does not matter, so we put a N(0,1)

prior on all arm means for simplicity. Given Ft(the information collected up to and including time

t), at time t+ 1 TS generates one sample from the posterior distribution of the mean for each arm,

and then plays the arm with the highest sampled mean. This can be implemented by generating

exogenous N(0,1) random variables Zk(t+ 1) for each arm k, and then playing the arm:

A(t+ 1) = arg max

k∈[K](bµk(Nk(t)) + Zk(t+ 1)

p1 + Nk(t)),

where bµk(n) = 1

1+nPn

j=1 Xk(j). As shown in Korda et al. (2013), the expected regret for well-

speciﬁed Gaussian TS satisﬁes:

lim

T→∞

E[R(T)]

log(T)=X

k6=k∗

∆k

.(1)

(A Jeﬀrey’s prior is used in Korda et al. (2013) to derive a more general result that applies to any

exponential family reward distribution.)

To develop our SLLN’s and CLT’s, we analyze the times (Tk(j), j ≥1) during which each sub-

optimal arm kis played. TS does not stop playing sub-optimal arms for any time horizon, and each

sub-optimal arm is played roughly O(log(T)) times by time T. Thus, the spacing between the Tk(j)

should increase exponentially with j. The log(Tk(j)) should then be on a linear scale and satisfy

SLLN’s and CLT’s. Using the basic identities from renewal theory, we can obtain corresponding

limit theorems for the Nk(T).

To analyze the Tk(j), we consider probabilities of playing sub-optimal arms, as well as approxi-

mations to such probabilities. For each sub-optimal arm k, deﬁne:

pk(Nk(t), t + 1) = P bµk(Nk(t)) + Zk(t+ 1)

p1 + Nk(t)>bµk∗(Nk∗(t)) + Zk∗(t+ 1)

p1 + Nk∗(t)Ft!

=PVk(t+ 1) < Bk(Nk(t), t + 1) Ft,(2)

where the Vk(t+ 1) = 1 −Φ(Zk(t+ 1)) are distributed according to Unif(0,1), and we deﬁne

Bk(Nk(t), t + 1) = 1 −Φ p1 + Nk(t) bµk∗(Nk∗(t)) −bµk(Nk(t)) + Zk∗(t+ 1)

p1 + Nk∗(t)!!.(3)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheTypicalBehaviorofBanditAlgorithmsLinFanDepartmentofManagementScienceandEngineering,StanfordUniversity,Stanford,CA94305,linfan@stanford.eduPeterW.GlynnDepartmentofManagementScienceandEngineering,StanfordUniversity,Stanford,CA94305,glynn@stanford.eduWeestablishstronglawsoflargenumbersandcentrallimi...

展开>> 收起<<

The Typical Behavior of Bandit Algorithms Lin Fan Department of Management Science and Engineering Stanford University Stanford CA 94305 linfanstanford.edu.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Typical Behavior of Bandit Algorithms Lin Fan Department of Management Science and Engineering Stanford University Stanford CA 94305 linfanstanford.edu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: