The Typical Behavior of Bandit Algorithms Lin Fan Department of Management Science and Engineering Stanford University Stanford CA 94305 linfanstanford.edu

2025-04-24 0 0 1.22MB 31 页 10玖币
侵权投诉
The Typical Behavior of Bandit Algorithms
Lin Fan
Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, linfan@stanford.edu
Peter W. Glynn
Department of Management Science and Engineering, Stanford University, Stanford, CA 94305, glynn@stanford.edu
We establish strong laws of large numbers and central limit theorems for the regret of two of the most popular
bandit algorithms: Thompson sampling and UCB. Here, our characterizations of the regret distribution
complement the characterizations of the tail of the regret distribution recently developed in Fan and Glynn
(2021b). The tail characterizations there are associated with atypical bandit behavior on trajectories where
the optimal arm mean is under-estimated, leading to mis-identification of the optimal arm and large regret.
In contrast, our SLLN’s and CLT’s here describe the typical behavior and fluctuation of regret on trajectories
where the optimal arm mean is properly estimated. We find that Thompson sampling and UCB satisfy the
same SLLN and CLT, with the asymptotics of both the SLLN and the (mean) centering sequence in the
CLT matching the asymptotics of expected regret. Both the mean and variance in the CLT grow at log(T)
rates with the time horizon T. Asymptotically as T→ ∞, the variability in the number of plays of each
sub-optimal arm depends only on the rewards received for that arm, which indicates that each sub-optimal
arm contributes independently to the overall CLT variance.
Key words : Multi-armed Bandits, Regret Distribution, Limit Theorems
Contents
1 Introduction 2
1.1 Model and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Analysis of Thompson Sampling 5
2.1 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Extension to Multiple Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Analysis of UCB 15
3.1 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Extension to Multiple Arms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Modifications and Model Mis-specification 26
5 Numerical Simulations 27
1
arXiv:2210.05660v1 [cs.LG] 11 Oct 2022
2
A Technical Lemmas 28
1. Introduction
The multi-armed bandit (MAB) problem has become an extremely fruitful area of both research
and practice in recent decades. Along with the widespread deployment of bandit algorithms in
numerous diverse electronic applications, there has been a great deal of effort to better understand
the performance of empirically successful algorithms from a theoretical perspective. This literature,
by now vast, is almost entirely focused on algorithm design principles which produce small regret
in expectation. Here, small means that expected regret grows as a constant multiple of log(T) with
the time horizon T, as motivated by the Lai-Robbins lower bound (Lai and Robbins 1985) which
characterizes the minimum possible log(T) rate.
However, as highlighted by the recent work of Fan and Glynn (2021b), it is important to consider
other aspects of the regret distribution besides just the expected regret in bandit algorithm design.
It is shown there that focusing purely on expected regret minimization comes with several highly
undesirable side effects. First, algorithms with small or minimal rates of expected regret growth,
including the most popular ones based on the Thompson sampling (TS) (Thompson 1933) and
upper confidence bound (UCB) (Lai and Robbins 1985,Auer et al. 2002) strategies, must have
regret distributions with heavy (power law) tails. These tails are essentially that of a truncated
Cauchy distribution, implying that there is a quite large probability of suffering very large regret.
Second, expected regret minimization provides no control over the growth rate of higher moments of
expected regret, and notably there is no control over the variability of regret. Third, the truncated
Cauchy tails cause an algorithm to suffer large expected regret (growing as Tafor some 0 < a < 1)
when the bandit environment is just slightly mis-specified relative to the algorithm’s design.
In this paper, we develop approximations to the regret distribution that complement those of
Fan and Glynn (2021b). For fixed bandit environments, we show that as the time horizon T,
the regret of TS and UCB satisfy strong laws of large numbers (SLLN’s) and central limit theorems
(CLT’s). (For simplicity, we consider versions of TS and UCB designed for Gaussian rewards.)
In fact, these limit theorems for TS and UCB are the same, with the asymptotics of both the
SLLN and the (mean) centering sequence in the CLT matching the asymptotics of expected regret.
Complementary to the characterizations of the regret distribution tail in Fan and Glynn (2021b),
the CLT’s here describe the concentration and shape of the main probability mass of the regret
distribution, centered around the expected regret. The tail characterizations in Fan and Glynn
(2021b) are obtained through changes of measure associated with trajectories where the optimal
arm mean is under-estimated, thereby causing the optimal arm to be mis-identified and resulting
in large regret. Here, our SLLN’s and CLT’s are implicitly associated with trajectories where the
3
mean of the optimal arm is properly estimated and the optimal arm is correctly identified. In
this sense, the tail of the regret distribution describes the atypical behavior of regret, whereas the
SLLN’s and CLT’s describe the typical behavior and fluctuation of regret.
Some additional highlights of our results are as follows. Both the means and variances in our
CLT’s grow at log(T) rates with T. By analogy with the large deviations theory for sums of iid
random variables, this suggests that large deviations of regret correspond to deviations from the
expected regret that are of order log(T). (Fan and Glynn (2021b) characterize the tail of the
regret distribution beyond log1+(T) for small  > 0. Future work will analyze deviations on the
log(T) scale.) The variability in our CLT’s is purely due to the variability of the sub-optimal arm
rewards. Asymptotically as T→ ∞, the number of plays of each sub-optimal arm depends only
on the rewards received for that arm. So the numbers of plays of different sub-optimal arms are
asymptotically independent and contribute additively to the overall CLT variance. Lastly, we find
that the CLT becomes a better approximation to the regret distribution as the regret distribution
tail is made lighter by increasing the amount of exploration performed by the algorithm. (See
Section 5 of Fan and Glynn (2021b) for a sharp trade-off between the amount of exploration
performed by UCB-type algorithms and the resulting heaviness of the regret tail.)
In terms of related work, Wager and Xu (2021) and Fan and Glynn (2021a) develop diffusion
approximations for the regret of TS and related algorithms, and Kalvit and Zeevi (2021) develop
diffusion approximations for the regret of UCB. The approximations of the regret distribution in
these works are developed for bandit settings where the gaps between the arm means are roughly
of size 1/Tfor time horizon T. So these distributional approximations are distinct from those
in this paper, which are developed for bandit settings with fixed gaps between arm means as
T→ ∞.InCowan and Katehakis (2019), SLLN’s and laws of the iterated logarithm (LIL’s) are
developed for a version of UCB and also for algorithms based on forced arm sampling according to
a predefined schedule. (Our SLLN for UCB is adapted from that of Cowan and Katehakis (2019),
but our SLLN’s for TS and our CLT’s for both UCB and TS are new.)
The rest of the paper is structured as follows. In Section 1.1, we provide a formal framework
for the MAB problem and introduce notation. In Section 2, we develop SLLN’s (Theorems 1and
3) and CLT’s (Theorems 2and 4) for the regret of TS in two- and multi-armed settings. Then, in
Section 3, we develop a SLLN (Theorem 5) and CLT’s (Theorems 6and 7) for the regret of UCB in
two- and multi-armed settings. In Sections 2and 3, we work with versions of TS and UCB designed
for environments with iid Gaussian rewards with variance 1 (for simplicity), and we analyze their
regret behavior when they operate in such environments (i.e., in well-specified settings). Later, in
Section 4, we develop SLLN’s and CLT’s (Propositions 1and 2) for the regret of TS and UCB in
possibly mis-specified settings. In such settings, we work with versions of TS and UCB designed
4
for iid Gaussian rewards with a specified variance σ2>0, but we analyze their regret behavior in
environments with essentially arbitrary reward distributions. (In these mis-specified settings, we
still assume that the rewards are iid for simplicity, but our technical arguments can be adapted
to accommodate rewards evolving as stochastic processes.) Finally, we examine the validity of the
CLT’s over finite time horizons through numerical simulations in Section 5.
1.1. Model and Preliminaries
AK-armed MAB evolves within a bandit environment ν= (P1,...,PK), where each Pkis a dis-
tribution on R. At time t, the decision-maker selects an arm A(t)[K] = {1,...,K}to play. The
conditional distribution of A(t) given A(1), Y (1),...,A(t1), Y (t1) is πt(·|A(1), Y (1),...,A(t
1), Y (t1)), where π= (πt, t 1) is a sequence of probability kernels, which constitutes the bandit
algorithm (with πtdefined on ([K]×R)t×2[K]). Upon selecting the arm A(t), a reward Y(t) from
arm A(t) is received as feedback. The conditional distribution of Y(t) given A(1), Y (1),...,A(t
1), Y (t1), A(t) is PA(t)(·). We write Xk(t) to denote the reward received when arm kis played
for the t-th instance, so that Y(t) = XA(t)(NA(t)(t)), where Nk(t) = Pt
i=1 I(A(i) = k)denotes the
number of plays of arm kup to and including time t. For each arm k, corresponding to Nk(t), we
use Tk(j) to denote the time of the j-th play of arm k, and we use
τk(j) = Tk(j+ 1) Tk(j)
to denote the time in between the j-th and (j+ 1)-th plays of arm k. At time t, the filtration for
the bandit algorithms studied in this paper is given by
Ft={A(1),...,A(t), Xk(1),...,Xk(Nk(t)),1kK}.
For any time n, the interaction between the algorithm πand the environment νinduces a unique
probability Pνπ (·) on ([K]×R)for which
Pνπ (A(1) = a1, Y (1) dy1,...,A(n) = an, Y (n)dyn) =
n
Y
t=1
πt(at|a1, y1,...,at1, yt1)Pat(dyt).
Throughout the paper, all expectations and probabilities will be taken with respect to Pνπ (·). The
particular environment νand algorithm πunder consideration will be clear from the context, and
we will not write it explicitly.
The performance of an algorithm πis measured by the (pseudo-)regret (at time t):
R(t) = X
k6=k
Nk(t)∆k,
where ∆k=µkµk,kis the optimal arm, and for any arm k0,µk0is the mean of its reward
distribution. (We will always assume the optimal arm is unique for technical simplicity.) The goal
in most settings is to find an algorithm πwhich minimizes the expected regret E[R(t)] as t.
5
2. Analysis of Thompson Sampling
In this section, we analyze a version of TS that is designed for iid Gaussian rewards with variance 1.
We assume that the actual arm rewards are also iid Gaussian with variance 1, i.e., TS is operating
in a well-specified environment. For modifications and consideration of model mis-specification, see
Section 4.
Asymptotically, the prior on the arm means used in TS does not matter, so we put a N(0,1)
prior on all arm means for simplicity. Given Ft(the information collected up to and including time
t), at time t+ 1 TS generates one sample from the posterior distribution of the mean for each arm,
and then plays the arm with the highest sampled mean. This can be implemented by generating
exogenous N(0,1) random variables Zk(t+ 1) for each arm k, and then playing the arm:
A(t+ 1) = arg max
k[K](bµk(Nk(t)) + Zk(t+ 1)
p1 + Nk(t)),
where bµk(n) = 1
1+nPn
j=1 Xk(j). As shown in Korda et al. (2013), the expected regret for well-
specified Gaussian TS satisfies:
lim
T→∞
E[R(T)]
log(T)=X
k6=k
2
k
.(1)
(A Jeffrey’s prior is used in Korda et al. (2013) to derive a more general result that applies to any
exponential family reward distribution.)
To develop our SLLN’s and CLT’s, we analyze the times (Tk(j), j 1) during which each sub-
optimal arm kis played. TS does not stop playing sub-optimal arms for any time horizon, and each
sub-optimal arm is played roughly O(log(T)) times by time T. Thus, the spacing between the Tk(j)
should increase exponentially with j. The log(Tk(j)) should then be on a linear scale and satisfy
SLLN’s and CLT’s. Using the basic identities from renewal theory, we can obtain corresponding
limit theorems for the Nk(T).
To analyze the Tk(j), we consider probabilities of playing sub-optimal arms, as well as approxi-
mations to such probabilities. For each sub-optimal arm k, define:
pk(Nk(t), t + 1) = P bµk(Nk(t)) + Zk(t+ 1)
p1 + Nk(t)>bµk(Nk(t)) + Zk(t+ 1)
p1 + Nk(t)Ft!
=PVk(t+ 1) < Bk(Nk(t), t + 1) Ft,(2)
where the Vk(t+ 1) = 1 Φ(Zk(t+ 1)) are distributed according to Unif(0,1), and we define
Bk(Nk(t), t + 1) = 1 Φ p1 + Nk(t) bµk(Nk(t)) bµk(Nk(t)) + Zk(t+ 1)
p1 + Nk(t)!!.(3)
摘要:

TheTypicalBehaviorofBanditAlgorithmsLinFanDepartmentofManagementScienceandEngineering,StanfordUniversity,Stanford,CA94305,linfan@stanford.eduPeterW.GlynnDepartmentofManagementScienceandEngineering,StanfordUniversity,Stanford,CA94305,glynn@stanford.eduWeestablishstronglawsoflargenumbersandcentrallimi...

展开>> 收起<<
The Typical Behavior of Bandit Algorithms Lin Fan Department of Management Science and Engineering Stanford University Stanford CA 94305 linfanstanford.edu.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:1.22MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注