Anytime-valid off-policy inference for contextual bandits Ian Waudby-Smith1 Lili Wu2 Aaditya Ramdas1 Nikos Karampatziakis2 and Paul Mineiro2 1Carnegie Mellon University

2025-04-30 0 0 1.21MB 43 页 10玖币
侵权投诉
Anytime-valid off-policy inference for contextual bandits
Ian Waudby-Smith1, Lili Wu2, Aaditya Ramdas1, Nikos Karampatziakis2, and Paul Mineiro2
1Carnegie Mellon University
2Microsoft
ianws@cmu.edu,liliwu@microsoft.com,aramdas@cmu.edu,
nikosk@microsoft.com,pmineiro@microsoft.com
Abstract
Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in
healthcare and the tech industry. They involve online learning algorithms that adaptively learn
policies over time to map observed contexts Xtto actions Atin an attempt to maximize stochastic
rewards Rt. This adaptivity raises interesting but hard statistical inference questions, especially
counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical
policy that is different from the logging policy that was used to collect the data — a problem
known as “off-policy evaluation” (OPE). Using modern martingale techniques, we present a com-
prehensive framework for OPE inference that relax unnecessary conditions made in some past
works (such as performing inference at prespecified sample sizes, uniformly bounded importance
weights, constant logging policies, constant policy values, among others), significantly improving
on them both theoretically and empirically. Importantly, our methods can be employed while
the original experiment is still running (that is, not necessarily post-hoc), when the logging pol-
icy may be itself changing (due to learning), and even if the context distributions are a highly
dependent time-series (such as if they are drifting over time). More concretely, we derive confi-
dence sequences for various functionals of interest in OPE. These include doubly robust ones for
time-varying off-policy mean reward values, but also confidence bands for the entire cumulative
distribution function of the off-policy reward distribution. All of our methods (a) are valid at
arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance
weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d)
adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid
off-policy inference using adaptively collected contextual bandit data.
1 Introduction 2
1.1 Off-policy inference, confidence intervals, and confidence sequences ........... 2
1.2 Desiderata for anytime-valid off-policy inference ...................... 4
1.3 Outline and contributions .................................. 6
1.4 Related work ......................................... 6
1.5 Notation: supermartingales, filtrations, and stopping times ............... 7
2 Warmup: Off-policy inference for constant policy values 7
2.1 Tighter confidence sequences via doubly robust pseudo-outcomes ............ 8
2.2 Tuning, truncating, and mirroring ............................. 11
2.3 Closed-form confidence sequences .............................. 12
2.4 Fixed-time confidence intervals ............................... 13
3 Inference for time-varying policy values 15
3.1 A remark on policy value differences ............................ 18
3.2 Time-varying treatment effects in adaptive experiments ................. 19
1
arXiv:2210.10768v3 [stat.ME] 15 Aug 2024
3.3 Sequential testing and anytime p-values for off-policy inference ............. 20
4 Time-uniform inference for the off-policy CDF 22
5 Summary & extensions 24
A Proofs of the main results 31
A.1 A technical lemma ...................................... 31
A.2 Proof of Theorem 1 ...................................... 31
A.3 Proof of Theorem 2 ...................................... 33
A.4 Proof of Proposition 2 .................................... 34
A.5 Proof of Proposition 3 .................................... 34
A.6 Proof of Theorem 3 ...................................... 36
B A causal view of contextual bandits via potential outcomes 41
1 Introduction
The so-called “contextual bandit” problem is an abstraction that can be used to describe several
problem setups in statistics and machine learning [35,36]. For example, it generalizes the multi-armed
bandit problem by allowing for “contextual” side information, and it can be used to describe many
adaptive sequential experiments. The general contextual bandit problem can be described informally
as follows: an agent (such as a medical scientist in a clinical trial) views contextual information XtPX
for subject t(such as the clinical patient’s demographics, medical history, etc.), takes an action AtPA
(such as whether to administer a placebo, a low dose, or a high dose), and observes some reward Rt
(such as whether their adverse symptoms have subsided). This description is made formal in the
protocol for the generation of contextual bandit data in Algorithm 1. In the present paper, no
restrictions are placed on the dimensionality or structure of the context and action spaces Xand A
beyond them being measurable, but it is often helpful to think about Xas a d-dimensional Euclidean
space, and Aas t0,1ufor binary treatments, or Rfor different dosages, and so on. Indeed, while high-
dimensional settings often pose certain challenges in contextual bandits (such as computational ones,
or inflated variances), none of these issues will affect the validity of our statistical inference methods.
Throughout, we will require that the rewards are real-valued and bounded in r0,1s— a common
assumption in contextual bandits [50,28] — except for Section 4where we relax the boundedness
constraint.
There are two main objectives that one can study in the contextual bandit setup: (1) policy
optimization, and (2) off-policy evaluation (OPE) [36,35,11,12]. Here, a “policy” πpa|xqis simply
a conditional distribution over actions, such as the probability that patient tshould receive various
treatments given their context Xt. Policy optimization is concerned with finding policies that achieve
high cumulative rewards (typically measured through regret), while off-policy evaluation is concerned
with asking the counterfactual question: “how would we have done if we used some policy πinstead
of the policy that is currently collecting data?”. In this paper, we study the latter with a particular
focus on statistical inference in adaptive, sequential environments under nonparametric assumptions.
1.1 Off-policy inference, confidence intervals, and confidence sequences
By far the most common parameter of interest in the OPE problem is the expected reward ν:
EπpRqthat would result from taking an action from the policy π. This expectation νis called the
“value” of the policy π. While several estimators for νhave been derived and refined over the years,
many practical problems call for more than just a point estimate: we may also wish to quantify the
uncertainty surrounding our estimates via statistical inference tools such as confidence intervals (CI).
2
However, a major drawback of CIs is the fact that they are only valid at fixed and prespecified sample
sizes, while contextual bandit data are collected in a sequential and adaptive fashion over time.
We lay out the assumed protocol for the generation of contextual bandit data in Algorithm 1,
and in particular, all of our results will assume access to the output of this algorithm, namely the
(potentially infinite) sequence of tuples pXt, At, RtqT
t1for TPNY t8u. As is standard in OPE, we
will always assume that the policy πis (almost surely) absolutely continuous with respect to htso
that π{htis almost surely finite (without which, estimation and inference are not possible in general).
Indeed, this permits many bandit techniques and in principle allows for Thompson sampling since it
always assigns positive probability to an action (note that it may not always easy to compute the
probability of taking that action via Thompson sampling, but if those probabilities can be computed,
they can be used directly within our framework). However, phtq8
t1cannot be the result of Upper
Confidence Bound (UCB)-style algorithms since they take conditionally deterministic actions given
the past, violating the absolute continuity of πwith respect to ht.
In Algorithm 1, the term “exogenously time-varying” simply means that the context and reward
distributions at time tcan only depend on the past through Xt´1
1” pX1, . . . , Xt´1q, and not on the
actions taken (or rewards received). Formally, we allow for any joint distribution over pXt, At, Rtq8
t1
as long as
pRtpr|x, a, Ht´1q “ pRtpr|x, a, Xt´1
1qand pXtpx|Ht´1q “ pXtpx|Xt´1
1q,(1)
where Htis all of the history σppXi, Ai, Riqt
i1qup until time t. This conditional independence
requirement (1) includes as a special case more classical setups where Xtis independent of all else
given At, such as those considered in Bibaut et al. [4] or iid scenarios [28], but is strictly more general,
since, for example, pXtq8
t1can be a highly dependent time-series. However, we do not go as far as to
consider the adversarial setting that is sometimes studied in the context of regret minimization. We
impose this conditional independence requirement since otherwise, the interpretation of EπpRt|Ht´1q
changes depending on which sequence of actions were played by the logging policy. Making matters
more concrete, the conditional off-policy value EπpRt|Ht´1qat time tis given by
νt:EπpRt|Ht´1q ” żXˆAˆR
r¨pRtpr|a, x, Ht´1qπpa|xqpXtpx|Ht´1qdxdadr(2)
żXˆAˆR
r¨pRtpr|a, x, Xt´1
1qπpa|xqpXtpx|Xt´1
1qdxdadr, (3)
and the equality (3) follows from (1). Notice that (2) could in principle depend on the logging policies
and actions played, but (3) does not. Despite imposing the conditional independence assumption (1),
the integral (2) is still a perfectly well-defined functional, and if (1) is not satisfied, then our CSs
will still cover a quantity in terms of this functional. However, its interpretation would no longer be
counterfactual with respect to the entire sequence of actions (only conditional on the past).
While most prior work on OPE in contextual bandits is not written causally in terms of potential
outcomes (e.g. [50,28,4,5,25]), it is nevertheless possible to write down a causal target ν
t(i.e. a
functional of the potential outcome distribution) and show that it is equal to νtunder certain causal
identification assumptions. These assumptions resemble the familiar consistency,exchangeability, and
positivity conditions that are ubiquitous in the treatment effect estimation literature. Moreover, there
is a close relationship between OPE and the estimation of so-called stochastic interventions in causal
inference; indeed, they can essentially be seen as equivalent but with slightly different emphases
and setups. However, given that neither the potential outcomes view nor the stochastic intervention
interpretation of OPE are typically emphasized in the contextual bandit literature (with the exception
of Zhan et al. [61], who use potential outcomes throughout), we leave this discussion to Appendix B.
To illustrate the shortcomings of CIs for OPE, suppose we run a contextual bandit algorithm and
want to see whether πis better than the current state-of-the-art policy h— e.g. whether EπpRq ą
EhpRq. (Here, we are implicitly assuming that Eπ1pRq “ Eπ1pRt|Ht´1qfor any policy π1for the sake
of illustration.) Suppose we compute a CI for the value of πbased on nsamples (for some prespecified
3
Algorithm 1 Protocol for the generation of contextual bandit data
// Here, TPNY t8u.
for t1,2, . . . , T do
// The agent selects a policy htbased on the history Ht´1σ`pXi, Ai, Riqt´1
i1˘.
htPHt´1.
// The environment draws a context from an (exogenously time-varying) distribution.
XtpXtp¨q
// The agent plays a random action drawn from the selected policy.
Athtp¨ | Xtq.
// The environment draws a reward from an (exogenously time-varying) distribution based on the
action and context.
RtpRtp¨ | At, Xtq.
end for
// Return a (potentially infinite) sequence of contextual bandit data.
return pXt, At, RtqT
t1
n), and while πseems promising, the CI turns out to be inconclusive (the CI for EπpRqincludes EhpRq
if the latter is known, or the two CIs overlap if the latter is unknown). It is tempting to collect more
data, for a total of n1points, to see if the result is now conclusive; however the resulting sample size
n1is now a data-dependent quantity, rendering the CI invalid. (This could happen more than once.)
Fortunately, there exist statistical tools that permit adaptive stopping in these types of sequential
data collection problems: confidence sequences (CSs [9,34]). A CS is a sequence of confidence intervals,
valid at all sample sizes uniformly (and hence at arbitrary stopping times). Importantly for the
aforementioned OPE setup, CSs allow practitioners to collect additional data and continuously monitor
it, so that the resulting CI is indeed valid at the data-dependent stopped sample size n1. More formally,
we say that a sequence of intervals rLt, Uts8
t1is a CS for the parameter θPRif
Pp@tPN, θ P rLt, Utsq ě 1´α, or equivalently, PpDtPN:θR rLt, Utsq ď α. (4)
Contrast (4) above with the definition of a CI which states that @nPN,PpθP rLn, Unsq ě 1´α,
so that the “@n” is outside the probability Pp¨q rather than inside. A powerful consequence of (4) is
that rLτ, Uτsis a valid CI for any stopping time τ. In fact, rLτ, Uτsbeing a valid CI is not just an
implication of (4) but the two statements are actually equivalent; see Howard et al. [24, Lemma 3].
The consequence for the OPE practitioner is that they can continuously update and monitor a
CS for the value of πwhile the contextual bandit algorithm is running, and deploy πas soon as they
are confident that it is better than the current state-of-the-art h. Karampatziakis et al. [28] refer to
this adaptive policy switching as “gated deployment”, and we will return to this motivating example
through the paper. Let us now lay out five desiderata that we want all of our CSs to satisfy.
1.2 Desiderata for anytime-valid off-policy inference
Throughout this paper, we will derive methods for off-policy evaluation and inference in a variety
of settings — including fixed policies (Section 2), time-varying policies (Section 3), and for entire
cumulative distribution functions (Section 4). However, what all of these approaches will have in
common is that they will satisfy five desirable properties which we enumerate here.
1. Nonasymptotic: Our confidence sets will satisfy exact coverage guarantees for any sample size,
4
unlike procedures based on the central limit theorem which only satisfy approximate guarantees
for large samples.1
2. Nonparametric: We will not make any parametric assumptions on the distribution of the
contexts, policies, or rewards.
3. Time-uniform / anytime-valid: Our confidence sets will be uniformly valid for all sample
sizes, and permit off-policy inference at arbitrary data-dependent stopping times.
4. Adaptive data collection (via online learning): All of our off-policy inference methods
will allow for the sequence of logging policies phtq8
t1to be predictable (i.e. htcan depend on
Ht´1). In particular phtq8
t1can be the result of an online learning algorithm.
5. Unknown and unbounded wmax :In all of our algorithms, the maximal importance weight
wmax :ess sup
tPN,aPA,xPX
πpa|xq
htpa|xq
can be unknown, and need not be uniformly bounded (i.e. wmax can be infinite). Note that
we do require that importance weights πpa|xq{htpa|xqthemselves are finite for each pt, a, xq,
but their essential supremum need not be. Perhaps surprisingly, even if wmax is infinite, it is
still possible for our CSs to shrink to zero-width since they depend only on empirical variances.
As an illustrative example, see Proposition 3for a closed-form CS whose width can shrink at a
rate of alog log t{tas long as the importance-weighted rewards are well-behaved (e.g. in the iid
setting, if they have finite second moments).
In addition to the above, we will design procedures so that they have strong empirical performance
and are straightforward to compute. While some of these desiderata are quite intuitive and common in
statistical inference (such as nonasymptotic, nonparametric, and time-uniform validity, so as to avoid
relying on large sample theory, unrealistic parametric assumptions, or prespecified sample sizes),
desiderata 4 and 5 are more specific to OPE and have not been satisfied in several prior works as we
outline in Sections 2,3, and 4. Given their central importance to our work, let us elaborate on them
here.
Why allow for logging policies to be predictable? For the purpose of policy optimization,
contextual bandit algorithms are tasked with balancing exploration and exploitation: simultaneously
figuring out which policies will yield high rewards (at the expense of trying out suboptimal policies)
and playing actions from the policies that have proven effective so far. On the other hand, in adaptive
sequential trials, an experimenter might aim to balance context distributions between treatment arms
(such as via Efron’s biased coin design [14]) or to adaptively find the treatment policy that yields the
most efficient treatment effect estimators [29]. In both cases, the logging policies phtq8
t1are not only
changing over time, but adaptively so based on previous observations. We strive to design procedures
that permit inference in precisely these types of adaptive data collection regimes, despite most prior
works on off-policy inference for contextual bandits having assumed that there is a fixed, prespecified
logging policy [50,28,5,25]. Of course, if a CS or CI is valid under adaptive data collection, they are
also valid when fixed logging policies are used instead.
Why not rely on knowledge of wmax?Related to the previous discussion, it may not be known
a priori how the range of the predictable logging policies will evolve over time. Moreover, one can
imagine a situation where supa,x πpa|xq{htpa|xqÑ8, even if every individual importance weight
1Note that nonasymptotic procedures may be more conservative than asymptotic ones as they satisfy more rigorous
coverage (similarly, type-I error) guarantees. Whether one should sacrifice stronger guarantees for tightness comes down
to philosophical preference. For the purposes of this paper, we focus solely on nonasymptotics.
5
摘要:

Anytime-validoff-policyinferenceforcontextualbanditsIanWaudby-Smith1,LiliWu2,AadityaRamdas1,NikosKarampatziakis2,andPaulMineiro21CarnegieMellonUniversity2Microsoftianws@cmu.edu,liliwu@microsoft.com,aramdas@cmu.edu,nikosk@microsoft.com,pmineiro@microsoft.comAbstractContextualbanditalgorithmsareubiqui...

展开>> 收起<<
Anytime-valid off-policy inference for contextual bandits Ian Waudby-Smith1 Lili Wu2 Aaditya Ramdas1 Nikos Karampatziakis2 and Paul Mineiro2 1Carnegie Mellon University.pdf

共43页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:43 页 大小:1.21MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 43
客服
关注