Anytime-valid off-policy inference for contextual bandits Ian Waudby-Smith1 Lili Wu2 Aaditya Ramdas1 Nikos Karampatziakis2 and Paul Mineiro2 1Carnegie Mellon University

2025-04-30 2 0 1.21MB 43 页 10玖币

侵权投诉

Anytime-valid oﬀ-policy inference for contextual bandits

Ian Waudby-Smith1, Lili Wu2, Aaditya Ramdas1, Nikos Karampatziakis2, and Paul Mineiro2

1Carnegie Mellon University

2Microsoft

ianws@cmu.edu,liliwu@microsoft.com,aramdas@cmu.edu,

nikosk@microsoft.com,pmineiro@microsoft.com

Abstract

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in

healthcare and the tech industry. They involve online learning algorithms that adaptively learn

policies over time to map observed contexts Xtto actions Atin an attempt to maximize stochastic

rewards Rt. This adaptivity raises interesting but hard statistical inference questions, especially

counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical

policy that is diﬀerent from the logging policy that was used to collect the data — a problem

known as “oﬀ-policy evaluation” (OPE). Using modern martingale techniques, we present a com-

prehensive framework for OPE inference that relax unnecessary conditions made in some past

works (such as performing inference at prespeciﬁed sample sizes, uniformly bounded importance

weights, constant logging policies, constant policy values, among others), signiﬁcantly improving

on them both theoretically and empirically. Importantly, our methods can be employed while

the original experiment is still running (that is, not necessarily post-hoc), when the logging pol-

icy may be itself changing (due to learning), and even if the context distributions are a highly

dependent time-series (such as if they are drifting over time). More concretely, we derive conﬁ-

dence sequences for various functionals of interest in OPE. These include doubly robust ones for

time-varying oﬀ-policy mean reward values, but also conﬁdence bands for the entire cumulative

distribution function of the oﬀ-policy reward distribution. All of our methods (a) are valid at

arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance

weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d)

adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid

oﬀ-policy inference using adaptively collected contextual bandit data.

1 Introduction 2

1.1 Oﬀ-policy inference, conﬁdence intervals, and conﬁdence sequences ........... 2

1.2 Desiderata for anytime-valid oﬀ-policy inference ...................... 4

1.3 Outline and contributions .................................. 6

1.4 Related work ......................................... 6

1.5 Notation: supermartingales, ﬁltrations, and stopping times ............... 7

2 Warmup: Oﬀ-policy inference for constant policy values 7

2.1 Tighter conﬁdence sequences via doubly robust pseudo-outcomes ............ 8

2.2 Tuning, truncating, and mirroring ............................. 11

2.3 Closed-form conﬁdence sequences .............................. 12

2.4 Fixed-time conﬁdence intervals ............................... 13

3 Inference for time-varying policy values 15

3.1 A remark on policy value diﬀerences ............................ 18

3.2 Time-varying treatment eﬀects in adaptive experiments ................. 19

arXiv:2210.10768v3 [stat.ME] 15 Aug 2024

3.3 Sequential testing and anytime p-values for oﬀ-policy inference ............. 20

4 Time-uniform inference for the oﬀ-policy CDF 22

5 Summary & extensions 24

A Proofs of the main results 31

A.1 A technical lemma ...................................... 31

A.2 Proof of Theorem 1 ...................................... 31

A.3 Proof of Theorem 2 ...................................... 33

A.4 Proof of Proposition 2 .................................... 34

A.5 Proof of Proposition 3 .................................... 34

A.6 Proof of Theorem 3 ...................................... 36

B A causal view of contextual bandits via potential outcomes 41

1 Introduction

The so-called “contextual bandit” problem is an abstraction that can be used to describe several

problem setups in statistics and machine learning [35,36]. For example, it generalizes the multi-armed

bandit problem by allowing for “contextual” side information, and it can be used to describe many

adaptive sequential experiments. The general contextual bandit problem can be described informally

as follows: an agent (such as a medical scientist in a clinical trial) views contextual information XtPX

for subject t(such as the clinical patient’s demographics, medical history, etc.), takes an action AtPA

(such as whether to administer a placebo, a low dose, or a high dose), and observes some reward Rt

(such as whether their adverse symptoms have subsided). This description is made formal in the

protocol for the generation of contextual bandit data in Algorithm 1. In the present paper, no

restrictions are placed on the dimensionality or structure of the context and action spaces Xand A

beyond them being measurable, but it is often helpful to think about Xas a d-dimensional Euclidean

space, and Aas t0,1ufor binary treatments, or Rfor diﬀerent dosages, and so on. Indeed, while high-

dimensional settings often pose certain challenges in contextual bandits (such as computational ones,

or inﬂated variances), none of these issues will aﬀect the validity of our statistical inference methods.

Throughout, we will require that the rewards are real-valued and bounded in r0,1s— a common

assumption in contextual bandits [50,28] — except for Section 4where we relax the boundedness

constraint.

There are two main objectives that one can study in the contextual bandit setup: (1) policy

optimization, and (2) oﬀ-policy evaluation (OPE) [36,35,11,12]. Here, a “policy” πpa|xqis simply

a conditional distribution over actions, such as the probability that patient tshould receive various

treatments given their context Xt. Policy optimization is concerned with ﬁnding policies that achieve

high cumulative rewards (typically measured through regret), while oﬀ-policy evaluation is concerned

with asking the counterfactual question: “how would we have done if we used some policy πinstead

of the policy that is currently collecting data?”. In this paper, we study the latter with a particular

focus on statistical inference in adaptive, sequential environments under nonparametric assumptions.

1.1 Oﬀ-policy inference, conﬁdence intervals, and conﬁdence sequences

By far the most common parameter of interest in the OPE problem is the expected reward ν:“

EπpRqthat would result from taking an action from the policy π. This expectation νis called the

“value” of the policy π. While several estimators for νhave been derived and reﬁned over the years,

many practical problems call for more than just a point estimate: we may also wish to quantify the

uncertainty surrounding our estimates via statistical inference tools such as conﬁdence intervals (CI).

However, a major drawback of CIs is the fact that they are only valid at ﬁxed and prespeciﬁed sample

sizes, while contextual bandit data are collected in a sequential and adaptive fashion over time.

We lay out the assumed protocol for the generation of contextual bandit data in Algorithm 1,

and in particular, all of our results will assume access to the output of this algorithm, namely the

(potentially inﬁnite) sequence of tuples pXt, At, RtqT

t“1for TPNY t8u. As is standard in OPE, we

will always assume that the policy πis (almost surely) absolutely continuous with respect to htso

that π{htis almost surely ﬁnite (without which, estimation and inference are not possible in general).

Indeed, this permits many bandit techniques and in principle allows for Thompson sampling since it

always assigns positive probability to an action (note that it may not always easy to compute the

probability of taking that action via Thompson sampling, but if those probabilities can be computed,

they can be used directly within our framework). However, phtq8

t“1cannot be the result of Upper

Conﬁdence Bound (UCB)-style algorithms since they take conditionally deterministic actions given

the past, violating the absolute continuity of πwith respect to ht.

In Algorithm 1, the term “exogenously time-varying” simply means that the context and reward

distributions at time tcan only depend on the past through Xt´1

1” pX1, . . . , Xt´1q, and not on the

actions taken (or rewards received). Formally, we allow for any joint distribution over pXt, At, Rtq8

t“1

as long as

pRtpr|x, a, Ht´1q “ pRtpr|x, a, Xt´1

1qand pXtpx|Ht´1q “ pXtpx|Xt´1

1q,(1)

where Htis all of the history σppXi, Ai, Riqt

i“1qup until time t. This conditional independence

requirement (1) includes as a special case more classical setups where Xtis independent of all else

given At, such as those considered in Bibaut et al. [4] or iid scenarios [28], but is strictly more general,

since, for example, pXtq8

t“1can be a highly dependent time-series. However, we do not go as far as to

consider the adversarial setting that is sometimes studied in the context of regret minimization. We

impose this conditional independence requirement since otherwise, the interpretation of EπpRt|Ht´1q

changes depending on which sequence of actions were played by the logging policy. Making matters

more concrete, the conditional oﬀ-policy value EπpRt|Ht´1qat time tis given by

νt:“EπpRt|Ht´1q ” żXˆAˆR

r¨pRtpr|a, x, Ht´1qπpa|xqpXtpx|Ht´1qdxdadr(2)

“żXˆAˆR

r¨pRtpr|a, x, Xt´1

1qπpa|xqpXtpx|Xt´1

1qdxdadr, (3)

and the equality (3) follows from (1). Notice that (2) could in principle depend on the logging policies

and actions played, but (3) does not. Despite imposing the conditional independence assumption (1),

the integral (2) is still a perfectly well-deﬁned functional, and if (1) is not satisﬁed, then our CSs

will still cover a quantity in terms of this functional. However, its interpretation would no longer be

counterfactual with respect to the entire sequence of actions (only conditional on the past).

While most prior work on OPE in contextual bandits is not written causally in terms of potential

outcomes (e.g. [50,28,4,5,25]), it is nevertheless possible to write down a causal target ν‹

t(i.e. a

functional of the potential outcome distribution) and show that it is equal to νtunder certain causal

identiﬁcation assumptions. These assumptions resemble the familiar consistency,exchangeability, and

positivity conditions that are ubiquitous in the treatment eﬀect estimation literature. Moreover, there

is a close relationship between OPE and the estimation of so-called stochastic interventions in causal

inference; indeed, they can essentially be seen as equivalent but with slightly diﬀerent emphases

and setups. However, given that neither the potential outcomes view nor the stochastic intervention

interpretation of OPE are typically emphasized in the contextual bandit literature (with the exception

of Zhan et al. [61], who use potential outcomes throughout), we leave this discussion to Appendix B.

To illustrate the shortcomings of CIs for OPE, suppose we run a contextual bandit algorithm and

want to see whether πis better than the current state-of-the-art policy h— e.g. whether EπpRq ą

EhpRq. (Here, we are implicitly assuming that Eπ1pRq “ Eπ1pRt|Ht´1qfor any policy π1for the sake

of illustration.) Suppose we compute a CI for the value of πbased on nsamples (for some prespeciﬁed

Algorithm 1 Protocol for the generation of contextual bandit data

// Here, TPNY t8u.

for t“1,2, . . . , T do

// The agent selects a policy htbased on the history Ht´1”σ`pXi, Ai, Riqt´1

i“1˘.

htPHt´1.

// The environment draws a context from an (exogenously time-varying) distribution.

Xt„pXtp¨q

// The agent plays a random action drawn from the selected policy.

At„htp¨ | Xtq.

// The environment draws a reward from an (exogenously time-varying) distribution based on the

action and context.

Rt„pRtp¨ | At, Xtq.

end for

// Return a (potentially inﬁnite) sequence of contextual bandit data.

return pXt, At, RtqT

t“1

n), and while πseems promising, the CI turns out to be inconclusive (the CI for EπpRqincludes EhpRq

if the latter is known, or the two CIs overlap if the latter is unknown). It is tempting to collect more

data, for a total of n1points, to see if the result is now conclusive; however the resulting sample size

n1is now a data-dependent quantity, rendering the CI invalid. (This could happen more than once.)

Fortunately, there exist statistical tools that permit adaptive stopping in these types of sequential

data collection problems: conﬁdence sequences (CSs [9,34]). A CS is a sequence of conﬁdence intervals,

valid at all sample sizes uniformly (and hence at arbitrary stopping times). Importantly for the

aforementioned OPE setup, CSs allow practitioners to collect additional data and continuously monitor

it, so that the resulting CI is indeed valid at the data-dependent stopped sample size n1. More formally,

we say that a sequence of intervals rLt, Uts8

t“1is a CS for the parameter θPRif

Pp@tPN, θ P rLt, Utsq ě 1´α, or equivalently, PpDtPN:θR rLt, Utsq ď α. (4)

Contrast (4) above with the deﬁnition of a CI which states that @nPN,PpθP rLn, Unsq ě 1´α,

so that the “@n” is outside the probability Pp¨q rather than inside. A powerful consequence of (4) is

that rLτ, Uτsis a valid CI for any stopping time τ. In fact, rLτ, Uτsbeing a valid CI is not just an

implication of (4) but the two statements are actually equivalent; see Howard et al. [24, Lemma 3].

The consequence for the OPE practitioner is that they can continuously update and monitor a

CS for the value of πwhile the contextual bandit algorithm is running, and deploy πas soon as they

are conﬁdent that it is better than the current state-of-the-art h. Karampatziakis et al. [28] refer to

this adaptive policy switching as “gated deployment”, and we will return to this motivating example

through the paper. Let us now lay out ﬁve desiderata that we want all of our CSs to satisfy.

1.2 Desiderata for anytime-valid oﬀ-policy inference

Throughout this paper, we will derive methods for oﬀ-policy evaluation and inference in a variety

of settings — including ﬁxed policies (Section 2), time-varying policies (Section 3), and for entire

cumulative distribution functions (Section 4). However, what all of these approaches will have in

common is that they will satisfy ﬁve desirable properties which we enumerate here.

1. Nonasymptotic: Our conﬁdence sets will satisfy exact coverage guarantees for any sample size,

unlike procedures based on the central limit theorem which only satisfy approximate guarantees

for large samples.1

2. Nonparametric: We will not make any parametric assumptions on the distribution of the

contexts, policies, or rewards.

3. Time-uniform / anytime-valid: Our conﬁdence sets will be uniformly valid for all sample

sizes, and permit oﬀ-policy inference at arbitrary data-dependent stopping times.

4. Adaptive data collection (via online learning): All of our oﬀ-policy inference methods

will allow for the sequence of logging policies phtq8

t“1to be predictable (i.e. htcan depend on

Ht´1). In particular phtq8

t“1can be the result of an online learning algorithm.

5. Unknown and unbounded wmax :In all of our algorithms, the maximal importance weight

wmax :“ess sup

tPN,aPA,xPX

πpa|xq

htpa|xq

can be unknown, and need not be uniformly bounded (i.e. wmax can be inﬁnite). Note that

we do require that importance weights πpa|xq{htpa|xqthemselves are ﬁnite for each pt, a, xq,

but their essential supremum need not be. Perhaps surprisingly, even if wmax is inﬁnite, it is

still possible for our CSs to shrink to zero-width since they depend only on empirical variances.

As an illustrative example, see Proposition 3for a closed-form CS whose width can shrink at a

rate of alog log t{tas long as the importance-weighted rewards are well-behaved (e.g. in the iid

setting, if they have ﬁnite second moments).

In addition to the above, we will design procedures so that they have strong empirical performance

and are straightforward to compute. While some of these desiderata are quite intuitive and common in

statistical inference (such as nonasymptotic, nonparametric, and time-uniform validity, so as to avoid

relying on large sample theory, unrealistic parametric assumptions, or prespeciﬁed sample sizes),

desiderata 4 and 5 are more speciﬁc to OPE and have not been satisﬁed in several prior works as we

outline in Sections 2,3, and 4. Given their central importance to our work, let us elaborate on them

here.

Why allow for logging policies to be predictable? For the purpose of policy optimization,

contextual bandit algorithms are tasked with balancing exploration and exploitation: simultaneously

ﬁguring out which policies will yield high rewards (at the expense of trying out suboptimal policies)

and playing actions from the policies that have proven eﬀective so far. On the other hand, in adaptive

sequential trials, an experimenter might aim to balance context distributions between treatment arms

(such as via Efron’s biased coin design [14]) or to adaptively ﬁnd the treatment policy that yields the

most eﬃcient treatment eﬀect estimators [29]. In both cases, the logging policies phtq8

t“1are not only

changing over time, but adaptively so based on previous observations. We strive to design procedures

that permit inference in precisely these types of adaptive data collection regimes, despite most prior

works on oﬀ-policy inference for contextual bandits having assumed that there is a ﬁxed, prespeciﬁed

logging policy [50,28,5,25]. Of course, if a CS or CI is valid under adaptive data collection, they are

also valid when ﬁxed logging policies are used instead.

Why not rely on knowledge of wmax?Related to the previous discussion, it may not be known

a priori how the range of the predictable logging policies will evolve over time. Moreover, one can

imagine a situation where supa,x πpa|xq{htpa|xqÑ8, even if every individual importance weight

1Note that nonasymptotic procedures may be more conservative than asymptotic ones as they satisfy more rigorous

coverage (similarly, type-I error) guarantees. Whether one should sacriﬁce stronger guarantees for tightness comes down

to philosophical preference. For the purposes of this paper, we focus solely on nonasymptotics.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Anytime-validoff-policyinferenceforcontextualbanditsIanWaudby-Smith1,LiliWu2,AadityaRamdas1,NikosKarampatziakis2,andPaulMineiro21CarnegieMellonUniversity2Microsoftianws@cmu.edu,liliwu@microsoft.com,aramdas@cmu.edu,nikosk@microsoft.com,pmineiro@microsoft.comAbstractContextualbanditalgorithmsareubiqui...

展开>> 收起<<

Anytime-valid off-policy inference for contextual bandits Ian Waudby-Smith1 Lili Wu2 Aaditya Ramdas1 Nikos Karampatziakis2 and Paul Mineiro2 1Carnegie Mellon University.pdf

共43页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Anytime-valid off-policy inference for contextual bandits Ian Waudby-Smith1 Lili Wu2 Aaditya Ramdas1 Nikos Karampatziakis2 and Paul Mineiro2 1Carnegie Mellon University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: