However, a major drawback of CIs is the fact that they are only valid at fixed and prespecified sample
sizes, while contextual bandit data are collected in a sequential and adaptive fashion over time.
We lay out the assumed protocol for the generation of contextual bandit data in Algorithm 1,
and in particular, all of our results will assume access to the output of this algorithm, namely the
(potentially infinite) sequence of tuples pXt, At, RtqT
t“1for TPNY t8u. As is standard in OPE, we
will always assume that the policy πis (almost surely) absolutely continuous with respect to htso
that π{htis almost surely finite (without which, estimation and inference are not possible in general).
Indeed, this permits many bandit techniques and in principle allows for Thompson sampling since it
always assigns positive probability to an action (note that it may not always easy to compute the
probability of taking that action via Thompson sampling, but if those probabilities can be computed,
they can be used directly within our framework). However, phtq8
t“1cannot be the result of Upper
Confidence Bound (UCB)-style algorithms since they take conditionally deterministic actions given
the past, violating the absolute continuity of πwith respect to ht.
In Algorithm 1, the term “exogenously time-varying” simply means that the context and reward
distributions at time tcan only depend on the past through Xt´1
1” pX1, . . . , Xt´1q, and not on the
actions taken (or rewards received). Formally, we allow for any joint distribution over pXt, At, Rtq8
t“1
as long as
pRtpr|x, a, Ht´1q “ pRtpr|x, a, Xt´1
1qand pXtpx|Ht´1q “ pXtpx|Xt´1
1q,(1)
where Htis all of the history σppXi, Ai, Riqt
i“1qup until time t. This conditional independence
requirement (1) includes as a special case more classical setups where Xtis independent of all else
given At, such as those considered in Bibaut et al. [4] or iid scenarios [28], but is strictly more general,
since, for example, pXtq8
t“1can be a highly dependent time-series. However, we do not go as far as to
consider the adversarial setting that is sometimes studied in the context of regret minimization. We
impose this conditional independence requirement since otherwise, the interpretation of EπpRt|Ht´1q
changes depending on which sequence of actions were played by the logging policy. Making matters
more concrete, the conditional off-policy value EπpRt|Ht´1qat time tis given by
νt:“EπpRt|Ht´1q ” żXˆAˆR
r¨pRtpr|a, x, Ht´1qπpa|xqpXtpx|Ht´1qdxdadr(2)
“żXˆAˆR
r¨pRtpr|a, x, Xt´1
1qπpa|xqpXtpx|Xt´1
1qdxdadr, (3)
and the equality (3) follows from (1). Notice that (2) could in principle depend on the logging policies
and actions played, but (3) does not. Despite imposing the conditional independence assumption (1),
the integral (2) is still a perfectly well-defined functional, and if (1) is not satisfied, then our CSs
will still cover a quantity in terms of this functional. However, its interpretation would no longer be
counterfactual with respect to the entire sequence of actions (only conditional on the past).
While most prior work on OPE in contextual bandits is not written causally in terms of potential
outcomes (e.g. [50,28,4,5,25]), it is nevertheless possible to write down a causal target ν‹
t(i.e. a
functional of the potential outcome distribution) and show that it is equal to νtunder certain causal
identification assumptions. These assumptions resemble the familiar consistency,exchangeability, and
positivity conditions that are ubiquitous in the treatment effect estimation literature. Moreover, there
is a close relationship between OPE and the estimation of so-called stochastic interventions in causal
inference; indeed, they can essentially be seen as equivalent but with slightly different emphases
and setups. However, given that neither the potential outcomes view nor the stochastic intervention
interpretation of OPE are typically emphasized in the contextual bandit literature (with the exception
of Zhan et al. [61], who use potential outcomes throughout), we leave this discussion to Appendix B.
To illustrate the shortcomings of CIs for OPE, suppose we run a contextual bandit algorithm and
want to see whether πis better than the current state-of-the-art policy h— e.g. whether EπpRq ą
EhpRq. (Here, we are implicitly assuming that Eπ1pRq “ Eπ1pRt|Ht´1qfor any policy π1for the sake
of illustration.) Suppose we compute a CI for the value of πbased on nsamples (for some prespecified
3