
Adaptive Experimental Design and Counterfactual Inference
TANNER FIEZ, SERGIO GAMEZ, ARICK CHEN, HOUSSAM NASSIF, and LALIT JAIN∗,Amazon
Adaptive experimental design methods are increasingly being used in industry as a tool to boost testing throughput or reduce
experimentation cost relative to traditional A/B/N testing methods. This paper shares lessons learned regarding the challenges and
pitfalls of naively using adaptive experimentation systems in industrial settings where non-stationarity is prevalent, while also
providing perspectives on the proper objectives and system specications in these settings. We developed an adaptive experimental
design framework for counterfactual inference based on these experiences, and tested it in a commercial environment.
ACM Reference Format:
Tanner Fiez, Sergio Gamez, Arick Chen, Houssam Nassif, and Lalit Jain. 2022. Adaptive Experimental Design and Counterfactual
Inference. In Proceedings of Recommender Systems Conference, CONSEQUENCES Workshop (RecSys’22 CONSEQUENCES Workshop).
ACM, New York, NY, USA, 5 pages. https://doi.org/XXXXXXX.XXXXXXX
1 INTRODUCTION
A/B/N testing is a classic and ubiquitous form of experimentation that has a proven track record of driving key
performance indicators within industry [
11
]. Yet, experimenters are steadily shifting toward Adaptive Experimental
Design (AED) methods with the goal of increasing testing throughput or reducing the cost of experimentation. AED
promises to use a fraction of the impressions that traditional A/B/N tests require to yield high condence inferences
or to directly drive business impact. In this paper, we share lessons learned regarding the challenges and pitfalls of
naively using adaptive experimentation systems in industrial settings where non-stationarity is the norm rather than
the exception. Moreover, we provide perspectives on the proper objectives and system specications in these settings.
This culminates in a high level presentation of an AED framework for counterfactual inference. To provide a robust and
exible tool for experimenters with performance certicates at minimal cost, our methodology combines cumulative
gain estimators, always-valid condence intervals, and an elimination algorithm.
2 A CASE STUDY
Imagine a setting where on a retailer web page, a marketer has been running a message
𝐴
for the last year and now
wants to test whether message
𝐵
beats
𝐴
. At the start of the experiment the messages are initialized with a default
prior distribution, and then at each round a Thompson sampling bandit dynamically allocates trac to each treatment,
playing each message according to the posterior probability of its mean being the highest [
15
]. After day 8, the algorithm
directs most trac to message
𝐴
(see Figure 1). On day 14, the experimenter needs to decide whether
𝐴
has actually
beaten
𝐵
. They conduct a paired t-test which, somewhat surprisingly, does not produce a signicant
𝑝
-value. As the
bandit shifted all trac to message
𝐴
, not enough trac was directed to message
𝐵
, diminishing the power of the test.
∗Also with University of Washington.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request permissions from permissions@acm.org.
©2022 Association for Computing Machinery.
Manuscript submitted to ACM
1
arXiv:2210.14369v1 [cs.LG] 25 Oct 2022