
Banerjee et al.: Artificial Replay 3
number of data points in the historical dataset from this suboptimal action tends towards infinity,
Full Start uses an unbounded amount of historical data, even though only a fraction of the data was
necessary to appropriately rule out this action as being optimal. This inefficient use of data increases
computational overhead and exacerbates the algorithm’s data efficiency. Clearly, Full Start is not
an optimal approach for incorporating historical data, as it fails to selectively process only the most
informative data points, leading to unnecessary computational and storage burdens.
Full Start thus suffers from issues which we call spurious data and imbalanced data coverage,
arising because the algorithm unnecessarily processes and stores data regardless of its quality.
These issues are particularly salient since many applications of bandit algorithms, such as online
recommender systems, may have historical data with billions of past records; processing all data
points would require exceptionally high upfront compute and storage costs. Similarly, in the context
of wildlife poaching, historical data takes on the form of past patrol data, which is guided by the
knowledge of domain experts and geographic constraints. As we will later see in Section 6, the
historical data is extremely geographically biased, so processing the entire historical dataset for
regions with large amounts of historical data is costly and unnecessary. Together, these challenges
highlight a key aspect of incorporating historical data to warm-start any bandit algorithm: unless
the historical data is collected by an optimal approach such as following a bandit algorithm, the
value of information of the historical data may not be a direct function of its size.
Lastly, while we’ve so far discussed the stylized 𝐾-armed bandit setup as an example, many
models of real-world systems require continuous actions over potentially combinatorial spaces with
constraints. For these more complex bandit settings, it may not even be clear how to instantiate
Full Start, as the base algorithm may not clearly admit a low-dimensional “sufficient statistic”.
Furthermore, the computation and storage costs increase dramatically with these more complex
models, for example, when estimates of the underlying mean reward function are computed as a high-
dimensional regression problem. In such instances, the computational complexity for generating
the estimate scales superlinearly with respect to the data efficiency (number of data points used).
Thus motivated, this paper seeks to answer the following questions:
•Is there a computationally efficient way to use only a subset of the historical data while
maintaining the same regret guarantees of Full Start? Equivalently, is there a data efficient
way to incorporate historical data?
•Can we do so across different bandit set-ups, such as continuous or combinatorial bandits?
•How can we quantify the quality or usefulness of a given set of historical data?