Artificial Replay A Meta-Algorithm for Harnessing Historical Data in Bandits Siddhartha Banerjee_2

2025-04-30 0 0 1.56MB 55 页 10玖币
侵权投诉
Artificial Replay: A Meta-Algorithm for Harnessing
Historical Data in Bandits
Siddhartha Banerjee
Cornell University, School of Operations Research and Information Engineering, sbanerjee@cornell.edu
Sean R. Sinclair
Northwestern University, Department of Industrial Engineering and Management Sciences, sean.sinclair@northwestern.edu
Milind Tambe
Harvard University, Department of Computer Science, milind tambe@harvard.edu
Lily Xu
University of Oxford, Leverhulme Centre for Nature Recovery & Department of Economics, lily.x@columbia.edu
Christina Lee Yu
Cornell University, School of Operations Research and Information Engineering, cleeyu@cornell.edu
Abstract. Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some
historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to
“warm start” bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer
from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) — particularly for
continuous action spaces. To address these challenges, we propose Artificial Replay, a meta-algorithm for incorporating historical
data into any arbitrary base bandit algorithm. We show that Artificial Replay uses only a fraction of the historical data compared
to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data
(IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on
𝐾-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our
results show the practical benefits of Artificial Replay for improving data efficiency, including for base algorithms that do not satisfy
IIData.
Key words: Multi-Armed Bandits, Historical Data, Regret Analysis, Combinatorial Bandits, Green Security
1. Introduction
Multi-armed bandits model decision-making settings with repeated choices between multiple
actions (known as “arms”), where the expected reward of each arm is unknown, and each arm pull
yields a noisy sample from the true reward distribution. Multi-armed bandits and their variants
have been used effectively to model many real-world problems. Resulting algorithms have been
applied to wireless networks (Zuo and Joe-Wong 2021), COVID testing regulations (Bastani et al.
2021), and wildlife conservation to prevent poaching (Xu et al. 2021). Typical bandit algorithms
are designed in one of two regimes: offline or online, depending on the assumptions on how data
is collected. In offline settings, the entire dataset is provided to the algorithm upfront, while in the
1
arXiv:2210.00025v4 [cs.LG] 19 Mar 2025
2Banerjee et al.: Artificial Replay
online setting the algorithm starts from scratch and dynamically collects data online. However, most
practical deployments exist between these two extremes; in applying bandits to wildlife conserva-
tion, for example, we may have years of historical patrol records that could help learn poaching risk
before initiating any bandit algorithm executed online to assign new patrol patterns.
A priori, it is not immediately clear how one should initialize an algorithm given historical data
(especially when using more complex bandit algorithms). As a warm-up, consider the well-studied
UCB1 algorithm from Lai and Robbins (1985) for the 𝐾-armed bandit problem. For a finite set of
actions 𝑎∈ [𝐾]with unknown mean reward 𝜇(𝑎), the algorithm maintains estimates 𝜇𝑡(𝑎)for the
mean reward of action 𝑎∈ [𝐾]and a count 𝑛𝑡(𝑎)for the number of times the action 𝑎has been
selected by the algorithm. At time 𝑡the algorithm then selects the action 𝐴𝑡which maximizes the
so-called upper confidence bound (UCB):
UCB𝑡(𝑎)=𝜇𝑡(𝑎) +𝑂1/𝑛𝑡(𝑎),
where the 𝑂(·) drops polylogarithmic constants depending on the total number of timesteps 𝑇.
A simple approach to incorporate historical data for this algorithm would be to initialize the mean
𝜇𝑡(𝑎)and number of samples 𝑛𝑡(𝑎)at 𝑡=1 based on the historical dataset, assuming it was collected
by the bandit algorithm itself. This approach uses that a “sufficient statistic” for UCB are the mean
estimates and counts for each action. We refer to this algorithm as Full Start, which is practical
and reasonable so long as the dataset is representative of the underlying model. This algorithm has
indeed been proposed for the simple 𝐾-armed bandit setting (Shivaswamy and Joachims 2012).
1.1. Data Inefficiency of Full Start
While Full Start is intuitive and simple, this naive approach can lead to excessive data inefficiency
(the amount of historical data used by the algorithm), resulting in potentially unbounded computa-
tion and storage costs. We outline the challenges of Full Start here, then in Section 4we theoretically
demonstrate that our approach can achieve arbitrarily better data efficiency than Full Start, leading
to significant reductions in computational and storage overhead.
In an extreme scenario, consider padding the historical data with samples from an arm 𝑎whose
mean reward 𝜇(𝑎)is lowest. Full Start will use the entire historical dataset regardless of its inherent
quality at timestep 𝑡=1. This is even more surprising, since the celebrated result from Lai and
Robbins (1985) shows that 𝑂(log(𝑇)/Δ(𝑎)2)samples are sufficient to discern that an action 𝑎is
suboptimal (where Δ(𝑎)is the gap between 𝜇(𝑎)and the reward of the optimal action). As the
Banerjee et al.: Artificial Replay 3
number of data points in the historical dataset from this suboptimal action tends towards infinity,
Full Start uses an unbounded amount of historical data, even though only a fraction of the data was
necessary to appropriately rule out this action as being optimal. This inefficient use of data increases
computational overhead and exacerbates the algorithm’s data efficiency. Clearly, Full Start is not
an optimal approach for incorporating historical data, as it fails to selectively process only the most
informative data points, leading to unnecessary computational and storage burdens.
Full Start thus suffers from issues which we call spurious data and imbalanced data coverage,
arising because the algorithm unnecessarily processes and stores data regardless of its quality.
These issues are particularly salient since many applications of bandit algorithms, such as online
recommender systems, may have historical data with billions of past records; processing all data
points would require exceptionally high upfront compute and storage costs. Similarly, in the context
of wildlife poaching, historical data takes on the form of past patrol data, which is guided by the
knowledge of domain experts and geographic constraints. As we will later see in Section 6, the
historical data is extremely geographically biased, so processing the entire historical dataset for
regions with large amounts of historical data is costly and unnecessary. Together, these challenges
highlight a key aspect of incorporating historical data to warm-start any bandit algorithm: unless
the historical data is collected by an optimal approach such as following a bandit algorithm, the
value of information of the historical data may not be a direct function of its size.
Lastly, while weve so far discussed the stylized 𝐾-armed bandit setup as an example, many
models of real-world systems require continuous actions over potentially combinatorial spaces with
constraints. For these more complex bandit settings, it may not even be clear how to instantiate
Full Start, as the base algorithm may not clearly admit a low-dimensional “sufficient statistic”.
Furthermore, the computation and storage costs increase dramatically with these more complex
models, for example, when estimates of the underlying mean reward function are computed as a high-
dimensional regression problem. In such instances, the computational complexity for generating
the estimate scales superlinearly with respect to the data efficiency (number of data points used).
Thus motivated, this paper seeks to answer the following questions:
Is there a computationally efficient way to use only a subset of the historical data while
maintaining the same regret guarantees of Full Start? Equivalently, is there a data efficient
way to incorporate historical data?
Can we do so across different bandit set-ups, such as continuous or combinatorial bandits?
How can we quantify the quality or usefulness of a given set of historical data?
4Banerjee et al.: Artificial Replay
We will evaluate algorithms in terms of their performance (measured by regret), and data efficiency
(measured by the number of historical datapoints accessed). We emphasize that this later measure
is correlated with the computation and storage costs for the algorithm.
1.2. Our Contributions
In our first contribution, we propose Artificial Replay, a meta-algorithm that modifies any base
bandit algorithm to optimally harness historical data — that is, using the minimal data required
to achieve the highest possible performance. Artificial Replay improves data efficiency by using
historical data as needed — specifically, only when recommended by the base bandit algorithm.
Concretely, Artificial Replay uses the historical data as a replay buffer to artificially simulate online
actions. When the base algorithm selects an action, we first check the historical data for any unused
samples from the chosen action. If an unused sample exists, update the reward estimates using
that historical data point and continue, without advancing to the next timestep. This allows the
algorithm to obtain better estimates for the reward of that action without incurring additional regret.
Otherwise, take an online step by sampling from the environment (incurring regret if that action
is non-optimal), and advance to the next timestep. We will demonstrate how this meta-algorithm
can be applied to a wide set of practical models: standard 𝐾-armed bandits, metric bandits with
continuous actions, and models with semi-bandit feedback.
Given that Artificial Replay only uses a subset of the data, one might wonder if it incurs higher
regret than Full Start (a generalization of the algorithm described above for 𝐾-armed bandits).
Surprisingly, in our second contribution we prove that under a widely applicable condition, the
regret of Artificial Replay (as a random variable) is identical to that of Full Start, while also
guaranteeing significantly better data efficiency. Specifically, we show a sample-path coupling
between Artificial Replay and Full Start with the same base algorithm, as long as the base algorithm
satisfies a novel (and widely applicable) independence of irrelevant data (IIData) assumption.
We complement this result by also highlighting the regret improvement for Artificial Replay as
a function of the used historical data. Additionally, we prove that Artificial Replay can lead to
arbitrary better computational complexity. These results highlight that Artificial Replay is a simple
approach for incorporating historical data with identical regret to Full Start and significantly better
data efficiency. To summarize, Artificial Replay Pareto-dominates Full Start in terms of the two
main metrics of interest: (𝑖)regret, (𝑖𝑖)data efficiency.
Banerjee et al.: Artificial Replay 5
Finally, we show the benefits of Artificial Replay by instantiating it for several classes of bandits
and evaluating on real-world poaching data. To highlight the breadth of algorithms that satisfy the
IIData property, we show how standard UCB algorithms can be easily modified to be IIData while
still guaranteeing regret-optimality, such as for 𝐾-armed and continuous combinatorial bandits.
While the results for 𝐾-armed bandits are fairly straightforward, the fact that these results extend
to complex settings such as metric bandits with continuous actions or combinatorial bandits is not
at all obvious. The surprising insight in our result is that a similarly simple approach for complex
bandit settings can attain the same guarantees as full usage of the historical data with significant
data efficiency gains. Even for algorithms that do not satisfy IIData, such as Thompson sampling
and Information Directed Sampling (IDS), we demonstrate on real poaching data that Artificial
Replay still achieves concrete gains in regret and data efficiency over a range of base algorithms. We
close with a case study of combinatorial bandit algorithms with continuous actions in the context
of green security, using a novel adaptive discretization technique.
1.3. Related Work
Multi-armed bandit problems and its sub-variants (including the finite-armed and CMAB-CRA
model to be discussed here) have a long history in the online learning and optimization literature.
We highlight the most closely related works below, but for more extensive references see Bubeck
et al. (2012), Slivkins (2019), Lattimore and Szepesv´
ari (2020).
Multi-Armed Bandits. The design and analysis of bandit algorithms have been considered under
a wide range of models. Lai and Robbins (1985), Auer et al. (2002) first studied the 𝐾-armed bandit,
where the decision maker chooses between a finite set of 𝐾actions at each timestep. Numerous
follow-up works have extended these approaches to handle continuous action spaces (Kleinberg
et al. 2019) and combinatorial constraints (Chen et al. 2013,Xu et al. 2021). Zuo and Joe-Wong
(2021) propose a discrete model of the combinatorial multi-armed bandit that generalizes previous
work (Lattimore et al. 2014,2015,Dagan and Koby 2018) to schedule a finite set of resources
to maximize the expected number of jobs finished. We extend their model to continuous actions,
tailored to the green security setting from Xu et al. (2021). Our work provides a framework to
modify existing algorithms to efficiently incorporate historical data. Moreover, we also propose and
evaluate a novel algorithm to incorporate adaptive discretization for combinatorial multi-armed
bandits for continuous resource allocation, extending the discrete model from Zuo and Joe-Wong
(2021).
摘要:

ArtificialReplay:AMeta-AlgorithmforHarnessingHistoricalDatainBanditsSiddharthaBanerjeeCornellUniversity,SchoolofOperationsResearchandInformationEngineering,sbanerjee@cornell.eduSeanR.SinclairNorthwesternUniversity,DepartmentofIndustrialEngineeringandManagementSciences,sean.sinclair@northwestern.eduM...

展开>> 收起<<
Artificial Replay A Meta-Algorithm for Harnessing Historical Data in Bandits Siddhartha Banerjee_2.pdf

共55页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:55 页 大小:1.56MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 55
客服
关注