Artificial Replay A Meta-Algorithm for Harnessing Historical Data in Bandits Siddhartha Banerjee_2

2025-04-30 1 0 1.56MB 55 页 10玖币

侵权投诉

Artiﬁcial Replay: A Meta-Algorithm for Harnessing

Historical Data in Bandits

Siddhartha Banerjee

Cornell University, School of Operations Research and Information Engineering, sbanerjee@cornell.edu

Sean R. Sinclair

Northwestern University, Department of Industrial Engineering and Management Sciences, sean.sinclair@northwestern.edu

Milind Tambe

Harvard University, Department of Computer Science, milind tambe@harvard.edu

Lily Xu

University of Oxford, Leverhulme Centre for Nature Recovery & Department of Economics, lily.x@columbia.edu

Christina Lee Yu

Cornell University, School of Operations Research and Information Engineering, cleeyu@cornell.edu

Abstract. Most real-world deployments of bandit algorithms exist somewhere in between the oﬄine and online set-up, where some

historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to

“warm start” bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suﬀer

from spurious data and imbalanced data coverage, leading to data ineﬃciency (amount of historical data used) — particularly for

continuous action spaces. To address these challenges, we propose Artiﬁcial Replay, a meta-algorithm for incorporating historical

data into any arbitrary base bandit algorithm. We show that Artiﬁcial Replay uses only a fraction of the historical data compared

to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data

(IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on

𝐾-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our

results show the practical beneﬁts of Artiﬁcial Replay for improving data eﬃciency, including for base algorithms that do not satisfy

IIData.

Key words: Multi-Armed Bandits, Historical Data, Regret Analysis, Combinatorial Bandits, Green Security

1. Introduction

Multi-armed bandits model decision-making settings with repeated choices between multiple

actions (known as “arms”), where the expected reward of each arm is unknown, and each arm pull

yields a noisy sample from the true reward distribution. Multi-armed bandits and their variants

have been used eﬀectively to model many real-world problems. Resulting algorithms have been

applied to wireless networks (Zuo and Joe-Wong 2021), COVID testing regulations (Bastani et al.

2021), and wildlife conservation to prevent poaching (Xu et al. 2021). Typical bandit algorithms

are designed in one of two regimes: oﬄine or online, depending on the assumptions on how data

is collected. In oﬄine settings, the entire dataset is provided to the algorithm upfront, while in the

arXiv:2210.00025v4 [cs.LG] 19 Mar 2025

2Banerjee et al.: Artiﬁcial Replay

online setting the algorithm starts from scratch and dynamically collects data online. However, most

practical deployments exist between these two extremes; in applying bandits to wildlife conserva-

tion, for example, we may have years of historical patrol records that could help learn poaching risk

before initiating any bandit algorithm executed online to assign new patrol patterns.

A priori, it is not immediately clear how one should initialize an algorithm given historical data

(especially when using more complex bandit algorithms). As a warm-up, consider the well-studied

UCB1 algorithm from Lai and Robbins (1985) for the 𝐾-armed bandit problem. For a ﬁnite set of

actions 𝑎∈ [𝐾]with unknown mean reward 𝜇(𝑎), the algorithm maintains estimates 𝜇𝑡(𝑎)for the

mean reward of action 𝑎∈ [𝐾]and a count 𝑛𝑡(𝑎)for the number of times the action 𝑎has been

selected by the algorithm. At time 𝑡the algorithm then selects the action 𝐴𝑡which maximizes the

so-called upper conﬁdence bound (UCB):

UCB𝑡(𝑎)=𝜇𝑡(𝑎) +𝑂1/𝑛𝑡(𝑎),

where the 𝑂(·) drops polylogarithmic constants depending on the total number of timesteps 𝑇.

A simple approach to incorporate historical data for this algorithm would be to initialize the mean

𝜇𝑡(𝑎)and number of samples 𝑛𝑡(𝑎)at 𝑡=1 based on the historical dataset, assuming it was collected

by the bandit algorithm itself. This approach uses that a “suﬃcient statistic” for UCB are the mean

estimates and counts for each action. We refer to this algorithm as Full Start, which is practical

and reasonable so long as the dataset is representative of the underlying model. This algorithm has

indeed been proposed for the simple 𝐾-armed bandit setting (Shivaswamy and Joachims 2012).

1.1. Data Ineﬃciency of Full Start

While Full Start is intuitive and simple, this naive approach can lead to excessive data ineﬃciency

(the amount of historical data used by the algorithm), resulting in potentially unbounded computa-

tion and storage costs. We outline the challenges of Full Start here, then in Section 4we theoretically

demonstrate that our approach can achieve arbitrarily better data eﬃciency than Full Start, leading

to signiﬁcant reductions in computational and storage overhead.

In an extreme scenario, consider padding the historical data with samples from an arm 𝑎whose

mean reward 𝜇(𝑎)is lowest. Full Start will use the entire historical dataset regardless of its inherent

quality at timestep 𝑡=1. This is even more surprising, since the celebrated result from Lai and

Robbins (1985) shows that 𝑂(log(𝑇)/Δ(𝑎)2)samples are suﬃcient to discern that an action 𝑎is

suboptimal (where Δ(𝑎)is the gap between 𝜇(𝑎)and the reward of the optimal action). As the

Banerjee et al.: Artiﬁcial Replay 3

number of data points in the historical dataset from this suboptimal action tends towards inﬁnity,

Full Start uses an unbounded amount of historical data, even though only a fraction of the data was

necessary to appropriately rule out this action as being optimal. This ineﬃcient use of data increases

computational overhead and exacerbates the algorithm’s data eﬃciency. Clearly, Full Start is not

an optimal approach for incorporating historical data, as it fails to selectively process only the most

informative data points, leading to unnecessary computational and storage burdens.

Full Start thus suﬀers from issues which we call spurious data and imbalanced data coverage,

arising because the algorithm unnecessarily processes and stores data regardless of its quality.

These issues are particularly salient since many applications of bandit algorithms, such as online

recommender systems, may have historical data with billions of past records; processing all data

points would require exceptionally high upfront compute and storage costs. Similarly, in the context

of wildlife poaching, historical data takes on the form of past patrol data, which is guided by the

knowledge of domain experts and geographic constraints. As we will later see in Section 6, the

historical data is extremely geographically biased, so processing the entire historical dataset for

regions with large amounts of historical data is costly and unnecessary. Together, these challenges

highlight a key aspect of incorporating historical data to warm-start any bandit algorithm: unless

the historical data is collected by an optimal approach such as following a bandit algorithm, the

value of information of the historical data may not be a direct function of its size.

Lastly, while we’ve so far discussed the stylized 𝐾-armed bandit setup as an example, many

models of real-world systems require continuous actions over potentially combinatorial spaces with

constraints. For these more complex bandit settings, it may not even be clear how to instantiate

Full Start, as the base algorithm may not clearly admit a low-dimensional “suﬃcient statistic”.

Furthermore, the computation and storage costs increase dramatically with these more complex

models, for example, when estimates of the underlying mean reward function are computed as a high-

dimensional regression problem. In such instances, the computational complexity for generating

the estimate scales superlinearly with respect to the data eﬃciency (number of data points used).

Thus motivated, this paper seeks to answer the following questions:

•Is there a computationally eﬃcient way to use only a subset of the historical data while

maintaining the same regret guarantees of Full Start? Equivalently, is there a data eﬃcient

way to incorporate historical data?

•Can we do so across diﬀerent bandit set-ups, such as continuous or combinatorial bandits?

•How can we quantify the quality or usefulness of a given set of historical data?

4Banerjee et al.: Artiﬁcial Replay

We will evaluate algorithms in terms of their performance (measured by regret), and data eﬃciency

(measured by the number of historical datapoints accessed). We emphasize that this later measure

is correlated with the computation and storage costs for the algorithm.

1.2. Our Contributions

In our ﬁrst contribution, we propose Artiﬁcial Replay, a meta-algorithm that modiﬁes any base

bandit algorithm to optimally harness historical data — that is, using the minimal data required

to achieve the highest possible performance. Artiﬁcial Replay improves data eﬃciency by using

historical data as needed — speciﬁcally, only when recommended by the base bandit algorithm.

Concretely, Artiﬁcial Replay uses the historical data as a replay buﬀer to artiﬁcially simulate online

actions. When the base algorithm selects an action, we ﬁrst check the historical data for any unused

samples from the chosen action. If an unused sample exists, update the reward estimates using

that historical data point and continue, without advancing to the next timestep. This allows the

algorithm to obtain better estimates for the reward of that action without incurring additional regret.

Otherwise, take an online step by sampling from the environment (incurring regret if that action

is non-optimal), and advance to the next timestep. We will demonstrate how this meta-algorithm

can be applied to a wide set of practical models: standard 𝐾-armed bandits, metric bandits with

continuous actions, and models with semi-bandit feedback.

Given that Artiﬁcial Replay only uses a subset of the data, one might wonder if it incurs higher

regret than Full Start (a generalization of the algorithm described above for 𝐾-armed bandits).

Surprisingly, in our second contribution we prove that under a widely applicable condition, the

regret of Artiﬁcial Replay (as a random variable) is identical to that of Full Start, while also

guaranteeing signiﬁcantly better data eﬃciency. Speciﬁcally, we show a sample-path coupling

between Artiﬁcial Replay and Full Start with the same base algorithm, as long as the base algorithm

satisﬁes a novel (and widely applicable) independence of irrelevant data (IIData) assumption.

We complement this result by also highlighting the regret improvement for Artiﬁcial Replay as

a function of the used historical data. Additionally, we prove that Artiﬁcial Replay can lead to

arbitrary better computational complexity. These results highlight that Artiﬁcial Replay is a simple

approach for incorporating historical data with identical regret to Full Start and signiﬁcantly better

data eﬃciency. To summarize, Artiﬁcial Replay Pareto-dominates Full Start in terms of the two

main metrics of interest: (𝑖)regret, (𝑖𝑖)data eﬃciency.

Banerjee et al.: Artiﬁcial Replay 5

Finally, we show the beneﬁts of Artiﬁcial Replay by instantiating it for several classes of bandits

and evaluating on real-world poaching data. To highlight the breadth of algorithms that satisfy the

IIData property, we show how standard UCB algorithms can be easily modiﬁed to be IIData while

still guaranteeing regret-optimality, such as for 𝐾-armed and continuous combinatorial bandits.

While the results for 𝐾-armed bandits are fairly straightforward, the fact that these results extend

to complex settings such as metric bandits with continuous actions or combinatorial bandits is not

at all obvious. The surprising insight in our result is that a similarly simple approach for complex

bandit settings can attain the same guarantees as full usage of the historical data with signiﬁcant

data eﬃciency gains. Even for algorithms that do not satisfy IIData, such as Thompson sampling

and Information Directed Sampling (IDS), we demonstrate on real poaching data that Artiﬁcial

Replay still achieves concrete gains in regret and data eﬃciency over a range of base algorithms. We

close with a case study of combinatorial bandit algorithms with continuous actions in the context

of green security, using a novel adaptive discretization technique.

1.3. Related Work

Multi-armed bandit problems and its sub-variants (including the ﬁnite-armed and CMAB-CRA

model to be discussed here) have a long history in the online learning and optimization literature.

We highlight the most closely related works below, but for more extensive references see Bubeck

et al. (2012), Slivkins (2019), Lattimore and Szepesv´

ari (2020).

Multi-Armed Bandits. The design and analysis of bandit algorithms have been considered under

a wide range of models. Lai and Robbins (1985), Auer et al. (2002) ﬁrst studied the 𝐾-armed bandit,

where the decision maker chooses between a ﬁnite set of 𝐾actions at each timestep. Numerous

follow-up works have extended these approaches to handle continuous action spaces (Kleinberg

et al. 2019) and combinatorial constraints (Chen et al. 2013,Xu et al. 2021). Zuo and Joe-Wong

(2021) propose a discrete model of the combinatorial multi-armed bandit that generalizes previous

work (Lattimore et al. 2014,2015,Dagan and Koby 2018) to schedule a ﬁnite set of resources

to maximize the expected number of jobs ﬁnished. We extend their model to continuous actions,

tailored to the green security setting from Xu et al. (2021). Our work provides a framework to

modify existing algorithms to eﬃciently incorporate historical data. Moreover, we also propose and

evaluate a novel algorithm to incorporate adaptive discretization for combinatorial multi-armed

bandits for continuous resource allocation, extending the discrete model from Zuo and Joe-Wong

(2021).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ArtificialReplay:AMeta-AlgorithmforHarnessingHistoricalDatainBanditsSiddharthaBanerjeeCornellUniversity,SchoolofOperationsResearchandInformationEngineering,sbanerjee@cornell.eduSeanR.SinclairNorthwesternUniversity,DepartmentofIndustrialEngineeringandManagementSciences,sean.sinclair@northwestern.eduM...

展开>> 收起<<

Artificial Replay A Meta-Algorithm for Harnessing Historical Data in Bandits Siddhartha Banerjee_2.pdf

共55页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Artificial Replay A Meta-Algorithm for Harnessing Historical Data in Bandits Siddhartha Banerjee_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: