Reward Imputation with Sketching for Contextual Batched Bandits Xiao Zhang12 Ninglu Shao12 Zihua Si12 Jun Xu12

2025-04-24 0 0 1.78MB 12 页 10玖币

侵权投诉

Reward Imputation with Sketching for

Contextual Batched Bandits

Xiao Zhang1,2, Ninglu Shao1,2,∗

, Zihua Si1,2,∗

, Jun Xu1,2,†

Wenhan Wang3,Hanjing Su3,Ji-Rong Wen1,2

1Gaoling School of Artiﬁcial Intelligence, Renmin University of China, Beijing, China

2Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China

3Tencent Inc., Shenzhen, China

{zhangx89, ninglu_shao, zihua_si, junxu, jrwen}@ruc.edu.cn

{justinsu, ezewang}@tencent.com

Abstract

Contextual batched bandit (CBB) is a setting where a batch of rewards is ob-

served from the environment at the end of each episode, but the rewards of the

non-executed actions are unobserved, resulting in partial-information feedback.

Existing approaches for CBB often ignore the rewards of the non-executed actions,

leading to underutilization of feedback information. In this paper, we propose an

efﬁcient approach called Sketched Policy Updating with Imputed Rewards (SPUIR)

that completes the unobserved rewards using sketching, which approximates the

full-information feedbacks. We formulate reward imputation as an imputation

regularized ridge regression problem that captures the feedback mechanisms of

both executed and non-executed actions. To reduce time complexity, we solve

the regression problem using randomized sketching. We prove that our approach

achieves an instantaneous regret with controllable bias and smaller variance than

approaches without reward imputation. Furthermore, our approach enjoys a sub-

linear regret bound against the optimal policy. We also present two extensions, a

rate-scheduled version and a version for nonlinear rewards, making our approach

more practical. Experimental results show that SPUIR outperforms state-of-the-art

baselines on synthetic, public benchmark, and real-world datasets.

1 Introduction

Contextual bandits have gained signiﬁcant popularity in solving sequential decision-making problems

(Li et al., 2010; Lan and Baraniuk, 2016; Yom-Tov et al., 2017; Yang et al., 2021), where the agent

continuously updates its decision-making policy fully online (i.e., at each step), considering the

context and the received reward feedback to maximize cumulative rewards. In this paper, we address

a more general setting called contextual batched bandits (CBB). In CBB, the decision process is

divided into

episodes, and within each episode, the agent interacts with the environment for a ﬁxed

number of

steps. At the end of each episode, the agent collects reward feedbacks and contexts.

Subsequently, the policy is updated using the collected data to guide the decision-making process in

the subsequent episode. CBB offers a practical framework for real-world streaming applications (e.g.,

streaming recommendation (Zhang et al., 2021, 2022)). In the context of CBB settings, the batch size

, can be adjusted by the agent to achieve improved regret guarantees and meet the data throughput

requirements based on the available computing resources (Zhou, 2023).

∗Ninglu Shao and Zihua Si have made equal contributions to this paper.

†Corresponding author: Jun Xu.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

arXiv:2210.06719v3 [cs.LG] 7 Oct 2023

In bandit settings, it is common for the environment to only provide feedback on the rewards of

executed actions to the agent, while concealing the rewards of non-executed actions. This type of

limited feedback is referred to as partial-information feedback (also called “bandit feedback”). In

CBB setting, existing approaches tend to overlook the potential rewards associated with non-executed

actions. Instead, they address the challenge of partial-information feedback through an exploration-

exploitation tradeoff in both the context space and reward space (Han et al., 2020; Zhang et al.,

2020). However, CBB agents typically estimate and maintain reward models for the action-selection

policy, thereby capturing some information about the potential rewards of non-executed actions. This

additional reward structure information is available for policy updating in each episode but remains

untapped by existing batched bandit approaches.

In the context of contextual bandit settings where the policy is updated online, several bias-correction

approaches have been introduced to tackle the issue of partial-information feedback. Dimakopoulou

et al. (2019) presented linear contextual bandits integrating the balancing approach from causal

inference, which reweight the contexts and rewards by the inverse propensity scores. Chou et al.

(2015) designed pseudo-reward algorithms for contextual bandits using upper conﬁdence bound

(UCB) strategy, which use a direct method to estimate the unobserved rewards. Kim and Paik (2019)

focused on the correction of feedback bias for LASSO bandit with high-dimensional contexts, and

applied the doubly-robust approach to the reward modiﬁcation using average contexts. While these

approaches have demonstrated effectiveness in contextual bandit settings, little attention has been

given to addressing the under-utilization of partial-information feedback in CBB setting.

Theoretical and experimental analyses in Section 2 indicate that better performance of CBB is

achievable if the rewards of the non-executed actions can be received. Motivated by these observations,

we propose a novel reward imputation approach for the non-executed actions, which mimics the

reward generation mechanisms of environments. We conclude our contributions as follows.

(1) To fully utilize feedback information in CBB, we formulate the reward imputation as a problem of

imputation regularized ridge regression, where the policy can be updated efﬁciently using sketching.

(2) We prove that our reward imputation approach obtains a relative-error bound for sketching

approximation, achieves an instantaneous regret with a controllable bias and a smaller variance than

that without reward imputation, has a lower bound of the sketch size independently of the overall

number of steps, enjoys a sublinear regret bound against the optimal policy, and reduces the time

complexity from

O(Bd2)

O(cd2)

for each action in one episode, where

denotes the batch size,

cthe sketch size, and dthe dimension of the context space, satisfying d<c<B.

(3) We present two practical variants of our reward imputation approach, including the rate-scheduled

version that sets the imputation rate without tuning, and the version for nonlinear rewards.

(4) We carried out extensive experiments on a synthetic dataset, the publicly available Criteo dataset,

and a dataset from a commercial app to demonstrate our performance, empirically analyzed the

inﬂuence of different parameters, and veriﬁed the correctness of the theoretical results.

Related Work. Recently, batched bandit has become an active research topic in statistics and

learning theory including

-armed bandit (Perchet et al., 2016), multi-armed bandit (Gao et al., 2019;

Zhang et al., 2020; Wang and Cheng, 2020), and contextual bandit (Han et al., 2020; Ren and Zhou,

2020; Gu et al., 2021). Han et al. (2020) deﬁned linear contextual bandits, and designed UCB-type

algorithms for both stochastic and adversarial contexts, where true rewards of different actions have

the same parameters. Zhang et al. (2020) provided methods for inference on data collected in batches

using bandits, and introduced a batched least squares estimator for both multi-arm and contextual

bandits. Recently, Esfandiari et al. (2021) proved reﬁned regret upper bounds of batched bandits in

stochastic and adversarial settings. There are several recent works that consider similar settings to

CBB, e.g., episodic Markov decision process (Jin et al., 2018), LASSO bandits (Wang and Cheng,

2020). Sketching is another related technology that compresses a large matrix to a much smaller one

by multiplying a (usually) random matrix while retaining certain properties (Woodruff, 2014), which

has been used in online convex optimization (Calandriello et al., 2017; Zhang and Liao, 2019).

2 Problem Formulation and Analysis

Let

[x] = {1,2, . . . , x}

S ⊆ Rd

be the context space whose dimension is

A={Aj}j∈[M]

the

action space containing

actions,

[A;B] = [A⊺,B⊺]⊺

∥A∥F

∥A∥1∥A∥2

denote the Frobenius

norm,

-norm, and spectral norm of a matrix

, respectively,

∥a∥1

and

∥a∥2

be the

ℓ1

-norm and the

Protocol 1 Contextual Batched Bandit (CBB)

INPUT: Batch size B, number of episodes N, action space A={Aj}j∈[M], context space S ⊆ Rd

1: Initialize policy p0←1/M, sample data buffer D1={(s0,b , AI0,b , R0,b )}b∈[B]using initial policy p0

2: for n= 1 to Ndo

3: Update the policy pnon Dn

4: for b= 1 to Bdo

5: Observe context sn,b and choose AIn,b ∈ A following the updated policy pn(sn,b )

6: end for

7: Dn+1 ← {(sn,b, AIn,b , Rn,b)}b∈[B], where Rn,b denotes the reward of action AIn,b on context sn,b

8: end for

ℓ2

-norm of a vector

σmin(A)

and

σmax(A)

denote the minimum and maximum of the of singular

values of

. In this paper, we focus on the setting of Contextual Batched Bandits (CBB) in Protocol 1,

where the decision process is partitioned into

episodes, and in each episode, CBB consists of two

phases: (1) the policy updating approximates the optimal policy based on the received contexts and

rewards; (2) the online decision chooses actions for execution following the updated and ﬁxed policy

for

steps (

is also called the batch size), and stores the context-action pairs and the observed

rewards of the executed actions into a data buffer

. The reward

in CBB is a partial-information

feedback where rewards are unobserved for the non-executed actions.

In contrast to the existing batched bandit setting (Han et al., 2020; Esfandiari et al., 2021), where

the true reward feedbacks for all actions are controlled by the same parameter vector while the

received contexts differ across actions at each step, we make the assumption that in CBB setting,

the mechanism of true reward feedback varies across actions, while the received context is shared

among actions. Formally, for any context

si∈ S ⊆ Rd

and action

A∈ A

, we assume that the

expectation of the true reward

Rtrue

i,A

is determined by an unknown action-speciﬁc reward parameter

vector

θ∗

A∈Rd

E[Rtrue

i,A |si] = ⟨θ∗

A,si⟩

(the linear reward will be extended to the nonlinear case in

Section 5). This setting for reward feedback matches many real-world applications, e.g., each action

corresponds to a different category of candidate coupons in coupon recommendation, and the reward

feedback mechanism of each category differs due to the different discount pricing strategies.

Next, we delve deeper into understanding the impact of unobserved feedbacks on the perfor-

mance of policy updating in CBB setting. We ﬁrst conducted an empirical comparison by ap-

plying the batch UCB policy (SBUCB) (Han et al., 2020) to environments under different pro-

portions of received reward feedbacks. In particular, the agent under full-information feedback

can receive all the rewards of the executed and non-executed actions, called Full-Information

CBB (FI-CBB) setting. From Figure 1, we can observe that the partial-information feedbacks

are damaging in terms of hurting the policy updating, and batched bandit policy can beneﬁt from

more reward feedbacks, where the performance of

80%

feedback is very close to that of FI-CBB.

10 20 30 40 50 60 70 80 90

Number of Episodes N

0.236

0.238

0.240

0.242

0.244

0.246

0.248

0.250

0.252

Average Reward

partial-information feedback (CBB)

40% feedback

60% feedback

80% feedback

full-information feedback (FI-CBB)

Figure 1: Average rewards of batch UCB policy

(Han et al., 2020) under different proportions of

received reward feedbacks, interacting with the

synthetic environment in Section 6, where

feedback means that

of actions can receive

their true rewards

Then, we prove the difference of instantaneous re-

grets between the CBB and FI-CBB settings in

Theorem 1 (proof can be found in Appendix A).

Theorem 1. For any action

A∈ A

and context

si∈ S

, let

θn

be the reward parameter vector

estimated by the batched UCB policy in the

-th

episode. The upper bound of instantaneous regret

(deﬁned by

|⟨θn

A,si⟩−⟨θ∗

A,si⟩|

) in the FI-CBB

setting is tighter than that in CBB setting (i.e., using

the partial-information feedback).

From Theorem 1, we can infer that utilizing partial-

information feedbacks leads to a deterioration in

the regret of the bandit policy. Ideally, the policy

would be updated using full-information feedback.

However, in CBB, full-information feedback is un-

available. Fortunately, in CBB, different reward

parameter vectors are maintained and estimated

separately for each action, and the potential reward

structures of the non-executed actions have been captured to some extent. Therefore, why not utilize

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RewardImputationwithSketchingforContextualBatchedBanditsXiaoZhang1,2,NingluShao1,2,∗,ZihuaSi1,2,∗,JunXu1,2,†,WenhanWang3,HanjingSu3,Ji-RongWen1,21GaolingSchoolofArtificialIntelligence,RenminUniversityofChina,Beijing,China2BeijingKeyLaboratoryofBigDataManagementandAnalysisMethods,Beijing,China3Tencen...

展开>> 收起<<

Reward Imputation with Sketching for Contextual Batched Bandits Xiao Zhang12 Ninglu Shao12 Zihua Si12 Jun Xu12.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Reward Imputation with Sketching for Contextual Batched Bandits Xiao Zhang12 Ninglu Shao12 Zihua Si12 Jun Xu12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: