Reward Imputation with Sketching for Contextual Batched Bandits Xiao Zhang12 Ninglu Shao12 Zihua Si12 Jun Xu12

2025-04-24 0 0 1.78MB 12 页 10玖币
侵权投诉
Reward Imputation with Sketching for
Contextual Batched Bandits
Xiao Zhang1,2, Ninglu Shao1,2,
, Zihua Si1,2,
, Jun Xu1,2,
,
Wenhan Wang3,Hanjing Su3,Ji-Rong Wen1,2
1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
2Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing, China
3Tencent Inc., Shenzhen, China
{zhangx89, ninglu_shao, zihua_si, junxu, jrwen}@ruc.edu.cn
{justinsu, ezewang}@tencent.com
Abstract
Contextual batched bandit (CBB) is a setting where a batch of rewards is ob-
served from the environment at the end of each episode, but the rewards of the
non-executed actions are unobserved, resulting in partial-information feedback.
Existing approaches for CBB often ignore the rewards of the non-executed actions,
leading to underutilization of feedback information. In this paper, we propose an
efficient approach called Sketched Policy Updating with Imputed Rewards (SPUIR)
that completes the unobserved rewards using sketching, which approximates the
full-information feedbacks. We formulate reward imputation as an imputation
regularized ridge regression problem that captures the feedback mechanisms of
both executed and non-executed actions. To reduce time complexity, we solve
the regression problem using randomized sketching. We prove that our approach
achieves an instantaneous regret with controllable bias and smaller variance than
approaches without reward imputation. Furthermore, our approach enjoys a sub-
linear regret bound against the optimal policy. We also present two extensions, a
rate-scheduled version and a version for nonlinear rewards, making our approach
more practical. Experimental results show that SPUIR outperforms state-of-the-art
baselines on synthetic, public benchmark, and real-world datasets.
1 Introduction
Contextual bandits have gained significant popularity in solving sequential decision-making problems
(Li et al., 2010; Lan and Baraniuk, 2016; Yom-Tov et al., 2017; Yang et al., 2021), where the agent
continuously updates its decision-making policy fully online (i.e., at each step), considering the
context and the received reward feedback to maximize cumulative rewards. In this paper, we address
a more general setting called contextual batched bandits (CBB). In CBB, the decision process is
divided into
N
episodes, and within each episode, the agent interacts with the environment for a fixed
number of
B
steps. At the end of each episode, the agent collects reward feedbacks and contexts.
Subsequently, the policy is updated using the collected data to guide the decision-making process in
the subsequent episode. CBB offers a practical framework for real-world streaming applications (e.g.,
streaming recommendation (Zhang et al., 2021, 2022)). In the context of CBB settings, the batch size
B
, can be adjusted by the agent to achieve improved regret guarantees and meet the data throughput
requirements based on the available computing resources (Zhou, 2023).
Ninglu Shao and Zihua Si have made equal contributions to this paper.
Corresponding author: Jun Xu.
37th Conference on Neural Information Processing Systems (NeurIPS 2023).
arXiv:2210.06719v3 [cs.LG] 7 Oct 2023
In bandit settings, it is common for the environment to only provide feedback on the rewards of
executed actions to the agent, while concealing the rewards of non-executed actions. This type of
limited feedback is referred to as partial-information feedback (also called “bandit feedback”). In
CBB setting, existing approaches tend to overlook the potential rewards associated with non-executed
actions. Instead, they address the challenge of partial-information feedback through an exploration-
exploitation tradeoff in both the context space and reward space (Han et al., 2020; Zhang et al.,
2020). However, CBB agents typically estimate and maintain reward models for the action-selection
policy, thereby capturing some information about the potential rewards of non-executed actions. This
additional reward structure information is available for policy updating in each episode but remains
untapped by existing batched bandit approaches.
In the context of contextual bandit settings where the policy is updated online, several bias-correction
approaches have been introduced to tackle the issue of partial-information feedback. Dimakopoulou
et al. (2019) presented linear contextual bandits integrating the balancing approach from causal
inference, which reweight the contexts and rewards by the inverse propensity scores. Chou et al.
(2015) designed pseudo-reward algorithms for contextual bandits using upper confidence bound
(UCB) strategy, which use a direct method to estimate the unobserved rewards. Kim and Paik (2019)
focused on the correction of feedback bias for LASSO bandit with high-dimensional contexts, and
applied the doubly-robust approach to the reward modification using average contexts. While these
approaches have demonstrated effectiveness in contextual bandit settings, little attention has been
given to addressing the under-utilization of partial-information feedback in CBB setting.
Theoretical and experimental analyses in Section 2 indicate that better performance of CBB is
achievable if the rewards of the non-executed actions can be received. Motivated by these observations,
we propose a novel reward imputation approach for the non-executed actions, which mimics the
reward generation mechanisms of environments. We conclude our contributions as follows.
(1) To fully utilize feedback information in CBB, we formulate the reward imputation as a problem of
imputation regularized ridge regression, where the policy can be updated efficiently using sketching.
(2) We prove that our reward imputation approach obtains a relative-error bound for sketching
approximation, achieves an instantaneous regret with a controllable bias and a smaller variance than
that without reward imputation, has a lower bound of the sketch size independently of the overall
number of steps, enjoys a sublinear regret bound against the optimal policy, and reduces the time
complexity from
O(Bd2)
to
O(cd2)
for each action in one episode, where
B
denotes the batch size,
cthe sketch size, and dthe dimension of the context space, satisfying d<c<B.
(3) We present two practical variants of our reward imputation approach, including the rate-scheduled
version that sets the imputation rate without tuning, and the version for nonlinear rewards.
(4) We carried out extensive experiments on a synthetic dataset, the publicly available Criteo dataset,
and a dataset from a commercial app to demonstrate our performance, empirically analyzed the
influence of different parameters, and verified the correctness of the theoretical results.
Related Work. Recently, batched bandit has become an active research topic in statistics and
learning theory including
2
-armed bandit (Perchet et al., 2016), multi-armed bandit (Gao et al., 2019;
Zhang et al., 2020; Wang and Cheng, 2020), and contextual bandit (Han et al., 2020; Ren and Zhou,
2020; Gu et al., 2021). Han et al. (2020) defined linear contextual bandits, and designed UCB-type
algorithms for both stochastic and adversarial contexts, where true rewards of different actions have
the same parameters. Zhang et al. (2020) provided methods for inference on data collected in batches
using bandits, and introduced a batched least squares estimator for both multi-arm and contextual
bandits. Recently, Esfandiari et al. (2021) proved refined regret upper bounds of batched bandits in
stochastic and adversarial settings. There are several recent works that consider similar settings to
CBB, e.g., episodic Markov decision process (Jin et al., 2018), LASSO bandits (Wang and Cheng,
2020). Sketching is another related technology that compresses a large matrix to a much smaller one
by multiplying a (usually) random matrix while retaining certain properties (Woodruff, 2014), which
has been used in online convex optimization (Calandriello et al., 2017; Zhang and Liao, 2019).
2 Problem Formulation and Analysis
Let
[x] = {1,2, . . . , x}
,
S Rd
be the context space whose dimension is
d
,
A={Aj}j[M]
the
action space containing
M
actions,
[A;B] = [A,B]
,
AF
,
A1A2
denote the Frobenius
norm,
1
-norm, and spectral norm of a matrix
A
, respectively,
a1
and
a2
be the
1
-norm and the
2
Protocol 1 Contextual Batched Bandit (CBB)
INPUT: Batch size B, number of episodes N, action space A={Aj}j[M], context space S Rd
1: Initialize policy p01/M, sample data buffer D1={(s0,b , AI0,b , R0,b )}b[B]using initial policy p0
2: for n= 1 to Ndo
3: Update the policy pnon Dn
4: for b= 1 to Bdo
5: Observe context sn,b and choose AIn,b A following the updated policy pn(sn,b )
6: end for
7: Dn+1 ← {(sn,b, AIn,b , Rn,b)}b[B], where Rn,b denotes the reward of action AIn,b on context sn,b
8: end for
2
-norm of a vector
a
,
σmin(A)
and
σmax(A)
denote the minimum and maximum of the of singular
values of
A
. In this paper, we focus on the setting of Contextual Batched Bandits (CBB) in Protocol 1,
where the decision process is partitioned into
N
episodes, and in each episode, CBB consists of two
phases: (1) the policy updating approximates the optimal policy based on the received contexts and
rewards; (2) the online decision chooses actions for execution following the updated and fixed policy
p
for
B
steps (
B
is also called the batch size), and stores the context-action pairs and the observed
rewards of the executed actions into a data buffer
D
. The reward
R
in CBB is a partial-information
feedback where rewards are unobserved for the non-executed actions.
In contrast to the existing batched bandit setting (Han et al., 2020; Esfandiari et al., 2021), where
the true reward feedbacks for all actions are controlled by the same parameter vector while the
received contexts differ across actions at each step, we make the assumption that in CBB setting,
the mechanism of true reward feedback varies across actions, while the received context is shared
among actions. Formally, for any context
si∈ S Rd
and action
A∈ A
, we assume that the
expectation of the true reward
Rtrue
i,A
is determined by an unknown action-specific reward parameter
vector
θ
ARd
:
E[Rtrue
i,A |si] = θ
A,si
(the linear reward will be extended to the nonlinear case in
Section 5). This setting for reward feedback matches many real-world applications, e.g., each action
corresponds to a different category of candidate coupons in coupon recommendation, and the reward
feedback mechanism of each category differs due to the different discount pricing strategies.
Next, we delve deeper into understanding the impact of unobserved feedbacks on the perfor-
mance of policy updating in CBB setting. We first conducted an empirical comparison by ap-
plying the batch UCB policy (SBUCB) (Han et al., 2020) to environments under different pro-
portions of received reward feedbacks. In particular, the agent under full-information feedback
can receive all the rewards of the executed and non-executed actions, called Full-Information
CBB (FI-CBB) setting. From Figure 1, we can observe that the partial-information feedbacks
are damaging in terms of hurting the policy updating, and batched bandit policy can benefit from
more reward feedbacks, where the performance of
80%
feedback is very close to that of FI-CBB.
10 20 30 40 50 60 70 80 90
Number of Episodes N
0.236
0.238
0.240
0.242
0.244
0.246
0.248
0.250
0.252
Average Reward
partial-information feedback (CBB)
40% feedback
60% feedback
80% feedback
full-information feedback (FI-CBB)
Figure 1: Average rewards of batch UCB policy
(Han et al., 2020) under different proportions of
received reward feedbacks, interacting with the
synthetic environment in Section 6, where
x%
feedback means that
x%
of actions can receive
their true rewards
Then, we prove the difference of instantaneous re-
grets between the CBB and FI-CBB settings in
Theorem 1 (proof can be found in Appendix A).
Theorem 1. For any action
A∈ A
and context
si∈ S
, let
θn
A
be the reward parameter vector
estimated by the batched UCB policy in the
n
-th
episode. The upper bound of instantaneous regret
(defined by
|⟨θn
A,si⟩−⟨θ
A,si⟩|
) in the FI-CBB
setting is tighter than that in CBB setting (i.e., using
the partial-information feedback).
From Theorem 1, we can infer that utilizing partial-
information feedbacks leads to a deterioration in
the regret of the bandit policy. Ideally, the policy
would be updated using full-information feedback.
However, in CBB, full-information feedback is un-
available. Fortunately, in CBB, different reward
parameter vectors are maintained and estimated
separately for each action, and the potential reward
structures of the non-executed actions have been captured to some extent. Therefore, why not utilize
3
摘要:

RewardImputationwithSketchingforContextualBatchedBanditsXiaoZhang1,2,NingluShao1,2,∗,ZihuaSi1,2,∗,JunXu1,2,†,WenhanWang3,HanjingSu3,Ji-RongWen1,21GaolingSchoolofArtificialIntelligence,RenminUniversityofChina,Beijing,China2BeijingKeyLaboratoryofBigDataManagementandAnalysisMethods,Beijing,China3Tencen...

展开>> 收起<<
Reward Imputation with Sketching for Contextual Batched Bandits Xiao Zhang12 Ninglu Shao12 Zihua Si12 Jun Xu12.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:1.78MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注