
Protocol 1 Contextual Batched Bandit (CBB)
INPUT: Batch size B, number of episodes N, action space A={Aj}j∈[M], context space S ⊆ Rd
1: Initialize policy p0←1/M, sample data buffer D1={(s0,b , AI0,b , R0,b )}b∈[B]using initial policy p0
2: for n= 1 to Ndo
3: Update the policy pnon Dn
4: for b= 1 to Bdo
5: Observe context sn,b and choose AIn,b ∈ A following the updated policy pn(sn,b )
6: end for
7: Dn+1 ← {(sn,b, AIn,b , Rn,b)}b∈[B], where Rn,b denotes the reward of action AIn,b on context sn,b
8: end for
ℓ2
-norm of a vector
a
,
σmin(A)
and
σmax(A)
denote the minimum and maximum of the of singular
values of
A
. In this paper, we focus on the setting of Contextual Batched Bandits (CBB) in Protocol 1,
where the decision process is partitioned into
N
episodes, and in each episode, CBB consists of two
phases: (1) the policy updating approximates the optimal policy based on the received contexts and
rewards; (2) the online decision chooses actions for execution following the updated and fixed policy
p
for
B
steps (
B
is also called the batch size), and stores the context-action pairs and the observed
rewards of the executed actions into a data buffer
D
. The reward
R
in CBB is a partial-information
feedback where rewards are unobserved for the non-executed actions.
In contrast to the existing batched bandit setting (Han et al., 2020; Esfandiari et al., 2021), where
the true reward feedbacks for all actions are controlled by the same parameter vector while the
received contexts differ across actions at each step, we make the assumption that in CBB setting,
the mechanism of true reward feedback varies across actions, while the received context is shared
among actions. Formally, for any context
si∈ S ⊆ Rd
and action
A∈ A
, we assume that the
expectation of the true reward
Rtrue
i,A
is determined by an unknown action-specific reward parameter
vector
θ∗
A∈Rd
:
E[Rtrue
i,A |si] = ⟨θ∗
A,si⟩
(the linear reward will be extended to the nonlinear case in
Section 5). This setting for reward feedback matches many real-world applications, e.g., each action
corresponds to a different category of candidate coupons in coupon recommendation, and the reward
feedback mechanism of each category differs due to the different discount pricing strategies.
Next, we delve deeper into understanding the impact of unobserved feedbacks on the perfor-
mance of policy updating in CBB setting. We first conducted an empirical comparison by ap-
plying the batch UCB policy (SBUCB) (Han et al., 2020) to environments under different pro-
portions of received reward feedbacks. In particular, the agent under full-information feedback
can receive all the rewards of the executed and non-executed actions, called Full-Information
CBB (FI-CBB) setting. From Figure 1, we can observe that the partial-information feedbacks
are damaging in terms of hurting the policy updating, and batched bandit policy can benefit from
more reward feedbacks, where the performance of
80%
feedback is very close to that of FI-CBB.
10 20 30 40 50 60 70 80 90
Number of Episodes N
0.236
0.238
0.240
0.242
0.244
0.246
0.248
0.250
0.252
Average Reward
partial-information feedback (CBB)
40% feedback
60% feedback
80% feedback
full-information feedback (FI-CBB)
Figure 1: Average rewards of batch UCB policy
(Han et al., 2020) under different proportions of
received reward feedbacks, interacting with the
synthetic environment in Section 6, where
x%
feedback means that
x%
of actions can receive
their true rewards
Then, we prove the difference of instantaneous re-
grets between the CBB and FI-CBB settings in
Theorem 1 (proof can be found in Appendix A).
Theorem 1. For any action
A∈ A
and context
si∈ S
, let
θn
A
be the reward parameter vector
estimated by the batched UCB policy in the
n
-th
episode. The upper bound of instantaneous regret
(defined by
|⟨θn
A,si⟩−⟨θ∗
A,si⟩|
) in the FI-CBB
setting is tighter than that in CBB setting (i.e., using
the partial-information feedback).
From Theorem 1, we can infer that utilizing partial-
information feedbacks leads to a deterioration in
the regret of the bandit policy. Ideally, the policy
would be updated using full-information feedback.
However, in CBB, full-information feedback is un-
available. Fortunately, in CBB, different reward
parameter vectors are maintained and estimated
separately for each action, and the potential reward
structures of the non-executed actions have been captured to some extent. Therefore, why not utilize
3