
Yuxuan Han∗†, Jialin Zeng∗†, Yang Wang†‡, Yang Xiang†§, Jiheng Zhang‡
use the idea of Upper-Confidence Bound as in the non-
constrained setting (Auer et al., 2002). Ferreira et al.
(2018) apply the BwK framework to the network revenue
management problem and develop a Thompson-sampling
counterpart for BwK. Their assumption on the demand
function is further generalized in the recent works (Chen
et al., 2022; Miao and Wang, 2021).
Online Optimization with Knapsacks: One closely
related area is online optimization with knapsacks, which
can be seen as a full-information variant of the BwK
problem: after a decision is made, the feedback for all
actions is available to the learner. Such problem often
leads to solving online linear/convex programming, which
has been studied in Balseiro et al. (2022), Jenatton et al.
(2016), Agrawal and Devanur (2014b), Mahdavi et al.
(2012), Liu and Grigas (2022) and Castiglioni et al. (2022).
In the work of Liu and Grigas (2022), they study online
contextual decision-making with knapsack constraints.
Specifically, they study the continuous action setting by
developing an online primal-dual algorithm based on the
“Smart Predict-then-Optimize” framework (Elmachtoub
and Grigas, 2022) in the unconstrained setting and leave
the problem with bandit feedback open. Our work provides
a partial answer to this open problem in the finite-armed
setting.
Concurrent works in CBwK: During the review period of
our paper, two very recent works appeared on the arxiv.org
(Ai et al., 2022; Slivkins and Foster, 2022) which also
consider the CBwK problem with general function classes.
In independent and concurrent work, Slivkins and Foster
(2022) consider the CBwK problem via regression oracles
similar to our work. They propose an algorithm based
on the perspective of Lagrangian game that results in a
slightly different choice of arm selection strategy than ours.
They also attain the same regret bound up to scalar as
our Theorem 3.1. Nevertheless, the focus of their work
is limited to the setting described in section 3 (i.e., the
B= Ω(T)regime), where they provide only an upper
regret bound. In contrast, our work demonstrates the
optimality by proving lower bound results for B= Ω(T)
regime. Moreover, we derive a two-stage algorithm and
corresponding regret guarantees for the B=o(T)regime
that requires less stringent budget assumptions.
Another recent work (Ai et al., 2022) study the CBwK
problem with general function classes by combining
the re-solving heuristic and the distribution estimation
techniques. Their result is interesting in that it may
work in some function classes without an efficient online
oracle. They also achieve logarithmic regret under
suitable conditions thus are more problem-dependent, in
contrast to the oracle-based approach, as discussed in
section 4 of Foster and Rakhlin (2020). However, their
framework involves a distribution estimation procedure
that requires additional assumptions about the regularity
of the underlying distribution. Consequently, their
method’s regret is influenced by the regularity parameters,
potentially leading to suboptimal results.
2 PRELIMINARIES
Notations Throughout this paper, a.band a=O(b)
a&band a= Ω(b)means a≤Cb a≥Cbfor some
absolute constant C.a=˜
O(b)a=˜
Ω(b)means a=
O(bmax{1,polylog(b)})a= Ω(bmax{1,polylog(b)})
and abmeans a=O(b)and b=O(a).
2.1 Basic setup
We consider the stochastic CBwK setting. Given the
budget B∈Rdfor ddifferent resources and the time
horizon T, at each step, the decision maker needs to select
an arm at∈[K]upon observing a context xt∈ X drawn
i.i.d. from some unknown distribution PX. Then a reward
rt,at∈[0,1] and a consumption vector ct,at∈[0,1]dis
observed. We assume that rt,a and ct,a are generated i.i.d.
from a fixed distribution parameterized by the given xtand
a. The goal is to learn a policy, a mapping from context
to action, that maximizes the total reward while ensuring
that the consumption of each resource does not exceed the
budget.
Without loss of generality we assume B1=B2=··· =
Bd=B. We focus on the regime B= Ω(T). That is the
budget scales linearly with time. (We relax this assumption
on the budget in Section 4.) Moreover, we make a standard
assumption that the K-th arm is a null arm that generates no
reward or consumption of any resource when being pulled.
Similar to the unconstrained setting (Foster et al., 2018;
Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021), we
assume access to two classes of functions F ⊂ (X×[K]→
[0,1]) and G ⊂ (X × [K]→[0,1]d)that characterize
the expectation of reward and consumption distributions
respectively. Note that only one function class Ffor
the reward distribution is considered in the unconstrained
setting, while to fit CBwK, we add another regression class
Gto model the consumption distribution. We assume the
following realizability condition.
Assumption 2.1. There exists some f?∈ F and g∗∈
Gsuch that f∗(x, a) = E[rt,a|xt=x]and g∗(x, a) =
E[ct,a|xt=x],∀a∈[K].
The algorithm benchmark is the best dynamic policy, which
knows the contextual distribution PX,f∗, and g∗and can
dynamically maximize total reward given the historical
information and the current context. We denote OPTDP as
the expected total reward of the best dynamic policy. The
goal of the decision-maker is to minimize the following
regret: