Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles Yuxuan HanyJialin ZengyYang WangyzYang XiangyxJiheng Zhangz

2025-05-02 0 0 641.27KB 25 页 10玖币
侵权投诉
Optimal Contextual Bandits with Knapsacks under Realizability via
Regression Oracles
Yuxuan Han∗† Jialin Zeng∗† Yang Wang†‡ Yang Xiang†§ Jiheng Zhang
Department of Mathematics, HKUST
Department of Industrial Engineering and Decision Analytics, HKUST
§HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute
Abstract
We study the stochastic contextual bandit with
knapsacks (CBwK) problem, where each action,
taken upon a context, not only leads to a
random reward but also costs a random resource
consumption in a vector form. The challenge is
to maximize the total reward without violating
the budget for each resource. We study this
problem under a general realizability setting
where the expected reward and expected cost are
functions of contexts and actions in some given
general function classes Fand G, respectively.
Existing works on CBwK are restricted to the
linear function class since they use UCB-type
algorithms, which heavily rely on the linear
form and thus are difficult to extend to general
function classes. Motivated by online regression
oracles that have been successfully applied to
contextual bandits, we propose the first universal
and optimal algorithmic framework for CBwK
by reducing it to online regression. We also
establish the lower regret bound to show the
optimality of our algorithm for a variety of
function classes.
1 INTRODUCTION
Contextual Bandits (CB) is a fundamental online learning
framework with an exploration-exploitation tradeoff. At
each step, the decision maker, to maximize the total
reward, takes one out of Kactions upon observing the
context and then receives a random reward. It has received
extensive attention due to its wide range of applications,
such as recommendation systems, clinic trials, and
Equal contribution. Correspondence to Yang Xiang, Jiheng
Zhang (<maxiang@ust.hk>,<jiheng@ust.hk>).
Proceedings of the 26th International Conference on Artificial
Intelligence and Statistics (AISTATS) 2023, Valencia, Spain.
PMLR: Volume 206. Copyright 2023 by the author(s).
online advertisement (Bietti et al., 2021; Lattimore and
Szepesv´
ari, 2020; Slivkins et al., 2019).
However, the standard contextual bandit setting ignores
the budget constraint that commonly arises in real-world
applications. For example, when a search engine picks
an ad to display, it needs to consider both the advertiser’s
feature and his remaining budget. Thus, canonical
contextual bandits have been generalized to Contextual
Bandit with Knapsacks (CBwK) setting, where a selected
action at each time twill lead to a context-dependent
consumption of several constrained resources and there
exists a global budget Bon the resources.
In contextual decision-making, it is crucial to model the
relationship between the outcome (reward and costs) and
the context to design an effective and efficient policy. In a
non-budget setting, i.e., without the knapsack constraints,
the contextual bandit algorithms fall into two main
categories: agnostic and realizability-based approaches.
The agnostic approaches aim to find the best policy out
of a given policy set Πand make no assumptions on the
context-outcome relationships. In this case, one often
requires access to a cost-sensitive classification oracle over
Πto achieve computational efficiency (Agarwal et al.,
2014; Dudik et al., 2011; Langford and Zhang, 2007).
On the other hand, realizability-based approaches assume
that there exists a given function class Fthat models the
outcome distribution with the given context. When Fis
a linear class, the optimal regret has been achieved by
UCB-type algorithms (Abbasi-Yadkori et al., 2011; Chu
et al., 2011). Moreover, a unified approach to developing
optimal algorithms has recently been studied in Foster
et al. (2018), Foster and Rakhlin (2020), Simchi-Levi
and Xu (2021) by introducing regression oracles over the
class F. While the realizability assumption could face
the problem of model mismatch, it has been shown that
empirically realizability-based approaches outperform
agnostic-based ones when no model misspecification
exists (Foster et al., 2018; Krishnamurthy et al., 2016).
However, these oracle-based algorithms only apply to
unconstrained contextual bandits without considering the
arXiv:2210.11834v2 [cs.LG] 22 Feb 2023
Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles
knapsack constraints.
In the CBwK setting, two similar approaches as above have
been explored. For the agnostic approach, Badanidiyuru
et al. (2014) and Agrawal et al. (2016) extend the general
contextual bandits to the knapsack setting by generalizing
the techniques in contextual bandits without knapsack
constraints (Agarwal et al., 2014; Dudik et al., 2011).
However, as shown in Agrawal and Devanur (2016), the
optimization problem using classification oracle in the
agnostic setting could be NP-hard even for the linear
case. On the other hand, CBwK under linear realizability
assumption (i.e., both the reward function class Fand
the cost function class Gare linear function classes) and
its variants have been studied. Based on the confidence
ellipsoid results guaranteed by the linear model, Agrawal
and Devanur (2016) develop algorithms for the stochastic
linear CBwK with near-optimal regret bounds. Sivakumar
et al. (2022) study the linear CBwK under the smoothed
contextual setting where the contexts are perturbed by
Gaussian noise. However, to our knowledge, no work
has considered realizability beyond the linear case. The
difficulty lies in the lack of confidence ellipsoid results for
the general function class as in linear model assumptions.
In this paper, we try to address the following open problem:
Is there a unified algorithm that applies to CBwK under the
general realizability assumption?
We answer this question positively by proposing
SquareCBwK, the first general and optimal algorithmic
framework for stochastic CBwK based on online regression
oracles. Compared to the contextual bandits, additional
resource constraints in CBwK lead to more intricate
couplings across contexts, which makes it substantially
more challenging to strike a balance between exploration
and exploitation. It is thus nontrivial to apply regression
oracle techniques to CBwK. The key challenge lies in
designing a proper score of actions for decision-making
that can well balance the reward and the cost gained over
time to maximize the rewards while ensuring no resource
is run out.
1.1 Our contributions
In this paper, we address the above issue and successfully
apply regression oracles to CBwK under the realizability
assumption. Our contributions are summarized in the
following three aspects.
Algorithm for CBwK with Online Regression Oracles:
In Section 3.1, we propose the first unified algorithm
SquareCBwK for solving CBwK under the general
realizability assumption, which addresses the above open
problem. Motivated by the works in contextual bandits
that apply regression oracles to estimate the reward (to
which we refer as the reward oracle), we propose to
apply a new cost oracle in SquareCBwK to tackle the
additional resource constraints in CBwK. Specifically,
under the setting that the budget Bscales linearly with the
time horizon T, we construct a penalized version of the
unconstrained score based on the cost oracle and adjust
the penalization adaptive to the resource consumption over
time. Our algorithm is able to be instantiated with any
available efficient oracle for any general function class to
obtain an upper regret bound for CBwK. In this sense, we
provide a meta-algorithm that reduces CBwK to regression
problems, which is a well-studied area in machine learning
with various computationally efficient algorithms for
different function classes. In Section 3.3, we show that
through different instantiations, SquareCBwK can achieve
the optimal regret bound in the linear CBwK and derive
new guarantees for more general function classes like
non-parametric classes.
Lower Bound Results: In Section 3.2, we develop a new
approach for establishing lower bounds for CBwK under
the general realizability assumption. Previous works in
CBwK under the realizability assumption usually show the
optimality of their results by matching the lower bound in
the unconstrained setting. However, when the cost function
class Gis more complicated than reward function class F,
the lower bound in the unconstrained setting will be very
loose since it does not consider the information of G. Our
method, in contrast, can provide lower bounds that include
information from both Fand G. Moreover, we apply this
new method to construct lower bounds that match the regret
bound given by SquareCBwK for various choices of F
and G, which demonstrates SquareCBwK’s optimality for
different function classes.
Relaxed Assumption on the Budget: While we present
our main result under the regime that the budget Bscales
linearly with the time horizon T, such an assumption may
be restrictive compared with previous works. For example,
in the linear CBwK, both Sivakumar et al. (2022) and
Agrawal and Devanur (2016) allow a relaxed budget B=
Ω(T3/4). To close this gap, in Section 4, we design a
two-stage algorithm based on SquareCBwK that allows a
weaker condition on the budget for solving CBwK under
the general realizability assumption. In particular, in the
linear CBwK setting, our relaxed condition recovers the
B= Ω(T3/4)condition.
1.2 Other related works
Bandit with Knapsack and Demand Learning: One
area closely related to our work is bandits with knapsacks
(BwK), which does not consider contextual information.
The non-contextual BwK problem is first investigated in
a general formulation by Badanidiyuru et al. (2018) and
further generalized to concave reward/convex constraint
setting by Agrawal and Devanur (2014a). Both papers
Yuxuan Han∗†, Jialin Zeng∗†, Yang Wang†‡, Yang Xiang†§, Jiheng Zhang
use the idea of Upper-Confidence Bound as in the non-
constrained setting (Auer et al., 2002). Ferreira et al.
(2018) apply the BwK framework to the network revenue
management problem and develop a Thompson-sampling
counterpart for BwK. Their assumption on the demand
function is further generalized in the recent works (Chen
et al., 2022; Miao and Wang, 2021).
Online Optimization with Knapsacks: One closely
related area is online optimization with knapsacks, which
can be seen as a full-information variant of the BwK
problem: after a decision is made, the feedback for all
actions is available to the learner. Such problem often
leads to solving online linear/convex programming, which
has been studied in Balseiro et al. (2022), Jenatton et al.
(2016), Agrawal and Devanur (2014b), Mahdavi et al.
(2012), Liu and Grigas (2022) and Castiglioni et al. (2022).
In the work of Liu and Grigas (2022), they study online
contextual decision-making with knapsack constraints.
Specifically, they study the continuous action setting by
developing an online primal-dual algorithm based on the
“Smart Predict-then-Optimize” framework (Elmachtoub
and Grigas, 2022) in the unconstrained setting and leave
the problem with bandit feedback open. Our work provides
a partial answer to this open problem in the finite-armed
setting.
Concurrent works in CBwK: During the review period of
our paper, two very recent works appeared on the arxiv.org
(Ai et al., 2022; Slivkins and Foster, 2022) which also
consider the CBwK problem with general function classes.
In independent and concurrent work, Slivkins and Foster
(2022) consider the CBwK problem via regression oracles
similar to our work. They propose an algorithm based
on the perspective of Lagrangian game that results in a
slightly different choice of arm selection strategy than ours.
They also attain the same regret bound up to scalar as
our Theorem 3.1. Nevertheless, the focus of their work
is limited to the setting described in section 3 (i.e., the
B= Ω(T)regime), where they provide only an upper
regret bound. In contrast, our work demonstrates the
optimality by proving lower bound results for B= Ω(T)
regime. Moreover, we derive a two-stage algorithm and
corresponding regret guarantees for the B=o(T)regime
that requires less stringent budget assumptions.
Another recent work (Ai et al., 2022) study the CBwK
problem with general function classes by combining
the re-solving heuristic and the distribution estimation
techniques. Their result is interesting in that it may
work in some function classes without an efficient online
oracle. They also achieve logarithmic regret under
suitable conditions thus are more problem-dependent, in
contrast to the oracle-based approach, as discussed in
section 4 of Foster and Rakhlin (2020). However, their
framework involves a distribution estimation procedure
that requires additional assumptions about the regularity
of the underlying distribution. Consequently, their
method’s regret is influenced by the regularity parameters,
potentially leading to suboptimal results.
2 PRELIMINARIES
Notations Throughout this paper, a.band a=O(b)
a&band a= Ω(b)means aCb aCbfor some
absolute constant C.a=˜
O(b)a=˜
Ω(b)means a=
O(bmax{1,polylog(b)})a= Ω(bmax{1,polylog(b)})
and abmeans a=O(b)and b=O(a).
2.1 Basic setup
We consider the stochastic CBwK setting. Given the
budget BRdfor ddifferent resources and the time
horizon T, at each step, the decision maker needs to select
an arm at[K]upon observing a context xt∈ X drawn
i.i.d. from some unknown distribution PX. Then a reward
rt,at[0,1] and a consumption vector ct,at[0,1]dis
observed. We assume that rt,a and ct,a are generated i.i.d.
from a fixed distribution parameterized by the given xtand
a. The goal is to learn a policy, a mapping from context
to action, that maximizes the total reward while ensuring
that the consumption of each resource does not exceed the
budget.
Without loss of generality we assume B1=B2=··· =
Bd=B. We focus on the regime B= Ω(T). That is the
budget scales linearly with time. (We relax this assumption
on the budget in Section 4.) Moreover, we make a standard
assumption that the K-th arm is a null arm that generates no
reward or consumption of any resource when being pulled.
Similar to the unconstrained setting (Foster et al., 2018;
Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021), we
assume access to two classes of functions F (X×[K]
[0,1]) and G (X × [K][0,1]d)that characterize
the expectation of reward and consumption distributions
respectively. Note that only one function class Ffor
the reward distribution is considered in the unconstrained
setting, while to fit CBwK, we add another regression class
Gto model the consumption distribution. We assume the
following realizability condition.
Assumption 2.1. There exists some f? F and g
Gsuch that f(x, a) = E[rt,a|xt=x]and g(x, a) =
E[ct,a|xt=x],a[K].
The algorithm benchmark is the best dynamic policy, which
knows the contextual distribution PX,f, and gand can
dynamically maximize total reward given the historical
information and the current context. We denote OPTDP as
the expected total reward of the best dynamic policy. The
goal of the decision-maker is to minimize the following
regret:
Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles
Definition 2.1. Reg(T) := OPTDP E[Pτ
t=1 rt,at], where
τis the stopping time when there exists some j[d]s.t.
Pτ
t=1(ct,at)j> B 1.
Due to the intractability of the best dynamic policy in
Definition 2.1, we consider a static relaxation problem that
provides an upper bound of OPTDP.
Denote Kthe set of probability distributions over the
action set [K]. The following program aims to find the best
static randomized policy p:X Kthat maximizes the
expected per-round reward while ensuring resources are not
exceeded in expectation.
max
p:XK
ExPX[X
a[K]
pa(x)f(x, a)]
s.t. ExPX[X
a[K]
pa(x)g(x, a)] B/T ·1.
(1)
Lemma 2.1. Let OPT denote the value of the optimal static
policy (1), then we have TOPT OPTDP.
Lemma 2.1 can be derived directly following the proof
of Lemma 1 in Agrawal and Devanur (2016), where they
consider the linear CBwK case but the reasoning therein
is completely independent of the linear structure. With
Lemma 2.1, we can control the regret bound by considering
TOPT E[Pτ
t=1 rt,at].
2.2 Online Regression Oracles
Under the realizability Assumption 2.1, We introduce
the online regression problem and the notion of online
regression oracles.
The general setting of online regression problem with input
space Z, output space Y, function class Hand loss `:
Y×Y R+is described as following: At the beginning of
the game, the environment chooses an h:Z → Y, h
Has the underlying response generating function. Then
at each round t, (i) the learner receives an input zt∈ Z
possibly chosen in adversarial by the environment, (ii) the
learner then predicts a value ˆy(zt)∈ Y based on historical
information, (iii) the learner observes a noisy response of
h(zt)and suffers a loss `(ˆy(zt), h(zt). The goal of the
learner is to minimize the cumulative loss
Regsq(T;h) :=
T
X
t=1
`ˆy(zt), h(zt).
In our problem, we assume our access to oracles Rr,Rc
for two online regression problems, respectively:
Reward Regression Oracle The reward regression
oracle Rris assumed to be an algorithm for the online
regression problem with Z=X × [K],Y= [0,1],H=
F, `(y1, y2) = (y1y2)2so that when ˆytis generated by
Rr, there exists some Regr
sq as a function of Tsuch that
Regsq(T;f)Regr
sq(T),f∈ F.(2)
Cost Regression Oracle The cost regression oracle Rc
is assumed to be an algorithm for the online regression
problem with Z=X × [K],Y= [0,1]d,H=
G, `(y1,y2) = ky1y2k2
so that when ˆytis generated
by Rc, there exists some Regc
sq as a function of Tsuch that
Regsq(T;g)Regc
sq(T),g∈ G.(3)
2.3 Online Mirror Descent
Our algorithm adopts an online primal-dual framework to
tackle the challenge brought by knapsack constraints. Our
strategy of updating the dual variable λtfalls into the
general online convex optimization (OCO) framework. In
the OCO problem with parameter set Λ, adversary class L
and time horizon T, the learner needs to choose λtΛ
adaptively at each round t. After each choice, he will
observe an adversarial convex loss Lt L and pay the
cost Lt(λt).The goal of the learner is to minimize the
cumulative regret:
RegOCO(T; Λ,L) :=
T
X
t=1
Lt(λt)min
λΛ
T
X
t=1
Lt(λ).
In our designed adversary, Lt(λ) = hB
T·1ct,at,λi
and minimizing Ltcorresponds to penalize the violation
of the budget. We focus on the following adversary class
and parameter set:
L={L(λ) := hθ,λi:θRd,kθk1}
Λ = {λRd:λ0,kλk1Z},
where Z > 0is the `1radius of Λto be determined.
The online mirror descent (OMD) algorithm (Hazan et al.,
2016; Shalev-Shwartz et al., 2012) is a simple and fast
algorithm achieving the optimal OCO regret in the non-
euclidean geometry. OMD follows the update rule
λt= arg min
λΛh∇Lt1(λt1),λi+1
ηt
Dh(λ,λt1),(4)
where his the metric generating function, and Dhis
the associated Bregman divergence. After adding a slack
variable and re-scaling, the OCO problem over L,Λis
equivalent to the OCO problem over
˜
L={˜
L(˜
λ) := h˜
θ,˜
λi:˜
θRd+1,k˜
θkZ, ˜
θd+1 = 0}
˜
Λ = {˜
λRd+1 :˜
λ0,k˜
λk1= 1},
which can be solved via the normalized Exponentiated
Gradient (i.e., selecting has the negative entropy function).
In this case, the OMD algorithm has the following OCO
regret guarantee (Hazan et al., 2016; Shalev-Shwartz et al.,
2012):
Yuxuan Han∗†, Jialin Zeng∗†, Yang Wang†‡, Yang Xiang†§, Jiheng Zhang
Lemma 2.2. Setting ηt=η=O(log d
T), the OMD yields
regret
RegOCO(T; Λ,L).ZpTlog d(5)
As in the literature (Agrawal and Devanur, 2016;
Castiglioni et al., 2022) that introduces the online primal-
dual framework, Lemma 2.2 plays an essential role in
controlling the regret induced by knapsack constraints in
SquareCBwK.
3 ALGORITHM AND THEORETICAL
GUARANTEES
We are ready to present our main algorithm SquareCBwK
for solving stochastic CBwK under the general realizability
assumption. We first present the algorithm and theoretical
results under B= Ω(T)regime for simplicity of algorithm
design. Algorithm and analysis for more general choices of
Bare presented in section 4.
3.1 The SquareCBwK Algorithm
The main body of SquareCBwK has a similar structure as
the SquareCB algorithm for contextual bandits (Foster and
Rakhlin, 2020), but with significant changes necessary to
handle the knapsack constraints. SquareCBwK is presented
in Algorithm 1 with three key modules: prediction through
oracles, arm selection scheme, and dual update through
OMD.
Prediction through oracles At each step, after observing
the context, SquareCBwK will simultaneously access the
reward oracle Rrand cost oracle Rcto predict the reward
and cost of each action for this round. Then these two
predicted scores are incorporated through the following
Lagrangian to form a final predicted score ˆ
`t:
ˆ
`t,a := ˆrt,a +λT
t(1·B/T ˆ
ct,a),a[K].
Arm selection scheme After computing the predicted
score ˆ
`t, we employ the probability selection strategy in
Abe and Long (1999): We choose the greedy action score
evaluated by ˆ
`tas the benchmark and select each arm
awith the probability pt,a that is inversely proportional
to the gap between the arm’s score and the benchmark.
This strategy strikes a balance between exploration and
exploitation: When the predicted score for an action is
close to the greedy action, we tend to explore it with the
probability roughly as 1/K, otherwise with a very small
chance.
Dual update through OMD After choosing the arm at,
the new data {xt, at, rt,at,ct,at}will be fed into Rc,Rd
and OMD. We then update the dual variable λtΛ
successively through OMD.
Algorithm 1: SquareCBwK
Input: Time horizon T, total budget initial B,
learning rate γ>0, online regression oracle
for reward Rrand cost Rc, radius Z=T /B
of the parameter set Λ.
1Initialization:initialize Rr,Rcand Rd.
2for t= 1, . . . , T do
3Observe context xt.
4Rrpredicts ˆrt,a,Rcpredicts ˆ
ct,a,a[K]
5Compute ˆ
`t,a := ˆrt,a +λT
t(B/T ·1ˆct,a),
a[K].
6Let bt= arg max
a[K]
ˆ
`t,a.
7For each a6=bt, define
pt,a := 1
K+γˆ
`t,btˆ
`t,aand let
pt,bt= 1 Pa6=btpt,a.
8Sample arm atpt. Observe reward rt,atand
cost ct,at.
9if j[d],Pt
t0=1 ct,at[j]B1then
10 Exit
11 end
12 Feed Rrwith data {(xt, at), rt,at)}and Rcwith
data {(xt, at),ct,at}.
13 Feed OMD with ct,atand OMD updates
λt+1 Λ.
14 end
Compared with SquareCB in Foster and Rakhlin (2020),
we introduce three novel technical elements to deal with
knapsack constraints: First, we apply a new cost oracle to
generate the cost prediction. Next, to balance the reward
and the cost prediction over time, we propose the predicted
Lagrangian to construct a proper score function ˆ
`t. Finally,
we introduce OMD to update the dual variable so that
the predicted scores ˆ
`tadapt to the resource consumption
throughout the process. Notably, the l1radius of the
parameter set Λis carefully chosen to be T/B since we
expect λtcan capture the sensitivity of the optimal static
policy (1) to knapsack constraints violations over time.
Specifically, if we increase the budget Bby ε, the increased
reward over T rounds is at most TOPT
Bε. This observation
suggests that Z, the radius of Λshould be at least TOPT
B. On
the other hand, Zshould be of constant level so that OMD
achieves optimal regret bounds O(T)by Lemma 2.2.
Therefore, setting Z=TOPT
Bwill be the desired choice. In
practice, since OPT is unknown, we need to estimate Zso
that TOPT
BZ.TOPT
B. The regime B= Ω(T)guarantees
that T
Bis approximately TOPT
B, without the need to further
estimate OPT. This is why we set Z=T
Bin SquareCBwK.
We will discuss estimating OPT when B= Ω(T)fails to
hold in Section 4.
Now we state the theoretical guarantee of SquareCBwK:
摘要:

OptimalContextualBanditswithKnapsacksunderRealizabilityviaRegressionOraclesYuxuanHanyJialinZengyYangWangyzYangXiangyxJihengZhangzyDepartmentofMathematics,HKUSTzDepartmentofIndustrialEngineeringandDecisionAnalytics,HKUSTxHKUSTShenzhen-HongKongCollaborativeInnovationResearchInstituteAbstractWestudyt...

展开>> 收起<<
Optimal Contextual Bandits with Knapsacks under Realizability via Regression Oracles Yuxuan HanyJialin ZengyYang WangyzYang XiangyxJiheng Zhangz.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:641.27KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注