Robust Contextual Linear Bandits Rong Zhu Branislav Kveton Institute of Science and Technology for Brain-Inspired Intelligence

2025-05-03 0 0 618.37KB 17 页 10玖币
侵权投诉
Robust Contextual Linear Bandits
Rong Zhu Branislav Kveton
Institute of Science and Technology for Brain-Inspired Intelligence
Fudan University
Amazon
Abstract
Model misspecification is a major consideration
in applications of statistical methods and ma-
chine learning. However, it is often neglected
in contextual bandits. This paper studies a com-
mon form of misspecification, an inter-arm het-
erogeneity that is not captured by context. To
address this issue, we assume that the hetero-
geneity arises due to arm-specific random vari-
ables, which can be learned. We call this set-
ting a robust contextual bandit. The arm-specific
variables explain the unknown inter-arm hetero-
geneity, and we incorporate them in the robust
contextual estimator of the mean reward and its
uncertainty. We develop two efficient bandit al-
gorithms for our setting: a UCB algorithm called
RoLinUCB and a posterior-sampling algorithm
called RoLinTS. We analyze both algorithms and
bound their n-round Bayes regret. Our experi-
ments show that RoLinTS is comparably statis-
tically efficient to the classic methods when the
misspecification is low, more robust when the
misspecification is high, and significantly more
computationally efficient than its naive imple-
mentation.
1 Introduction
Astochastic contextual bandit (Auer et al.,2002;Li et al.,
2010;Lattimore and Szepesvari,2019) is an online learning
problem where a learning agent sequentially interacts with
an environment over nrounds. In each round, the agent
observes context, pulls an arm conditioned on the context,
and receives a corresponding stochastic reward. Contex-
tual bandits have many applications in practice, such as in
personalized recommendations (Li et al.,2010;Jeunen and
Goethals,2021). This is because the mean rewards of the
arms are tied together through known context and learned
model parameters. Thus the contextual approach can be
more statistically efficient than a naive multi-armed bandit
solution (Auer et al.,2002;Agrawal and Goyal,2012). The
linear model, where the mean reward of an arm is the dot
product of its context and an unknown parameter, is versa-
tile and popular (Dani et al.,2008;Rusmevichientong and
Tsitsiklis,2010;Abbasi-Yadkori et al.,2011;Agrawal and
Goyal,2013), and we consider it in this work.
There are two common approaches to using linear models
in contextual bandits. One maintains a separate parameter
per arm (Section 3.1 in Li et al. (2010)). While this ap-
proach can learn complex models, it is not very statistically
efficient because the arm parameters are not shared. This
is especially important when each arm is pulled a different
number of times. The other approach maintains a single
shared parameter for all arms. While this approach can be
statistically efficient, it is more rigid and likely to fail due
to model misspecification; when the optimal arm under the
assumed model is not the actual optimal arm.
To address the above issues, we propose a new contextual
linear model. This model assumes that the mean reward
of an arm is a dot product of its context and an unknown
shared parameter, which is offset by an arm-specific vari-
able. This approach is statistically efficient because the
model parameter is shared by all arms; yet flexible because
the arm-specific variables can address model misspecifica-
tion. We call this setting a robust contextual linear ban-
dit. To provide an efficient solution to the problem, we as-
sume that the arm-specific variables are random and drawn
from a distribution known by the agent. This allows us to
develop a joint estimator of the shared parameter and the
arm-specific variables, which interpolates between the two
and also uses the context.
One motivating example for our setting are recommender
systems, where the features of an item cannot explain all
information about the item, such as its intrinsic popularity
(Koren et al.,2009). This is why the so-called behavioral
features, the features that summarize the past engagement
with the item, exist. The intrinsic popularity can be viewed
as the average engagement, click or purchase rate, in the
absence of any other information. The item features then
arXiv:2210.14483v1 [cs.LG] 26 Oct 2022
Rong Zhu, Branislav Kveton
offset the engagement, either up or down, depending on
their affinity. For instance, a feature representing the po-
sition of the item in the recommended list would have a
negative weight, meaning that lower ranked items are less
likely to be clicked, no matter how engaging they are.
We make the following contributions. First, we propose
robust contextual linear bandits, where the model misspec-
ification can be learned using arm-specific variables (Sec-
tion 3). Under the assumption that the variables are ran-
dom, both the Bayesian and random-effect viewpoints can
be used to derive efficient joint estimators of the shared
model parameter and arm-specific variables. We derive
the estimators in Section 4, and show how to incorporate
them in the estimate of the mean arm reward and its uncer-
tainty. Second, we propose upper confidence bound (UCB)
and Thompson sampling (TS) algorithms for this problem,
RoLinUCB and RoLinTS (Section 5). Both algorithms are
computationally efficient and robust to model misspecifica-
tion. We analyze both algorithms and derive upper bounds
on their n-round Bayes regret (Section 6). Our proofs rely
on analyzing an equivalent linear bandit, and the resulting
regret bounds improve in constants due to the special co-
variance structure of learned parameters. Our algorithms
are also significantly more computationally efficient than
naive implementations, which take O((d+K)3)time for
ddimensions and Karms, instead of our O(d2(d+K)).
Finally, we evaluate RoLinTS on both synthetic and real-
world problems. We observe that RoLinTS is comparably
statically efficient to the classic methods when the misspec-
ification is low, more robust when the misspecification is
high, and significantly more computationally efficient than
its naive implementation.
2 Related Work
Our model is related to a hybrid linear model (Section 3.2
in Li et al. (2010)) with shared and arm-specific parameters.
Unlike the hybrid linear model, where the coefficients of
some features are shared by all arms while the others are
not, we introduce arm-specific random variables to capture
the model misspecification. We further study the impact of
this structure on regret and propose an especially efficient
implementation for this setting. Another related work is
hLinUCB of Wang et al. (2016). hLinUCB is a variant of
LinUCB that learns a portion of the feature vector and we
compare to it in Section 7.
Due to our focus on robustness, our work is related to mis-
specified linear bandits. Ghosh et al. (2017) proposed an
algorithm that switches from a linear to multi-armed bandit
algorithm when the linear model is detected to be misspec-
ified. Differently from this work, we adapt to misspecifi-
cation. We do not compare to this algorithm because it is
non-contextual; and thus would have a linear regret in our
setting. Foster et al. (2020) and Krishnamurthy et al. (2021)
proposed oracle-efficient algorithms that reduce contextual
bandits to online regression, and are robust to misspecifi-
cation. Since Safe-FALCON of Krishnamurthy et al. (2021)
is an improvement upon Foster et al. (2020), we discuss
it in more detail. Safe-FALCON is more general than our
approach because it does not make any distributional as-
sumptions. On the other hand, it is very conservative in our
setting because of inverse gap weighting. We compare to
it in Section 7. Finally, Bogunovic et al. (2021) and Ding
et al. (2022) proposed linear bandit algorithms that are ro-
bust to adversarial noise attack. The notion of robustness
in these works is very different from ours.
Our work is also related to random-effect bandits (Zhu and
Kveton,2022). As in Zhu and Kveton (2022), we assume
that each arm is associated with a random variable that can
help with explaining its unknown mean reward. Zhu and
Kveton (2022) used this structure to design a bandit algo-
rithm that is comparably efficient to TS without knowing
the prior. Their algorithm is UCB not contextual. A simi-
lar idea was explored by Wan et al. (2022) and applied to
structured bandits. This work is also non-contextual. Wan
et al. (2021) assumed a hierarchical structure over tasks
and modeled inter-task heterogeneity. We focus on a sin-
gle task and model inter-arm heterogeneity. Our work is
also related to recent papers on hierarchical Bayesian ban-
dits (Kveton et al.,2021;Basu et al.,2021;Hong et al.,
2022). All of these papers considered a similar graphical
model to Wan et al. (2021) and therefore model inter-task
heterogeneity.
3 Robust Contextual Linear Bandits
We adopt the following notation. For any positive integer
n, we denote by [n]the set {1, . . . , n}. We let 1{·} be the
indicator function. For any matrix MRd×d, the max-
imum eigenvalue is λ1(M)and the minimum is λd(M).
The big O notation up to logarithmic factors is ˜
O.
We consider a contextual bandit (Li et al.,2010), where
the relationship between the mean reward of an arm and its
context is represented by a model. In round t[n], an
agent pulls one of Karms with feature vectors xi,t Rd
for i[K]. The vector xi,t summarizes information spe-
cific to arm iin round tand we call it a context. Compared
to context-free bandits (Lai and Robbins,1985;Auer et al.,
2002;Agrawal and Goyal,2012), contextual bandits have
more practical applications because they model the reward
as a function of context. For instance, in online advertis-
ing, the arms would be different ads, the context would be
user features, and the contextual bandit agent would pull
arms according to user features (Li et al.,2010;Agrawal
and Goyal,2013). More formally, in round t, the agent
pulls arm It[K]based on context and rewards in past
rounds; and receives the reward of arm It,rIt,t, whose
mean reward depends on the context xIt,t. Since the num-
Rong Zhu, Branislav Kveton
ber of contexts is large, the agent assumes some general-
ization model, such as that the mean reward is linear in xi,t
and some unknown parameter. When this model is incor-
rectly specified, the contextual bandit algorithm may per-
form poorly.
To improve the robustness of contextual linear bandits to
misspecification, we introduce a novel modeling assump-
tion. Specifically, the reward ri,t of arm iin round tis
generated as
ri,t =µi,t +i,t ,(1)
µi,t =x>
i,tθ+vi,(2)
θPθ(0, λ1Id),(3)
viPv(0, σ2
0),(4)
i,t P(0, σ2).(5)
Here µi,t and i,t are the mean reward and reward noise,
respectively, of arm iin round t. The mean reward µi,t
has two terms: a linear function of context xi,t and pa-
rameter θRdshared by all arms, and the inter-arm het-
erogeneity viR, which is an unobserved arm-specific
random variable. The distributions of θ,vi, and i,t are de-
noted by Pθ,Pv, and P; and their hyper-parameters are λ,
σ2
0, and σ2. We call our model a robust contextual linear
bandit because vimakes it robust to the misspecification
due to context. Our model can be viewed as an instance
of unobserved-effect models commonly used in panel and
longitudinal data analyses (Wooldridge,2001;Diggle et al.,
2002). For brevity, and since we only study linear models,
we often call our model a robust contextual bandit.
Our goal is to design an algorithm that minimizes its regret
with respect to the optimal arm-selection strategy. The n-
round Bayes regret R(n)of an agent is defined as
R(n) = E"n
X
t=1
µI
t,t µIt,t#,(6)
where I
tis the arm with highest mean reward in round t
and Itis the pulled arm in round t. The expectation is under
the randomness of Itand I
t; and those of θ,vi, and i,t.
3.1 Discussion
We introduce an unobserved effect vi, which can be inter-
preted as capturing the characteristics of arm ithat is not
explained by context, but is assumed not to change over n
rounds. We call it the inter-arm heterogeneity. For exam-
ple, in online advertising, the arms would be different ads
and the context would be user features. In this problem,
vimay contain unobserved ad characteristics, such as its
intrinsic quality, that can be viewed as roughly constant.
We assume that the parameter θis shared by all arms, while
the inter-arm heterogeneity is modeled by arm-specific
variables. From the statistical-efficiency viewpoint, this
model reduces the number of parameters compared to mod-
eling arms separately, and therefore increases statistical ef-
ficiency. Li et al. (2010) proposed hybrid linear models,
where the coefficients of some features are shared by all
arms while the others are arm-specific. However, choosing
features to share may be challenging in practice. From the
practical viewpoint, it is more convenient to apply the ro-
bust contextual bandit, as it avoids the challenging choice
of the shared features. In particular, the model is still flexi-
ble enough because it uses the unobserved effect vito cap-
ture inter-arm heterogeneity, information not explained by
the context. For instance, imagine a contextual recommen-
dation problem with Karms, where arms represent items.
In addition to what the item and user features can explain,
there may still be item-specific biases (Koren et al.,2009).
4 Estimation
This section introduces our estimators for robust contextual
bandits. In Section 4.1, we derive the estimators of θand
vifor i[K]. In Section 4.2, we derive the estimator of
µi,t =x>
i,tθ+viand its uncertainty.
4.1 Maximum a Posteriori Estimation of θand vi
Fix round t. Let Ti,t be the set of rounds where arm iis
pulled by the beginning of round tand ni,t =|Ti,t|be the
size of Ti,t. Let ri,t = (ri,`)>
`∈Ti,t be the column vector
of rewards obtained by pulling arm i,i,t = (i,`)>
`∈Ti,t be
the column vector of the corresponding reward noise, and
Xi,t = (xi,`)>
`∈Ti,t be a ni,t ×dmatrix with the corre-
sponding contexts. From (1) and (2),
ri,t =Xi,tθ+vi1ni,t +i,t ,
where 1kis a vector of length kwhose all entries are one.
The covariance matrix Vi,t for the vector ri,t is given by
Vi,t =σ2
01ni,t 1>
ni,t +σ2Ini,t , where Ikis the identify
matrix of size k×k. The terms σ2
01ni,t 1>
ni,t and σ2Ini,t
represent the randomness from viand i,t, respectively. By
the Woodbury matrix identity,
V1
i,t =σ2Ini,t σ2n1
i,t wi,t1ni,t 1T
ni,t .(7)
Assuming that Pθ,Pv,Pare Gaussian, the maximum a
posteriori (MAP) estimation is equivalent to minimizing
the following loss function
L(v1,··· , vK,θ)
=
K
X
i=1 σ2kri,t Xi,tθvik2+σ2
0v2
i+λkθk2(8)
with respect to (vi)i[K]and θ, where k·kis the Euclidean
norm. The term σ2PK
i=1 kri,t Xi,tθvik2is from
Rong Zhu, Branislav Kveton
the conditional likelihood of ri,t given (vi)i[K]and θ.
The regularization term σ2
0PK
i=1 v2
iis from the prior of
(vi)i[K]in (4). The other term λkθk2is from the prior of
θin (3).
Differentiating L(v1,··· , vK,θ)with respect to viand
putting it equal to zero, viis estimated by
˜vi,t =wi,t(¯ri,t ¯
x>
i,tθ),(9)
where ¯ri,t =n1
i,t P
`∈Ti,t
ri,` is the average reward of arm i
up to round t,¯
xi,t =n1
i,t P
`∈Ti,t
xi,` is the average context
associated with the pulls of arm iup to round t, and
wi,t =σ2
0
σ2
0+σ2/ni,t
(10)
is a weight that interpolates between the context and the
arm-specific variable. We discuss its role in Section 4.2.
Let ui,t =ri,t Xi,tθand ¯ui,t =n1
i,t P
`∈Ti,t
ui,`, where
ui,` is the `-th element of ui,t. Inserting (9) into (8), it
follows that
L(θ) =
K
X
i=1 σ2kui,t wi,t ¯ui,tk2+σ2
0w2
i,t ¯u2
i,t+λkθk2
=σ2
K
X
i=1 kui,tk2ni,twi,t ¯u2
i,t+λkθk2
=
K
X
i=1 (ri,t Xi,tθ)>V1
i,t (ri,t Xi,tθ)+λkθk2,
where the last step is from (7).
To obtain the MAP estimate of θ, we minimize L(θ)with
respect to θand get
ˆ
θt= λId+
K
X
i=1
X>
i,tV1
i,t Xi,t!1K
X
i=1
X>
i,tV1
i,t ri,t .
(11)
To obtain the MAP estimate of vi, we insert ˆ
θtinto (9),
ˆvi,t =wi,t(¯ri,t ¯
x>
i,t ˆ
θt).(12)
4.2 Prediction of µi,t and Its Uncertainty
Based on (11) and (12), the estimated mean reward of arm
iin context xi,t in round tis
ˆµi,t =xi,t ˆ
θt+wi,t(¯ri,t ¯
x>
i,t ˆ
θt).(13)
In (2), virepresents the inter-arm heterogeneity. It is the
arm-specific effect that cannot be explained by context.
This effect is estimated in (13) using wi,t(¯ri,t ¯
x>
i,t ˆ
θt).
Now consider wi,t in (10). If the arm has not been pulled,
ni,t = 0 and wi,t = 0. Therefore, ˆµi,t =x>
i,t ˆ
θt. Similarly,
if the arm has not been pulled often, ˆµi,t is close to xi,t ˆ
θt.
This means that the prediction ˆµi,t is statistically efficient
for small ni,t. This is helpful in the initial rounds when
there are only a few observations of arms.
We further explain (13) by rewriting it as
ˆµi,t =wi,t ¯ri,t + (xi,t wi,t ¯
xi,t)>ˆ
θt.(14)
Here ˆµi,t is a weighted estimator of two terms: the sample
mean of arm i,¯ri,t, and additional calibration from contexts
(xi,t wi,t ¯
xi,t)>ˆ
θt. The weight is wi,t. When ni,t → ∞,
wi,t 1and ˆµi,t ¯ri,t + (xi,t ¯
xi,t)>ˆ
θt. This shows
why our prediction ˆµi,t is robust. Informally, it uses ¯ri,t as
a baseline and corrects it using context as (xi,t ¯
xi,t)>ˆ
θt.
Therefore, we reduce the reliance on the contextual model
by automatically balancing the contextual and multi-armed
bandits. This is why we call our framework a robust con-
textual bandit.
The prediction ˆµi,t can also degenerate to that of a contex-
tual linear bandit (Rusmevichientong and Tsitsiklis,2010;
Agrawal and Goyal,2013). More specifically, wi,t 0as
σ2
00, and then ˆµi,t approaches
ˆµlin
i,t =xi,t λId+
K
X
i=1
X>
i,tXi,t!1K
X
i=1
X>
i,tri,t ,
which is the prediction of a simple linear model without
inter-arm heterogeneity. This observation is important be-
cause it shows that our framework is as general as the con-
textual linear bandit.
Now we characterize the uncertainty of ˆµi,t in (14) for effi-
cient exploration. We measure the uncertainty by the mean
squared error E[(ˆµi,t µi,t)2]. Let ¯i,t =n1
i,t P
`∈Ti,t
i,`
and Mt=λId+
K
P
i=1
X>
i,tV1
i,t Xi,t. A direct calculation
shows that
E[(ˆµi,t µi,t)2]
=E[((wi,t 1)vi+wi,t¯i,t + (xi,t wi,t ¯
xi,t)>(ˆ
θtθ))2]
=σ2
0(1 wi,t)+(xi,t wi,t ¯
xi,t)>M1
t(xi,t wi,t ¯
xi,t)
=τ2
i,t ,(15)
where the last step is from E[(wi,tvivi+wi,t¯i,t)( ˆ
θt
θ)>(xi,t wi,t ¯
xi,t)] = 0 shown in Kachar and Harville
(1984).
4.3 Computational Efficiency
We also investigate if the robust contextual bandit can be
implemented as computationally efficiently as a contextual
摘要:

RobustContextualLinearBanditsRongZhuBranislavKvetonInstituteofScienceandTechnologyforBrain-InspiredIntelligenceFudanUniversityAmazonAbstractModelmisspecicationisamajorconsiderationinapplicationsofstatisticalmethodsandma-chinelearning.However,itisoftenneglectedincontextualbandits.Thispaperstudiesaco...

展开>> 收起<<
Robust Contextual Linear Bandits Rong Zhu Branislav Kveton Institute of Science and Technology for Brain-Inspired Intelligence.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:618.37KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注