Robust Contextual Linear Bandits Rong Zhu Branislav Kveton Institute of Science and Technology for Brain-Inspired Intelligence

2025-05-03 0 0 618.37KB 17 页 10玖币

侵权投诉

Robust Contextual Linear Bandits

Rong Zhu Branislav Kveton

Institute of Science and Technology for Brain-Inspired Intelligence

Fudan University

Amazon

Abstract

Model misspeciﬁcation is a major consideration

in applications of statistical methods and ma-

chine learning. However, it is often neglected

in contextual bandits. This paper studies a com-

mon form of misspeciﬁcation, an inter-arm het-

erogeneity that is not captured by context. To

address this issue, we assume that the hetero-

geneity arises due to arm-speciﬁc random vari-

ables, which can be learned. We call this set-

ting a robust contextual bandit. The arm-speciﬁc

variables explain the unknown inter-arm hetero-

geneity, and we incorporate them in the robust

contextual estimator of the mean reward and its

uncertainty. We develop two efﬁcient bandit al-

gorithms for our setting: a UCB algorithm called

RoLinUCB and a posterior-sampling algorithm

called RoLinTS. We analyze both algorithms and

bound their n-round Bayes regret. Our experi-

ments show that RoLinTS is comparably statis-

tically efﬁcient to the classic methods when the

misspeciﬁcation is low, more robust when the

misspeciﬁcation is high, and signiﬁcantly more

computationally efﬁcient than its naive imple-

mentation.

1 Introduction

Astochastic contextual bandit (Auer et al.,2002;Li et al.,

2010;Lattimore and Szepesvari,2019) is an online learning

problem where a learning agent sequentially interacts with

an environment over nrounds. In each round, the agent

observes context, pulls an arm conditioned on the context,

and receives a corresponding stochastic reward. Contex-

tual bandits have many applications in practice, such as in

personalized recommendations (Li et al.,2010;Jeunen and

Goethals,2021). This is because the mean rewards of the

arms are tied together through known context and learned

model parameters. Thus the contextual approach can be

more statistically efﬁcient than a naive multi-armed bandit

solution (Auer et al.,2002;Agrawal and Goyal,2012). The

linear model, where the mean reward of an arm is the dot

product of its context and an unknown parameter, is versa-

tile and popular (Dani et al.,2008;Rusmevichientong and

Tsitsiklis,2010;Abbasi-Yadkori et al.,2011;Agrawal and

Goyal,2013), and we consider it in this work.

There are two common approaches to using linear models

in contextual bandits. One maintains a separate parameter

per arm (Section 3.1 in Li et al. (2010)). While this ap-

proach can learn complex models, it is not very statistically

efﬁcient because the arm parameters are not shared. This

is especially important when each arm is pulled a different

number of times. The other approach maintains a single

shared parameter for all arms. While this approach can be

statistically efﬁcient, it is more rigid and likely to fail due

to model misspeciﬁcation; when the optimal arm under the

assumed model is not the actual optimal arm.

To address the above issues, we propose a new contextual

linear model. This model assumes that the mean reward

of an arm is a dot product of its context and an unknown

shared parameter, which is offset by an arm-speciﬁc vari-

able. This approach is statistically efﬁcient because the

model parameter is shared by all arms; yet ﬂexible because

the arm-speciﬁc variables can address model misspeciﬁca-

tion. We call this setting a robust contextual linear ban-

dit. To provide an efﬁcient solution to the problem, we as-

sume that the arm-speciﬁc variables are random and drawn

from a distribution known by the agent. This allows us to

develop a joint estimator of the shared parameter and the

arm-speciﬁc variables, which interpolates between the two

and also uses the context.

One motivating example for our setting are recommender

systems, where the features of an item cannot explain all

information about the item, such as its intrinsic popularity

(Koren et al.,2009). This is why the so-called behavioral

features, the features that summarize the past engagement

with the item, exist. The intrinsic popularity can be viewed

as the average engagement, click or purchase rate, in the

absence of any other information. The item features then

arXiv:2210.14483v1 [cs.LG] 26 Oct 2022

Rong Zhu, Branislav Kveton

offset the engagement, either up or down, depending on

their afﬁnity. For instance, a feature representing the po-

sition of the item in the recommended list would have a

negative weight, meaning that lower ranked items are less

likely to be clicked, no matter how engaging they are.

We make the following contributions. First, we propose

robust contextual linear bandits, where the model misspec-

iﬁcation can be learned using arm-speciﬁc variables (Sec-

tion 3). Under the assumption that the variables are ran-

dom, both the Bayesian and random-effect viewpoints can

be used to derive efﬁcient joint estimators of the shared

model parameter and arm-speciﬁc variables. We derive

the estimators in Section 4, and show how to incorporate

them in the estimate of the mean arm reward and its uncer-

tainty. Second, we propose upper conﬁdence bound (UCB)

and Thompson sampling (TS) algorithms for this problem,

RoLinUCB and RoLinTS (Section 5). Both algorithms are

computationally efﬁcient and robust to model misspeciﬁca-

tion. We analyze both algorithms and derive upper bounds

on their n-round Bayes regret (Section 6). Our proofs rely

on analyzing an equivalent linear bandit, and the resulting

regret bounds improve in constants due to the special co-

variance structure of learned parameters. Our algorithms

are also signiﬁcantly more computationally efﬁcient than

naive implementations, which take O((d+K)3)time for

ddimensions and Karms, instead of our O(d2(d+K)).

Finally, we evaluate RoLinTS on both synthetic and real-

world problems. We observe that RoLinTS is comparably

statically efﬁcient to the classic methods when the misspec-

iﬁcation is low, more robust when the misspeciﬁcation is

high, and signiﬁcantly more computationally efﬁcient than

its naive implementation.

2 Related Work

Our model is related to a hybrid linear model (Section 3.2

in Li et al. (2010)) with shared and arm-speciﬁc parameters.

Unlike the hybrid linear model, where the coefﬁcients of

some features are shared by all arms while the others are

not, we introduce arm-speciﬁc random variables to capture

the model misspeciﬁcation. We further study the impact of

this structure on regret and propose an especially efﬁcient

implementation for this setting. Another related work is

hLinUCB of Wang et al. (2016). hLinUCB is a variant of

LinUCB that learns a portion of the feature vector and we

compare to it in Section 7.

Due to our focus on robustness, our work is related to mis-

speciﬁed linear bandits. Ghosh et al. (2017) proposed an

algorithm that switches from a linear to multi-armed bandit

algorithm when the linear model is detected to be misspec-

iﬁed. Differently from this work, we adapt to misspeciﬁ-

cation. We do not compare to this algorithm because it is

non-contextual; and thus would have a linear regret in our

setting. Foster et al. (2020) and Krishnamurthy et al. (2021)

proposed oracle-efﬁcient algorithms that reduce contextual

bandits to online regression, and are robust to misspeciﬁ-

cation. Since Safe-FALCON of Krishnamurthy et al. (2021)

is an improvement upon Foster et al. (2020), we discuss

it in more detail. Safe-FALCON is more general than our

approach because it does not make any distributional as-

sumptions. On the other hand, it is very conservative in our

setting because of inverse gap weighting. We compare to

it in Section 7. Finally, Bogunovic et al. (2021) and Ding

et al. (2022) proposed linear bandit algorithms that are ro-

bust to adversarial noise attack. The notion of robustness

in these works is very different from ours.

Our work is also related to random-effect bandits (Zhu and

Kveton,2022). As in Zhu and Kveton (2022), we assume

that each arm is associated with a random variable that can

help with explaining its unknown mean reward. Zhu and

Kveton (2022) used this structure to design a bandit algo-

rithm that is comparably efﬁcient to TS without knowing

the prior. Their algorithm is UCB not contextual. A simi-

lar idea was explored by Wan et al. (2022) and applied to

structured bandits. This work is also non-contextual. Wan

et al. (2021) assumed a hierarchical structure over tasks

and modeled inter-task heterogeneity. We focus on a sin-

gle task and model inter-arm heterogeneity. Our work is

also related to recent papers on hierarchical Bayesian ban-

dits (Kveton et al.,2021;Basu et al.,2021;Hong et al.,

2022). All of these papers considered a similar graphical

model to Wan et al. (2021) and therefore model inter-task

heterogeneity.

3 Robust Contextual Linear Bandits

We adopt the following notation. For any positive integer

n, we denote by [n]the set {1, . . . , n}. We let 1{·} be the

indicator function. For any matrix M∈Rd×d, the max-

imum eigenvalue is λ1(M)and the minimum is λd(M).

The big O notation up to logarithmic factors is ˜

We consider a contextual bandit (Li et al.,2010), where

the relationship between the mean reward of an arm and its

context is represented by a model. In round t∈[n], an

agent pulls one of Karms with feature vectors xi,t ∈Rd

for i∈[K]. The vector xi,t summarizes information spe-

ciﬁc to arm iin round tand we call it a context. Compared

to context-free bandits (Lai and Robbins,1985;Auer et al.,

2002;Agrawal and Goyal,2012), contextual bandits have

more practical applications because they model the reward

as a function of context. For instance, in online advertis-

ing, the arms would be different ads, the context would be

user features, and the contextual bandit agent would pull

arms according to user features (Li et al.,2010;Agrawal

and Goyal,2013). More formally, in round t, the agent

pulls arm It∈[K]based on context and rewards in past

rounds; and receives the reward of arm It,rIt,t, whose

mean reward depends on the context xIt,t. Since the num-

Rong Zhu, Branislav Kveton

ber of contexts is large, the agent assumes some general-

ization model, such as that the mean reward is linear in xi,t

and some unknown parameter. When this model is incor-

rectly speciﬁed, the contextual bandit algorithm may per-

form poorly.

To improve the robustness of contextual linear bandits to

misspeciﬁcation, we introduce a novel modeling assump-

tion. Speciﬁcally, the reward ri,t of arm iin round tis

generated as

ri,t =µi,t +i,t ,(1)

µi,t =x>

i,tθ+vi,(2)

θ∼Pθ(0, λ−1Id),(3)

vi∼Pv(0, σ2

0),(4)

i,t ∼P(0, σ2).(5)

Here µi,t and i,t are the mean reward and reward noise,

respectively, of arm iin round t. The mean reward µi,t

has two terms: a linear function of context xi,t and pa-

rameter θ∈Rdshared by all arms, and the inter-arm het-

erogeneity vi∈R, which is an unobserved arm-speciﬁc

random variable. The distributions of θ,vi, and i,t are de-

noted by Pθ,Pv, and P; and their hyper-parameters are λ,

σ2

0, and σ2. We call our model a robust contextual linear

bandit because vimakes it robust to the misspeciﬁcation

due to context. Our model can be viewed as an instance

of unobserved-effect models commonly used in panel and

longitudinal data analyses (Wooldridge,2001;Diggle et al.,

2002). For brevity, and since we only study linear models,

we often call our model a robust contextual bandit.

Our goal is to design an algorithm that minimizes its regret

with respect to the optimal arm-selection strategy. The n-

round Bayes regret R(n)of an agent is deﬁned as

R(n) = E"n

t=1

µI∗

t,t −µIt,t#,(6)

where I∗

tis the arm with highest mean reward in round t

and Itis the pulled arm in round t. The expectation is under

the randomness of Itand I∗

t; and those of θ,vi, and i,t.

3.1 Discussion

We introduce an unobserved effect vi, which can be inter-

preted as capturing the characteristics of arm ithat is not

explained by context, but is assumed not to change over n

rounds. We call it the inter-arm heterogeneity. For exam-

ple, in online advertising, the arms would be different ads

and the context would be user features. In this problem,

vimay contain unobserved ad characteristics, such as its

intrinsic quality, that can be viewed as roughly constant.

We assume that the parameter θis shared by all arms, while

the inter-arm heterogeneity is modeled by arm-speciﬁc

variables. From the statistical-efﬁciency viewpoint, this

model reduces the number of parameters compared to mod-

eling arms separately, and therefore increases statistical ef-

ﬁciency. Li et al. (2010) proposed hybrid linear models,

where the coefﬁcients of some features are shared by all

arms while the others are arm-speciﬁc. However, choosing

features to share may be challenging in practice. From the

practical viewpoint, it is more convenient to apply the ro-

bust contextual bandit, as it avoids the challenging choice

of the shared features. In particular, the model is still ﬂexi-

ble enough because it uses the unobserved effect vito cap-

ture inter-arm heterogeneity, information not explained by

the context. For instance, imagine a contextual recommen-

dation problem with Karms, where arms represent items.

In addition to what the item and user features can explain,

there may still be item-speciﬁc biases (Koren et al.,2009).

4 Estimation

This section introduces our estimators for robust contextual

bandits. In Section 4.1, we derive the estimators of θand

vifor i∈[K]. In Section 4.2, we derive the estimator of

µi,t =x>

i,tθ+viand its uncertainty.

4.1 Maximum a Posteriori Estimation of θand vi

Fix round t. Let Ti,t be the set of rounds where arm iis

pulled by the beginning of round tand ni,t =|Ti,t|be the

size of Ti,t. Let ri,t = (ri,`)>

`∈Ti,t be the column vector

of rewards obtained by pulling arm i,i,t = (i,`)>

`∈Ti,t be

the column vector of the corresponding reward noise, and

Xi,t = (xi,`)>

`∈Ti,t be a ni,t ×dmatrix with the corre-

sponding contexts. From (1) and (2),

ri,t =Xi,tθ+vi1ni,t +i,t ,

where 1kis a vector of length kwhose all entries are one.

The covariance matrix Vi,t for the vector ri,t is given by

Vi,t =σ2

01ni,t 1>

ni,t +σ2Ini,t , where Ikis the identify

matrix of size k×k. The terms σ2

01ni,t 1>

ni,t and σ2Ini,t

represent the randomness from viand i,t, respectively. By

the Woodbury matrix identity,

V−1

i,t =σ−2Ini,t −σ−2n−1

i,t wi,t1ni,t 1T

ni,t .(7)

Assuming that Pθ,Pv,Pare Gaussian, the maximum a

posteriori (MAP) estimation is equivalent to minimizing

the following loss function

L(v1,··· , vK,θ)

i=1 σ−2kri,t −Xi,tθ−vik2+σ−2

0v2

i+λkθk2(8)

with respect to (vi)i∈[K]and θ, where k·kis the Euclidean

norm. The term σ−2PK

i=1 kri,t −Xi,tθ−vik2is from

Rong Zhu, Branislav Kveton

the conditional likelihood of ri,t given (vi)i∈[K]and θ.

The regularization term σ−2

0PK

i=1 v2

iis from the prior of

(vi)i∈[K]in (4). The other term λkθk2is from the prior of

θin (3).

Differentiating L(v1,··· , vK,θ)with respect to viand

putting it equal to zero, viis estimated by

˜vi,t =wi,t(¯ri,t −¯

i,tθ),(9)

where ¯ri,t =n−1

i,t P

`∈Ti,t

ri,` is the average reward of arm i

up to round t,¯

xi,t =n−1

i,t P

`∈Ti,t

xi,` is the average context

associated with the pulls of arm iup to round t, and

wi,t =σ2

σ2

0+σ2/ni,t

(10)

is a weight that interpolates between the context and the

arm-speciﬁc variable. We discuss its role in Section 4.2.

Let ui,t =ri,t −Xi,tθand ¯ui,t =n−1

i,t P

`∈Ti,t

ui,`, where

ui,` is the `-th element of ui,t. Inserting (9) into (8), it

follows that

L(θ) =

i=1 σ−2kui,t −wi,t ¯ui,tk2+σ−2

0w2

i,t ¯u2

i,t+λkθk2

=σ−2

i=1 kui,tk2−ni,twi,t ¯u2

i,t+λkθk2

i=1 (ri,t −Xi,tθ)>V−1

i,t (ri,t −Xi,tθ)+λkθk2,

where the last step is from (7).

To obtain the MAP estimate of θ, we minimize L(θ)with

respect to θand get

θt= λId+

i=1

i,tV−1

i,t Xi,t!−1K

i=1

i,tV−1

i,t ri,t .

(11)

To obtain the MAP estimate of vi, we insert ˆ

θtinto (9),

ˆvi,t =wi,t(¯ri,t −¯

i,t ˆ

θt).(12)

4.2 Prediction of µi,t and Its Uncertainty

Based on (11) and (12), the estimated mean reward of arm

iin context xi,t in round tis

ˆµi,t =xi,t ˆ

θt+wi,t(¯ri,t −¯

i,t ˆ

θt).(13)

In (2), virepresents the inter-arm heterogeneity. It is the

arm-speciﬁc effect that cannot be explained by context.

This effect is estimated in (13) using wi,t(¯ri,t −¯

i,t ˆ

θt).

Now consider wi,t in (10). If the arm has not been pulled,

ni,t = 0 and wi,t = 0. Therefore, ˆµi,t =x>

i,t ˆ

θt. Similarly,

if the arm has not been pulled often, ˆµi,t is close to xi,t ˆ

θt.

This means that the prediction ˆµi,t is statistically efﬁcient

for small ni,t. This is helpful in the initial rounds when

there are only a few observations of arms.

We further explain (13) by rewriting it as

ˆµi,t =wi,t ¯ri,t + (xi,t −wi,t ¯

xi,t)>ˆ

θt.(14)

Here ˆµi,t is a weighted estimator of two terms: the sample

mean of arm i,¯ri,t, and additional calibration from contexts

(xi,t −wi,t ¯

xi,t)>ˆ

θt. The weight is wi,t. When ni,t → ∞,

wi,t →1and ˆµi,t →¯ri,t + (xi,t −¯

xi,t)>ˆ

θt. This shows

why our prediction ˆµi,t is robust. Informally, it uses ¯ri,t as

a baseline and corrects it using context as (xi,t −¯

xi,t)>ˆ

θt.

Therefore, we reduce the reliance on the contextual model

by automatically balancing the contextual and multi-armed

bandits. This is why we call our framework a robust con-

textual bandit.

The prediction ˆµi,t can also degenerate to that of a contex-

tual linear bandit (Rusmevichientong and Tsitsiklis,2010;

Agrawal and Goyal,2013). More speciﬁcally, wi,t →0as

σ2

0→0, and then ˆµi,t approaches

ˆµlin

i,t =xi,t λId+

i=1

i,tXi,t!−1K

i=1

i,tri,t ,

which is the prediction of a simple linear model without

inter-arm heterogeneity. This observation is important be-

cause it shows that our framework is as general as the con-

textual linear bandit.

Now we characterize the uncertainty of ˆµi,t in (14) for efﬁ-

cient exploration. We measure the uncertainty by the mean

squared error E[(ˆµi,t −µi,t)2]. Let ¯i,t =n−1

i,t P

`∈Ti,t

i,`

and Mt=λId+

i=1

i,tV−1

i,t Xi,t. A direct calculation

shows that

E[(ˆµi,t −µi,t)2]

=E[((wi,t −1)vi+wi,t¯i,t + (xi,t −wi,t ¯

xi,t)>(ˆ

θt−θ))2]

=σ2

0(1 −wi,t)+(xi,t −wi,t ¯

xi,t)>M−1

t(xi,t −wi,t ¯

xi,t)

=τ2

i,t ,(15)

where the last step is from E[(wi,tvi−vi+wi,t¯i,t)( ˆ

θt−

θ)>(xi,t −wi,t ¯

xi,t)] = 0 shown in Kachar and Harville

(1984).

4.3 Computational Efﬁciency

We also investigate if the robust contextual bandit can be

implemented as computationally efﬁciently as a contextual

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RobustContextualLinearBanditsRongZhuBranislavKvetonInstituteofScienceandTechnologyforBrain-InspiredIntelligenceFudanUniversityAmazonAbstractModelmisspecicationisamajorconsiderationinapplicationsofstatisticalmethodsandma-chinelearning.However,itisoftenneglectedincontextualbandits.Thispaperstudiesaco...

展开>> 收起<<

Robust Contextual Linear Bandits Rong Zhu Branislav Kveton Institute of Science and Technology for Brain-Inspired Intelligence.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Robust Contextual Linear Bandits Rong Zhu Branislav Kveton Institute of Science and Technology for Brain-Inspired Intelligence

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: