
Rong Zhu, Branislav Kveton
offset the engagement, either up or down, depending on
their affinity. For instance, a feature representing the po-
sition of the item in the recommended list would have a
negative weight, meaning that lower ranked items are less
likely to be clicked, no matter how engaging they are.
We make the following contributions. First, we propose
robust contextual linear bandits, where the model misspec-
ification can be learned using arm-specific variables (Sec-
tion 3). Under the assumption that the variables are ran-
dom, both the Bayesian and random-effect viewpoints can
be used to derive efficient joint estimators of the shared
model parameter and arm-specific variables. We derive
the estimators in Section 4, and show how to incorporate
them in the estimate of the mean arm reward and its uncer-
tainty. Second, we propose upper confidence bound (UCB)
and Thompson sampling (TS) algorithms for this problem,
RoLinUCB and RoLinTS (Section 5). Both algorithms are
computationally efficient and robust to model misspecifica-
tion. We analyze both algorithms and derive upper bounds
on their n-round Bayes regret (Section 6). Our proofs rely
on analyzing an equivalent linear bandit, and the resulting
regret bounds improve in constants due to the special co-
variance structure of learned parameters. Our algorithms
are also significantly more computationally efficient than
naive implementations, which take O((d+K)3)time for
ddimensions and Karms, instead of our O(d2(d+K)).
Finally, we evaluate RoLinTS on both synthetic and real-
world problems. We observe that RoLinTS is comparably
statically efficient to the classic methods when the misspec-
ification is low, more robust when the misspecification is
high, and significantly more computationally efficient than
its naive implementation.
2 Related Work
Our model is related to a hybrid linear model (Section 3.2
in Li et al. (2010)) with shared and arm-specific parameters.
Unlike the hybrid linear model, where the coefficients of
some features are shared by all arms while the others are
not, we introduce arm-specific random variables to capture
the model misspecification. We further study the impact of
this structure on regret and propose an especially efficient
implementation for this setting. Another related work is
hLinUCB of Wang et al. (2016). hLinUCB is a variant of
LinUCB that learns a portion of the feature vector and we
compare to it in Section 7.
Due to our focus on robustness, our work is related to mis-
specified linear bandits. Ghosh et al. (2017) proposed an
algorithm that switches from a linear to multi-armed bandit
algorithm when the linear model is detected to be misspec-
ified. Differently from this work, we adapt to misspecifi-
cation. We do not compare to this algorithm because it is
non-contextual; and thus would have a linear regret in our
setting. Foster et al. (2020) and Krishnamurthy et al. (2021)
proposed oracle-efficient algorithms that reduce contextual
bandits to online regression, and are robust to misspecifi-
cation. Since Safe-FALCON of Krishnamurthy et al. (2021)
is an improvement upon Foster et al. (2020), we discuss
it in more detail. Safe-FALCON is more general than our
approach because it does not make any distributional as-
sumptions. On the other hand, it is very conservative in our
setting because of inverse gap weighting. We compare to
it in Section 7. Finally, Bogunovic et al. (2021) and Ding
et al. (2022) proposed linear bandit algorithms that are ro-
bust to adversarial noise attack. The notion of robustness
in these works is very different from ours.
Our work is also related to random-effect bandits (Zhu and
Kveton,2022). As in Zhu and Kveton (2022), we assume
that each arm is associated with a random variable that can
help with explaining its unknown mean reward. Zhu and
Kveton (2022) used this structure to design a bandit algo-
rithm that is comparably efficient to TS without knowing
the prior. Their algorithm is UCB not contextual. A simi-
lar idea was explored by Wan et al. (2022) and applied to
structured bandits. This work is also non-contextual. Wan
et al. (2021) assumed a hierarchical structure over tasks
and modeled inter-task heterogeneity. We focus on a sin-
gle task and model inter-arm heterogeneity. Our work is
also related to recent papers on hierarchical Bayesian ban-
dits (Kveton et al.,2021;Basu et al.,2021;Hong et al.,
2022). All of these papers considered a similar graphical
model to Wan et al. (2021) and therefore model inter-task
heterogeneity.
3 Robust Contextual Linear Bandits
We adopt the following notation. For any positive integer
n, we denote by [n]the set {1, . . . , n}. We let 1{·} be the
indicator function. For any matrix M∈Rd×d, the max-
imum eigenvalue is λ1(M)and the minimum is λd(M).
The big O notation up to logarithmic factors is ˜
O.
We consider a contextual bandit (Li et al.,2010), where
the relationship between the mean reward of an arm and its
context is represented by a model. In round t∈[n], an
agent pulls one of Karms with feature vectors xi,t ∈Rd
for i∈[K]. The vector xi,t summarizes information spe-
cific to arm iin round tand we call it a context. Compared
to context-free bandits (Lai and Robbins,1985;Auer et al.,
2002;Agrawal and Goyal,2012), contextual bandits have
more practical applications because they model the reward
as a function of context. For instance, in online advertis-
ing, the arms would be different ads, the context would be
user features, and the contextual bandit agent would pull
arms according to user features (Li et al.,2010;Agrawal
and Goyal,2013). More formally, in round t, the agent
pulls arm It∈[K]based on context and rewards in past
rounds; and receives the reward of arm It,rIt,t, whose
mean reward depends on the context xIt,t. Since the num-