
arXiv Template A PREPRINT
However, one may be interested in optimizing the expectation of a function
f(r)
, instead of
r
itself. In other words,
rt
may be seen as an intermediate reward that mediates between the agent’s actions
at
and the ultimate reward
f(rt)
. In
various fields of application, it is usual to study a function of the outcome, rather than the outcome itself. For instance,
the conditional value at risk (CVaR), which has seen extensive use in financial portfolio optimization (Rockafellar et al.
[2000], Zhu and Fukushima [2009]), considers
f(r) = r1r≤rα
. The median and other quantiles may be of interest for
noisy data (Even-Dar et al. [2002], Altschuler et al. [2018]).
In more general settings, the function fmay be itself unknown. Noisy data of the form
(ri, f(ri) + ξi)M
i=1 (3)
may be the only available information on such dependency. In this case, uncertainty raises from the lack of information
between the traditional CMAB variables r, a, s, but also from the unknown function f.
Please note that this scenario opens the door for considering multiple sources of data. In contrast to the usual matched
scenario, where the sequential data is of the form
(si, ai, ri, f(ri) + ξi)t
i=1
, it may be of interest to consider an
unmatched scenario, where a second data set of the form
(rj, f(rj) + ξj)M
j=1
is given on top of the sequential data.
Algorithms that handle unmatched data sets have currently been attracting more and more interest in the so called data
fusion problem (Meng et al. [2020]).
For instance, a new movie recommendation system might obtain a multivariate
rj
consisting on: whether the user
showed interest in the recommendation, the number of cliques after the recommendation screen, the time spent reading
the description of the recommendation, etc. The final reward
f(rj) + ξj
is given by the number of minutes that the
client ended up watching the content for. A rich data set
(rj, f(rj) + ξj)M
j=1
might be obtained from other platforms
that have been longer in the market. Not considering this second data set would imply a substantial loss of information
which would to suboptimal approaches.
In this work, we revisit the problem of sequentially maximising the conditional expectation of a function in the CMAB
framework. Our contributions are two-folded:
•
We design a novel algorithm, namely Contextual Bayesian Mean Process Upper Confidence Bound, which
considers two sources of uncertainty derived from the lack of information on
r, a, s
, as well as the unknown
function f. It allows for considering both matched and unmatched data sets.
•
We empirically show that the Bayesian Mean Process Upper Confidence Bound outperforms the state-of-the-
art Conditional Mean Embeddings Upper Confidence Bound (Chowdhury et al. [2020]) in the experimental
settings considered.
2 RELATED WORK
Multiple algorithms have been proposed for addressing the CMAB problem. The contextual Gaussian process upper
confidence bound (CGP-UCB), which motivates the algorithm proposed in this work, was first presented in Krause and
Ong [2011]. It generalizes the context-free Gaussian process upper confidence bound (GP-UCB) introduced in Srinivas
et al. [2009]. The CGP-UCB attains sublinear contextual regret in many real life applications i.e. it is able to compete
with the optimal mapping from contexts to actions. However, please note that there exist prominent alternatives to
CGP-UCB such as Thompson sampling (Agrawal and Goyal [2013]). Some algorithms designed for the MAB problem
(Chowdhury and Gopalan [2017]) may also be adapted to the CMAB framework.
The problem of sequentially maximising the conditional expectation has not received as much attention in the literature,
although a number of closely related problems have been presented in different forms. Oliveira et al. [2019] proposed
a sequential decision problem where both the reward and actions are noisy (execution and localisation noise), with
applications in robotics and stochastic simulations. The contextual combinatorial multi-armed bandit problem (Chen
et al. [2013], Qin et al. [2014]) may be understood as the problem of sequentially maximixing the conditional expectation
of a function, where the chosen super arm may be seen as a multivariate binary action, and the dependency between the
scores of the individual arms and the reward as the unknown function.
For the problem of sequentially maximizing the expectation of a function, Chowdhury et al. [2020] proposed the
Conditional Mean Embeddings Upper Confidence Bound (CME-UCB) algorithm. The CME-UCB uses the conditional
mean embedding for modeling the conditional distribution of the features. In contrast, alternative approaches for such
conditional density estimation scale poorly with the dimension of the underlying space (Grunewalder et al. [2012]).
The conditional mean embedding, key element of the CME-UCB algorithm, extends the concept of kernel mean
embeddings to conditional distributions (Muandet et al. [2017]). In terms of Bayesian procedures, Flaxman et al. [2016]
proposed a Bayesian approach for learning mean embeddings, and Chau et al. [2021a] generalized the approach for
2