
PAC-Bayesian Offline Contextual Bandits With Guarantees
Otmane Sakhi 1 2 Pierre Alquier 3Nicolas Chopin 2
Abstract
This paper introduces a new principled approach
for off-policy learning in contextual bandits. Un-
like previous work, our approach does not de-
rive learning principles from intractable or loose
bounds. We analyse the problem through the PAC-
Bayesian lens, interpreting policies as mixtures
of decision rules. This allows us to propose novel
generalization bounds and provide tractable al-
gorithms to optimize them. We prove that the
derived bounds are tighter than their competitors,
and can be optimized directly to confidently im-
prove upon the logging policy offline. Our ap-
proach learns policies with guarantees, uses all
available data and does not require tuning ad-
ditional hyperparameters on held-out sets. We
demonstrate through extensive experiments the
effectiveness of our approach in providing perfor-
mance guarantees in practical scenarios.
1. Introduction
Online industrial systems encounter sequential decision
problems as they interact with the environment and strive
to improve based on the received feedback. The contextual
bandit framework formalizes this mechanism, and proved
valuable with applications in recommender systems (Valko
et al.,2014) and clinical trials (Villar et al.,2015). It de-
scribes a game of repeated interactions between a system
and an environment, where the latter reveals a context that
the system interacts with, and receives a feedback in return.
While the online solution to this problem involves strategies
that find an optimal trade-off between exploration and ex-
ploitation to minimize the regret (Lattimore & Szepesv
´
ari,
2020), we are concerned with its offline formulation (Swami-
nathan & Joachims,2015b), which is arguably better suited
for real-life applications, where more control over the
1
Criteo AI Lab, Paris, France
2
CREST, ENSAE, IPP, Palaiseau,
France
3
ESSEC Business School, Asia-Pacific campus, Singapore.
Correspondence to: Otmane Sakhi <o.sakhi@criteo.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
decision-maker, often coined the policy, is needed. The
learning of the policy is performed offline, based on his-
torical data, typically obtained by logging the interactions
between an older version of the decision system and the
environment. By leveraging this data, our goal is to discover
new strategies of greater performance.
There are two main paths to address this learning problem.
The direct method (Sakhi et al.,2020a;Jeunen & Goethals,
2021) attacks the problem by modelling the feedback and
deriving a policy according to this model. This approach can
be praised for its simplicity (Brandfonbrener et al.,2021), is
well-studied in the offline setting (Nguyen-Tang et al.,2022)
but will often suffer from a bias as the feedback received is
complex and the efficiency of the method directly depends
on our ability to understand the problem’s structure.
We will consider the second path of off-policy learning,
or IPS: inverse propensity scoring (Horvitz & Thompson,
1952) where we learn the policy directly from the logged
data after correcting its bias with importance sampling
(Owen & Zhou,2000). As these estimators (Bottou et al.,
2013;Swaminathan & Joachims,2015a) can suffer from
a variance problem as we drift away from the logging pol-
icy, the literature gave birth to different learning principles
(Swaminathan & Joachims,2015b;Ma et al.,2019;Lon-
don & Sandler,2019;Faury et al.,2020) motivating penal-
izations toward the logging policy. These principles are
inspired by generalization bounds, but introduce a hyperpa-
rameter
λ
to either replace intractable quantities (Swami-
nathan & Joachims,2015b) or to tighten a potentially vac-
uous bound (London & Sandler,2019). These approaches
require tuning
λ
on a held-out set and sometimes fail at
improving the previous decision system (London & Sandler,
2019;Chen et al.,2019).
In this work, we analyse off-policy learning from the PAC-
Bayesian perspective (McAllester,1998;Catoni,2007). We
aim at introducing a novel, theoretically-grounded approach,
based on the direct optimization of newly derived tight gen-
eralization bounds, to obtain guaranteed improvement of
the previous system offline, without the need for held-out
sets nor hyperparameter tuning. We show that our approach
is perfectly suited to this framework, as it naturally incor-
porates information about the old decision system and can
confidently improve it.
1
arXiv:2210.13132v2 [stat.ML] 27 May 2023