PAC-Bayesian Offline Contextual Bandits With Guarantees

2025-05-06 0 0 755.28KB 23 页 10玖币
侵权投诉
PAC-Bayesian Offline Contextual Bandits With Guarantees
Otmane Sakhi 1 2 Pierre Alquier 3Nicolas Chopin 2
Abstract
This paper introduces a new principled approach
for off-policy learning in contextual bandits. Un-
like previous work, our approach does not de-
rive learning principles from intractable or loose
bounds. We analyse the problem through the PAC-
Bayesian lens, interpreting policies as mixtures
of decision rules. This allows us to propose novel
generalization bounds and provide tractable al-
gorithms to optimize them. We prove that the
derived bounds are tighter than their competitors,
and can be optimized directly to confidently im-
prove upon the logging policy offline. Our ap-
proach learns policies with guarantees, uses all
available data and does not require tuning ad-
ditional hyperparameters on held-out sets. We
demonstrate through extensive experiments the
effectiveness of our approach in providing perfor-
mance guarantees in practical scenarios.
1. Introduction
Online industrial systems encounter sequential decision
problems as they interact with the environment and strive
to improve based on the received feedback. The contextual
bandit framework formalizes this mechanism, and proved
valuable with applications in recommender systems (Valko
et al.,2014) and clinical trials (Villar et al.,2015). It de-
scribes a game of repeated interactions between a system
and an environment, where the latter reveals a context that
the system interacts with, and receives a feedback in return.
While the online solution to this problem involves strategies
that find an optimal trade-off between exploration and ex-
ploitation to minimize the regret (Lattimore & Szepesv
´
ari,
2020), we are concerned with its offline formulation (Swami-
nathan & Joachims,2015b), which is arguably better suited
for real-life applications, where more control over the
1
Criteo AI Lab, Paris, France
2
CREST, ENSAE, IPP, Palaiseau,
France
3
ESSEC Business School, Asia-Pacific campus, Singapore.
Correspondence to: Otmane Sakhi <o.sakhi@criteo.com>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
decision-maker, often coined the policy, is needed. The
learning of the policy is performed offline, based on his-
torical data, typically obtained by logging the interactions
between an older version of the decision system and the
environment. By leveraging this data, our goal is to discover
new strategies of greater performance.
There are two main paths to address this learning problem.
The direct method (Sakhi et al.,2020a;Jeunen & Goethals,
2021) attacks the problem by modelling the feedback and
deriving a policy according to this model. This approach can
be praised for its simplicity (Brandfonbrener et al.,2021), is
well-studied in the offline setting (Nguyen-Tang et al.,2022)
but will often suffer from a bias as the feedback received is
complex and the efficiency of the method directly depends
on our ability to understand the problem’s structure.
We will consider the second path of off-policy learning,
or IPS: inverse propensity scoring (Horvitz & Thompson,
1952) where we learn the policy directly from the logged
data after correcting its bias with importance sampling
(Owen & Zhou,2000). As these estimators (Bottou et al.,
2013;Swaminathan & Joachims,2015a) can suffer from
a variance problem as we drift away from the logging pol-
icy, the literature gave birth to different learning principles
(Swaminathan & Joachims,2015b;Ma et al.,2019;Lon-
don & Sandler,2019;Faury et al.,2020) motivating penal-
izations toward the logging policy. These principles are
inspired by generalization bounds, but introduce a hyperpa-
rameter
λ
to either replace intractable quantities (Swami-
nathan & Joachims,2015b) or to tighten a potentially vac-
uous bound (London & Sandler,2019). These approaches
require tuning
λ
on a held-out set and sometimes fail at
improving the previous decision system (London & Sandler,
2019;Chen et al.,2019).
In this work, we analyse off-policy learning from the PAC-
Bayesian perspective (McAllester,1998;Catoni,2007). We
aim at introducing a novel, theoretically-grounded approach,
based on the direct optimization of newly derived tight gen-
eralization bounds, to obtain guaranteed improvement of
the previous system offline, without the need for held-out
sets nor hyperparameter tuning. We show that our approach
is perfectly suited to this framework, as it naturally incor-
porates information about the old decision system and can
confidently improve it.
1
arXiv:2210.13132v2 [stat.ML] 27 May 2023
PAC-Bayesian Offline Contextual Bandits With Guarantees
2. Preliminaries
2.1. Setting
We use
x∈ X
to denote a context and
a∈ A = [K]
an action, where
K
denotes the number of available ac-
tions. For a context
x
, each action is associated with a cost
c(x, a)[1,0]
, with the convention that better actions
have smaller cost. The cost function
c
is unknown. Our
decision system is represented by its policy
π:X → P(A)
which given
x∈ X
, defines a probability distribution over
the discrete action space
A
of size
K
. Assuming that the
contexts are stochastic and follow an unknown distribution
ν
, we define the risk of the policy
π
as the expected cost
one suffers when playing actions according to π:
R(π) = xν,aπ(·|x)[c(x, a)] .
The learning problem is to find a policy
π
which minimizes
the risk. This risk can be naively estimated by deploying the
policy online and gathering enough interactions to construct
an accurate estimate. Unfortunately, we do not have this
luxury in most real-world problems as the cost of deploying
bad policies can be extremely high. We can obtain instead
an estimate by exploiting the logged interactions collected
by the previous system. Indeed, the previous system is
represented by a logging policy
π0
(e.g a previous version
of a recommender system that we are trying to improve),
which gathered interaction data of the following form:
Dn={xi, aiπ0(·|xi), ci}i[n],with ci=c(xi, ai).
Given this data, one can build various estimators, with the
clipped IPS (Bottou et al.,2013) the most commonly used. It
is constructed based on a clipping of the importance weights
or the logging propensities to mitigate variance issues (Ion-
ides,2008). We are more interested in clipping the logging
probabilities as we need objectives that are linear in the
policy πfor our study. The cIPS estimator is given by:
ˆ
Rτ
n(π) = 1
n
n
X
i=1
π(ai|xi)
max{π0(ai|xi), τ}ci(1)
with τ[0,1] being the clipping factor.
Choosing
τ1
reduces the bias of cIPS. We recover
the classical IPS estimator (Horvitz & Thompson,1952)
(unbiased under mild conditions) by taking τ= 0.
Another estimator with better statistical properties is the
doubly robust estimator (Ben-Tal et al.,2013), which uses
the importance weights as control variates to reduce fur-
ther the variance of the cIPS estimators. This estimator is
asymptotically optimal (Farajtabar et al.,2018) (in terms
of variance) amongst the class of unbiased and consistent
off-policy estimators.
We consider a simplified version of this estimator, which
replaces the use of a model
ˆc
of the cost by one parameter
ξ[1,0]
that can be chosen freely. We define the control
variate clipped IPS, or cvcIPS as follows:
ˆ
Rτ
n(π) = ξ+1
n
n
X
i=1
π(ai|xi)
max{π0(ai|xi), τ}(ciξ).(2)
The cvcIPS estimator can be seen as a special case of the
doubly robust estimator when the cost model
ˆc=ξ
is
constant and
τ= 0
. cIPS is recovered by setting
ξ= 0
. This
simple estimator is deeply connected to the SNIPS estimator
(Swaminathan & Joachims,2015a) and was shown to be
more suited to off-policy learning as it mitigates the problem
of propensity overfitting (Joachims et al.,2018).
2.2. Related Work: Learning Principles
The literature so far has focused on deriving new principles
to learn policies with good online performance. The first line
of work in this direction is CRM: Counterfactual Risk mini-
mization (Swaminathan & Joachims,2015b) which adopted
SVP: Sample Variance Penalization (Maurer & Pontil,2009)
to favor policies with small empirical risk and controlled
variance. The intuition behind it is that the variance of cIPS
depends on the disparity between
π
and
π0
making the esti-
mator unreliable when
π
drifts away from
π0
. The analysis
focused on the cIPS estimator and used uniform bounds
based on empirical Bernstein inequalities (Maurer & Pon-
til,2009), where intractable quantities were replaced by a
tuning parameter
λ
, giving the following learning objective:
arg min
π
ˆ
Rτ
n(π) + λsˆ
Vn(π)
n
(3)
with
ˆ
Vn(π)
the empirical variance of the cIPS estimator.
A majorization-minimization algorithm was provided in
(Swaminathan & Joachims,2015b) to solve Equation
(3)
for parametrized softmax policies.
In the same spirit, (Faury et al.,2020;Sakhi et al.,2020b)
generalize SVP using the distributional robustness frame-
work, showing that the CRM principle can be retrieved with
a particular choice of the divergence and provide asymptotic
coverage results of the true risk. Their objectives are com-
petitive with SVP while providing simple ways to scale its
optimization to large datasets.
Another line of research, closer to our work, uses PAC-
Bayesian bounds to derive learning objectives in the same
fashion as (Swaminathan & Joachims,2015b). Indeed, (Lon-
don & Sandler,2019) introduce the Bayesian CRM, moti-
vating the use of
L2
regularization towards the parameter
θ0
of the logging policy
π0
. The analysis uses (McAllester,
2003)’s bound, is conducted on the cIPS estimator and con-
2
PAC-Bayesian Offline Contextual Bandits With Guarantees
trols the
L2
norm by a hyperparameter
λ
, giving the follow-
ing learning objective for parametrized softmax policies:
arg min
θnˆ
Rτ
n(πθ) + λ||θθ0||2o.(4)
(London & Sandler,2019) minimize a convex upper-bound
of objective
(4)
(by taking a log transform of the policy)
which is amenable to stochastic optimization, giving better
results than
(3)
while scaling better to the size of the dataset.
Limitations. The principles found in the literature are
mainly inspired by generalization bounds. However, the
bounds from where these principles are derived either de-
pend on intractable quantities (Swaminathan & Joachims,
2015b) or are not tight enough to be used as-is (London &
Sandler,2019). For example, (Swaminathan & Joachims,
2015b) derive a generalisation bound (see Theorem 1 in
(Swaminathan & Joachims,2015b)) for offline policy learn-
ing using the notion of covering number. This introduces
the complexity measure
QH(n, γ)
that cannot be computed
(even for simple policy classes) making their bound in-
tractable. This forces the introduction of a hyperparameter
λ
that needs further tuning. Unfortunately, this approach
suffers from numerous problems:
No Theoretical Guarantees. Introducing the hyper-
parameter
λ
in Equations
(3)
and
(4)
gives tractable
objectives, but loses the theoretical guarantees given by
the initial bounds. These objectives do not necessarily
cover the true risk, and optimizing them can lead to
policies worse than the logging
π0
. Empirical evidence
can be found in (Chen et al.,2019;London & Sandler,
2019) where the SVP principle in Equation
(3)
fails to
improve on π0.
Inconsistent Strategy. These principles were first in-
troduced to mitigate the suboptimality of learning with
off-policy estimators, deemed untrustworthy for their
potential high variance. The strategy minimizes the ob-
jectives for different values of
{λ1, ..., λm}
, generating
a set of policy candidates
{πλ1, ..., πλm}
, from which
we select the best policy
πλ
according to the same
untrustworthy, high variance estimators on a held-out
set. This makes the selection strategy used inconsistent
with what these principles are claiming to solve.
Tuning requires additional care. Tuning
λ
needs to
be done on a held-out set. This means that we need
to train multiple policies (computational burden) on
a fraction of the data (data inefficiency), and select
the best policy among the candidates using off-policy
estimators (variance problem) on the held-out set.
In this paper, we derive a coherent principle, that learns
policies using all the available data and provides guaran-
tees on their performance, without the introduction of new
hyperparameters.
2.3. Learning With Guaranteed Improvements
Our first concern in most applications is to improve upon
the actual system
π0
. As
Dn
is collected by
π0
, we have
access to
R(π0)1
. Given a new policy
π
, we want to be
confident that the improvement
I(π, π0) = R(π0)R(π)
is positive before deployment.
Let us suppose that we are restricted to a class of policies
H
, and have access to a generalization bound that gives the
following result with high probability over draws of Dn:
R(π)≤ UBn(π)π∈ H.
with
UBn
an empirical upper bound that depends on
Dn
.
For any π, we define the guaranteed improvement:
GIUBn(π, π0) = R(π0)− UBn(π).
We can be sure of improving
R(π0)
offline if we manage
to find
π∈ H
that achieves
GIUBn(π, π0)>0
as the
following result will hold with high probability:
I(π, π0)≥ GIUBn(π, π0)>0.
To obtain such a policy, we look for the minimizer of
UBn
over the class of policies Has:
π
UBnarg min
π∈H UBn(π) = arg max
π∈H GIUBn(π, π0).
We define the best guaranteed risk and the best guaranteed
improvement follows:
GR
UBn=UBn(π
UBn)
GI
UBn(π0) = R(π0)− GR
UBn.
Atheoretically-grounded strategy to improve
π0
will be to deploy
π
UBn
if we obtain a positive guar-
anteed improvement
GI
UBn(π0)>0
, otherwise
continue collecting data with the current system
π0
.
This strategy will always produce policies that are at least
as good as
π0
, optimizes directly a bound over all data and
does not require held-out sets nor new hyperparameters.
However, the tightness of the bounds
UBn
will play an
important role. Indeed, If we fix
Dn
and
π0
,
GI
UBn(π0)
will only depend on the minimum of
UBn
, motivating the
derivation of bounds that are tractable and tight enough to
achieve the smallest minimum possible.
In this regard, we opt for the PAC-Bayesian framework to
tackle this problem as it is proven to give tractable, non-
vacuous bounds even in difficult settings (Dziugaite & Roy,
1up to a small O(1/n)approximation error.
3
PAC-Bayesian Offline Contextual Bandits With Guarantees
2017). Its paradigm also fits our application as we can
incorporate information about the previous system
π0
in the
form of a prior; see (Alquier,2021) for a recent review.
Contributions. We advocate for a theoretically grounded
strategy that uses generalization bound to improve
π0
with
guarantees. So far, the existing bounds are either intractable
(Swaminathan & Joachims,2015b) or can be proven to be
suboptimal (London & Sandler,2019). In this work,
we derive new, tractable and tight generalization
bounds using the PAC-Bayesian framework. These
bounds are fully tractable unlike (Swaminathan &
Joachims,2015b)’s bound and are tighter than (London
& Sandler,2019)’s bound.
we provide a way to optimize our bounds over a particu-
lar class of policies and show empirically that they can
guarantee improvement over π0in practical scenarios.
3. Motivating PAC-Bayesian tools
As previously discussed, in the contextual bandit setting, we
seek a policy that minimizes the expected cost:
R(π) = xν,aπ(·|x)[c(x, a)] .
The minimizer of this objective over the unrestricted space
of policies is a deterministic decision rule defined by:
x, a π(a|x) = [argmin
a
c(x, a) = a].
Given a context
x
, the solution will always choose the action
that has the minimum cost. However, as the function
c
is
generally unknown, we instead learn a parametric score
function
fθ∈ FΘ={fθ:X × [K], θ Θ}
that
encodes the action’s relevance to a context
x
. Given a
function fθ, we define the decision rule dθby:
dθ(a|x) = [argmax
a
fθ(x, a) = a].
These parametric decisions rules will be the building blocks
of our analysis. We view stochastic policies as smoothed
decision rules, with smoothness induced by distributions
over the space of score functions
FΘ
. Given a context
x
,
instead of sampling an action
a
directly from a distribution
on the action set, we sample a function
fθ
from a distribution
over
FΘ
and compute the action as
a= argmaxafθ(x, a)
.
With this interpretation, for any distribution
Q
on
FΘ
, the
probability of an action,
a∈ A
, given a context
x∈ X
, is
defined as the expected value of dθover Q, that is:
πQ(a|x) = θQ[dθ(a|x)]
=Qargmax
a
fθ(x, a) = a.
Policies as mixtures of decision rules. This perspective
on policies was introduced in (Seldin et al.,2011) and later
developed in (London & Sandler,2019). Constructing poli-
cies as mixtures of deterministic decision rules does not
restrict the class of policies our study applies to. Indeed,
if the family
FΘ
is rich enough (e.g, neural networks), we
give the following theorem that proves that any policy
π
can
be written as a mixture of deterministic policies.
Theorem 3.1. Let us fix a policy
π
. Then there is a probabil-
ity distribution
Qπ
on the set of all functions
f:X ×[K]
{0,1}such that
x, a π(a|x) = fQπ argmax
a
f(x, a) = a.
A formal proof of Theorem 3.1 is given in Appendix A.1.
This means that adopting this perspective on policies does
not narrow the scope of our study. For a policy
πQ
defined
by a distribution
Q
over
FΘ
, we observe that by linearity,
its true risk can be written as:
R(πQ) = θQ[R(dθ)].
Similarly, clipping the logging propensities in our empirical
estimators allows us to obtain a linear estimators in
π
. For
instance, we can estimate empirically the risk of the policy
πQwith cvcIPS (as it generalizes cIPS) and obtain:
ˆ
Rτ
n(πQ) = θQ[ˆ
Rτ
n(dθ)].
By linearity, one can see that both the true and empirical risk
of a policy
πQ
can also be interpreted as the average risk
of decision rules drawn from the distribution
Q
. This dual-
ity is in the heart of our analysis and paves the way nicely
to the PAC-Bayesian framework, which studies generaliza-
tion properties of the average risk of randomized predictors
(Alquier,2021). If we fix a reference distribution
P
over
FΘand define the KL divergence from Pto Qas:
KL(Q||P) =
Rln ndQ
dP odQ if Qis P-continuous,
+otherwise,
we can construct with the help of PAC-Bayesian tools
bounds holding for the average risk of decision rules over
any distribution Q;
θQ[R(dθ)] θQ[ˆ
Rτ
n(dθ)] + O(KL(Q||P)) .
Our objective will be to find tight generalisation bounds of
this form as this construction, coupled with the linearity of
our objective and estimator, allows us to obtain tight bounds
holding for any policy πQ;
R(πQ)ˆ
Rτ
n(πQ) + O(KL(Q||P)) .
4
PAC-Bayesian Offline Contextual Bandits With Guarantees
The PAC-Bayesian Paradigm. Before we dive deeper into
the analysis, we want to emphasize the similarities between
the PAC-Bayesian paradigm and the offline contextual ban-
dit problem. This learning framework proceeds as follows:
Given a class of functions
FΘ
, we fix a prior (reference dis-
tribution)
P
on
FΘ
before seeing the data, then, we receive
some data
Dn
which help us learn a better distribution
Q
over
FΘ
than our reference
P
. With the previous perspec-
tive on policies, the prior
P
, even if it can be any data-free
distribution, will be our logging policy (i.e.
π0=πP
), and
we will use the data
Dn
to learn distribution
Q
, thus a new
policy πQthat improves the logging policy π0.
4. PAC-Bayesian Analysis
4.1. Bounds for clipped IPS
The clipped IPS estimator (Bottou et al.,2013) is often stud-
ied for offline policy learning (Swaminathan & Joachims,
2015b;London & Sandler,2019) as it is easy to analyse, and
have a negative bias (once the cost is negative) facilitating
the derivation of learning bounds.
(London & Sandler,2019) adapted (McAllester,2003)’s
bound to derive their learning objective. We state a slightly
tighter version in Proposition 4.1 for the cIPS estimator. The
proof of this bound cannot be adapted naively to the cvcIPS
estimator, because once
ξ̸= 0
, the bias of the estimator, an
intractable quantity that depends on the unknown distribu-
tion
ν
, is no longer negative and needs to be incorporated in
the bound, making the bound itself intractable.
Proposition 4.1. Given a prior
P
on
FΘ
,
τ(0,1]
,
δ
(0,1]
, the following bound holds with probability at least
1δuniformly for all distribution Qover FΘ:
R(πQ)ˆ
Rτ
n(πQ) + 2(KL(Q||P) + ln 2n
δ)
τn
+s2[ ˆ
Rτ
n(πQ) + 1
τ](KL(Q||P) + ln 2n
δ)
τn .
The upper bound stated in the previous proposition will be
denoted by
LSP,δ,τ
n(πQ)
. When there is no ambiguity, we
will also drop P, δ, τ (all fixed) and only use LSn(πQ).
(McAllester,2003)’s bound can give tight results in the
[0,1]
-bounded loss case when the empirical risk is close to
0 as one obtains fast convergence rates in
O(1/n)
. However,
its use in the case of offline contextual bandits is far from
being optimal. Indeed, to achieve a fast rate in this context,
one needs
ˆ
Rτ
n(πQ) + 1
τ0
. This is hardly achievable in
practice especially when nis large and τ1.
To defend our claim, let us suppose that for each context
x
,
there is one optimal action
a
x
for which
c(x, a
x) = 1
and
it is 0 otherwise. Let us write down the clipped IPS:
ˆ
Rτ
n(π) = 1
n
n
X
i=1
π(ai|xi)
max{π0(ai|xi), τ}ci≥ −1
τ.
To get equality, we need:
i[n], ci=1, π0(ai|xi)τ, π(ai|xi)=1.
If
n
is large, the first condition on the costs means that
π0
is
near optimal and the played actions
ai
are optimal. For this,
we get that
i[n], π0(ai|xi)1
. This, combined with
the second condition on
π0
gives that
τ1
. In practice,
π0
is never the optimal policy and
τ1
which makes the fast
rate condition
ˆ
Rτ
n(π)+ 1
τ0
unachievable. In the majority
of scenarios, as we penalize πQto stay close to π0through
the KL divergence, we will have
ˆ
Rτ
n(πQ)[1,0]
, thus
ˆ
Rτ
n(πQ) + 1
τ1
τ, giving a limiting behavior:
LSn(πQ) = ˆ
Rτ
n(πQ) + O 1
τrKL(Q||P)
n!.
If we want to get tighter results, we need to look for bounds
with better dependencies on
τ
and
n
. In our pursuit of a
tighter bound, we derive the following result:
Proposition 4.2. Given a prior
P
on
FΘ
,
τ(0,1]
,
δ
(0,1]
. The following bound holds with probability at least
1δuniformly for all distribution Qover FΘ:
R(πQ)min
λ>0
1exp τλ ˆ
Rτ
n(πQ)1
nhKL(Q||P) + ln 2n
δi
τ(eλ1)
We will denote by
CP,δ,τ
n(πQ)
the upper bound stated in this
proposition. When there is no ambiguity, we will also drop
P, δ, τ (all fixed) and only use Cn(πQ).
This is a direct application of (Catoni,2007)’s bound to the
bounded loss cIPS while exploiting the fact that its bias is
negative. A full derivation can be found in Appendix A.2.
Note that Proposition 4.2 cannot be applied to the cvcIPS
estimator (
ξ̸= 0
) as its bias is non-negative and intractable.
To be able to measure the tightness of this bound, we esti-
mate its limiting behaviour to understand its dependency on
τand n. We derive in Appendix A.3 the following result:
Cn(πQ) = ˆ
Rτ
n(πQ) + OKL(Q||P)
τn .
This shows that
Cn
has a better dependency on
n
compared
to
LSn
. Actually, we can prove that
Cn
will always give
tighter results as the next theorem states that it is smaller
than LSnin all scenarios.
Theorem 4.3. For any
Dn(µ, π0)n
, any distributions
P, Q, any τ(0,1], δ (0,1], we have:
CP,δ,τ
n(πQ)≤ LSP,δ,τ
n(πQ).
5
摘要:

PAC-BayesianOfflineContextualBanditsWithGuaranteesOtmaneSakhi12PierreAlquier3NicolasChopin2AbstractThispaperintroducesanewprincipledapproachforoff-policylearningincontextualbandits.Un-likepreviouswork,ourapproachdoesnotde-rivelearningprinciplesfromintractableorloosebounds.Weanalysetheproblemthrought...

展开>> 收起<<
PAC-Bayesian Offline Contextual Bandits With Guarantees.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:755.28KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注