PAC-Bayesian Offline Contextual Bandits With Guarantees

2025-05-06 0 0 755.28KB 23 页 10玖币

侵权投诉

PAC-Bayesian Ofﬂine Contextual Bandits With Guarantees

Otmane Sakhi 1 2 Pierre Alquier 3Nicolas Chopin 2

Abstract

This paper introduces a new principled approach

for off-policy learning in contextual bandits. Un-

like previous work, our approach does not de-

rive learning principles from intractable or loose

bounds. We analyse the problem through the PAC-

Bayesian lens, interpreting policies as mixtures

of decision rules. This allows us to propose novel

generalization bounds and provide tractable al-

gorithms to optimize them. We prove that the

derived bounds are tighter than their competitors,

and can be optimized directly to conﬁdently im-

prove upon the logging policy ofﬂine. Our ap-

proach learns policies with guarantees, uses all

available data and does not require tuning ad-

ditional hyperparameters on held-out sets. We

demonstrate through extensive experiments the

effectiveness of our approach in providing perfor-

mance guarantees in practical scenarios.

1. Introduction

Online industrial systems encounter sequential decision

problems as they interact with the environment and strive

to improve based on the received feedback. The contextual

bandit framework formalizes this mechanism, and proved

valuable with applications in recommender systems (Valko

et al.,2014) and clinical trials (Villar et al.,2015). It de-

scribes a game of repeated interactions between a system

and an environment, where the latter reveals a context that

the system interacts with, and receives a feedback in return.

While the online solution to this problem involves strategies

that ﬁnd an optimal trade-off between exploration and ex-

ploitation to minimize the regret (Lattimore & Szepesv

ari,

2020), we are concerned with its ofﬂine formulation (Swami-

nathan & Joachims,2015b), which is arguably better suited

for real-life applications, where more control over the

Criteo AI Lab, Paris, France

CREST, ENSAE, IPP, Palaiseau,

France

ESSEC Business School, Asia-Paciﬁc campus, Singapore.

Correspondence to: Otmane Sakhi <o.sakhi@criteo.com>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

decision-maker, often coined the policy, is needed. The

learning of the policy is performed ofﬂine, based on his-

torical data, typically obtained by logging the interactions

between an older version of the decision system and the

environment. By leveraging this data, our goal is to discover

new strategies of greater performance.

There are two main paths to address this learning problem.

The direct method (Sakhi et al.,2020a;Jeunen & Goethals,

2021) attacks the problem by modelling the feedback and

deriving a policy according to this model. This approach can

be praised for its simplicity (Brandfonbrener et al.,2021), is

well-studied in the ofﬂine setting (Nguyen-Tang et al.,2022)

but will often suffer from a bias as the feedback received is

complex and the efﬁciency of the method directly depends

on our ability to understand the problem’s structure.

We will consider the second path of off-policy learning,

or IPS: inverse propensity scoring (Horvitz & Thompson,

1952) where we learn the policy directly from the logged

data after correcting its bias with importance sampling

(Owen & Zhou,2000). As these estimators (Bottou et al.,

2013;Swaminathan & Joachims,2015a) can suffer from

a variance problem as we drift away from the logging pol-

icy, the literature gave birth to different learning principles

(Swaminathan & Joachims,2015b;Ma et al.,2019;Lon-

don & Sandler,2019;Faury et al.,2020) motivating penal-

izations toward the logging policy. These principles are

inspired by generalization bounds, but introduce a hyperpa-

rameter

to either replace intractable quantities (Swami-

nathan & Joachims,2015b) or to tighten a potentially vac-

uous bound (London & Sandler,2019). These approaches

require tuning

on a held-out set and sometimes fail at

improving the previous decision system (London & Sandler,

2019;Chen et al.,2019).

In this work, we analyse off-policy learning from the PAC-

Bayesian perspective (McAllester,1998;Catoni,2007). We

aim at introducing a novel, theoretically-grounded approach,

based on the direct optimization of newly derived tight gen-

eralization bounds, to obtain guaranteed improvement of

the previous system ofﬂine, without the need for held-out

sets nor hyperparameter tuning. We show that our approach

is perfectly suited to this framework, as it naturally incor-

porates information about the old decision system and can

conﬁdently improve it.

arXiv:2210.13132v2 [stat.ML] 27 May 2023

PAC-Bayesian Ofﬂine Contextual Bandits With Guarantees

2. Preliminaries

2.1. Setting

We use

x∈ X

to denote a context and

a∈ A = [K]

an action, where

denotes the number of available ac-

tions. For a context

, each action is associated with a cost

c(x, a)∈[−1,0]

, with the convention that better actions

have smaller cost. The cost function

is unknown. Our

decision system is represented by its policy

π:X → P(A)

which given

x∈ X

, deﬁnes a probability distribution over

the discrete action space

of size

. Assuming that the

contexts are stochastic and follow an unknown distribution

, we deﬁne the risk of the policy

as the expected cost

one suffers when playing actions according to π:

R(π) = x∼ν,a∼π(·|x)[c(x, a)] .

The learning problem is to ﬁnd a policy

which minimizes

the risk. This risk can be naively estimated by deploying the

policy online and gathering enough interactions to construct

an accurate estimate. Unfortunately, we do not have this

luxury in most real-world problems as the cost of deploying

bad policies can be extremely high. We can obtain instead

an estimate by exploiting the logged interactions collected

by the previous system. Indeed, the previous system is

represented by a logging policy

π0

(e.g a previous version

of a recommender system that we are trying to improve),

which gathered interaction data of the following form:

Dn={xi, ai∼π0(·|xi), ci}i∈[n],with ci=c(xi, ai).

Given this data, one can build various estimators, with the

clipped IPS (Bottou et al.,2013) the most commonly used. It

is constructed based on a clipping of the importance weights

or the logging propensities to mitigate variance issues (Ion-

ides,2008). We are more interested in clipping the logging

probabilities as we need objectives that are linear in the

policy πfor our study. The cIPS estimator is given by:

Rτ

n(π) = 1

i=1

π(ai|xi)

max{π0(ai|xi), τ}ci(1)

with τ∈[0,1] being the clipping factor.

Choosing

τ≪1

reduces the bias of cIPS. We recover

the classical IPS estimator (Horvitz & Thompson,1952)

(unbiased under mild conditions) by taking τ= 0.

Another estimator with better statistical properties is the

doubly robust estimator (Ben-Tal et al.,2013), which uses

the importance weights as control variates to reduce fur-

ther the variance of the cIPS estimators. This estimator is

asymptotically optimal (Farajtabar et al.,2018) (in terms

of variance) amongst the class of unbiased and consistent

off-policy estimators.

We consider a simpliﬁed version of this estimator, which

replaces the use of a model

ˆc

of the cost by one parameter

ξ∈[−1,0]

that can be chosen freely. We deﬁne the control

variate clipped IPS, or cvcIPS as follows:

Rτ,ξ

n(π) = ξ+1

i=1

π(ai|xi)

max{π0(ai|xi), τ}(ci−ξ).(2)

The cvcIPS estimator can be seen as a special case of the

doubly robust estimator when the cost model

ˆc=ξ

constant and

τ= 0

. cIPS is recovered by setting

ξ= 0

. This

simple estimator is deeply connected to the SNIPS estimator

(Swaminathan & Joachims,2015a) and was shown to be

more suited to off-policy learning as it mitigates the problem

of propensity overﬁtting (Joachims et al.,2018).

2.2. Related Work: Learning Principles

The literature so far has focused on deriving new principles

to learn policies with good online performance. The ﬁrst line

of work in this direction is CRM: Counterfactual Risk mini-

mization (Swaminathan & Joachims,2015b) which adopted

SVP: Sample Variance Penalization (Maurer & Pontil,2009)

to favor policies with small empirical risk and controlled

variance. The intuition behind it is that the variance of cIPS

depends on the disparity between

and

π0

making the esti-

mator unreliable when

drifts away from

π0

. The analysis

focused on the cIPS estimator and used uniform bounds

based on empirical Bernstein inequalities (Maurer & Pon-

til,2009), where intractable quantities were replaced by a

tuning parameter

, giving the following learning objective:

arg min

π





Rτ

n(π) + λsˆ

Vn(π)

n





(3)

with

Vn(π)

the empirical variance of the cIPS estimator.

A majorization-minimization algorithm was provided in

(Swaminathan & Joachims,2015b) to solve Equation

(3)

for parametrized softmax policies.

In the same spirit, (Faury et al.,2020;Sakhi et al.,2020b)

generalize SVP using the distributional robustness frame-

work, showing that the CRM principle can be retrieved with

a particular choice of the divergence and provide asymptotic

coverage results of the true risk. Their objectives are com-

petitive with SVP while providing simple ways to scale its

optimization to large datasets.

Another line of research, closer to our work, uses PAC-

Bayesian bounds to derive learning objectives in the same

fashion as (Swaminathan & Joachims,2015b). Indeed, (Lon-

don & Sandler,2019) introduce the Bayesian CRM, moti-

vating the use of

regularization towards the parameter

θ0

of the logging policy

π0

. The analysis uses (McAllester,

2003)’s bound, is conducted on the cIPS estimator and con-

PAC-Bayesian Ofﬂine Contextual Bandits With Guarantees

trols the

norm by a hyperparameter

, giving the follow-

ing learning objective for parametrized softmax policies:

arg min

θnˆ

Rτ

n(πθ) + λ||θ−θ0||2o.(4)

(London & Sandler,2019) minimize a convex upper-bound

of objective

(4)

(by taking a log transform of the policy)

which is amenable to stochastic optimization, giving better

results than

(3)

while scaling better to the size of the dataset.

Limitations. The principles found in the literature are

mainly inspired by generalization bounds. However, the

bounds from where these principles are derived either de-

pend on intractable quantities (Swaminathan & Joachims,

2015b) or are not tight enough to be used as-is (London &

Sandler,2019). For example, (Swaminathan & Joachims,

2015b) derive a generalisation bound (see Theorem 1 in

(Swaminathan & Joachims,2015b)) for ofﬂine policy learn-

ing using the notion of covering number. This introduces

the complexity measure

QH(n, γ)

that cannot be computed

(even for simple policy classes) making their bound in-

tractable. This forces the introduction of a hyperparameter

that needs further tuning. Unfortunately, this approach

suffers from numerous problems:

•

No Theoretical Guarantees. Introducing the hyper-

parameter

in Equations

(3)

and

(4)

gives tractable

objectives, but loses the theoretical guarantees given by

the initial bounds. These objectives do not necessarily

cover the true risk, and optimizing them can lead to

policies worse than the logging

π0

. Empirical evidence

can be found in (Chen et al.,2019;London & Sandler,

2019) where the SVP principle in Equation

(3)

fails to

improve on π0.

•

Inconsistent Strategy. These principles were ﬁrst in-

troduced to mitigate the suboptimality of learning with

off-policy estimators, deemed untrustworthy for their

potential high variance. The strategy minimizes the ob-

jectives for different values of

{λ1, ..., λm}

, generating

a set of policy candidates

{πλ1, ..., πλm}

, from which

we select the best policy

πλ∗

according to the same

untrustworthy, high variance estimators on a held-out

set. This makes the selection strategy used inconsistent

with what these principles are claiming to solve.

•

Tuning requires additional care. Tuning

needs to

be done on a held-out set. This means that we need

to train multiple policies (computational burden) on

a fraction of the data (data inefﬁciency), and select

the best policy among the candidates using off-policy

estimators (variance problem) on the held-out set.

In this paper, we derive a coherent principle, that learns

policies using all the available data and provides guaran-

tees on their performance, without the introduction of new

hyperparameters.

2.3. Learning With Guaranteed Improvements

Our ﬁrst concern in most applications is to improve upon

the actual system

π0

. As

is collected by

π0

, we have

access to

R(π0)1

. Given a new policy

, we want to be

conﬁdent that the improvement

I(π, π0) = R(π0)−R(π)

is positive before deployment.

Let us suppose that we are restricted to a class of policies

, and have access to a generalization bound that gives the

following result with high probability over draws of Dn:

R(π)≤ UBn(π)∀π∈ H.

with

UBn

an empirical upper bound that depends on

For any π, we deﬁne the guaranteed improvement:

GIUBn(π, π0) = R(π0)− UBn(π).

We can be sure of improving

R(π0)

ofﬂine if we manage

to ﬁnd

π∈ H

that achieves

GIUBn(π, π0)>0

as the

following result will hold with high probability:

I(π, π0)≥ GIUBn(π, π0)>0.

To obtain such a policy, we look for the minimizer of

UBn

over the class of policies Has:

π∗

UBn∈arg min

π∈H UBn(π) = arg max

π∈H GIUBn(π, π0).

We deﬁne the best guaranteed risk and the best guaranteed

improvement follows:

GR∗

UBn=UBn(π∗

UBn)

GI∗

UBn(π0) = R(π0)− GR∗

UBn.

Atheoretically-grounded strategy to improve

π0

will be to deploy

π∗

UBn

if we obtain a positive guar-

anteed improvement

GI∗

UBn(π0)>0

, otherwise

continue collecting data with the current system

π0

This strategy will always produce policies that are at least

as good as

π0

, optimizes directly a bound over all data and

does not require held-out sets nor new hyperparameters.

However, the tightness of the bounds

UBn

will play an

important role. Indeed, If we ﬁx

and

π0

GI∗

UBn(π0)

will only depend on the minimum of

UBn

, motivating the

derivation of bounds that are tractable and tight enough to

achieve the smallest minimum possible.

In this regard, we opt for the PAC-Bayesian framework to

tackle this problem as it is proven to give tractable, non-

vacuous bounds even in difﬁcult settings (Dziugaite & Roy,

1up to a small O(1/√n)approximation error.

PAC-Bayesian Ofﬂine Contextual Bandits With Guarantees

2017). Its paradigm also ﬁts our application as we can

incorporate information about the previous system

π0

in the

form of a prior; see (Alquier,2021) for a recent review.

Contributions. We advocate for a theoretically grounded

strategy that uses generalization bound to improve

π0

with

guarantees. So far, the existing bounds are either intractable

(Swaminathan & Joachims,2015b) or can be proven to be

suboptimal (London & Sandler,2019). In this work,

•

we derive new, tractable and tight generalization

bounds using the PAC-Bayesian framework. These

bounds are fully tractable unlike (Swaminathan &

Joachims,2015b)’s bound and are tighter than (London

& Sandler,2019)’s bound.

•

we provide a way to optimize our bounds over a particu-

lar class of policies and show empirically that they can

guarantee improvement over π0in practical scenarios.

3. Motivating PAC-Bayesian tools

As previously discussed, in the contextual bandit setting, we

seek a policy that minimizes the expected cost:

R(π) = x∼ν,a∼π(·|x)[c(x, a)] .

The minimizer of this objective over the unrestricted space

of policies is a deterministic decision rule deﬁned by:

∀x, a π∗(a|x) = [argmin

a′

c(x, a′) = a].

Given a context

, the solution will always choose the action

that has the minimum cost. However, as the function

generally unknown, we instead learn a parametric score

function

fθ∈ FΘ={fθ:X × [K]→, θ ∈Θ}

that

encodes the action’s relevance to a context

. Given a

function fθ, we deﬁne the decision rule dθby:

dθ(a|x) = [argmax

a′

fθ(x, a′) = a].

These parametric decisions rules will be the building blocks

of our analysis. We view stochastic policies as smoothed

decision rules, with smoothness induced by distributions

over the space of score functions

FΘ

. Given a context

instead of sampling an action

directly from a distribution

on the action set, we sample a function

fθ

from a distribution

over

FΘ

and compute the action as

a= argmaxa′fθ(x, a′)

With this interpretation, for any distribution

FΘ

, the

probability of an action,

a∈ A

, given a context

x∈ X

, is

deﬁned as the expected value of dθover Q, that is:

πQ(a|x) = θ∼Q[dθ(a|x)]

=Qargmax

a′

fθ(x, a′) = a.

Policies as mixtures of decision rules. This perspective

on policies was introduced in (Seldin et al.,2011) and later

developed in (London & Sandler,2019). Constructing poli-

cies as mixtures of deterministic decision rules does not

restrict the class of policies our study applies to. Indeed,

if the family

FΘ

is rich enough (e.g, neural networks), we

give the following theorem that proves that any policy

can

be written as a mixture of deterministic policies.

Theorem 3.1. Let us ﬁx a policy

. Then there is a probabil-

ity distribution

Qπ

on the set of all functions

f:X ×[K]→

{0,1}such that

∀x, a π(a|x) = f∼Qπ argmax

a′

f(x, a′) = a.

A formal proof of Theorem 3.1 is given in Appendix A.1.

This means that adopting this perspective on policies does

not narrow the scope of our study. For a policy

πQ

deﬁned

by a distribution

over

FΘ

, we observe that by linearity,

its true risk can be written as:

R(πQ) = θ∼Q[R(dθ)].

Similarly, clipping the logging propensities in our empirical

estimators allows us to obtain a linear estimators in

. For

instance, we can estimate empirically the risk of the policy

πQwith cvcIPS (as it generalizes cIPS) and obtain:

Rτ,ξ

n(πQ) = θ∼Q[ˆ

Rτ,ξ

n(dθ)].

By linearity, one can see that both the true and empirical risk

of a policy

πQ

can also be interpreted as the average risk

of decision rules drawn from the distribution

. This dual-

ity is in the heart of our analysis and paves the way nicely

to the PAC-Bayesian framework, which studies generaliza-

tion properties of the average risk of randomized predictors

(Alquier,2021). If we ﬁx a reference distribution

over

FΘand deﬁne the KL divergence from Pto Qas:

KL(Q||P) = 



Rln ndQ

dP odQ if Qis P-continuous,

+∞otherwise,

we can construct with the help of PAC-Bayesian tools

bounds holding for the average risk of decision rules over

any distribution Q;

θ∼Q[R(dθ)] ≤θ∼Q[ˆ

Rτ,ξ

n(dθ)] + O(KL(Q||P)) .

Our objective will be to ﬁnd tight generalisation bounds of

this form as this construction, coupled with the linearity of

our objective and estimator, allows us to obtain tight bounds

holding for any policy πQ;

R(πQ)≤ˆ

Rτ,ξ

n(πQ) + O(KL(Q||P)) .

PAC-Bayesian Ofﬂine Contextual Bandits With Guarantees

The PAC-Bayesian Paradigm. Before we dive deeper into

the analysis, we want to emphasize the similarities between

the PAC-Bayesian paradigm and the ofﬂine contextual ban-

dit problem. This learning framework proceeds as follows:

Given a class of functions

FΘ

, we ﬁx a prior (reference dis-

tribution)

FΘ

before seeing the data, then, we receive

some data

which help us learn a better distribution

over

FΘ

than our reference

. With the previous perspec-

tive on policies, the prior

, even if it can be any data-free

distribution, will be our logging policy (i.e.

π0=πP

), and

we will use the data

to learn distribution

, thus a new

policy πQthat improves the logging policy π0.

4. PAC-Bayesian Analysis

4.1. Bounds for clipped IPS

The clipped IPS estimator (Bottou et al.,2013) is often stud-

ied for ofﬂine policy learning (Swaminathan & Joachims,

2015b;London & Sandler,2019) as it is easy to analyse, and

have a negative bias (once the cost is negative) facilitating

the derivation of learning bounds.

(London & Sandler,2019) adapted (McAllester,2003)’s

bound to derive their learning objective. We state a slightly

tighter version in Proposition 4.1 for the cIPS estimator. The

proof of this bound cannot be adapted naively to the cvcIPS

estimator, because once

ξ̸= 0

, the bias of the estimator, an

intractable quantity that depends on the unknown distribu-

tion

, is no longer negative and needs to be incorporated in

the bound, making the bound itself intractable.

Proposition 4.1. Given a prior

FΘ

τ∈(0,1]

δ∈

(0,1]

, the following bound holds with probability at least

1−δuniformly for all distribution Qover FΘ:

R(πQ)≤ˆ

Rτ

n(πQ) + 2(KL(Q||P) + ln 2√n

δ)

τn

+s2[ ˆ

Rτ

n(πQ) + 1

τ](KL(Q||P) + ln 2√n

δ)

τn .

The upper bound stated in the previous proposition will be

denoted by

LSP,δ,τ

n(πQ)

. When there is no ambiguity, we

will also drop P, δ, τ (all ﬁxed) and only use LSn(πQ).

(McAllester,2003)’s bound can give tight results in the

[0,1]

-bounded loss case when the empirical risk is close to

0 as one obtains fast convergence rates in

O(1/n)

. However,

its use in the case of ofﬂine contextual bandits is far from

being optimal. Indeed, to achieve a fast rate in this context,

one needs

Rτ

n(πQ) + 1

τ≈0

. This is hardly achievable in

practice especially when nis large and τ≪1.

To defend our claim, let us suppose that for each context

there is one optimal action

a∗

for which

c(x, a∗

x) = −1

and

it is 0 otherwise. Let us write down the clipped IPS:

Rτ

n(π) = 1

i=1

π(ai|xi)

max{π0(ai|xi), τ}ci≥ −1

τ.

To get equality, we need:

∀i∈[n], ci=−1, π0(ai|xi)≤τ, π(ai|xi)=1.

is large, the ﬁrst condition on the costs means that

π0

near optimal and the played actions

are optimal. For this,

we get that

∀i∈[n], π0(ai|xi)≈1

. This, combined with

the second condition on

π0

gives that

τ≈1

. In practice,

π0

is never the optimal policy and

τ≪1

which makes the fast

rate condition

Rτ

n(π)+ 1

τ≈0

unachievable. In the majority

of scenarios, as we penalize πQto stay close to π0through

the KL divergence, we will have

Rτ

n(πQ)∈[−1,0]

, thus

Rτ

n(πQ) + 1

τ≈1

τ, giving a limiting behavior:

LSn(πQ) = ˆ

Rτ

n(πQ) + O 1

τrKL(Q||P)

n!.

If we want to get tighter results, we need to look for bounds

with better dependencies on

and

. In our pursuit of a

tighter bound, we derive the following result:

Proposition 4.2. Given a prior

FΘ

τ∈(0,1]

δ∈

(0,1]

. The following bound holds with probability at least

1−δuniformly for all distribution Qover FΘ:

R(πQ)≤min

λ>0

1−exp −τλ ˆ

Rτ

n(πQ)−1

nhKL(Q||P) + ln 2√n

δi

τ(eλ−1)

We will denote by

CP,δ,τ

n(πQ)

the upper bound stated in this

proposition. When there is no ambiguity, we will also drop

P, δ, τ (all ﬁxed) and only use Cn(πQ).

This is a direct application of (Catoni,2007)’s bound to the

bounded loss cIPS while exploiting the fact that its bias is

negative. A full derivation can be found in Appendix A.2.

Note that Proposition 4.2 cannot be applied to the cvcIPS

estimator (

ξ̸= 0

) as its bias is non-negative and intractable.

To be able to measure the tightness of this bound, we esti-

mate its limiting behaviour to understand its dependency on

τand n. We derive in Appendix A.3 the following result:

Cn(πQ) = ˆ

Rτ

n(πQ) + OKL(Q||P)

τn .

This shows that

has a better dependency on

compared

LSn

. Actually, we can prove that

will always give

tighter results as the next theorem states that it is smaller

than LSnin all scenarios.

Theorem 4.3. For any

Dn∼(µ, π0)n

, any distributions

P, Q, any τ∈(0,1], δ ∈(0,1], we have:

CP,δ,τ

n(πQ)≤ LSP,δ,τ

n(πQ).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PAC-BayesianOfflineContextualBanditsWithGuaranteesOtmaneSakhi12PierreAlquier3NicolasChopin2AbstractThispaperintroducesanewprincipledapproachforoff-policylearningincontextualbandits.Un-likepreviouswork,ourapproachdoesnotde-rivelearningprinciplesfromintractableorloosebounds.Weanalysetheproblemthrought...

展开>> 收起<<

PAC-Bayesian Offline Contextual Bandits With Guarantees.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PAC-Bayesian Offline Contextual Bandits With Guarantees

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: