Bayesian Optimization with Conformal Prediction Sets Samuel Stanton12Wesley Maddox2Andrew Gordon Wilson2 Prescient Design Genentech1New York University2

2025-05-06 0 0 1.41MB 28 页 10玖币

侵权投诉

Bayesian Optimization with Conformal Prediction Sets

Samuel Stanton1,2 Wesley Maddox2Andrew Gordon Wilson2

Prescient Design, Genentech1New York University2

Abstract

Bayesian optimization is a coherent, ubiqui-

tous approach to decision-making under uncer-

tainty, with applications including multi-arm ban-

dits, active learning, and black-box optimization.

Bayesian optimization selects decisions (i.e. ob-

jective function queries) with maximal expected

utility with respect to the posterior distribution

of a Bayesian model, which quantiﬁes reducible,

epistemic uncertainty about query outcomes. In

practice, subjectively implausible outcomes can

occur regularly for two reasons: 1) model mis-

speciﬁcation and 2) covariate shift. Conformal

prediction is an uncertainty quantiﬁcation method

with coverage guarantees even for misspeciﬁed

models and a simple mechanism to correct for

covariate shift. We propose conformal Bayesian

optimization, which directs queries towards re-

gions of search space where the model predictions

have guaranteed validity, and investigate its be-

havior on a suite of black-box optimization tasks

and tabular ranking tasks. In many cases we ﬁnd

that query coverage can be signiﬁcantly improved

without harming sample-efﬁciency.

1 INTRODUCTION

Bayesian optimization (BayesOpt) is a popular strategy to

focus data collection towards improving a speciﬁc objective,

such as discovering useful new materials or drugs (Terayama

et al., 2021; Wang and Dowling, 2022). BayesOpt relies on

a Bayesian model of the objective (a surrogate model) to

select new observations (queries) that maximize the user’s

expected utility. If the surrogate does not ﬁt the objective

well, then the expected utility of new queries may not cor-

respond well at all to their actual utility, leading to little or

no improvement in the objective value after many rounds of

data collection.

Proceedings of the 26

International Conference on Artiﬁcial Intel-

ligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR:

The most practical way to check how well the surrogate

ﬁts the objective is to compute its accuracy on a random

heldout subset of the training data. Unfortunately such

a holdout set is not at all representative of points we are

likely to query since the goal is to ﬁnd queries that are

better than the training data in some way. In other words

there is feedback covariate shift between the likely query

points and the existing training data which degrades the

accuracy of the surrogate (Fannjiang et al., 2022). Even

without covariate shift, we cannot guarantee the accuracy of

the surrogate predictions at all, and instead can only hope

that the predictions are accurate enough to provide a useful

signal for data collection.

The crux of the issue is that the coverage (i.e., the frequency

that a prediction set contains the true outcome over many

repeated measurements) of Bayes credible prediction sets is

directly tied to the correctness of our modeling assumptions,

which we cannot entirely control (Datta et al., 2000; Du-

anmu et al., 2020). We would prefer the price of assumption

error to be lost efﬁciency (i.e., wider prediction sets), rather

than poor coverage.

Conformal prediction is a distribution-free uncertainty quan-

tiﬁcation method which provides prediction sets with re-

liable coverage under very mild assumptions (Vovk et al.,

2005). In particular, conformal prediction can accomodate

post hoc covariate shift (i.e., covariate shift that is only

known after training the surrogate) and does not assume the

surrogate is well-speciﬁed (e.g., we could use a linear model

on data following a cubic trend). Unfortunately conformal

prediction is challenging to use in a BayesOpt algorithm

since it is non-differentiable, requires continuous outcomes

to be discretized, and needs density ratio estimates for un-

known densities. Furthermore, because conformal predic-

tion sets are deﬁned over observable outcomes, they cannot

distinguish between epistemic and aleatoric uncertainty, a

distinction that is important for effective exploration.

In this work we present conformal Bayesian optimization

with a motivating example in Figure 1. Conformal BayesOpt

adjusts how far new queries will move from the training

data by choosing an acceptable miscoverage tolerance

α∈

(0,1]

. If

α= 1

then we recover conventional BayesOpt,

but if

α < 1

then the search will be directed to the region

where conformal predictions are guaranteed coverage of at

arXiv:2210.12496v4 [cs.LG] 12 Dec 2023

Bayesian Optimization with Conformal Prediction Sets

(a) Objective fn. (b) UCB (c) CUCB (d) Normalized density ratio

Figure 1: A motivating example of feedback covariate shift. We want

x∗∈[0,1]2

which maximizes the Branin objective

(a), starting from

examples in the upper right (the black dots). The upper-conﬁdence bound (UCB) acquisition function

(b) selects the next query (the red star) far from any training data, where we cannot guarantee reliable predictions. In higher

dimensions, we will exhaust our query budget long before covering the whole search space with training data. Given a

miscoverage tolerance

α= 1/√8

, conformal UCB (c) directs the search to the region where conformal predictions are

guaranteed coverage of at least

(1 −α)

.(d) The dashed line is the set

such that

w(x)∝pquery(x)/ptrain(x)

is exactly

least

(1 −α)

, keeping feedback covariate shift in check and

accounting for potential error in modeling assumptions.

In summary, our contributions are as follows:

•

We show how to integrate conformal prediction into

BayesOpt through the conformal Bayes posterior, with

corresponding generalizations of common BayesOpt

acquisition functions, enabling the reliable coverage of

conformal prediction while still distinguishing between

epistemic and aleatoric uncertainty in a principled way.

•

An efﬁcient, differentiable implementation of full con-

formal Bayes for Gaussian process (GP) regression

models, which is necessary for effective query opti-

mization, and a practical procedure to estimate the den-

sity ratio for BayesOpt query proposal distributions.

•

Demonstrations on synthetic black-box optimization

tasks and real tabular ranking tasks that conformal

BayesOpt has superior sample-efﬁciency when the sur-

rogate is misspeciﬁed and is comparable otherwise,

while improving query coverage signiﬁcantly. Note

that while conformal BayesOpt has promising perfor-

mance, our goal is not primarily to “beat” classical al-

ternatives; rather, we show how to introduce conformal

prediction into BayesOpt, and explore the correspond-

ing empirical behaviour and results. 1

2 PRELIMINARIES

In this work we will focus on black-box optimization prob-

lems of the form

maxx∈X (f∗

1(x), . . . , f∗

d(x))

, where each

f∗

i:X → R

is an unknown function of decision variables

x∈ X

, and

is the number of objectives. We do not ob-

serve

f∗

directly, but instead receive noisy outcomes (i.e.

labels) y∈ Y according to some likelihood p∗(y|f).

Code is available at

github.com/samuelstanton/

conformal-bayesopt.git

2.1 Bayesian optimization

BayesOpt alternates between inference and selection, in-

ferring the expected utility of potential query points from

available data, which then serves as a proxy objective to

select the next batch of observations, which are fed back

into the inference procedure, completing one iteration of

a repeating loop (Brochu et al., 2010; Frazier, 2018). In-

ference consists of applying Bayes rule to a prior

p(f)

, a

likelihood

p(y|f)

and dataset

D={(xi,yi)}n−1

i=0

to obtain

a Bayes posterior

p(f|D)

. The expected utility of

is given

by an acquisition function

a:X → R

with the general form

a(x,D) = Zu(x, f, D)p(f|D)df, (1)

where

is a user-speciﬁed utility function. For example,

taking

u(x, f, D)=[f(x)−maxyi∈D yi]+

, where

[·]+=

max(·,0)

, yields the expected improvement (EI) acquisition

function (Jones et al., 1998). Since

is the Bayes posterior

expectation of

, maximizers

x∗= argmaxx∈X a(x,D)

are Bayes-optimal with respect to

. Bayes-optimality

means a decision is coherent with our posterior beliefs about

f∗

. We think of

argmaxx∈X a(x,D)

as inducing a distribu-

tion

pquery(x) = p′(x|D)∝exp{a(x,D)}

(Levine, 2018).

2.2 Bayesian inference and model misspeciﬁcation

One way to assess the quality of our posterior beliefs is to

check the coverage of the corresponding Bayes

-credible

prediction sets, which are subsets Kβ(x)⊆ Y satisfying

β=Zy∈Kβ(x)Zp(y|f(x))p(f|D)dfdy,(2)

where

β∈(0,1]

is the level of subjective credibility (Gneit-

ing et al., 2007).

-credible sets may exhibit poor coverage,

meaning the frequency of “implausible” events outside the

set happening is much more than

1−β

(Bachoc, 2013).

Note that poor coverage does not necessarily imply that

(a) conformal scores s(b) imp. weights w(c) IW partial sums w(d) Cα(xn)

Figure 2: Constructing a conformal prediction set

Cα(xn)

in the regression setting. First, (a) we choose some

yn∈ Y

and

guess

yn=ˆ

, computing conformal scores

{(x0,y0),...,(xn−1,yn−1),(xn,ˆ

yn)}

.(b) we note which examples

score better than our guess (shown in blue), and mask out the corresponding importance weights

.(c) we compute the

partial sums

of the masked importance weights, adding

Cα(xn)

wn> α

,(d) repeat steps (a - c) for many guesses

of yn. Rejected and accepted guesses are shaded light and dark, respectively.

Kβ(x)

was computed incorrectly, it may simply indicate a

faulty assumption.

For example, in BayesOpt it is very common to assume

f∗∼ GP(0, κ)

, where

is a Matérn kernel. Matérn kernels

support functions that are at least once differentiable, and

can struggle to model objectives with discontinuities. As

another example, we typically do not know the true like-

lihood

p∗(y|f)

, and often choose a simple homoscedastic

likelihood

p(y|f) = N(f, σ2Id)

for convenience. In real-

ity the true noise process may be correlated with

, across

objectives, or may not be Gaussian at all (Assael et al.,

2014; Grifﬁths et al., 2019; Makarova et al., 2021). These

examples are common instances of model misspeciﬁcation.

In practice faulty assumptions are nearly inevitable, and

they are not always harmful, since simplifying assumptions

can confer signiﬁcant practical beneﬁts. Indeed, theoreti-

cal convergence rates for acquisition functions like UCB

(Srinivas et al., 2010) suggest that for BayesOpt we want to

use the smoothest possible model, subject to the constraint

that we can still model

f∗

sufﬁciently well. Similarly Gaus-

sian likelihoods have signiﬁcant computational advantages,

and there is no guarantee that constructing a task-speciﬁc

likelihood for every optimization problem would be worth

the effort. We propose accepting that some assumption er-

ror will always be present, and instead focus on how alter

BayesOpt to accomodate imperfect models.

2.3 Conformal prediction

See Shafer and Vovk (2008) for a complete tutorial on con-

formal prediction, or Angelopoulos and Bates (2021) for a

modern, accessible introduction. Informally, a conformal

prediction set

Cα(xn)⊂ Y

is a set of possible labels for a

test point

given a sequence of observations

. Candi-

date labels

are included in

Cα(xn)

if the resulting pair

(xn,ˆ

yn)

is sufﬁciently similar to the actual examples in

The degree of similarity is measured by a score function

and importance weights (IWs)

, and the similarity thresh-

old is determined by the miscoverage tolerance

. In Figure

2 we visualize the process of constructing Cα(xn).

Conformal prediction is a distribution-free uncertainty quan-

tiﬁcation method because it does not assume

is drawn

from any particular distribution, nor does it assume

is de-

rived from a well-speciﬁed model. In our context the critical

assumption is that

D ∪ {(xn,yn)}

is pseudo-exchangeable.

Fannjiang et al. (2022) provide a formal statement of pseudo-

exchangeability (Deﬁnition 2), and prove a coverage guar-

antee for conformal prediction sets in the special case when

is IID from

p(x)p∗(y|x)

and

p(x|D)

is invariant to shuf-

ﬂing of D, which we restate below:

Deﬁnition 2.1. Let

D ∼ p(x)p∗(y|x)

and

(xn,yn)∼

p′(x|D)p∗(y|x)

, where

p′(x|D)

is chosen such that

D ∪ {(xn,yn)}

is pseudo-exchangeable. Given

wi∝

p′(xi|ˆ

D−i)/p(xi)

s.t.

Piwi= 1

∀α∈(0,1]

, the con-

formal prediction set corresponding to score function sis

Cα(xn) := ˆ

yn∈ Y 

h⊤w> α,(3)

where hi:= 1ns(xi,yi,ˆ

D)≤s(xn,ˆ

yn,ˆ

D)o,

D=D ∪ {(xn,ˆ

yn)}

, and

p′(x|ˆ

D−i)

is the query proposal

density given training data

D−i=ˆ

D − {(xi,yi)}

. The

importance weights

account for covariate shift (Tibshirani

et al., 2019), and

wi= 1/(n+ 1) ∀i

in the special case

where D ∪ {(xn,yn)}is fully exchangeable (e.g. IID).

Conformal prediction enjoys a frequentist marginal cover-

age guarantee on

Cα(xn)

with respect to the joint distribu-

tion over D ∪ {(xn,yn)},

P[yn∈ Cα(xn)] ≥1−α, (4)

meaning if we repeatedly draw

D ∼ p(x)p∗(y|x)

, and

(xn,yn)∼p′(x|D)p∗(y|x)

Cα(xn)

will contain the ob-

served label

with frequency at least

(1−α)

. A prediction

set with a coverage guarantee like Eq.

(4)

is conservatively

valid at the

1−α

level. In Appendix B.1 we discuss random-

ized conformal prediction which is exactly valid, meaning

the long run frequency of errors converges to exactly

Bayesian Optimization with Conformal Prediction Sets

Marginal coverage is distinct from conditional coverage,

since it does not guarantee the coverage of

Cα

for any spe-

ciﬁc

, only the average coverage over the whole domain

Full conformal Bayes corresponds to the score function

s(xi,yi,ˆ

D) = log p(yi|xi,ˆ

. Conditioning an existing

posterior

p(y|xi,D)

on the additional observation

(xn,ˆ

yn)

is commonly referred to as “retraining” in the conformal

prediction literature. If the surrogate just so happens to be

correctly speciﬁed (e.g.

f∗∼p(f)

), then

log p(yi|xi,ˆ

is the optimal choice of score function, meaning it provides

the most efﬁcient prediction sets (i.e. smallest by expected

volume w.r.t. the prior

p(f)

) among all prediction sets that

are valid at the

1−α

level (Hoff, 2021). In the typical situ-

ation where we think our model assumptions are plausible

but do not really believe them, full conformal Bayes rewards

us if our assumptions turn out to be correct, yet it produces

valid predictions as long as the true data generation process

is some pseudo-exchangeable sequence.

BayesOpt and pseudo-exchangeability: unfortunately if

is collected by some active online selection strategy such

as BayesOpt, then

is not IID and the coverage guarantee

for Defn. 2.1 does not apply. Note that even large ofﬂine

datasets are not guaranteed to be IID, so the same issue

may arise even in single-round design setting considered

by Fannjiang et al. (2022). Despite the gap in theory, in

this work we investigate the technical challenges associated

with incorporating conformal prediction sets into BayesOpt,

and ﬁnd that empirically they can still improve query cov-

erage (Section 5.3). In Appendix A.1 we include further

discussion of the assumptions and limitations of conformal

prediction.

3 RELATED WORK

Conformal prediction: Our work is related to Fannjiang

et al. (2022), who propose a black-box optimization method

based on conformal prediction speciﬁcally to address feed-

back covariate shift. However, because they assume new

queries are drawn from a closed-form proposal distribution,

and because exact conformal prediction is not differentiable,

their approach cannot be easily extended to most BayesOpt

methods.

Bai et al. (2022) propose a differentiable approx-

imation of conformal prediction, but it requires solving a

minimax optimization subproblem. Stutz et al. (2021) inde-

pendently proposed a continuous relaxation of conformal

prediction, like our work, but only for fully exchangeable

classiﬁcation data. We propose a more general form that

allows for covariate shift, and we also provide an efﬁcient

discretization procedure for regression and show how to

estimate the importance weights when the queries are drawn

from an implicit density.

BayesOpt proposal distributions are usually implicit, obtained

through gradient-based optimization of the acquisition function.

Robust BayesOpt: There is a substantial body of work on

adaptation to model misspeciﬁcation in the bandit setting

(i.e. discrete actions), e.g. Lattimore et al. (2020) and Foster

et al. (2020), however we are primarily focused on problems

with continuous decisions. Since the seminal analysis of GP-

UCB regret bounds by Srinivas et al. (2010), follow-up work

has proposed UCB variants for misspeciﬁed likelihoods

(Makarova et al., 2021), misspeciﬁed GP priors (Bogunovic

and Krause, 2021), or to guarantee

f∗(xi)> c

for some

threshold

c∈R

(Sui et al., 2015). These approaches are

not easy to extend to other acquisition functions, and tend

to rely on fairly strong assumptions on the smoothness of

f∗

or ﬁx a speciﬁc kind of model misspeciﬁcation.

Wang

et al. (2018) prove regret bounds for GP-UCB when

f∗

drawn from a GP with unknown mean and kernel functions,

but assume we know the right hypothesis class of mean and

kernel functions and have a collection of ofﬂine datasets

available for pretraining.

Finally, Eriksson et al. (2019) propose TuRBO, which is su-

perﬁcially similar to conformal BayesOpt since it constrains

queries to a Latin hypercube trust region around the best

known local optimum. While TuRBO can be very effective

in practice, the size of the trust region is controlled by a

heuristic with ﬁve hyperparameters in the single-objective

case, and even more in the multi-objective case (Daulton

et al., 2022b). Despite the additional complexity, the credi-

ble set coverage on queries in TuRBO trust regions can still

vary wildly (see Section 5.3). In contrast, conformal predic-

tion provides distribution-free coverage guarantees under

very mild assumptions, and our approach can be applied to

any reparameterizable acquisition function (Wilson et al.,

2017). To our knowledge our approach is the ﬁrst BayesOpt

procedure to incorporate conformal prediction.

4 METHOD

We now describe the key ideas behind conformal Bayesian

optimization. First in Section 4.1 we show how to efﬁ-

ciently compute

Cα(xn)

, addressing differentiability and

discretization of continuous outcomes. Our procedure is

summarized in Algorithm 1. In Section 4.2 we introduce

the conformal Bayes posterior

pα(f(xn)|D)

, allowing us to

distinguish between aleatoric and epistemic uncertainty and

to combine conformal prediction with many well-known

BayesOpt utility functions. Finally in Section 4.3 we ad-

dress feedback covariate shift without requiring closed-form

expressions for

p(x)

and

p′(xi|ˆ

D−i)

. In Appendix D.1 we

provide a detailed overview of the whole method, along with

a discussion of the computational complexity in Appendix

D.4.

For example, it is commonly assumed that

f∗

has bounded

RKHS norm w.r.t. the chosen GP kernel, that we know a good

bound in order to set hyperparameters correctly, and that

f∗

Lipschitz continuous.

(a) (b) (c)

Figure 3: Constructing full conformal Bayes prediction sets starting with a Bayes posterior

p(ˆ

y|x,D)

. In this example

is composed of

n= 27

noisy observations (shown as black dots) of the true objective (shown as a black dashed line) and

α= 1 −β= 0.19

. In panel (a) we show

Kβ(x)

, the

-credible prediction set. In panel (b) we show

Ycand

populated by

samples from

p(ˆ

y|x,D)

. In panel (c) we show

Cα(x)

, the conformal prediction set. The coverage of

Cα(x)

is noticeably

better than Kβ(x)in regions where there is little training data, though the nominal conﬁdence level is the same.

Algorithm 1 Differentiable conformal prediction masks

Data:

train data

D={(xi,yi)}n−1

i=0

, test point

, imp.

weights

, label candidates

Ycand

, score function

miscoverage tolerance α, relaxation strength τ > 0.

mj= 0,∀j∈ {0, . . . , k −1}.

for ˆ

yj∈Ycand do

D ← D ∪ {(xn,ˆ

yj)}

s←[s(x0,y0,ˆ

D)··· s(xn,ˆ

yj,ˆ

D)]⊤.

h←sigmoid(τ−1(s−sn)).

w←h⊤w.

mj←sigmoid(τ−1(w−α)).

end

Result: m

4.1 Full conformal Bayes with Gaussian processes

Efﬁcient retraining: full conformal Bayes requires us to

compute

log p(yi|xi,ˆ

D)∀i≤n

and

∀ˆ

yj∈Ycand

, where

Ycand

is some discretization of

. This incremental poste-

rior update can be done very efﬁciently if the surrogate is

a GP regression model (Gardner et al., 2018; Stanton et al.,

2021; Maddox et al., 2021), and we will later reuse the condi-

tioned posteriors to estimate expectations w.r.t.

pα(f(x)|D)

Note that computing the GP posterior likelihood of training

data can be numerically unstable, which we address in Ap-

pendix D. Other Bayesian predictive posteriors (e.g. from

Bayesian neural networks) are conditioned on training data

via iterative methods such as gradient descent, making full

conformal Bayes very expensive (Fong and Holmes, 2021).

Differentiable prediction masks: the deﬁnition of

Cα(xn)

in Eq.

(3)

can be broken down into a sequence of simple vec-

tor operations interspersed with Heaviside functions. The

Heaviside function is piecewise constant, with ill-deﬁned

derivatives, so we replace it with its continuous relaxation,

the

sigmoid

function (Algorithm 1). Informally, the out-

put

of the ﬁnal sigmoid can be interpreted as the proba-

bilility of accepting some

into

Cα(xn)

. The smoothness

of the relaxation is controlled by a single hyperparameter

τ∈(0,+∞)

. As

τ→0

the relaxation becomes exact but

the gradients become very poorly behaved.

Efﬁcient discretization of

:now we need a good way

to choose

Ycand

. When

is low-dimensional (e.g. sequen-

tial, single-objective tasks), then Ycand can be a dense grid,

however dense grids are inefﬁcient since they must be wide

enough to capture all possible values of

and dense enough

to pinpoint the boundary of

Cα(xn)

. Even if we do not fully

believe

p(y|xn,D)

, it is still our best guess of where

y|xn

should be, so instead of a dense grid we populate

Ycand

with proposals

yj∼p(y|xn,D)

. This approach not only

reduces computational effort for low-dimensional

, it also

allows us to extend to tasks with multiple objectives and

batched queries (Appendix B.6). In Figure 3 we visualize

the computation of a conformal Bayes prediction set.

4.2 Conformal acquisition functions

For the sake of clarity in the following sections we will

omit the subscript from

. By the sum rule of probability,

we can rewrite

p(f(x)|D)

as an integral over all possible

outcomes y|x,

p(f(x)|D) = Zˆ

y∈Y

p(f(x)|ˆ

D)p(ˆ

y|x,D)dˆ

y.(5)

In other words,

p(f(x)|D)

can be seen as a Bayesian model

average, where we condition each component model on a

different potential observation

(x,ˆ

, and weight the com-

ponents by p(ˆ

y|x,D).

We are free to change the component weights to any other

valid distribution over

we like. Now we introduce the

conformal Bayes predictive posterior pα(ˆ

y|x,D),

pα(ˆ

y|x,D) := ((1 −α)/Z1if ˆ

y∈Cα(x),

αp(ˆ

y|x,D)/Z2else,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BayesianOptimizationwithConformalPredictionSetsSamuelStanton1,2WesleyMaddox2AndrewGordonWilson2PrescientDesign,Genentech1NewYorkUniversity2AbstractBayesianoptimizationisacoherent,ubiqui-tousapproachtodecision-makingunderuncer-tainty,withapplicationsincludingmulti-armban-dits,activelearning,andblack-...

展开>> 收起<<

Bayesian Optimization with Conformal Prediction Sets Samuel Stanton12Wesley Maddox2Andrew Gordon Wilson2 Prescient Design Genentech1New York University2.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bayesian Optimization with Conformal Prediction Sets Samuel Stanton12Wesley Maddox2Andrew Gordon Wilson2 Prescient Design Genentech1New York University2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: