Bayesian Optimization with Conformal Prediction Sets Samuel Stanton12Wesley Maddox2Andrew Gordon Wilson2 Prescient Design Genentech1New York University2

2025-05-06 0 0 1.41MB 28 页 10玖币
侵权投诉
Bayesian Optimization with Conformal Prediction Sets
Samuel Stanton1,2 Wesley Maddox2Andrew Gordon Wilson2
Prescient Design, Genentech1New York University2
Abstract
Bayesian optimization is a coherent, ubiqui-
tous approach to decision-making under uncer-
tainty, with applications including multi-arm ban-
dits, active learning, and black-box optimization.
Bayesian optimization selects decisions (i.e. ob-
jective function queries) with maximal expected
utility with respect to the posterior distribution
of a Bayesian model, which quantifies reducible,
epistemic uncertainty about query outcomes. In
practice, subjectively implausible outcomes can
occur regularly for two reasons: 1) model mis-
specification and 2) covariate shift. Conformal
prediction is an uncertainty quantification method
with coverage guarantees even for misspecified
models and a simple mechanism to correct for
covariate shift. We propose conformal Bayesian
optimization, which directs queries towards re-
gions of search space where the model predictions
have guaranteed validity, and investigate its be-
havior on a suite of black-box optimization tasks
and tabular ranking tasks. In many cases we find
that query coverage can be significantly improved
without harming sample-efficiency.
1 INTRODUCTION
Bayesian optimization (BayesOpt) is a popular strategy to
focus data collection towards improving a specific objective,
such as discovering useful new materials or drugs (Terayama
et al., 2021; Wang and Dowling, 2022). BayesOpt relies on
a Bayesian model of the objective (a surrogate model) to
select new observations (queries) that maximize the user’s
expected utility. If the surrogate does not fit the objective
well, then the expected utility of new queries may not cor-
respond well at all to their actual utility, leading to little or
no improvement in the objective value after many rounds of
data collection.
Proceedings of the 26
th
International Conference on Artificial Intel-
ligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR:
Volume 206. Copyright 2023 by the author(s).
The most practical way to check how well the surrogate
fits the objective is to compute its accuracy on a random
heldout subset of the training data. Unfortunately such
a holdout set is not at all representative of points we are
likely to query since the goal is to find queries that are
better than the training data in some way. In other words
there is feedback covariate shift between the likely query
points and the existing training data which degrades the
accuracy of the surrogate (Fannjiang et al., 2022). Even
without covariate shift, we cannot guarantee the accuracy of
the surrogate predictions at all, and instead can only hope
that the predictions are accurate enough to provide a useful
signal for data collection.
The crux of the issue is that the coverage (i.e., the frequency
that a prediction set contains the true outcome over many
repeated measurements) of Bayes credible prediction sets is
directly tied to the correctness of our modeling assumptions,
which we cannot entirely control (Datta et al., 2000; Du-
anmu et al., 2020). We would prefer the price of assumption
error to be lost efficiency (i.e., wider prediction sets), rather
than poor coverage.
Conformal prediction is a distribution-free uncertainty quan-
tification method which provides prediction sets with re-
liable coverage under very mild assumptions (Vovk et al.,
2005). In particular, conformal prediction can accomodate
post hoc covariate shift (i.e., covariate shift that is only
known after training the surrogate) and does not assume the
surrogate is well-specified (e.g., we could use a linear model
on data following a cubic trend). Unfortunately conformal
prediction is challenging to use in a BayesOpt algorithm
since it is non-differentiable, requires continuous outcomes
to be discretized, and needs density ratio estimates for un-
known densities. Furthermore, because conformal predic-
tion sets are defined over observable outcomes, they cannot
distinguish between epistemic and aleatoric uncertainty, a
distinction that is important for effective exploration.
In this work we present conformal Bayesian optimization
with a motivating example in Figure 1. Conformal BayesOpt
adjusts how far new queries will move from the training
data by choosing an acceptable miscoverage tolerance
α
(0,1]
. If
α= 1
then we recover conventional BayesOpt,
but if
α < 1
then the search will be directed to the region
where conformal predictions are guaranteed coverage of at
arXiv:2210.12496v4 [cs.LG] 12 Dec 2023
Bayesian Optimization with Conformal Prediction Sets
(a) Objective fn. (b) UCB (c) CUCB (d) Normalized density ratio
Figure 1: A motivating example of feedback covariate shift. We want
x[0,1]2
which maximizes the Branin objective
(a), starting from
8
examples in the upper right (the black dots). The upper-confidence bound (UCB) acquisition function
(b) selects the next query (the red star) far from any training data, where we cannot guarantee reliable predictions. In higher
dimensions, we will exhaust our query budget long before covering the whole search space with training data. Given a
miscoverage tolerance
α= 1/8
, conformal UCB (c) directs the search to the region where conformal predictions are
guaranteed coverage of at least
(1 α)
.(d) The dashed line is the set
x
such that
w(x)pquery(x)/ptrain(x)
is exactly
α
.
least
(1 α)
, keeping feedback covariate shift in check and
accounting for potential error in modeling assumptions.
In summary, our contributions are as follows:
We show how to integrate conformal prediction into
BayesOpt through the conformal Bayes posterior, with
corresponding generalizations of common BayesOpt
acquisition functions, enabling the reliable coverage of
conformal prediction while still distinguishing between
epistemic and aleatoric uncertainty in a principled way.
An efficient, differentiable implementation of full con-
formal Bayes for Gaussian process (GP) regression
models, which is necessary for effective query opti-
mization, and a practical procedure to estimate the den-
sity ratio for BayesOpt query proposal distributions.
Demonstrations on synthetic black-box optimization
tasks and real tabular ranking tasks that conformal
BayesOpt has superior sample-efficiency when the sur-
rogate is misspecified and is comparable otherwise,
while improving query coverage significantly. Note
that while conformal BayesOpt has promising perfor-
mance, our goal is not primarily to “beat” classical al-
ternatives; rather, we show how to introduce conformal
prediction into BayesOpt, and explore the correspond-
ing empirical behaviour and results. 1
2 PRELIMINARIES
In this work we will focus on black-box optimization prob-
lems of the form
maxx∈X (f
1(x), . . . , f
d(x))
, where each
f
i:X R
is an unknown function of decision variables
x∈ X
, and
d
is the number of objectives. We do not ob-
serve
f
directly, but instead receive noisy outcomes (i.e.
labels) y∈ Y according to some likelihood p(y|f).
1
Code is available at
github.com/samuelstanton/
conformal-bayesopt.git
2.1 Bayesian optimization
BayesOpt alternates between inference and selection, in-
ferring the expected utility of potential query points from
available data, which then serves as a proxy objective to
select the next batch of observations, which are fed back
into the inference procedure, completing one iteration of
a repeating loop (Brochu et al., 2010; Frazier, 2018). In-
ference consists of applying Bayes rule to a prior
p(f)
, a
likelihood
p(y|f)
and dataset
D={(xi,yi)}n1
i=0
to obtain
a Bayes posterior
p(f|D)
. The expected utility of
x
is given
by an acquisition function
a:X R
with the general form
a(x,D) = Zu(x, f, D)p(f|D)df, (1)
where
u
is a user-specified utility function. For example,
taking
u(x, f, D)=[f(x)maxyi∈D yi]+
, where
[·]+=
max(·,0)
, yields the expected improvement (EI) acquisition
function (Jones et al., 1998). Since
a
is the Bayes posterior
expectation of
u
, maximizers
x= argmaxx∈X a(x,D)
are Bayes-optimal with respect to
u
. Bayes-optimality
means a decision is coherent with our posterior beliefs about
f
. We think of
argmaxx∈X a(x,D)
as inducing a distribu-
tion
pquery(x) = p(x|D)exp{a(x,D)}
(Levine, 2018).
2.2 Bayesian inference and model misspecification
One way to assess the quality of our posterior beliefs is to
check the coverage of the corresponding Bayes
β
-credible
prediction sets, which are subsets Kβ(x)⊆ Y satisfying
β=Zy∈Kβ(x)Zp(y|f(x))p(f|D)dfdy,(2)
where
β(0,1]
is the level of subjective credibility (Gneit-
ing et al., 2007).
β
-credible sets may exhibit poor coverage,
meaning the frequency of “implausible” events outside the
set happening is much more than
1β
(Bachoc, 2013).
Note that poor coverage does not necessarily imply that
(a) conformal scores s(b) imp. weights w(c) IW partial sums w(d) Cα(xn)
Figure 2: Constructing a conformal prediction set
Cα(xn)
in the regression setting. First, (a) we choose some
ˆ
yn∈ Y
and
guess
yn=ˆ
yn
, computing conformal scores
s
of
{(x0,y0),...,(xn1,yn1),(xn,ˆ
yn)}
.(b) we note which examples
score better than our guess (shown in blue), and mask out the corresponding importance weights
w
.(c) we compute the
partial sums
w
of the masked importance weights, adding
ˆ
yn
to
Cα(xn)
if
wn> α
,(d) repeat steps (a - c) for many guesses
of yn. Rejected and accepted guesses are shaded light and dark, respectively.
Kβ(x)
was computed incorrectly, it may simply indicate a
faulty assumption.
For example, in BayesOpt it is very common to assume
f∼ GP(0, κ)
, where
κ
is a Matérn kernel. Matérn kernels
support functions that are at least once differentiable, and
can struggle to model objectives with discontinuities. As
another example, we typically do not know the true like-
lihood
p(y|f)
, and often choose a simple homoscedastic
likelihood
p(y|f) = N(f, σ2Id)
for convenience. In real-
ity the true noise process may be correlated with
x
, across
objectives, or may not be Gaussian at all (Assael et al.,
2014; Griffiths et al., 2019; Makarova et al., 2021). These
examples are common instances of model misspecification.
In practice faulty assumptions are nearly inevitable, and
they are not always harmful, since simplifying assumptions
can confer significant practical benefits. Indeed, theoreti-
cal convergence rates for acquisition functions like UCB
(Srinivas et al., 2010) suggest that for BayesOpt we want to
use the smoothest possible model, subject to the constraint
that we can still model
f
sufficiently well. Similarly Gaus-
sian likelihoods have significant computational advantages,
and there is no guarantee that constructing a task-specific
likelihood for every optimization problem would be worth
the effort. We propose accepting that some assumption er-
ror will always be present, and instead focus on how alter
BayesOpt to accomodate imperfect models.
2.3 Conformal prediction
See Shafer and Vovk (2008) for a complete tutorial on con-
formal prediction, or Angelopoulos and Bates (2021) for a
modern, accessible introduction. Informally, a conformal
prediction set
Cα(xn)⊂ Y
is a set of possible labels for a
test point
xn
given a sequence of observations
D
. Candi-
date labels
ˆ
yn
are included in
Cα(xn)
if the resulting pair
(xn,ˆ
yn)
is sufficiently similar to the actual examples in
D
.
The degree of similarity is measured by a score function
s
and importance weights (IWs)
w
, and the similarity thresh-
old is determined by the miscoverage tolerance
α
. In Figure
2 we visualize the process of constructing Cα(xn).
Conformal prediction is a distribution-free uncertainty quan-
tification method because it does not assume
D
is drawn
from any particular distribution, nor does it assume
s
is de-
rived from a well-specified model. In our context the critical
assumption is that
D ∪ {(xn,yn)}
is pseudo-exchangeable.
Fannjiang et al. (2022) provide a formal statement of pseudo-
exchangeability (Definition 2), and prove a coverage guar-
antee for conformal prediction sets in the special case when
D
is IID from
p(x)p(y|x)
and
p(x|D)
is invariant to shuf-
fling of D, which we restate below:
Definition 2.1. Let
D p(x)p(y|x)
and
(xn,yn)
p(x|D)p(y|x)
, where
p(x|D)
is chosen such that
D ∪ {(xn,yn)}
is pseudo-exchangeable. Given
wi
p(xi|ˆ
Di)/p(xi)
s.t.
Piwi= 1
,
α(0,1]
, the con-
formal prediction set corresponding to score function sis
Cα(xn) := ˆ
yn∈ Y
hw> α,(3)
where hi:= 1ns(xi,yi,ˆ
D)s(xn,ˆ
yn,ˆ
D)o,
ˆ
D=D ∪ {(xn,ˆ
yn)}
, and
p(x|ˆ
Di)
is the query proposal
density given training data
ˆ
Di=ˆ
D − {(xi,yi)}
. The
importance weights
w
account for covariate shift (Tibshirani
et al., 2019), and
wi= 1/(n+ 1) i
in the special case
where D ∪ {(xn,yn)}is fully exchangeable (e.g. IID).
Conformal prediction enjoys a frequentist marginal cover-
age guarantee on
Cα(xn)
with respect to the joint distribu-
tion over D ∪ {(xn,yn)},
P[yn∈ Cα(xn)] 1α, (4)
meaning if we repeatedly draw
D p(x)p(y|x)
, and
(xn,yn)p(x|D)p(y|x)
,
Cα(xn)
will contain the ob-
served label
yn
with frequency at least
(1α)
. A prediction
set with a coverage guarantee like Eq.
(4)
is conservatively
valid at the
1α
level. In Appendix B.1 we discuss random-
ized conformal prediction which is exactly valid, meaning
the long run frequency of errors converges to exactly
α
.
Bayesian Optimization with Conformal Prediction Sets
Marginal coverage is distinct from conditional coverage,
since it does not guarantee the coverage of
Cα
for any spe-
cific
xn
, only the average coverage over the whole domain
X.
Full conformal Bayes corresponds to the score function
s(xi,yi,ˆ
D) = log p(yi|xi,ˆ
D)
. Conditioning an existing
posterior
p(y|xi,D)
on the additional observation
(xn,ˆ
yn)
is commonly referred to as “retraining” in the conformal
prediction literature. If the surrogate just so happens to be
correctly specified (e.g.
fp(f)
), then
log p(yi|xi,ˆ
D)
is the optimal choice of score function, meaning it provides
the most efficient prediction sets (i.e. smallest by expected
volume w.r.t. the prior
p(f)
) among all prediction sets that
are valid at the
1α
level (Hoff, 2021). In the typical situ-
ation where we think our model assumptions are plausible
but do not really believe them, full conformal Bayes rewards
us if our assumptions turn out to be correct, yet it produces
valid predictions as long as the true data generation process
is some pseudo-exchangeable sequence.
BayesOpt and pseudo-exchangeability: unfortunately if
D
is collected by some active online selection strategy such
as BayesOpt, then
D
is not IID and the coverage guarantee
for Defn. 2.1 does not apply. Note that even large offline
datasets are not guaranteed to be IID, so the same issue
may arise even in single-round design setting considered
by Fannjiang et al. (2022). Despite the gap in theory, in
this work we investigate the technical challenges associated
with incorporating conformal prediction sets into BayesOpt,
and find that empirically they can still improve query cov-
erage (Section 5.3). In Appendix A.1 we include further
discussion of the assumptions and limitations of conformal
prediction.
3 RELATED WORK
Conformal prediction: Our work is related to Fannjiang
et al. (2022), who propose a black-box optimization method
based on conformal prediction specifically to address feed-
back covariate shift. However, because they assume new
queries are drawn from a closed-form proposal distribution,
and because exact conformal prediction is not differentiable,
their approach cannot be easily extended to most BayesOpt
methods.
2
Bai et al. (2022) propose a differentiable approx-
imation of conformal prediction, but it requires solving a
minimax optimization subproblem. Stutz et al. (2021) inde-
pendently proposed a continuous relaxation of conformal
prediction, like our work, but only for fully exchangeable
classification data. We propose a more general form that
allows for covariate shift, and we also provide an efficient
discretization procedure for regression and show how to
estimate the importance weights when the queries are drawn
from an implicit density.
2
BayesOpt proposal distributions are usually implicit, obtained
through gradient-based optimization of the acquisition function.
Robust BayesOpt: There is a substantial body of work on
adaptation to model misspecification in the bandit setting
(i.e. discrete actions), e.g. Lattimore et al. (2020) and Foster
et al. (2020), however we are primarily focused on problems
with continuous decisions. Since the seminal analysis of GP-
UCB regret bounds by Srinivas et al. (2010), follow-up work
has proposed UCB variants for misspecified likelihoods
(Makarova et al., 2021), misspecified GP priors (Bogunovic
and Krause, 2021), or to guarantee
f(xi)> c
for some
threshold
cR
(Sui et al., 2015). These approaches are
not easy to extend to other acquisition functions, and tend
to rely on fairly strong assumptions on the smoothness of
f
or fix a specific kind of model misspecification.
3
Wang
et al. (2018) prove regret bounds for GP-UCB when
f
is
drawn from a GP with unknown mean and kernel functions,
but assume we know the right hypothesis class of mean and
kernel functions and have a collection of offline datasets
available for pretraining.
Finally, Eriksson et al. (2019) propose TuRBO, which is su-
perficially similar to conformal BayesOpt since it constrains
queries to a Latin hypercube trust region around the best
known local optimum. While TuRBO can be very effective
in practice, the size of the trust region is controlled by a
heuristic with five hyperparameters in the single-objective
case, and even more in the multi-objective case (Daulton
et al., 2022b). Despite the additional complexity, the credi-
ble set coverage on queries in TuRBO trust regions can still
vary wildly (see Section 5.3). In contrast, conformal predic-
tion provides distribution-free coverage guarantees under
very mild assumptions, and our approach can be applied to
any reparameterizable acquisition function (Wilson et al.,
2017). To our knowledge our approach is the first BayesOpt
procedure to incorporate conformal prediction.
4 METHOD
We now describe the key ideas behind conformal Bayesian
optimization. First in Section 4.1 we show how to effi-
ciently compute
Cα(xn)
, addressing differentiability and
discretization of continuous outcomes. Our procedure is
summarized in Algorithm 1. In Section 4.2 we introduce
the conformal Bayes posterior
pα(f(xn)|D)
, allowing us to
distinguish between aleatoric and epistemic uncertainty and
to combine conformal prediction with many well-known
BayesOpt utility functions. Finally in Section 4.3 we ad-
dress feedback covariate shift without requiring closed-form
expressions for
p(x)
and
p(xi|ˆ
Di)
. In Appendix D.1 we
provide a detailed overview of the whole method, along with
a discussion of the computational complexity in Appendix
D.4.
3
For example, it is commonly assumed that
f
has bounded
RKHS norm w.r.t. the chosen GP kernel, that we know a good
bound in order to set hyperparameters correctly, and that
f
is
Lipschitz continuous.
(a) (b) (c)
Figure 3: Constructing full conformal Bayes prediction sets starting with a Bayes posterior
p(ˆ
y|x,D)
. In this example
D
is composed of
n= 27
noisy observations (shown as black dots) of the true objective (shown as a black dashed line) and
α= 1 β= 0.19
. In panel (a) we show
Kβ(x)
, the
β
-credible prediction set. In panel (b) we show
Ycand
populated by
samples from
p(ˆ
y|x,D)
. In panel (c) we show
Cα(x)
, the conformal prediction set. The coverage of
Cα(x)
is noticeably
better than Kβ(x)in regions where there is little training data, though the nominal confidence level is the same.
Algorithm 1 Differentiable conformal prediction masks
Data:
train data
D={(xi,yi)}n1
i=0
, test point
xn
, imp.
weights
w
, label candidates
Ycand
, score function
s
,
miscoverage tolerance α, relaxation strength τ > 0.
mj= 0,j∈ {0, . . . , k 1}.
for ˆ
yjYcand do
ˆ
D D ∪ {(xn,ˆ
yj)}
s[s(x0,y0,ˆ
D)··· s(xn,ˆ
yj,ˆ
D)].
hsigmoid(τ1(ssn)).
whw.
mjsigmoid(τ1(wα)).
end
Result: m
4.1 Full conformal Bayes with Gaussian processes
Efficient retraining: full conformal Bayes requires us to
compute
log p(yi|xi,ˆ
D)in
and
ˆ
yjYcand
, where
Ycand
is some discretization of
Y
. This incremental poste-
rior update can be done very efficiently if the surrogate is
a GP regression model (Gardner et al., 2018; Stanton et al.,
2021; Maddox et al., 2021), and we will later reuse the condi-
tioned posteriors to estimate expectations w.r.t.
pα(f(x)|D)
.
Note that computing the GP posterior likelihood of training
data can be numerically unstable, which we address in Ap-
pendix D. Other Bayesian predictive posteriors (e.g. from
Bayesian neural networks) are conditioned on training data
via iterative methods such as gradient descent, making full
conformal Bayes very expensive (Fong and Holmes, 2021).
Differentiable prediction masks: the definition of
Cα(xn)
in Eq.
(3)
can be broken down into a sequence of simple vec-
tor operations interspersed with Heaviside functions. The
Heaviside function is piecewise constant, with ill-defined
derivatives, so we replace it with its continuous relaxation,
the
sigmoid
function (Algorithm 1). Informally, the out-
put
mj
of the final sigmoid can be interpreted as the proba-
bilility of accepting some
ˆ
yj
into
Cα(xn)
. The smoothness
of the relaxation is controlled by a single hyperparameter
τ(0,+)
. As
τ0
the relaxation becomes exact but
the gradients become very poorly behaved.
Efficient discretization of
Y
:now we need a good way
to choose
Ycand
. When
y
is low-dimensional (e.g. sequen-
tial, single-objective tasks), then Ycand can be a dense grid,
however dense grids are inefficient since they must be wide
enough to capture all possible values of
y
and dense enough
to pinpoint the boundary of
Cα(xn)
. Even if we do not fully
believe
p(y|xn,D)
, it is still our best guess of where
y|xn
should be, so instead of a dense grid we populate
Ycand
with proposals
ˆ
yjp(y|xn,D)
. This approach not only
reduces computational effort for low-dimensional
y
, it also
allows us to extend to tasks with multiple objectives and
batched queries (Appendix B.6). In Figure 3 we visualize
the computation of a conformal Bayes prediction set.
4.2 Conformal acquisition functions
For the sake of clarity in the following sections we will
omit the subscript from
xn
. By the sum rule of probability,
we can rewrite
p(f(x)|D)
as an integral over all possible
outcomes y|x,
p(f(x)|D) = Zˆ
y∈Y
p(f(x)|ˆ
D)p(ˆ
y|x,D)dˆ
y.(5)
In other words,
p(f(x)|D)
can be seen as a Bayesian model
average, where we condition each component model on a
different potential observation
(x,ˆ
y)
, and weight the com-
ponents by p(ˆ
y|x,D).
We are free to change the component weights to any other
valid distribution over
ˆ
y
we like. Now we introduce the
conformal Bayes predictive posterior pα(ˆ
y|x,D),
pα(ˆ
y|x,D) := ((1 α)/Z1if ˆ
yCα(x),
αp(ˆ
y|x,D)/Z2else,
摘要:

BayesianOptimizationwithConformalPredictionSetsSamuelStanton1,2WesleyMaddox2AndrewGordonWilson2PrescientDesign,Genentech1NewYorkUniversity2AbstractBayesianoptimizationisacoherent,ubiqui-tousapproachtodecision-makingunderuncer-tainty,withapplicationsincludingmulti-armban-dits,activelearning,andblack-...

展开>> 收起<<
Bayesian Optimization with Conformal Prediction Sets Samuel Stanton12Wesley Maddox2Andrew Gordon Wilson2 Prescient Design Genentech1New York University2.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:1.41MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注