
Bayesian Optimization with Conformal Prediction Sets
Marginal coverage is distinct from conditional coverage,
since it does not guarantee the coverage of
Cα
for any spe-
cific
xn
, only the average coverage over the whole domain
X.
Full conformal Bayes corresponds to the score function
s(xi,yi,ˆ
D) = log p(yi|xi,ˆ
D)
. Conditioning an existing
posterior
p(y|xi,D)
on the additional observation
(xn,ˆ
yn)
is commonly referred to as “retraining” in the conformal
prediction literature. If the surrogate just so happens to be
correctly specified (e.g.
f∗∼p(f)
), then
log p(yi|xi,ˆ
D)
is the optimal choice of score function, meaning it provides
the most efficient prediction sets (i.e. smallest by expected
volume w.r.t. the prior
p(f)
) among all prediction sets that
are valid at the
1−α
level (Hoff, 2021). In the typical situ-
ation where we think our model assumptions are plausible
but do not really believe them, full conformal Bayes rewards
us if our assumptions turn out to be correct, yet it produces
valid predictions as long as the true data generation process
is some pseudo-exchangeable sequence.
BayesOpt and pseudo-exchangeability: unfortunately if
D
is collected by some active online selection strategy such
as BayesOpt, then
D
is not IID and the coverage guarantee
for Defn. 2.1 does not apply. Note that even large offline
datasets are not guaranteed to be IID, so the same issue
may arise even in single-round design setting considered
by Fannjiang et al. (2022). Despite the gap in theory, in
this work we investigate the technical challenges associated
with incorporating conformal prediction sets into BayesOpt,
and find that empirically they can still improve query cov-
erage (Section 5.3). In Appendix A.1 we include further
discussion of the assumptions and limitations of conformal
prediction.
3 RELATED WORK
Conformal prediction: Our work is related to Fannjiang
et al. (2022), who propose a black-box optimization method
based on conformal prediction specifically to address feed-
back covariate shift. However, because they assume new
queries are drawn from a closed-form proposal distribution,
and because exact conformal prediction is not differentiable,
their approach cannot be easily extended to most BayesOpt
methods.
2
Bai et al. (2022) propose a differentiable approx-
imation of conformal prediction, but it requires solving a
minimax optimization subproblem. Stutz et al. (2021) inde-
pendently proposed a continuous relaxation of conformal
prediction, like our work, but only for fully exchangeable
classification data. We propose a more general form that
allows for covariate shift, and we also provide an efficient
discretization procedure for regression and show how to
estimate the importance weights when the queries are drawn
from an implicit density.
2
BayesOpt proposal distributions are usually implicit, obtained
through gradient-based optimization of the acquisition function.
Robust BayesOpt: There is a substantial body of work on
adaptation to model misspecification in the bandit setting
(i.e. discrete actions), e.g. Lattimore et al. (2020) and Foster
et al. (2020), however we are primarily focused on problems
with continuous decisions. Since the seminal analysis of GP-
UCB regret bounds by Srinivas et al. (2010), follow-up work
has proposed UCB variants for misspecified likelihoods
(Makarova et al., 2021), misspecified GP priors (Bogunovic
and Krause, 2021), or to guarantee
f∗(xi)> c
for some
threshold
c∈R
(Sui et al., 2015). These approaches are
not easy to extend to other acquisition functions, and tend
to rely on fairly strong assumptions on the smoothness of
f∗
or fix a specific kind of model misspecification.
3
Wang
et al. (2018) prove regret bounds for GP-UCB when
f∗
is
drawn from a GP with unknown mean and kernel functions,
but assume we know the right hypothesis class of mean and
kernel functions and have a collection of offline datasets
available for pretraining.
Finally, Eriksson et al. (2019) propose TuRBO, which is su-
perficially similar to conformal BayesOpt since it constrains
queries to a Latin hypercube trust region around the best
known local optimum. While TuRBO can be very effective
in practice, the size of the trust region is controlled by a
heuristic with five hyperparameters in the single-objective
case, and even more in the multi-objective case (Daulton
et al., 2022b). Despite the additional complexity, the credi-
ble set coverage on queries in TuRBO trust regions can still
vary wildly (see Section 5.3). In contrast, conformal predic-
tion provides distribution-free coverage guarantees under
very mild assumptions, and our approach can be applied to
any reparameterizable acquisition function (Wilson et al.,
2017). To our knowledge our approach is the first BayesOpt
procedure to incorporate conformal prediction.
4 METHOD
We now describe the key ideas behind conformal Bayesian
optimization. First in Section 4.1 we show how to effi-
ciently compute
Cα(xn)
, addressing differentiability and
discretization of continuous outcomes. Our procedure is
summarized in Algorithm 1. In Section 4.2 we introduce
the conformal Bayes posterior
pα(f(xn)|D)
, allowing us to
distinguish between aleatoric and epistemic uncertainty and
to combine conformal prediction with many well-known
BayesOpt utility functions. Finally in Section 4.3 we ad-
dress feedback covariate shift without requiring closed-form
expressions for
p(x)
and
p′(xi|ˆ
D−i)
. In Appendix D.1 we
provide a detailed overview of the whole method, along with
a discussion of the computational complexity in Appendix
D.4.
3
For example, it is commonly assumed that
f∗
has bounded
RKHS norm w.r.t. the chosen GP kernel, that we know a good
bound in order to set hyperparameters correctly, and that
f∗
is
Lipschitz continuous.