Multi-Fidelity Bayesian Optimization with Unreliable Information Sources Petrus Mikkola Julien Martinelli Louis Filstroff Samuel Kaski Aalto University Aalto University ENSAI CREST Aalto University

2025-05-02 0 0 2.57MB 30 页 10玖币
侵权投诉
Multi-Fidelity Bayesian Optimization with Unreliable Information Sources
Petrus Mikkola Julien Martinelli Louis Filstroff Samuel Kaski
Aalto University Aalto University ENSAI, CREST Aalto University
University of Manchester
Abstract
Bayesian optimization (BO) is a powerful frame-
work for optimizing black-box, expensive-to-
evaluate functions. Over the past decade, many al-
gorithms have been proposed to integrate cheaper,
lower-fidelity approximations of the objective
function into the optimization process, with the
goal of converging towards the global optimum at
a reduced cost. This task is generally referred to
as multi-fidelity Bayesian optimization (MFBO).
However, MFBO algorithms can lead to higher
optimization costs than their vanilla BO counter-
parts, especially when the low-fidelity sources
are poor approximations of the objective func-
tion, therefore defeating their purpose. To address
this issue, we propose rMFBO (robust MFBO),
a methodology to make any GP-based MFBO
scheme robust to the addition of unreliable infor-
mation sources. rMFBO comes with a theoretical
guarantee that its performance can be bound to its
vanilla BO analog, with high controllable prob-
ability. We demonstrate the effectiveness of the
proposed methodology on a number of numerical
benchmarks, outperforming earlier MFBO meth-
ods on unreliable sources. We expect rMFBO
to be particularly useful to reliably include hu-
man experts with varying knowledge within BO
processes.
1 INTRODUCTION
Bayesian optimization (BO) has become a popular frame-
work for global optimization of black-box functions, es-
pecially when they are expensive to evaluate (Jones et al.,
1998; Brochu et al., 2010). Such functions have neither
known functional form nor derivatives, and conventional
Proceedings of the 26
th
International Conference on Artificial Intel-
ligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR:
Volume 206. Copyright 2023 by the author(s).
optimization techniques such as gradient descent cannot be
directly employed. BO rests upon two key elements. First,
it constructs a probabilistic surrogate model of the objec-
tive function with built-in uncertainty estimates, typically a
Gaussian process (GP), based on evaluations of the function.
The obtained surrogate is then used to select the next point to
evaluate by maximizing of a so-called acquisition function,
which quantifies the expected utility of evaluating a specific
point with the purpose of optimizing the black-box function.
Many off-the-shelf acquisition functions achieve this task
while balancing exploration and exploitation. Iterating these
two steps produces a sequence of designs whose aim is to
converge to the global optimum using a limited number of
function queries. BO has proven effective for a variety of
problems, including hyperparameter optimization (Snoek
et al., 2012), materials science (Zhang et al., 2020), and
drug discovery (G
´
omez-Bombarelli et al., 2018; Korovina
et al., 2020).
In many scenarios, lower-fidelity approximations of the
objective function are available at a cheaper query cost.
This occurs for instance when the evaluation of the objective
function involves a numerical scheme, where computational
cost and accuracy can be traded off. Another example is
the knowledge of domain experts. Indeed, practitioners
may have implicit knowledge of the objective function, for
instance, they may be able to point out good candidate
regions on the location of the global optimum (Hvarfner
et al., 2022). Such knowledge may naturally be considered
as a low-fidelity version of the true objective function.
The problem of integrating these auxiliary information
sources (ISs) to reduce the cost of BO has been tackled
in the literature under the name multi-fidelity Bayesian op-
timization (MFBO) (Huang et al., 2006; Kandasamy et al.,
2016; Zhang et al., 2017; Sen et al., 2018; Song et al., 2019;
Takeno et al., 2020; Li et al., 2020; Moss et al., 2021) when
the different sources can be ranked by their degree of fidelity;
when this is not possible, the problem has been studied as
multi-task BO (Swersky et al., 2013), non-hierarchical multi-
fidelity BO (Lam et al., 2015), or multi-information source
BO (Poloczek et al., 2017). However, as we will empiri-
cally demonstrate, state-of-the-art MFBO algorithms can
fail when the auxiliary ISs are poor approximations of the
arXiv:2210.13937v2 [cs.LG] 24 Feb 2023
Multi-Fidelity Bayesian Optimization with Unreliable Information Sources
primary IS. More precisely, for a fixed budget, these algo-
rithms will lead to a higher regret w.r.t. their single-fidelity
counterparts (i.e., vanilla BO, which uses the primary IS
only), defeating their purpose. For instance, the guaran-
tees of Kandasamy et al. (2016) require that the deviation
between an auxiliary IS and the primary IS is bounded by
a constant known beforehand, which hardly ever holds in
practice, for example when working with a human expert,
or when experimenting with simulations to find the optimal
control parameters for a robotic system (Marco et al., 2017).
Surprisingly enough, this issue has not formally been ad-
dressed in the BO literature so far. We fix this gap by
introducing rMFBO, a methodology to make any GP-based
MFBO algorithm robust to the addition of unreliable in-
formation sources. More precisely, rMFBO comes with a
theoretical guarantee that its performance can be bound to
its vanilla BO analog, with high controllable probability.
To the best of our knowledge, rMFBO is the first MFBO
scheme providing such performance guarantees. We then
proceed to demonstrate the effectiveness of the proposed
methodology on various numerical settings using different
MFBO algorithms of the literature. Through its building
block nature, rMFBO paves the way towards a more sys-
tematic usage of auxiliary ISs independently of their degree
of fidelity, allowing human experts to join the optimization
process in a reliable manner.
2 PRELIMINARIES
Gaussian process regression
We begin by introducing the notation for single-output
Gaussian process regression, the probabilistic surrogate
upon which BO rests. Consider a dataset
D=
{(x1, y1),...,(xn, yn)}
with
(xi, yi)Rd×R
, for which
we want to learn a model of the form
yi=f(xi) +
, with
∼ N(0, σ2
noise)
for all
i
. We may place a (zero-mean) GP
prior on f:
f(x)∼ GP(0, k(x,x0)).(1)
This defines a distribution over functions
f
whose mean is
E[f(x)] = 0
and covariance
cov[f(x), f(x0)] = k(x,x0)
.
Here,
k
is a kernel function measuring the similarity be-
tween inputs. Consequently, for any finite-dimensional col-
lection of inputs
(x1,...,xn)
, the function values
f=
(f(x1), . . . , f(xn))TRn
follow a multivariate nor-
mal distribution
f∼ N(0, K)
, where
KRn×n=
(k(xi,xj))1i,jnis the kernel matrix.
Given
D
, the posterior predictive distribution
p(f(x)| D)
is Gaussian for all
x
with mean
µ(x)
and variance
σ2(x)
,
such that
µ(x) = kx(K+σ2
noiseI)1y,
σ2(x) = k(x,x)kx(K+σ2
noiseI)1kx,
where
y= [y1, . . . , yn]Rn
and
kx=
[k(x,x1),··· , k(x,xn)]TRn.
Multi-output Gaussian process regression
GPs can be extended to multi-output Gaussian processes
(MOGP), modeling any collection of
m
-sized vector-valued
outputs
(y1, ..., yn)
based on inputs
(x1, ..., xn)
as a mul-
tivariate normal distribution. One way to achieve this ex-
tension is through the addition of a
(d+ 1)th
dimension
to the input space, representing the output index
1l
m
. This enables treating a MOGP as a single-output GP
acting on the augmented space
Rd+1
through the kernel
k((x, `),(x0, `0))
. The latter can, for instance, take the sep-
arable form
k((x, `),(x0, `0)) = kinput(x,x0)×kIS(`, `0)
.
In particular, this setting allows for the use of the readily-
available analytical formulae for the posterior mean and
variance of single-output GPs.
Problem setup
We consider the problem of optimizing a black-box function
f(m):X R, where Xis a subset of Rd, i.e. solving
arg max
x∈X
f(m)(x).(2)
In addition to
f(m)
(the primary IS), we may also query
m1
other auxiliary functions (auxiliary ISs),
f(`):X →
R
, where
`Jm1K
denotes the IS index. The cost of
evaluating
f(`)(x)
is
λ`
for any
x∈ X
. We assume that
λ`< λm
for any auxiliary IS
`Jm1K
. The objective is
to solve (2) within the budget Λ>0.
Bayesian optimization
At each round
t
, an input-IS pair
(x, `) X × JmK
is
selected by maximizing the acquisition function
α
, which
depends on the GP surrogate model on
f
given all the data
acquired up until round t1:
xt, `t= arg max
(x,`)∈X×JmK
α(x, `).(3)
Querying for
f(`)(x)
returns a noisy observation
y(`)
x=
f(`)(x) +
, with i.i.d. noise
∼ N(0, σ2
noise)
. We refer to
the sequence of queries
{xt}T
t=1
returned by a BO algorithm
as an acquisition trajectory.
Note that vanilla BO (i.e., BO with the primary IS only)
amounts to using the acquisition function
x7→ α(x, m)
,
which will be referred to as single-fidelity BO (SFBO) from
now on.
Recall that we wish to optimize
f(m)
within the budget
Λ
.
In this scenario, the performance metric of interest is the
regret of the algorithm, whose definition is recalled below.
Petrus Mikkola, Julien Martinelli, Louis Filstroff, Samuel Kaski
Definition 1
(BO regret)
.
The regret of the BO algorithm
that spends
Λcost
and returns the final choice
xchoice
, is de-
fined by
Rcost,xchoice) := (ff(m)(xchoice)if Λcost Λ,
otherwise
where
f= maxx∈X f(m)
is the global maximum of the
primary IS.
Definition 2
(Number of queries)
.
Let
T:= bΛmc
be
the available number of primary IS queries. Let
T(`)
be the
random variable describing the number of
`th
-IS queries
spent by the MFBO algorithm.
There are two popular choices for
xchoice
. First, the Bayes-
optimal choice
xchoice = arg max
x∈X
µ(x, m),
where
µ(x, m)
is the posterior mean of the GP model given
all the data up to the final query
Tlast =Pm
`=1 T(`)
. The
regret in this case is called the inference regret. Second, the
simple choice
xchoice = arg max
tJT(m)K
f(m)(xt),
where
(x1, ..., xT(m))
is the primary IS acquisition trajec-
tory returned by the MFBO algorithm. The regret in that
case is called the simple regret.
Lastly, we provide an informal definition for an unreliable
IS. To the best of our knowledge, no formal definition has
been proposed in the literature so far, in view of the non-
trivial character of the task. Broadly speaking, an auxiliary
IS
`
is said to be unreliable, if querying
f(`)(x)
does not
lead to a decreasing inference regret when averaged over all
sequences of queries
x∈ X
and all datasets
D
of any size.
As such, the relevance on an IS then also depends on the
MOGP model chosen.
3 RELATED WORK
As discussed in the introduction, many different multi-
fidelity extensions of Bayesian optimization have been pro-
posed in the literature; we refer the interested reader to
Takeno et al. (2020, Section 5) for a review. The closest to
our work are methods that do not assume a hierarchy be-
tween the sources (e.g., when the degree of fidelity cannot
be assessed in advance), as by Lam et al. (2015), where the
focus lies in designing a GP model that takes into account
the non-hierarchical nature of the sources. The multi-fidelity
kernel introduced by Poloczek et al. (2017) (see Supplemen-
tary F) is one example of such a design.
Surprisingly, the problem of the potential performance
degradation of MFBO algorithms has been largely ignored
in the literature, with the exception of Kandasamy et al.
(2016), who noticed that their multi-fidelity method per-
formed poorly compared to all single-fidelity variants in one
experiment (Kandasamy et al., 2016, Supplementary D.3).
Lastly, we mention that robustness has been studied for
vanilla BO in the context of (sometimes adversarially) noisy
inputs or outputs (Martinez-Cantin et al., 2018; Bogunovic
et al., 2018; Fr
¨
ohlich et al., 2020; Kirschner and Krause,
2021). This notion of robustness is fundamentally different
from ours, indeed, we wish to provide guarantees that the
addition of an auxiliary IS (or several) will not lead to worse
performance w.r.t. vanilla BO.
4 PITFALLS OF MFBO METHODS
We now demonstrate, on a simple example, the influence
of the auxiliary IS quality on the performance of MFBO
algorithms. Let us consider the Hartmann6D function as the
objective (i.e., the primary IS). We examine two scenarios:
in the first one, the auxiliary IS is informative, consisting of
a biased version of the primary IS, with a degree of fidelity
l= 0.2
. In the second scenario, the auxiliary IS is taken
to be the 6-dimensional Rosenbrock function, an irrelevant
source for this problem. Analytical forms for these examples
can be found in Supplementary G. We evaluate the multi-
fidelity maximum-entropy search (MF-MES) method from
Takeno et al. (2020) on these two scenarios, as well as
its single-fidelity counterpart (SF-MES), and our proposed
algorithm, rMFBO, built on top of these methods (rMF-
MES). In both cases, the cost of the primary IS is set to 1,
and cost of the auxiliary IS to 0.2. The simple regret of the
three algorithms is displayed in Figure 1.
When the auxiliary IS is informative (left panel), MF-MES
converges faster than SF-MES. This is the expected behavior
from MFBO algorithms: they use cheap IS queries in the
beginning to clear out unpromising regions of the space at a
low cost, which eventually speeds up convergence. However,
when the auxiliary IS is irrelevant (right panel), there is a
clear gap between MF-MES and SF-MES, even in the long
run. This demonstrates the inability of MF-MES to deal
with an irrelevant IS. In that scenario, we hypothesize that
the budget is wasted on uninformative queries, and thus too
many rounds are spent on learning that the sources are not
correlated (Figure 1, right bar plot), leading to a sub-optimal
data acquisition trajectory.
There is therefore a need for a robust method for such a
scenario. This is what the proposed rMF-MES, formally
introduced in the next section, achieves, by taking the best of
both worlds: sticking close to the single fidelity track in case
of an irrelevant IS, while using informative lower-fidelity
queries to accelerate convergence.
Multi-Fidelity Bayesian Optimization with Unreliable Information Sources
0 10 20 30 40 50 60 70 80
Budget
102
101
Simple regret
Informative Auxiliary Information Source
0 10 20 30 40 50 60 70 80
Budget
Irrelevant Auxiliary Information Source
MF-MES
Auxiliary
MF-MES
Primary
rMF-MES
Auxiliary
rMF-MES
Primary
0.0
0.5
Query
Proportion
0.62 0.38 0.14
0.86
MF-MES
Auxiliary
MF-MES
Primary
rMF-MES
Auxiliary
rMF-MES
Primary
0.58 0.42 0.09
0.91
Hartmann6D
MF-MES SF-MES rMF-MES
Figure 1: Simple regret as a function of budget spent in two multi-fidelty problems, averaged over 100 repetitions. The
informative auxiliary IS helps reduce the cost of BO (left panel: MF-MES in purple reduces regret faster than SF-MES in
green), whereas the irrelevant IS catastrophically disrupts performance (right panel: MF-MES does not reach the low regret
of SF-MES at all). In both settings, the primary IS cost is set to 1 and the auxiliary IS cost to 0.2. From relevant to irrelevant
IS, the proportion of auxiliary IS queries remains high for MF-MES, while rMF-MES is more consistent (lower panel).
5 ROBUST MFBO ALGORITHM
In this section, we introduce rMFBO (robust MFBO), a
methodology to make any GP-based MFBO scheme robust
to the addition of unreliable ISs. The key idea is to con-
trol the quality of the acquisitions to prevent the MFBO
algorithm from behaving as described at the end of Section
4.
At round
t
, based on the acquisition function
α
, MFBO
proposes the query
(xMF
t, `t)
according to Eq.
(3)
. The
question is to decide whether to execute this query, or to go
with a more conservative query from the primary IS. Indeed,
we wish to curb the potential performance deterioration w.r.t.
vanilla BO. To do so, we introduce a concurrent pseudo-
SFBO algorithm, which constructs a GP surrogate based
on data from the primary IS only, and so-called pseudo-
observations, introduced later on. The pseudo-SFBO uses
the acquisition function
x7→ α(x, m)
and a separate single-
output GP, yielding the query (xpSF
t, m).
Let us denote the predictive mean and standard deviation
of the MOGP model (used by the MFBO algorithm) by
µMF
and
σMF
, and those of the GP model (used by the
pseudo-SFBO algorithm) by
µSF
and
σSF
. In a nutshell,
the proposed rMFBO follows the conservative query
xpSF
t
,
unless the predictive variance of the MOGP model at
xpSF
t
is small enough:
σMF(xpSF
t, m)c1,(4)
where
c1>0
is a user-specified parameter. The pseudo-
SFBO learns from all the samples, even when the MFBO
candidate
xMF
t
is queried, by adding the pseudo-obervation
µMF(xpSF
t, m).
It can easily be seen that if
c10
, the described algorithm
becomes the SFBO algorithm, since the MFBO proposals
would be always ignored. Our main result, formally dis-
cussed in Section 6, is that we are able to derive a lower
bound on the regret difference between robust MFBO and
SFBO as a function of c1>0.
While condition
(4)
ensures that we can achieve similar
performance as SFBO when auxiliary IS is irrelevant, we
also want to reap the benefits of the multi-fidelity approach
when the auxiliary IS is relevant. To that end, we introduce
a measure of IS relevance,
s
, and add this second condition
for the acceptance of the MF query:
s(xMF
t, `t)c2,(5)
with
c2>0
a user-specified parameter. We want to draw at-
tention to the fact that condition
(4)
operates over the SFBO
proposal while condition
(5)
acts on the MFBO proposal,
possibly based on an auxiliary IS. Condition
(5)
makes
rMFBO revert more often to primary IS, and makes pseudo-
observations more accurate overall. This allows the algo-
rithm to consider less conservative values for
c1
, opening
the door for the exploitation of auxiliary ISs. In this pa-
per, we use a cost-adjusted information gain (Takeno et al.,
2020),
s(x, `) = I(f(`)(x), f| DMF)
λ`
,(6)
where
I
is the mutual information between the obser-
vation
f(`)(x)
and the maximal value of
f(m), f:=
Petrus Mikkola, Julien Martinelli, Louis Filstroff, Samuel Kaski
Algorithm 1 Robust MFBO algorithm
1: Input
: Budget
Λ
, costs
(λ1, ..., λm)
, acquisition func-
tion
α
, hyperparameters
c1
and
c2
, relevance measure
s
2: Initialize DpSF,DMF
3: Perform Bayesian updates µSF, σSF, µMF, σMF
4: t1
5: while bΛmc ≥ 2λmdo
6: xpSF
targ maxxα(x, m |µSF, σSF)
7: (xMF
t, `t)arg maxx,` α(x, ` |µMF, σMF)
8: if σMF(xpSF
t, m)c1and s(xMF
t, `t)c2then
9: ytf(xMF
t, `t)
10: DMF ← DMF ∪ {((xMF
t, `t), yt)}
11: Perform Bayesian updates µMF, σMF
12: ytµMF(xpSF
t, m)# pseudo-observation
13: DpSF ← DpSF ∪ {(xpSF
t, yt)}
14: ΛΛλ`t
15: else
16: ytf(xpSF
t, m)
17: DpSF ← DpSF ∪ {(xpSF
t, yt)}
18: DMF ← DMF ∪ {((xpSF
t, m), yt)}
19: ΛΛλm
20: end if
21: Perform Bayesian updates µSF, σSF, µMF, σMF
22: tt+ 1
23: end while
24: S← {x∈ X | σMF(x, m)c1}
25: xpSF
targ maxxSµMF(x, m)
26: ytf(xpSF
t, m)
maxx∈X f(m)(x)
; in other words the information gain on
fbrought by the observation f(x, `).
The whole procedure is summarized in Algorithm 1, and
an extended version is discussed in Supplementary C. Note
that lines 24-26 guarantee that if the maximizer is one of
the unobserved pseudo-points, it is converted into an actual
observation, a requirement in the proof of Theorem 1.
6 THEORETICAL RESULTS
In this section we tie the regret of rMFBO to that of its
SFBO counterpart. The derivation holds for any relevance
measure s.
Let us first define the function
f:X × JmKR
such that
f(x, `) = f(`)(x)
for all
(x, `) X × JmK
. We assume
that
X
is a convex compact subset of
Rd
, and we make the
following assumptions about f:
Assumption 1
(
f
is drawn from a MOGP)
.
Assume
f
is a
draw from a MOGP with zero-mean and covariance function
κ((x, `),(x0, `0))
. In other words,
{f(xi, `i)}i
is multivari-
ate normal for any finite set of input-IS pairs {(xi, `i)}i.
Assumption 2. κis known.
Assumption 3. κis at least twice differentiable.
These assumptions are common in the Bayesian optimiza-
tion literature. For instance, see Srinivas et al. (2012) and
Kandasamy et al. (2016). We also follow these authors in
the next assumption.
Assumption 4
(Bounded derivatives with high probability)
.
Psup
x∈X
f
xj
> Lae(L/bj)2,jJdK
for some constants a, bj>0.
Since a function with bounded partial derivatives (with an
uniform bound
L
) is Lipschitz continuous (with a Lipschitz
constant
dL
), Assumption 4 implies by complementing
and the union bound that
|f(m)(x)f(m)(x0)| ≤ dL kxx0k2x,x0∈ X,
with probability greater than
1dae(L/b)2
, where
b:=
maxjbj
. Further, Assumption 4 is satisfied for four times
differentiable kernels (Ghosal and Roy, 2006, Theorem 5).
The next assumptions relate to the acquisition function.
Assumption 5.
For any round
tN
, we assume that the
mapping (x,Dt)7→ α(x, m, Dt)is C2.
Assumption 6.
The Hessian
2
xα(x, m)
is a definite matrix
at the optimum x=x.
Running Algorithm 1 for
T
rounds returns the trajectory
{xpSF
t}T
t=1
, which consists of primary IS queries and pseudo-
primary IS queries. Moreover, we denote the acquisition
trajectory returned by the single-fidelity counterpart as
{xSF
t}T
t=1
. Our reasoning is as follows: we first control
the closeness of the two acquisition trajectories, then derive
a lower bound on the difference of their regret.
Let us consider the dataset as a
t(d+ 1)
-dimensional vector
Dt= (x(1)
1, ..., x(d)
t, y1, ..., yt)
. Let
Dt
be the closed line
segment joining two datasets
DA
t
and
DB
t
. We introduce a
concept, the maximum rate of change of the next query with
respect to Dt, defined as the random variable Mt,
Mt= max
D∈Dt
xt+1
D(D)
op
,
where
k·kop
is the operator norm.
Mt
measures the sensitiv-
ity of the next query when moving from a dataset
DA
t
to a
dataset
DB
t
. It depends on the smoothness of the objective
function
f
, the kernel
k
, and the acquisition function
α
. The
detailed formulas Eqs.
(13)
-
(14)
and the computation details
for
Mt
can be found in Supplementary B and D. Consider
DA
t=DpSF
t
and
DB
t=DSF
t
, and let us denote by
ˆ
Mt
the
largest product of any combination of M0, ..., Mt1,
ˆ
Mt= max
S2Jt1KY
kS
Mk.(7)
摘要:

Multi-FidelityBayesianOptimizationwithUnreliableInformationSourcesPetrusMikkolaJulienMartinelliLouisFilstroffSamuelKaskiAaltoUniversityAaltoUniversityENSAI,CRESTAaltoUniversityUniversityofManchesterAbstractBayesianoptimization(BO)isapowerfulframe-workforoptimizingblack-box,expensive-to-evaluatefunct...

展开>> 收起<<
Multi-Fidelity Bayesian Optimization with Unreliable Information Sources Petrus Mikkola Julien Martinelli Louis Filstroff Samuel Kaski Aalto University Aalto University ENSAI CREST Aalto University.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:2.57MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注