Multi-Fidelity Bayesian Optimization with Unreliable Information Sources Petrus Mikkola Julien Martinelli Louis Filstroff Samuel Kaski Aalto University Aalto University ENSAI CREST Aalto University

2025-05-02 0 0 2.57MB 30 页 10玖币

侵权投诉

Multi-Fidelity Bayesian Optimization with Unreliable Information Sources

Petrus Mikkola Julien Martinelli Louis Filstroff Samuel Kaski

Aalto University Aalto University ENSAI, CREST Aalto University

University of Manchester

Abstract

Bayesian optimization (BO) is a powerful frame-

work for optimizing black-box, expensive-to-

evaluate functions. Over the past decade, many al-

gorithms have been proposed to integrate cheaper,

lower-ﬁdelity approximations of the objective

function into the optimization process, with the

goal of converging towards the global optimum at

a reduced cost. This task is generally referred to

as multi-ﬁdelity Bayesian optimization (MFBO).

However, MFBO algorithms can lead to higher

optimization costs than their vanilla BO counter-

parts, especially when the low-ﬁdelity sources

are poor approximations of the objective func-

tion, therefore defeating their purpose. To address

this issue, we propose rMFBO (robust MFBO),

a methodology to make any GP-based MFBO

scheme robust to the addition of unreliable infor-

mation sources. rMFBO comes with a theoretical

guarantee that its performance can be bound to its

vanilla BO analog, with high controllable prob-

ability. We demonstrate the effectiveness of the

proposed methodology on a number of numerical

benchmarks, outperforming earlier MFBO meth-

ods on unreliable sources. We expect rMFBO

to be particularly useful to reliably include hu-

man experts with varying knowledge within BO

processes.

1 INTRODUCTION

Bayesian optimization (BO) has become a popular frame-

work for global optimization of black-box functions, es-

pecially when they are expensive to evaluate (Jones et al.,

1998; Brochu et al., 2010). Such functions have neither

known functional form nor derivatives, and conventional

Proceedings of the 26

International Conference on Artiﬁcial Intel-

ligence and Statistics (AISTATS) 2023, Valencia, Spain. PMLR:

optimization techniques such as gradient descent cannot be

directly employed. BO rests upon two key elements. First,

it constructs a probabilistic surrogate model of the objec-

tive function with built-in uncertainty estimates, typically a

Gaussian process (GP), based on evaluations of the function.

The obtained surrogate is then used to select the next point to

evaluate by maximizing of a so-called acquisition function,

which quantiﬁes the expected utility of evaluating a speciﬁc

point with the purpose of optimizing the black-box function.

Many off-the-shelf acquisition functions achieve this task

while balancing exploration and exploitation. Iterating these

two steps produces a sequence of designs whose aim is to

converge to the global optimum using a limited number of

function queries. BO has proven effective for a variety of

problems, including hyperparameter optimization (Snoek

et al., 2012), materials science (Zhang et al., 2020), and

drug discovery (G

omez-Bombarelli et al., 2018; Korovina

et al., 2020).

In many scenarios, lower-ﬁdelity approximations of the

objective function are available at a cheaper query cost.

This occurs for instance when the evaluation of the objective

function involves a numerical scheme, where computational

cost and accuracy can be traded off. Another example is

the knowledge of domain experts. Indeed, practitioners

may have implicit knowledge of the objective function, for

instance, they may be able to point out good candidate

regions on the location of the global optimum (Hvarfner

et al., 2022). Such knowledge may naturally be considered

as a low-ﬁdelity version of the true objective function.

The problem of integrating these auxiliary information

sources (ISs) to reduce the cost of BO has been tackled

in the literature under the name multi-ﬁdelity Bayesian op-

timization (MFBO) (Huang et al., 2006; Kandasamy et al.,

2016; Zhang et al., 2017; Sen et al., 2018; Song et al., 2019;

Takeno et al., 2020; Li et al., 2020; Moss et al., 2021) when

the different sources can be ranked by their degree of ﬁdelity;

when this is not possible, the problem has been studied as

multi-task BO (Swersky et al., 2013), non-hierarchical multi-

ﬁdelity BO (Lam et al., 2015), or multi-information source

BO (Poloczek et al., 2017). However, as we will empiri-

cally demonstrate, state-of-the-art MFBO algorithms can

fail when the auxiliary ISs are poor approximations of the

arXiv:2210.13937v2 [cs.LG] 24 Feb 2023

Multi-Fidelity Bayesian Optimization with Unreliable Information Sources

primary IS. More precisely, for a ﬁxed budget, these algo-

rithms will lead to a higher regret w.r.t. their single-ﬁdelity

counterparts (i.e., vanilla BO, which uses the primary IS

only), defeating their purpose. For instance, the guaran-

tees of Kandasamy et al. (2016) require that the deviation

between an auxiliary IS and the primary IS is bounded by

a constant known beforehand, which hardly ever holds in

practice, for example when working with a human expert,

or when experimenting with simulations to ﬁnd the optimal

control parameters for a robotic system (Marco et al., 2017).

Surprisingly enough, this issue has not formally been ad-

dressed in the BO literature so far. We ﬁx this gap by

introducing rMFBO, a methodology to make any GP-based

MFBO algorithm robust to the addition of unreliable in-

formation sources. More precisely, rMFBO comes with a

theoretical guarantee that its performance can be bound to

its vanilla BO analog, with high controllable probability.

To the best of our knowledge, rMFBO is the ﬁrst MFBO

scheme providing such performance guarantees. We then

proceed to demonstrate the effectiveness of the proposed

methodology on various numerical settings using different

MFBO algorithms of the literature. Through its building

block nature, rMFBO paves the way towards a more sys-

tematic usage of auxiliary ISs independently of their degree

of ﬁdelity, allowing human experts to join the optimization

process in a reliable manner.

2 PRELIMINARIES

Gaussian process regression

We begin by introducing the notation for single-output

Gaussian process regression, the probabilistic surrogate

upon which BO rests. Consider a dataset

{(x1, y1),...,(xn, yn)}

with

(xi, yi)∈Rd×R

, for which

we want to learn a model of the form

yi=f(xi) + 

, with

∼ N(0, σ2

noise)

for all

. We may place a (zero-mean) GP

prior on f:

f(x)∼ GP(0, k(x,x0)).(1)

This deﬁnes a distribution over functions

whose mean is

E[f(x)] = 0

and covariance

cov[f(x), f(x0)] = k(x,x0)

Here,

is a kernel function measuring the similarity be-

tween inputs. Consequently, for any ﬁnite-dimensional col-

lection of inputs

(x1,...,xn)

, the function values

(f(x1), . . . , f(xn))T∈Rn

follow a multivariate nor-

mal distribution

f∼ N(0, K)

, where

K∈Rn×n=

(k(xi,xj))1≤i,j≤nis the kernel matrix.

Given

, the posterior predictive distribution

p(f(x)| D)

is Gaussian for all

with mean

µ(x)

and variance

σ2(x)

such that

µ(x) = kx(K+σ2

noiseI)−1y,

σ2(x) = k(x,x)−kx(K+σ2

noiseI)−1kx,

where

y= [y1, . . . , yn]∈Rn

and

kx=

[k(x,x1),··· , k(x,xn)]T∈Rn.

Multi-output Gaussian process regression

GPs can be extended to multi-output Gaussian processes

(MOGP), modeling any collection of

-sized vector-valued

outputs

(y1, ..., yn)

based on inputs

(x1, ..., xn)

as a mul-

tivariate normal distribution. One way to achieve this ex-

tension is through the addition of a

(d+ 1)th

dimension

to the input space, representing the output index

1≤l≤

. This enables treating a MOGP as a single-output GP

acting on the augmented space

Rd+1

through the kernel

k((x, `),(x0, `0))

. The latter can, for instance, take the sep-

arable form

k((x, `),(x0, `0)) = kinput(x,x0)×kIS(`, `0)

In particular, this setting allows for the use of the readily-

available analytical formulae for the posterior mean and

variance of single-output GPs.

Problem setup

We consider the problem of optimizing a black-box function

f(m):X → R, where Xis a subset of Rd, i.e. solving

arg max

x∈X

f(m)(x).(2)

In addition to

f(m)

(the primary IS), we may also query

m−1

other auxiliary functions (auxiliary ISs),

f(`):X →

, where

`∈Jm−1K

denotes the IS index. The cost of

evaluating

f(`)(x)

λ`

for any

x∈ X

. We assume that

λ`< λm

for any auxiliary IS

`∈Jm−1K

. The objective is

to solve (2) within the budget Λ>0.

Bayesian optimization

At each round

, an input-IS pair

(x, `)∈ X × JmK

selected by maximizing the acquisition function

, which

depends on the GP surrogate model on

given all the data

acquired up until round t−1:

xt, `t= arg max

(x,`)∈X×JmK

α(x, `).(3)

Querying for

f(`)(x)

returns a noisy observation

y(`)

f(`)(x) + 

, with i.i.d. noise

∼ N(0, σ2

noise)

. We refer to

the sequence of queries

{xt}T

t=1

returned by a BO algorithm

as an acquisition trajectory.

Note that vanilla BO (i.e., BO with the primary IS only)

amounts to using the acquisition function

x7→ α(x, m)

which will be referred to as single-ﬁdelity BO (SFBO) from

now on.

Recall that we wish to optimize

f(m)

within the budget

In this scenario, the performance metric of interest is the

regret of the algorithm, whose deﬁnition is recalled below.

Petrus Mikkola, Julien Martinelli, Louis Filstroff, Samuel Kaski

Deﬁnition 1

(BO regret)

The regret of the BO algorithm

that spends

Λcost

and returns the ﬁnal choice

xchoice

, is de-

ﬁned by

R(Λcost,xchoice) := (f∗−f(m)(xchoice)if Λcost ≤Λ,

∞otherwise

where

f∗= maxx∈X f(m)

is the global maximum of the

primary IS.

Deﬁnition 2

(Number of queries)

Let

T:= bΛ/λmc

the available number of primary IS queries. Let

T(`)

be the

random variable describing the number of

`th

-IS queries

spent by the MFBO algorithm.

There are two popular choices for

xchoice

. First, the Bayes-

optimal choice

xchoice = arg max

x∈X

µ(x, m),

where

µ(x, m)

is the posterior mean of the GP model given

all the data up to the ﬁnal query

Tlast =Pm

`=1 T(`)

. The

regret in this case is called the inference regret. Second, the

simple choice

xchoice = arg max

t∈JT(m)K

f(m)(xt),

where

(x1, ..., xT(m))

is the primary IS acquisition trajec-

tory returned by the MFBO algorithm. The regret in that

case is called the simple regret.

Lastly, we provide an informal deﬁnition for an unreliable

IS. To the best of our knowledge, no formal deﬁnition has

been proposed in the literature so far, in view of the non-

trivial character of the task. Broadly speaking, an auxiliary

is said to be unreliable, if querying

f(`)(x)

does not

lead to a decreasing inference regret when averaged over all

sequences of queries

x∈ X

and all datasets

of any size.

As such, the relevance on an IS then also depends on the

MOGP model chosen.

3 RELATED WORK

As discussed in the introduction, many different multi-

ﬁdelity extensions of Bayesian optimization have been pro-

posed in the literature; we refer the interested reader to

Takeno et al. (2020, Section 5) for a review. The closest to

our work are methods that do not assume a hierarchy be-

tween the sources (e.g., when the degree of ﬁdelity cannot

be assessed in advance), as by Lam et al. (2015), where the

focus lies in designing a GP model that takes into account

the non-hierarchical nature of the sources. The multi-ﬁdelity

kernel introduced by Poloczek et al. (2017) (see Supplemen-

tary F) is one example of such a design.

Surprisingly, the problem of the potential performance

degradation of MFBO algorithms has been largely ignored

in the literature, with the exception of Kandasamy et al.

(2016), who noticed that their multi-ﬁdelity method per-

formed poorly compared to all single-ﬁdelity variants in one

experiment (Kandasamy et al., 2016, Supplementary D.3).

Lastly, we mention that robustness has been studied for

vanilla BO in the context of (sometimes adversarially) noisy

inputs or outputs (Martinez-Cantin et al., 2018; Bogunovic

et al., 2018; Fr

ohlich et al., 2020; Kirschner and Krause,

2021). This notion of robustness is fundamentally different

from ours, indeed, we wish to provide guarantees that the

addition of an auxiliary IS (or several) will not lead to worse

performance w.r.t. vanilla BO.

4 PITFALLS OF MFBO METHODS

We now demonstrate, on a simple example, the inﬂuence

of the auxiliary IS quality on the performance of MFBO

algorithms. Let us consider the Hartmann6D function as the

objective (i.e., the primary IS). We examine two scenarios:

in the ﬁrst one, the auxiliary IS is informative, consisting of

a biased version of the primary IS, with a degree of ﬁdelity

l= 0.2

. In the second scenario, the auxiliary IS is taken

to be the 6-dimensional Rosenbrock function, an irrelevant

source for this problem. Analytical forms for these examples

can be found in Supplementary G. We evaluate the multi-

ﬁdelity maximum-entropy search (MF-MES) method from

Takeno et al. (2020) on these two scenarios, as well as

its single-ﬁdelity counterpart (SF-MES), and our proposed

algorithm, rMFBO, built on top of these methods (rMF-

MES). In both cases, the cost of the primary IS is set to 1,

and cost of the auxiliary IS to 0.2. The simple regret of the

three algorithms is displayed in Figure 1.

When the auxiliary IS is informative (left panel), MF-MES

converges faster than SF-MES. This is the expected behavior

from MFBO algorithms: they use cheap IS queries in the

beginning to clear out unpromising regions of the space at a

low cost, which eventually speeds up convergence. However,

when the auxiliary IS is irrelevant (right panel), there is a

clear gap between MF-MES and SF-MES, even in the long

run. This demonstrates the inability of MF-MES to deal

with an irrelevant IS. In that scenario, we hypothesize that

the budget is wasted on uninformative queries, and thus too

many rounds are spent on learning that the sources are not

correlated (Figure 1, right bar plot), leading to a sub-optimal

data acquisition trajectory.

There is therefore a need for a robust method for such a

scenario. This is what the proposed rMF-MES, formally

introduced in the next section, achieves, by taking the best of

both worlds: sticking close to the single ﬁdelity track in case

of an irrelevant IS, while using informative lower-ﬁdelity

queries to accelerate convergence.

Multi-Fidelity Bayesian Optimization with Unreliable Information Sources

0 10 20 30 40 50 60 70 80

Budget

10−2

10−1

Simple regret

Informative Auxiliary Information Source

0 10 20 30 40 50 60 70 80

Budget

Irrelevant Auxiliary Information Source

MF-MES

Auxiliary

MF-MES

Primary

rMF-MES

Auxiliary

rMF-MES

Primary

0.0

0.5

Query

Proportion

0.62 0.38 0.14

0.86

MF-MES

Auxiliary

MF-MES

Primary

rMF-MES

Auxiliary

rMF-MES

Primary

0.58 0.42 0.09

0.91

Hartmann6D

MF-MES SF-MES rMF-MES

Figure 1: Simple regret as a function of budget spent in two multi-ﬁdelty problems, averaged over 100 repetitions. The

informative auxiliary IS helps reduce the cost of BO (left panel: MF-MES in purple reduces regret faster than SF-MES in

green), whereas the irrelevant IS catastrophically disrupts performance (right panel: MF-MES does not reach the low regret

of SF-MES at all). In both settings, the primary IS cost is set to 1 and the auxiliary IS cost to 0.2. From relevant to irrelevant

IS, the proportion of auxiliary IS queries remains high for MF-MES, while rMF-MES is more consistent (lower panel).

5 ROBUST MFBO ALGORITHM

In this section, we introduce rMFBO (robust MFBO), a

methodology to make any GP-based MFBO scheme robust

to the addition of unreliable ISs. The key idea is to con-

trol the quality of the acquisitions to prevent the MFBO

algorithm from behaving as described at the end of Section

At round

, based on the acquisition function

, MFBO

proposes the query

(xMF

t, `t)

according to Eq.

(3)

. The

question is to decide whether to execute this query, or to go

with a more conservative query from the primary IS. Indeed,

we wish to curb the potential performance deterioration w.r.t.

vanilla BO. To do so, we introduce a concurrent pseudo-

SFBO algorithm, which constructs a GP surrogate based

on data from the primary IS only, and so-called pseudo-

observations, introduced later on. The pseudo-SFBO uses

the acquisition function

x7→ α(x, m)

and a separate single-

output GP, yielding the query (xpSF

t, m).

Let us denote the predictive mean and standard deviation

of the MOGP model (used by the MFBO algorithm) by

µMF

and

σMF

, and those of the GP model (used by the

pseudo-SFBO algorithm) by

µSF

and

σSF

. In a nutshell,

the proposed rMFBO follows the conservative query

xpSF

unless the predictive variance of the MOGP model at

xpSF

is small enough:

σMF(xpSF

t, m)≤c1,(4)

where

c1>0

is a user-speciﬁed parameter. The pseudo-

SFBO learns from all the samples, even when the MFBO

candidate

xMF

is queried, by adding the pseudo-obervation

µMF(xpSF

t, m).

It can easily be seen that if

c1→0

, the described algorithm

becomes the SFBO algorithm, since the MFBO proposals

would be always ignored. Our main result, formally dis-

cussed in Section 6, is that we are able to derive a lower

bound on the regret difference between robust MFBO and

SFBO as a function of c1>0.

While condition

(4)

ensures that we can achieve similar

performance as SFBO when auxiliary IS is irrelevant, we

also want to reap the beneﬁts of the multi-ﬁdelity approach

when the auxiliary IS is relevant. To that end, we introduce

a measure of IS relevance,

, and add this second condition

for the acceptance of the MF query:

s(xMF

t, `t)≥c2,(5)

with

c2>0

a user-speciﬁed parameter. We want to draw at-

tention to the fact that condition

(4)

operates over the SFBO

proposal while condition

(5)

acts on the MFBO proposal,

possibly based on an auxiliary IS. Condition

(5)

makes

rMFBO revert more often to primary IS, and makes pseudo-

observations more accurate overall. This allows the algo-

rithm to consider less conservative values for

, opening

the door for the exploitation of auxiliary ISs. In this pa-

per, we use a cost-adjusted information gain (Takeno et al.,

2020),

s(x, `) = I(f(`)(x), f∗| DMF)

λ`

,(6)

where

is the mutual information between the obser-

vation

f(`)(x)

and the maximal value of

f(m), f∗:=

Petrus Mikkola, Julien Martinelli, Louis Filstroff, Samuel Kaski

Algorithm 1 Robust MFBO algorithm

1: Input

: Budget

, costs

(λ1, ..., λm)

, acquisition func-

tion

, hyperparameters

and

, relevance measure

2: Initialize DpSF,DMF

3: Perform Bayesian updates µSF, σSF, µMF, σMF

4: t←1

5: while bΛ/λmc ≥ 2λmdo

6: xpSF

t←arg maxxα(x, m |µSF, σSF)

7: (xMF

t, `t)←arg maxx,` α(x, ` |µMF, σMF)

8: if σMF(xpSF

t, m)≤c1and s(xMF

t, `t)≥c2then

9: yt←f(xMF

t, `t)

10: DMF ← DMF ∪ {((xMF

t, `t), yt)}

11: Perform Bayesian updates µMF, σMF

12: yt←µMF(xpSF

t, m)# pseudo-observation

13: DpSF ← DpSF ∪ {(xpSF

t, yt)}

14: Λ←Λ−λ`t

15: else

16: yt←f(xpSF

t, m)

17: DpSF ← DpSF ∪ {(xpSF

t, yt)}

18: DMF ← DMF ∪ {((xpSF

t, m), yt)}

19: Λ←Λ−λm

20: end if

21: Perform Bayesian updates µSF, σSF, µMF, σMF

22: t←t+ 1

23: end while

24: S← {x∈ X | σMF(x, m)≤c1}

25: xpSF

t←arg maxx∈SµMF(x, m)

26: yt←f(xpSF

t, m)

maxx∈X f(m)(x)

; in other words the information gain on

f∗brought by the observation f(x, `).

The whole procedure is summarized in Algorithm 1, and

an extended version is discussed in Supplementary C. Note

that lines 24-26 guarantee that if the maximizer is one of

the unobserved pseudo-points, it is converted into an actual

observation, a requirement in the proof of Theorem 1.

6 THEORETICAL RESULTS

In this section we tie the regret of rMFBO to that of its

SFBO counterpart. The derivation holds for any relevance

measure s.

Let us ﬁrst deﬁne the function

f:X × JmK→R

such that

f(x, `) = f(`)(x)

for all

(x, `)∈ X × JmK

. We assume

that

is a convex compact subset of

, and we make the

following assumptions about f:

Assumption 1

(

is drawn from a MOGP)

Assume

is a

draw from a MOGP with zero-mean and covariance function

κ((x, `),(x0, `0))

. In other words,

{f(xi, `i)}i

is multivari-

ate normal for any ﬁnite set of input-IS pairs {(xi, `i)}i.

Assumption 2. κis known.

Assumption 3. κis at least twice differentiable.

These assumptions are common in the Bayesian optimiza-

tion literature. For instance, see Srinivas et al. (2012) and

Kandasamy et al. (2016). We also follow these authors in

the next assumption.

Assumption 4

(Bounded derivatives with high probability)

Psup

x∈X 

∂f

∂xj

> L≤ae−(L/bj)2,∀j∈JdK

for some constants a, bj>0.

Since a function with bounded partial derivatives (with an

uniform bound

) is Lipschitz continuous (with a Lipschitz

constant

√dL

), Assumption 4 implies by complementing

and the union bound that

|f(m)(x)−f(m)(x0)| ≤ √dL kx−x0k2∀x,x0∈ X,

with probability greater than

1−dae−(L/b)2

, where

b:=

maxjbj

. Further, Assumption 4 is satisﬁed for four times

differentiable kernels (Ghosal and Roy, 2006, Theorem 5).

The next assumptions relate to the acquisition function.

Assumption 5.

For any round

t∈N

, we assume that the

mapping (x,Dt)7→ α(x, m, Dt)is C2.

Assumption 6.

The Hessian

∇2

xα(x, m)

is a deﬁnite matrix

at the optimum x=x∗.

Running Algorithm 1 for

rounds returns the trajectory

{xpSF

t}T

t=1

, which consists of primary IS queries and pseudo-

primary IS queries. Moreover, we denote the acquisition

trajectory returned by the single-ﬁdelity counterpart as

{xSF

t}T

t=1

. Our reasoning is as follows: we ﬁrst control

the closeness of the two acquisition trajectories, then derive

a lower bound on the difference of their regret.

Let us consider the dataset as a

t(d+ 1)

-dimensional vector

Dt= (x(1)

1, ..., x(d)

t, y1, ..., yt)

. Let

be the closed line

segment joining two datasets

and

. We introduce a

concept, the maximum rate of change of the next query with

respect to Dt, deﬁned as the random variable Mt,

Mt= max

D∈Dt



∂xt+1

∂D(D)



op

where

k·kop

is the operator norm.

measures the sensitiv-

ity of the next query when moving from a dataset

to a

dataset

. It depends on the smoothness of the objective

function

, the kernel

, and the acquisition function

. The

detailed formulas Eqs.

(13)

(14)

and the computation details

for

can be found in Supplementary B and D. Consider

t=DpSF

and

t=DSF

, and let us denote by

the

largest product of any combination of M0, ..., Mt−1,

Mt= max

S∈2Jt−1KY

k∈S

Mk.(7)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-FidelityBayesianOptimizationwithUnreliableInformationSourcesPetrusMikkolaJulienMartinelliLouisFilstroffSamuelKaskiAaltoUniversityAaltoUniversityENSAI,CRESTAaltoUniversityUniversityofManchesterAbstractBayesianoptimization(BO)isapowerfulframe-workforoptimizingblack-box,expensive-to-evaluatefunct...

展开>> 收起<<

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Fidelity Bayesian Optimization with Unreliable Information Sources Petrus Mikkola Julien Martinelli Louis Filstroff Samuel Kaski Aalto University Aalto University ENSAI CREST Aalto University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: