Joint Entropy Search for Multi-Objective Bayesian Optimization Ben TuAxel GandyNikolas KantasBehrang Shafeiz

2025-05-05 0 0 7.23MB 49 页 10玖币

侵权投诉

Joint Entropy Search for Multi-Objective Bayesian

Optimization

Ben Tu†Axel Gandy†Nikolas Kantas†Behrang Shafei‡

†Imperial College London

‡BASF SE

ben.tu16@imperial.ac.uk

Abstract

Many real-world problems can be phrased as a multi-objective optimization prob-

lem, where the goal is to identify the best set of compromises between the compet-

ing objectives. Multi-objective Bayesian optimization (BO) is a sample efﬁcient

strategy that can be deployed to solve these vector-valued optimization problems

where access is limited to a number of noisy objective function evaluations. In

this paper, we propose a novel information-theoretic acquisition function for BO

called Joint Entropy Search (JES), which considers the joint information gain for

the optimal set of inputs and outputs. We present several analytical approximations

to the JES acquisition function and also introduce an extension to the batch setting.

We showcase the effectiveness of this new approach on a range of synthetic and

real-world problems in terms of the hypervolume and its weighted variants.

1 Introduction

Bayesian optimization (BO) has demonstrated a lot of success in solving black-box optimization

problems in various domains such as machine learning [

], chemistry [

], robotics

[

] and clinical trials [

]. The procedure works by maintaining a probabilistic model of the

observed data in order to guide the optimization procedure into regions of interest. Speciﬁcally, at

each iteration the black-box function is evaluated at one or more input locations that maximizes

an acquisition function on the model. Implicitly, this function strikes a balance between exploring

new areas and exploiting areas that have been shown to be promising. In this work, we consider the

more general problem, where the black-box function of interest is vector-valued. This increases

the difﬁculty of the problem because there are now many directions in which the objectives can be

improved, in contrast to the single-objective setting where there is only one. Informally, the end goal

of multi-objective optimization is to identify a collection of points that describe the best trade-offs

between the different objectives.

There are several ways to deﬁne an acquisition function for multi-objective BO. A popular

strategy is random scalarization [51,64], which works by transforming the multi-objective problem

into a distribution of single-objective problems. These approaches are appealing because they enable

the use of standard single-objective acquisition functions. A weakness of this approach is that it

relies on random sampling to encourage exploration and therefore the performance of this method

might suffer early on when the scale of the objectives is unknown or when either the input space or

the objective space is high-dimensional [

]. Another popular class of multi-objective acquisition

functions are improvement-based. These strategies focus on improving a performance metric over

sets, for example the hypervolume indicator [

] or the R2 indicator [

]. The main

drawback of these approaches is that the performance of these methods can be biased towards a

single performance metric, which can be inadequate to assess the multi-objective aspects of the

problem [

]. There are also many other multi-objective acquisition functions discussed in the litera-

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.02905v1 [cs.LG] 6 Oct 2022

ture, which mainly differ by how they navigate the exploration-exploitation trade-off [

Instead of relying on scalarizations or an improvement-based criterion, this paper considers

the perspective where the goal of interest is to improve the posterior distribution over the optimal

points. We propose a novel information-theoretic acquisition function called the Joint Entropy

Search (JES), which assesses how informative an observation will be in learning more about the joint

distribution of the optimal inputs and outputs. This acquisition function combines the advantages

of existing information-theoretic methods, which focus solely on improving the posterior of either

the optimal inputs [

] or the optimal outputs [

]. We connect JES with the existing

information-theoretic acquisition functions by showing that it is an upper bound to these utilities.

After acceptance of this work, we learnt of a parallel line of inquiry by Hvarfner et al.

[

], who independently came up with the same JES acquisition function

(3)

. Their work focussed

on the single-objective setting and the approximation scheme they devised is subtly different to the

ones we present. We see our work as being complementary to theirs because we both demonstrate

the effectiveness of this new acquisition function in different settings.

Contributions and organization.

In Section 2, we set up the problem and introduce the novel

JES acquisition function. In Section 3, we present a catalogue of conditional entropy estimates to

approximate this utility and present a simple extension to the batch setting. These approximations are

analytically tractable and differentiable, which means that we can take advantage of gradient-based

optimization. The main results that we developed here can be viewed as direct extensions to the

recent work in the Bayesian optimization literature by Suzuki et al. [

] and Moss et al. [

In Section 4, we present a discussion on the hypervolume indicator and explain how it can be a

misleading performance criterion because it is sensitive to the scale of the objectives. We show that

information-theoretic approaches are naturally invariant to reparameterization of the objectives, which

make them well-suited for multi-objective black-box optimization. For a more complete picture of

performance, we propose a novel weighted hypervolume strategy (Appendix K), which allows us to

assess the performance of a multi-objective algorithm over different parts of the objective space. In

Section 5, we demonstrate the effectiveness of JES on some synthetic and real-life multi-objective

problems. Finally in Section 6, we provide some concluding remarks. Additional results and proofs

are presented in the Appendix.

2 Preliminaries

We consider the problem of maximizing a vector-valued function

f:X→RM

over a bounded

space of inputs

X⊂RD

. To deﬁne the maximum

maxx∈Xf(x)

, we appeal to the Pareto partial

ordering in

. For the rest of this paper, we will denote vectors by

y= (y(1), . . . , y(M))∈RM

the non-negative real numbers by R≥0and diagonal matrices by diag(·).

Pareto domination.

We say a vector

y∈RM

weakly Pareto dominates another vector

y0∈RM

if it performs just as well in all objectives if not better:

yy0⇐⇒ y−y0∈RM

≥0

. Additionally,

if the vectors are not equivalent,

y6=y0

, then we say strict Pareto domination holds:

yy0⇐⇒

y−y0∈RM

≥0\ {0M}

, where

is the

-dimensional zero vector. This binary relation can be

further extended to deﬁne domination among sets. Let

A, B ⊂RM

be sets, if the set

lies in the

weakly dominated region of

, namely

B⊆D(A) = ∪a∈A{y∈RM:ya}

, then we say

weakly dominates

, denoted by

AB

. In addition, if it also holds that the dominated regions are

not equal, D(A)6=D(B), we say strict Pareto domination holds, denoted by AB.

Multi-objective optimization.

The goal of multi-objective optimization is to identify the Pareto

optimal set of inputs

X∗= arg maxx∈Xf(x)⊆X

. The Pareto set is deﬁned as the set of inputs

whose objective vectors are not strictly Pareto dominated by another:

x∗∈X∗⇐⇒ x∗∈

Xand @x∈Xsuch that f(x)f(x∗)

. The image of the Pareto set in the objective space

Y∗=

f(X∗) = maxx∈Xf(x)

is called the Pareto front. For convenience of notation, we will denote the

set of Pareto optimal input-output pairs by (X∗,Y∗).

Bayesian Optimization

is a sample efﬁcient global optimization strategy, which relies on a proba-

bilistic model in order to decide which points to query. In Appendix A.1, we present the pseudo-code

Figure 1: Comparison of the samples (top) and change in entropy (bottom) for the posterior and

conditional distributions. The red line in the posterior plots denotes the reference sample that is used to

obtain the maximizer

x∗

and maximum

y∗

, whilst the shaded blue region is the

95%

credible interval

of the posterior

p(f(x)|x, Dn)

. Conditioning on

x∗

reduces the entropy for all inputs according to

how correlated it is with

x∗

. Conditioning on

y∗

reduces the entropy for all inputs according to the

posterior probability that the objective surpasses

y∗

. Conditioning on

(x∗, y∗)

leads to a drop in

entropy based on both the input correlation with x∗and the posterior probability of exceeding y∗.

for the standard BO procedure—for more details see [

]. In this work, we will use indepen-

dent Gaussian process priors [

] on each objective,

f(m)∼GP(µ(m)

0,Σ(m)

, where

µ(m):X→R

is the mean function and

Σ(m):X×X→R

is the covariance function for objective

. The observa-

tions at location

x∈X

will be assumed to be corrupted with additive Gaussian noise,

y=f(x) + 

where

∼ N(0,diag(σ(x)))

denotes the observation noise with variance

σ(x)∈RM

≥0

. After

evaluations, we have a data set

Dn={(xt,yt)}t=1,...,n

. The posterior model

p(f|Dn)

is a collection

of independent Gaussian processes

f(m)|Dn∼GP(µ(m)

n,Σ(m)

. The explicit expressions for the

mean and covariance are presented in Appendix A.2. The main focus of this work is on designing the

acquisition function, α:X→R, which is used to select the inputs: xn+1 = arg maxx∈Xα(x|Dn).

Information-theoretic acquisition functions

focus on maximizing the gain in information from

the next observation and a function of the probabilistic model. Initial work in BO focussed on picking

points to learn more about the distribution of the maximizer

p(X∗|Dn)

. Speciﬁcally, the goal of

interest was to maximize the mutual information between the observation

and the Pareto set

X∗

conditional on the current data set Dn:

where

H[p(x)] = −Rp(x) log p(x)dx

represents the differential entropy. This acquisition function

is commonly referred to as predictive entropy search (PES) [

], but it was formerly

known as entropy search (ES) [

]. Despite the importance of obtaining more information

about the maximizer, the PES acquisition function is heavily dependent on the approximation of

p(y|x, Dn,X∗)

, which is both computationally difﬁcult to implement and optimize. This motivated

researchers to consider a simpler scheme that focusses on learning more about the distribution of the

maximum

p(Y∗|Dn)

. The resulting acquisition function is known as the max-value entropy search

(MES) [4,44,80,86]:

Unlike PES, the conditional probability

p(y|x, Dn,Y∗)

arising in MES can be approximated and

optimized more easily because some approximations lead to closed-form expressions. Despite the

favourable properties of MES, the primary goal of interest is to identify the location of the maximizer

X∗

and not necessarily the value of the maximum

Y∗

. To combine the advantages of both of these

approaches, we propose the joint entropy search acquisition function, which focusses on learning

more about the joint distribution of the optimal points p((X∗,Y∗)|Dn):

αJES(x|Dn) = MI(y; (X∗,Y∗)|x, Dn)

=H[p(y|x, Dn)] −Ep((X∗,Y∗)|Dn)[H[p(y|x, Dn,(X∗,Y∗))]].(3)

The difference in the naming convention stems solely from the approximation strategy used to estimate the

mutual information. At a high level, ES applies expectation propagation [

] to estimate

p(X∗|Dn∪ {x,y})

whilst PES applies expectation propagation to estimate p(y|x, Dn,X∗).

The JES acquisition function inherits the advantages of the PES and MES acquisition functions

because it considers the knowledge learnt about the optimal points and is also simple to implement—

more details in the next section. The following proposition shows that we can also interpret JES as an

upper bound to both the PES and MES acquisition function.

Proposition 1.

The JES is an upper bound to any convex combination of the PES and MES acquisition

functions: αJES(x|Dn)≥βαPES(x|Dn) + (1 −β)αMES(x|Dn), for any β∈[0,1].

In Figure 1, we illustrate the subtle differences between the different information-theoretic acquisition

functions. More speciﬁcally, we visualise the difference between the conditional distributions arising

in each acquisition function for a single-objective problem using one sample of the optimal points.

Remark.

In the BO literature it is common to distinguish between single-objective and multi-

objective acquisition functions by appending ‘MO’ to the end of the acronym. For notational

simplicity, we opt against this convention in this paper. In Appendix C, we emphasize the main

differences that arise when computing the information-theoretic algorithms in both settings.

3 Approximating JES

In this section, we present several approximations to the JES acquisition function

(3)

and a simple

extension to the batch setting. The ﬁrst term in the JES criterion

(3)

is the entropy of a multivariate

normal distribution:

H[p(y|x, Dn)] = M

2log(2πe) + 1

m=1

log(Σ(m)

n(x,x) + σ(m)(x)).(4)

The second term is an intractable expectation which is approximated by drawing Monte Carlo sam-

ples from

p((X∗,Y∗)|Dn)

. The conditional entropy

H[p(y|x, Dn,(X∗,Y∗))]

is also an intractable

quantity which has to be estimated. The overall approximation of (3) will take the form

ˆαJES(x|Dn) = H[p(y|x, Dn)] −1

s=1

h((X∗

s,Y∗

s); x, Dn),(5)

where

denotes the conditional entropy estimate and

(X∗

s,Y∗

s)∼p((X∗,Y∗)|Dn)

are the Monte

Carlo samples. The distribution

p(y|x, Dn,(X∗,Y∗))

is very challenging to work with because it

enforces the global optimality condition that the function lies below the Pareto front

f(X)Y∗

Instead of enforcing global optimality, we make the common simplifying assumption as in [

]

and only enforce the optimality condition at the considered location:

f(x)Y∗

. By applying Bayes’

theorem, the resulting density of interest becomes

p(y|x, Dn∗, f(x)Y∗) = p(f(x)Y∗|x, Dn+)

p(f(x)Y∗|x, Dn∗)p(y|x, Dn∗),(6)

where we have denoted the augmented data sets by

Dn∗=Dn∪(X∗,Y∗)

and

Dn+=Dn∗∪{(x,y)}

We will refer to the quantity

p(f(x)Y∗)

as the cumulative distribution function (CDF). The

following lemma shows that this CDF can be computed analytically when the set

Y∗⊂RM

discrete. This is a standard result [

] which can be derived by ﬁrst partitioning the region

of integration,

D(Y∗) = ∪y∗∈Y∗{z∈RM:zy∗}

, into a collection of hyperrectangle subsets

and then summing up the individual contributions—see Figure 2for a visual. This partition can be

computed using an incremental approach (Algorithm 1 of [

]), which has a cost of

O(|Y∗|bM/2c+1)

In the single-objective setting, the maximum is a single point

y∗∈R

and the box-decomposition is

simply the interval D({y∗})=(−∞, y∗].

Lemma 1.

Let

Y∗⊂RM

be a ﬁnite set and

z∼N(a,diag(b))

be an

-dimensional mul-

tivariate normal with mean

a∈RM

and variances

b∈RM

≥0

. Let

D(Y∗) = SJ

j=1 Bj=

j=1 QM

m=1(l(m)

j, u(m)

j]be the box decomposition of the dominated space, then

p(zY∗) =

j=1

m=1 "Φ u(m)

j−a(m)

√b(m)!−Φ l(m)

j−a(m)

√b(m)!#.(7)

Figure 2: Box decompositions for a two-dimensional and three-dimensional Pareto front.

In Algorithm 1, we present the pseudo-code for the estimation of the JES acquisition function at a

single candidate input. Several variables that calculated within the algorithm are independent of the

input (coloured in blue). For computational efﬁciency, we only compute these variables once and

then save them to memory for later use.

Algorithm 1: Joint Entropy Search (JES).

Input :A candidate x; the data set Dn.

// Cached variables are coloured in blue.

1Compute the initial entropy h0=H[p(y|x, Dn)].

2for s= 1, . . . , S do

3Sample a path fs∼p(f|Dn).

4Compute the Pareto optimal points X∗

s= arg maxx0∈Xfs(x0)and Y∗

s=fs(X∗

s).

5Compute the box decomposition D(Y∗

s) = SJ

j=1 Bj.

6Compute the conditional p(f|Dn∪(X∗

s,Y∗

s)).

7Compute the estimate hs=h((X∗

s,Y∗

s); x, Dn).

8end

9return ˆαJES(x|Dn) = h0−1

SPS

s=1 hs.

3.1 Estimating the conditional entropy

The entropy of

(6)

can be written as an

-dimensional expectation over the multivariate normal

distribution p(y|x, Dn∗) = N(y;µn∗(x),Σn∗(x,x)):

H[p(y|x, Dn∗, f(x)Y∗)]

=−Ep(y|x,Dn∗)p(f(x)Y∗|x, Dn+)

p(f(x)Y∗|x, Dn∗)log p(f(x)Y∗|x, Dn+)

p(f(x)Y∗|x, Dn∗)p(y|x, Dn∗).(8)

To simplify the notation, we deﬁne the m-th standardized value by

γm(z)=(z−µ(m)

n∗(x))/qΣ(m)

n∗(x,x)(9)

for any scalar

z∈R

. Using this function together with Lemma 1, we denote the cumulative

distribution p(f(x)Y∗|x, Dn∗)by W=PJ

j=1 Wj=PJ

j=1 QM

m=1 Wj,m, where

Wj,m = Φ(γm(u(m)

j)) −Φ(γm(l(m)

j)) (10)

are the differences appearing in

(7)

. Moreover, we denote the differences of the ﬁrst derivative of

Wj,m and the negative of the second derivative (with respect to γm) by

Gj,m =φ(γm(u(m)

j)) −φ(γm(l(m)

j)),(11)

Vj,m =γm(u(m)

j)φ(γm(u(m)

j)) −γm(l(m)

j)φ(γm(l(m)

j)),(12)

where

is the probability density function of a standard normal distribution. In the setting where the

observation noise is zero, the conditional distribution is a truncated multivariate normal, which is

known to have an analytical equation formula for the entropy (Theorem 3.1. in [

]). In Appendix E,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JointEntropySearchforMulti-ObjectiveBayesianOptimizationBenTuAxelGandyNikolasKantasBehrangShafeizImperialCollegeLondonzBASFSEben.tu16@imperial.ac.ukAbstractManyreal-worldproblemscanbephrasedasamulti-objectiveoptimizationprob-lem,wherethegoalistoidentifythebestsetofcompromisesbetweenthecompet-ing...

展开>> 收起<<

Joint Entropy Search for Multi-Objective Bayesian Optimization Ben TuAxel GandyNikolas KantasBehrang Shafeiz.pdf

共49页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Joint Entropy Search for Multi-Objective Bayesian Optimization Ben TuAxel GandyNikolas KantasBehrang Shafeiz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: