Joint Entropy Search for Multi-Objective Bayesian Optimization Ben TuAxel GandyNikolas KantasBehrang Shafeiz

2025-05-05 0 0 7.23MB 49 页 10玖币
侵权投诉
Joint Entropy Search for Multi-Objective Bayesian
Optimization
Ben TuAxel GandyNikolas KantasBehrang Shafei
Imperial College London
BASF SE
ben.tu16@imperial.ac.uk
Abstract
Many real-world problems can be phrased as a multi-objective optimization prob-
lem, where the goal is to identify the best set of compromises between the compet-
ing objectives. Multi-objective Bayesian optimization (BO) is a sample efficient
strategy that can be deployed to solve these vector-valued optimization problems
where access is limited to a number of noisy objective function evaluations. In
this paper, we propose a novel information-theoretic acquisition function for BO
called Joint Entropy Search (JES), which considers the joint information gain for
the optimal set of inputs and outputs. We present several analytical approximations
to the JES acquisition function and also introduce an extension to the batch setting.
We showcase the effectiveness of this new approach on a range of synthetic and
real-world problems in terms of the hypervolume and its weighted variants.
1 Introduction
Bayesian optimization (BO) has demonstrated a lot of success in solving black-box optimization
problems in various domains such as machine learning [
76
,
77
,
91
], chemistry [
27
,
35
], robotics
[
7
,
14
] and clinical trials [
63
,
79
]. The procedure works by maintaining a probabilistic model of the
observed data in order to guide the optimization procedure into regions of interest. Specifically, at
each iteration the black-box function is evaluated at one or more input locations that maximizes
an acquisition function on the model. Implicitly, this function strikes a balance between exploring
new areas and exploiting areas that have been shown to be promising. In this work, we consider the
more general problem, where the black-box function of interest is vector-valued. This increases
the difficulty of the problem because there are now many directions in which the objectives can be
improved, in contrast to the single-objective setting where there is only one. Informally, the end goal
of multi-objective optimization is to identify a collection of points that describe the best trade-offs
between the different objectives.
There are several ways to define an acquisition function for multi-objective BO. A popular
strategy is random scalarization [51,64], which works by transforming the multi-objective problem
into a distribution of single-objective problems. These approaches are appealing because they enable
the use of standard single-objective acquisition functions. A weakness of this approach is that it
relies on random sampling to encourage exploration and therefore the performance of this method
might suffer early on when the scale of the objectives is unknown or when either the input space or
the objective space is high-dimensional [
21
,
64
]. Another popular class of multi-objective acquisition
functions are improvement-based. These strategies focus on improving a performance metric over
sets, for example the hypervolume indicator [
18
,
19
,
26
,
93
] or the R2 indicator [
24
]. The main
drawback of these approaches is that the performance of these methods can be biased towards a
single performance metric, which can be inadequate to assess the multi-objective aspects of the
problem [
98
]. There are also many other multi-objective acquisition functions discussed in the litera-
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.02905v1 [cs.LG] 6 Oct 2022
ture, which mainly differ by how they navigate the exploration-exploitation trade-off [
8
,
9
,
52
,
68
,
69
].
Instead of relying on scalarizations or an improvement-based criterion, this paper considers
the perspective where the goal of interest is to improve the posterior distribution over the optimal
points. We propose a novel information-theoretic acquisition function called the Joint Entropy
Search (JES), which assesses how informative an observation will be in learning more about the joint
distribution of the optimal inputs and outputs. This acquisition function combines the advantages
of existing information-theoretic methods, which focus solely on improving the posterior of either
the optimal inputs [
31
,
33
,
39
] or the optimal outputs [
4
,
6
,
80
]. We connect JES with the existing
information-theoretic acquisition functions by showing that it is an upper bound to these utilities.
After acceptance of this work, we learnt of a parallel line of inquiry by Hvarfner et al.
[
46
], who independently came up with the same JES acquisition function
(3)
. Their work focussed
on the single-objective setting and the approximation scheme they devised is subtly different to the
ones we present. We see our work as being complementary to theirs because we both demonstrate
the effectiveness of this new acquisition function in different settings.
Contributions and organization.
In Section 2, we set up the problem and introduce the novel
JES acquisition function. In Section 3, we present a catalogue of conditional entropy estimates to
approximate this utility and present a simple extension to the batch setting. These approximations are
analytically tractable and differentiable, which means that we can take advantage of gradient-based
optimization. The main results that we developed here can be viewed as direct extensions to the
recent work in the Bayesian optimization literature by Suzuki et al. [
80
] and Moss et al. [
59
].
In Section 4, we present a discussion on the hypervolume indicator and explain how it can be a
misleading performance criterion because it is sensitive to the scale of the objectives. We show that
information-theoretic approaches are naturally invariant to reparameterization of the objectives, which
make them well-suited for multi-objective black-box optimization. For a more complete picture of
performance, we propose a novel weighted hypervolume strategy (Appendix K), which allows us to
assess the performance of a multi-objective algorithm over different parts of the objective space. In
Section 5, we demonstrate the effectiveness of JES on some synthetic and real-life multi-objective
problems. Finally in Section 6, we provide some concluding remarks. Additional results and proofs
are presented in the Appendix.
2 Preliminaries
We consider the problem of maximizing a vector-valued function
f:XRM
over a bounded
space of inputs
XRD
. To define the maximum
maxxXf(x)
, we appeal to the Pareto partial
ordering in
RM
. For the rest of this paper, we will denote vectors by
y= (y(1), . . . , y(M))RM
,
the non-negative real numbers by R0and diagonal matrices by diag(·).
Pareto domination.
We say a vector
yRM
weakly Pareto dominates another vector
y0RM
if it performs just as well in all objectives if not better:
yy0yy0RM
0
. Additionally,
if the vectors are not equivalent,
y6=y0
, then we say strict Pareto domination holds:
yy0
yy0RM
0\ {0M}
, where
0M
is the
M
-dimensional zero vector. This binary relation can be
further extended to define domination among sets. Let
A, B RM
be sets, if the set
B
lies in the
weakly dominated region of
A
, namely
BD(A) = aA{yRM:ya}
, then we say
A
weakly dominates
B
, denoted by
AB
. In addition, if it also holds that the dominated regions are
not equal, D(A)6=D(B), we say strict Pareto domination holds, denoted by AB.
Multi-objective optimization.
The goal of multi-objective optimization is to identify the Pareto
optimal set of inputs
X= arg maxxXf(x)X
. The Pareto set is defined as the set of inputs
whose objective vectors are not strictly Pareto dominated by another:
xXx
Xand @xXsuch that f(x)f(x)
. The image of the Pareto set in the objective space
Y=
f(X) = maxxXf(x)
is called the Pareto front. For convenience of notation, we will denote the
set of Pareto optimal input-output pairs by (X,Y).
Bayesian Optimization
is a sample efficient global optimization strategy, which relies on a proba-
bilistic model in order to decide which points to query. In Appendix A.1, we present the pseudo-code
2
Figure 1: Comparison of the samples (top) and change in entropy (bottom) for the posterior and
conditional distributions. The red line in the posterior plots denotes the reference sample that is used to
obtain the maximizer
x
and maximum
y
, whilst the shaded blue region is the
95%
credible interval
of the posterior
p(f(x)|x, Dn)
. Conditioning on
x
reduces the entropy for all inputs according to
how correlated it is with
x
. Conditioning on
y
reduces the entropy for all inputs according to the
posterior probability that the objective surpasses
y
. Conditioning on
(x, y)
leads to a drop in
entropy based on both the input correlation with xand the posterior probability of exceeding y.
for the standard BO procedure—for more details see [
13
,
29
,
75
]. In this work, we will use indepen-
dent Gaussian process priors [
71
] on each objective,
f(m)GP(µ(m)
0,Σ(m)
0)
, where
µ(m):XR
is the mean function and
Σ(m):X×XR
is the covariance function for objective
m
. The observa-
tions at location
xX
will be assumed to be corrupted with additive Gaussian noise,
y=f(x) +
,
where
∼ N(0,diag(σ(x)))
denotes the observation noise with variance
σ(x)RM
0
. After
n
evaluations, we have a data set
Dn={(xt,yt)}t=1,...,n
. The posterior model
p(f|Dn)
is a collection
of independent Gaussian processes
f(m)|DnGP(µ(m)
n,Σ(m)
n)
. The explicit expressions for the
mean and covariance are presented in Appendix A.2. The main focus of this work is on designing the
acquisition function, α:XR, which is used to select the inputs: xn+1 = arg maxxXα(x|Dn).
Information-theoretic acquisition functions
focus on maximizing the gain in information from
the next observation and a function of the probabilistic model. Initial work in BO focussed on picking
points to learn more about the distribution of the maximizer
p(X|Dn)
. Specifically, the goal of
interest was to maximize the mutual information between the observation
y
and the Pareto set
X
conditional on the current data set Dn:
αPES(x|Dn) = MI(y;X|x, Dn) = H[p(y|x, Dn)] Ep(X|Dn)[H[p(y|x, Dn,X)]] (1)
where
H[p(x)] = Rp(x) log p(x)dx
represents the differential entropy. This acquisition function
is commonly referred to as predictive entropy search (PES) [
39
,
40
,
74
], but it was formerly
1
known as entropy search (ES) [
38
,
84
]. Despite the importance of obtaining more information
about the maximizer, the PES acquisition function is heavily dependent on the approximation of
p(y|x, Dn,X)
, which is both computationally difficult to implement and optimize. This motivated
researchers to consider a simpler scheme that focusses on learning more about the distribution of the
maximum
p(Y|Dn)
. The resulting acquisition function is known as the max-value entropy search
(MES) [4,44,80,86]:
αMES(x|Dn) = MI(y;Y|x, Dn) = H[p(y|x, Dn)] Ep(Y|Dn)[H[p(y|x, Dn,Y)]].(2)
Unlike PES, the conditional probability
p(y|x, Dn,Y)
arising in MES can be approximated and
optimized more easily because some approximations lead to closed-form expressions. Despite the
favourable properties of MES, the primary goal of interest is to identify the location of the maximizer
X
and not necessarily the value of the maximum
Y
. To combine the advantages of both of these
approaches, we propose the joint entropy search acquisition function, which focusses on learning
more about the joint distribution of the optimal points p((X,Y)|Dn):
αJES(x|Dn) = MI(y; (X,Y)|x, Dn)
=H[p(y|x, Dn)] Ep((X,Y)|Dn)[H[p(y|x, Dn,(X,Y))]].(3)
1
The difference in the naming convention stems solely from the approximation strategy used to estimate the
mutual information. At a high level, ES applies expectation propagation [
58
] to estimate
p(X|Dn∪ {x,y})
,
whilst PES applies expectation propagation to estimate p(y|x, Dn,X).
3
The JES acquisition function inherits the advantages of the PES and MES acquisition functions
because it considers the knowledge learnt about the optimal points and is also simple to implement—
more details in the next section. The following proposition shows that we can also interpret JES as an
upper bound to both the PES and MES acquisition function.
Proposition 1.
The JES is an upper bound to any convex combination of the PES and MES acquisition
functions: αJES(x|Dn)βαPES(x|Dn) + (1 β)αMES(x|Dn), for any β[0,1].
In Figure 1, we illustrate the subtle differences between the different information-theoretic acquisition
functions. More specifically, we visualise the difference between the conditional distributions arising
in each acquisition function for a single-objective problem using one sample of the optimal points.
Remark.
In the BO literature it is common to distinguish between single-objective and multi-
objective acquisition functions by appending ‘MO’ to the end of the acronym. For notational
simplicity, we opt against this convention in this paper. In Appendix C, we emphasize the main
differences that arise when computing the information-theoretic algorithms in both settings.
3 Approximating JES
In this section, we present several approximations to the JES acquisition function
(3)
and a simple
extension to the batch setting. The first term in the JES criterion
(3)
is the entropy of a multivariate
normal distribution:
H[p(y|x, Dn)] = M
2log(2πe) + 1
2
M
X
m=1
log(Σ(m)
n(x,x) + σ(m)(x)).(4)
The second term is an intractable expectation which is approximated by drawing Monte Carlo sam-
ples from
p((X,Y)|Dn)
. The conditional entropy
H[p(y|x, Dn,(X,Y))]
is also an intractable
quantity which has to be estimated. The overall approximation of (3) will take the form
ˆαJES(x|Dn) = H[p(y|x, Dn)] 1
S
S
X
s=1
h((X
s,Y
s); x, Dn),(5)
where
h
denotes the conditional entropy estimate and
(X
s,Y
s)p((X,Y)|Dn)
are the Monte
Carlo samples. The distribution
p(y|x, Dn,(X,Y))
is very challenging to work with because it
enforces the global optimality condition that the function lies below the Pareto front
f(X)Y
.
Instead of enforcing global optimality, we make the common simplifying assumption as in [
59
,
80
,
86
]
and only enforce the optimality condition at the considered location:
f(x)Y
. By applying Bayes’
theorem, the resulting density of interest becomes
p(y|x, Dn, f(x)Y) = p(f(x)Y|x, Dn+)
p(f(x)Y|x, Dn)p(y|x, Dn),(6)
where we have denoted the augmented data sets by
Dn=Dn(X,Y)
and
Dn+=Dn{(x,y)}
.
We will refer to the quantity
p(f(x)Y)
as the cumulative distribution function (CDF). The
following lemma shows that this CDF can be computed analytically when the set
YRM
is
discrete. This is a standard result [
16
,
50
,
68
,
80
] which can be derived by first partitioning the region
of integration,
D(Y) = yY{zRM:zy}
, into a collection of hyperrectangle subsets
and then summing up the individual contributions—see Figure 2for a visual. This partition can be
computed using an incremental approach (Algorithm 1 of [
55
]), which has a cost of
O(|Y|bM/2c+1)
.
In the single-objective setting, the maximum is a single point
yR
and the box-decomposition is
simply the interval D({y})=(−∞, y].
Lemma 1.
Let
YRM
be a finite set and
zN(a,diag(b))
be an
M
-dimensional mul-
tivariate normal with mean
aRM
and variances
bRM
0
. Let
D(Y) = SJ
j=1 Bj=
SJ
j=1 QM
m=1(l(m)
j, u(m)
j]be the box decomposition of the dominated space, then
p(zY) =
J
X
j=1
M
Y
m=1 "Φ u(m)
ja(m)
b(m)!Φ l(m)
ja(m)
b(m)!#.(7)
4
5
B
4
B
3
B
2
B
1
B
Figure 2: Box decompositions for a two-dimensional and three-dimensional Pareto front.
In Algorithm 1, we present the pseudo-code for the estimation of the JES acquisition function at a
single candidate input. Several variables that calculated within the algorithm are independent of the
input (coloured in blue). For computational efficiency, we only compute these variables once and
then save them to memory for later use.
Algorithm 1: Joint Entropy Search (JES).
Input :A candidate x; the data set Dn.
// Cached variables are coloured in blue.
1Compute the initial entropy h0=H[p(y|x, Dn)].
2for s= 1, . . . , S do
3Sample a path fsp(f|Dn).
4Compute the Pareto optimal points X
s= arg maxx0Xfs(x0)and Y
s=fs(X
s).
5Compute the box decomposition D(Y
s) = SJ
j=1 Bj.
6Compute the conditional p(f|Dn(X
s,Y
s)).
7Compute the estimate hs=h((X
s,Y
s); x, Dn).
8end
9return ˆαJES(x|Dn) = h01
SPS
s=1 hs.
3.1 Estimating the conditional entropy
The entropy of
(6)
can be written as an
M
-dimensional expectation over the multivariate normal
distribution p(y|x, Dn) = N(y;µn(x),Σn(x,x)):
H[p(y|x, Dn, f(x)Y)]
=Ep(y|x,Dn)p(f(x)Y|x, Dn+)
p(f(x)Y|x, Dn)log p(f(x)Y|x, Dn+)
p(f(x)Y|x, Dn)p(y|x, Dn).(8)
To simplify the notation, we define the m-th standardized value by
γm(z)=(zµ(m)
n(x))/qΣ(m)
n(x,x)(9)
for any scalar
zR
. Using this function together with Lemma 1, we denote the cumulative
distribution p(f(x)Y|x, Dn)by W=PJ
j=1 Wj=PJ
j=1 QM
m=1 Wj,m, where
Wj,m = Φ(γm(u(m)
j)) Φ(γm(l(m)
j)) (10)
are the differences appearing in
(7)
. Moreover, we denote the differences of the first derivative of
Wj,m and the negative of the second derivative (with respect to γm) by
Gj,m =φ(γm(u(m)
j)) φ(γm(l(m)
j)),(11)
Vj,m =γm(u(m)
j)φ(γm(u(m)
j)) γm(l(m)
j)φ(γm(l(m)
j)),(12)
where
φ
is the probability density function of a standard normal distribution. In the setting where the
observation noise is zero, the conditional distribution is a truncated multivariate normal, which is
known to have an analytical equation formula for the entropy (Theorem 3.1. in [
80
]). In Appendix E,
5
摘要:

JointEntropySearchforMulti-ObjectiveBayesianOptimizationBenTu†AxelGandy†NikolasKantas†BehrangShafeiz†ImperialCollegeLondonzBASFSEben.tu16@imperial.ac.ukAbstractManyreal-worldproblemscanbephrasedasamulti-objectiveoptimizationprob-lem,wherethegoalistoidentifythebestsetofcompromisesbetweenthecompet-ing...

展开>> 收起<<
Joint Entropy Search for Multi-Objective Bayesian Optimization Ben TuAxel GandyNikolas KantasBehrang Shafeiz.pdf

共49页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:49 页 大小:7.23MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 49
客服
关注