ture, which mainly differ by how they navigate the exploration-exploitation trade-off [
8
,
9
,
52
,
68
,
69
].
Instead of relying on scalarizations or an improvement-based criterion, this paper considers
the perspective where the goal of interest is to improve the posterior distribution over the optimal
points. We propose a novel information-theoretic acquisition function called the Joint Entropy
Search (JES), which assesses how informative an observation will be in learning more about the joint
distribution of the optimal inputs and outputs. This acquisition function combines the advantages
of existing information-theoretic methods, which focus solely on improving the posterior of either
the optimal inputs [
31
,
33
,
39
] or the optimal outputs [
4
,
6
,
80
]. We connect JES with the existing
information-theoretic acquisition functions by showing that it is an upper bound to these utilities.
After acceptance of this work, we learnt of a parallel line of inquiry by Hvarfner et al.
[
46
], who independently came up with the same JES acquisition function
(3)
. Their work focussed
on the single-objective setting and the approximation scheme they devised is subtly different to the
ones we present. We see our work as being complementary to theirs because we both demonstrate
the effectiveness of this new acquisition function in different settings.
Contributions and organization.
In Section 2, we set up the problem and introduce the novel
JES acquisition function. In Section 3, we present a catalogue of conditional entropy estimates to
approximate this utility and present a simple extension to the batch setting. These approximations are
analytically tractable and differentiable, which means that we can take advantage of gradient-based
optimization. The main results that we developed here can be viewed as direct extensions to the
recent work in the Bayesian optimization literature by Suzuki et al. [
80
] and Moss et al. [
59
].
In Section 4, we present a discussion on the hypervolume indicator and explain how it can be a
misleading performance criterion because it is sensitive to the scale of the objectives. We show that
information-theoretic approaches are naturally invariant to reparameterization of the objectives, which
make them well-suited for multi-objective black-box optimization. For a more complete picture of
performance, we propose a novel weighted hypervolume strategy (Appendix K), which allows us to
assess the performance of a multi-objective algorithm over different parts of the objective space. In
Section 5, we demonstrate the effectiveness of JES on some synthetic and real-life multi-objective
problems. Finally in Section 6, we provide some concluding remarks. Additional results and proofs
are presented in the Appendix.
2 Preliminaries
We consider the problem of maximizing a vector-valued function
f:X→RM
over a bounded
space of inputs
X⊂RD
. To define the maximum
maxx∈Xf(x)
, we appeal to the Pareto partial
ordering in
RM
. For the rest of this paper, we will denote vectors by
y= (y(1), . . . , y(M))∈RM
,
the non-negative real numbers by R≥0and diagonal matrices by diag(·).
Pareto domination.
We say a vector
y∈RM
weakly Pareto dominates another vector
y0∈RM
if it performs just as well in all objectives if not better:
yy0⇐⇒ y−y0∈RM
≥0
. Additionally,
if the vectors are not equivalent,
y6=y0
, then we say strict Pareto domination holds:
yy0⇐⇒
y−y0∈RM
≥0\ {0M}
, where
0M
is the
M
-dimensional zero vector. This binary relation can be
further extended to define domination among sets. Let
A, B ⊂RM
be sets, if the set
B
lies in the
weakly dominated region of
A
, namely
B⊆D(A) = ∪a∈A{y∈RM:ya}
, then we say
A
weakly dominates
B
, denoted by
AB
. In addition, if it also holds that the dominated regions are
not equal, D(A)6=D(B), we say strict Pareto domination holds, denoted by AB.
Multi-objective optimization.
The goal of multi-objective optimization is to identify the Pareto
optimal set of inputs
X∗= arg maxx∈Xf(x)⊆X
. The Pareto set is defined as the set of inputs
whose objective vectors are not strictly Pareto dominated by another:
x∗∈X∗⇐⇒ x∗∈
Xand @x∈Xsuch that f(x)f(x∗)
. The image of the Pareto set in the objective space
Y∗=
f(X∗) = maxx∈Xf(x)
is called the Pareto front. For convenience of notation, we will denote the
set of Pareto optimal input-output pairs by (X∗,Y∗).
Bayesian Optimization
is a sample efficient global optimization strategy, which relies on a proba-
bilistic model in order to decide which points to query. In Appendix A.1, we present the pseudo-code
2