3 Decision-theoretic Entropy Search
In this section, we first describe a family of decision-theoretic entropies from past work in Bayesian
decision theory [
13
,
39
,
21
], which are parameterized by a problem-specific action set
A
and loss
function
. This family includes Shannon entropy as a special case. We denote this family using
the symbol H`,A, and refer to it as the H`,A-entropy.
Definition 3.1.
(
H`,A
-entropy of
f
). Given a prior distribution
p(f)
on functions, and a dataset
D
of
observed function evaluations, the posterior H`,A-entropy with loss and action set Ais defined as
H`,A[f| D] = inf
a∈A
Ep(f|D)[(f, a)] .(1)
Intuitively, after expending our budget of function queries, suppose that we must make a terminal
action
a∗∈ A
, where this action incurs a loss
(f, a∗)
defined by the loss
and function
f
. Given
a posterior
p(f|D)
that describes our belief about
f
after observing
D
, we take the terminal
action
a∗
to be the Bayes action, i.e. the action that minimizes the posterior expected loss,
a∗= arg infa∈A Ep(f|D)[(f, a)]
. The
H`,A
-entropy can then be viewed as the posterior expected
loss of the Bayes action. We next describe how this generalizes Shannon entropy, and why it is a
reasonable definition for an uncertainty measure.
Example: Shannon entropy
Let
P(F)
denote a set of probability distributions on a function
space
F
, which we assume contains the posterior distribution
p(f| D)∈ P(F)
. Suppose, for the
H`,A
-entropy, that we let the action set
A=P(F)
, and loss function
(f, a) = −log a(f)
, for
a∈ P(F). Unlike the previous examples, note that the action set is now a set of distributions.
Then, the Bayes action will be
a∗=p(f| D)
(this can be shown by writing out the definition of the
Bayes action as a cross entropy, see Appendix A.1), and thus
H`,A[f| D] = Ep(f|D)[−log a∗(f)] = H[f| D],(2)
where
H[f| D] = −Rp(f| D) log p(f| D)
is the Shannon differential entropy. Thus, the
H`,A-entropy using the above (, A)is equal to the Shannon differential entropy.
Note that we have focused here on the Shannon entropy of the posterior over functions
p(f| D)
. In
Section 4we show how this example can be extended to the Shannon entropy of the posterior over
properties of
f
, such as the location (or values) of optima, which will provide a direct equivalence to
entropy search methods in BO.
Why is this a reasonable measure of uncertainty?
The
H`,A
-entropy has been interpreted as
a measurement of uncertainty in the literature because it satisfies a few intuitive properties. First,
similar to Shannon differential entropy, the
H`,A
-entropy is a concave uncertainty measure [
13
,
21
].
Intuitively, if we have two distributions
p1
and
p2
, and flip a coin to sample from
p1
or
p2
, then we
should have less uncertainty if we were told the outcome of the coin flip than if we weren’t. In other
words, the average uncertainty of
p1
and
p2
(i.e. coin flip outcome known) should be less than the
uncertainty of
0.5p1+ 0.5p2
(coin flip outcome unknown). Since
H`,A
is concave, it has this property.
As a consequence—also similar to Shannon differential entropy—the
H`,A
-entropy of the posterior
is less than the
H`,A
-entropy of the prior, in expectation. Intuitively, whenever we make additional
observations (i.e. gain more information), the posterior entropy is expected to decrease.
Acquisition function
We propose a family of acquisition functions for BO based on the
H`,A
-
entropy, which are similar in structure to information-theoretic acquisition functions in the entropy
search family. Like these, our acquisition function selects the query
xt∈ X
that maximally reduces
the uncertainty, as characterized by the
H`,A
-entropy, in expectation. We refer to this quantity as the
expected H`,A-information gain (EHIG).
Definition 3.2.
(Expected
H`,A
-information gain). Given a prior
p(f)
on functions and a dataset
of observed function evaluations
Dt
, the expected
H`,A
-information gain (EHIG), with loss
and
action set A, is defined as
EHIGt(x;, A) = H`,A[f| Dt]−Ep(yx|Dt)[H`,A[f| Dt∪ {(x, yx)}]] .(3)
There are multiple benefits to developing this acquisition function. Though similar in form to entropy
search acquisition functions, the EHIG yields (based on the definition of
H`,A
) the one-step Bayes
3