Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger Lantao Yu Shengjia Zhao Chenlin Meng Stefano Ermon

2025-05-06 0 0 6.55MB 18 页 10玖币
侵权投诉
Generalizing Bayesian Optimization with
Decision-theoretic Entropies
Willie Neiswanger, Lantao Yu, Shengjia Zhao, Chenlin Meng, Stefano Ermon
Computer Science Department, Stanford University
Stanford, CA 94305
{neiswanger,lantaoyu,sjzhao,chenlin,ermon}@cs.stanford.edu
Abstract
Bayesian optimization (BO) is a popular method for efficiently inferring optima of
an expensive black-box function via a sequence of queries. Existing information-
theoretic BO procedures aim to make queries that most reduce the uncertainty
about optima, where the uncertainty is captured by Shannon entropy. However,
an optimal measure of uncertainty would, ideally, factor in how we intend to use
the inferred quantity in some downstream procedure. In this paper, we instead
consider a generalization of Shannon entropy from work in statistical decision
theory [
13
,
39
], which contains a broad class of uncertainty measures parameterized
by a problem-specific loss function corresponding to a downstream task. We first
show that special cases of this entropy lead to popular acquisition functions used in
BO procedures such as knowledge gradient, expected improvement, and entropy
search. We then show how alternative choices for the loss yield a flexible family
of acquisition functions that can be customized for use in novel optimization
settings. Additionally, we develop gradient-based methods to efficiently optimize
our proposed family of acquisition functions, and demonstrate strong empirical
performance on a diverse set of sequential decision making tasks, including variants
of top-koptimization, multi-level set estimation, and sequence search.
1 Introduction
Bayesian optimization (BO) is a popular method for efficient global optimization of an expensive
black-box function, which leverages a probabilistic model to judiciously choose a sequence of
function queries. In BO, there are a few key paradigms that motivate existing methodologies. One
paradigm is decision-theoretic BO, which includes methods such as knowledge gradient [
16
] and
expected improvement [
33
,
25
]. At each iteration of BO, these methods aim to make a query that
maximally increases the expected value, under the posterior, of a final estimate of the optima (some-
times referred to as a terminal action). Another common paradigm is based on maximal uncertainty
reduction and includes information-based BO methods such as the family of entropy search methods
[
22
,
24
,
47
,
35
]. At each iteration of BO, these methods aim to make a query that most reduces the
uncertainty, under the posterior, about a quantity of interest (such as the location of the optima).
In the uncertainty-reduction paradigm, the information-based methods have predominantly used
Shannon entropy as the measure of uncertainty. While Shannon entropy is one measure of uncertainty
that we could aim to reduce at each iteration of BO, it is not the only measure, and it is not necessarily
the most ideal measure for every optimization task. For instance, an optimal uncertainty function
would, ideally, factor in how we intend to use the final uncertainty about an inferred quantity in some
downstream procedure.
The first two authors contributed equally to this work.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01383v1 [stat.ML] 4 Oct 2022
In this paper, we develop a framework that aims to first unify and then extend these two paradigms.
Specifically, we adopt a generalized definition of entropy from past work in Bayesian decision
theory [
13
,
39
,
21
], which proposes a family of decision-theoretic entropies parameterized by a
problem-specific loss function and action set. This family includes Shannon entropy as a special case.
Using this generalized entropy, we can view information-based BO methods as instances of decision-
theoretic BO, with a terminal action chosen from a different type of action set. Similarly, this frame-
work also includes as special cases the decision-theoretic methods such as expected improvement and
knowledge gradient, which yields an uncertainty-reduction view of these methods. Beyond this uni-
fied view, our framework can be easily adapted to novel problem settings by choosing an appropriate
loss and action set tailored to a given downstream use case. This allows for handling new optimization
scenarios that have not previously been studied and where no BO procedure currently exists.
As an example, there are many real-world problems where we want to estimate a set of optimal points,
rather than a single global optimum. Use cases include when we wish to find a set of highest-value
points subject to some constraints on the similarity between these points (e.g. to produce a diverse
set of candidates in drug or materials design [
38
,
44
,
45
]), or points which satisfy some sequential
relation (e.g. to construct a library of molecules that attains a sequence of desired measurements [
15
]).
Further, we may wish to estimate other properties of a black-box function, such as certain curves,
surfaces, or subsets of the domain [
54
,
28
,
40
]. Due to the vast number of possibilities, most custom
problem settings have not been explicitly studied in the literature. A key advantage of our framework
is that it provides a way to approach these problems where no suitable methods have been developed.
Additionally, since we define this family of generalized entropies in a standardized way, we can
develop a common acquisition optimization procedure, which applies generically to many members
of this family (where each member is induced by a specific loss function and action set). In particular,
we develop a fully differentiable acquisition optimization method inspired by recent work on one-shot
knowledge gradient procedures [
4
]. This yields an effective and computationally efficient algorithm
for many optimization and sequential decision making tasks, as long as the problem-specific loss
function is differentiable. In summary, our main contributions are the following:
We propose an acquisition function based on a family of decision-theoretic entropies parame-
terized by a loss function
and action set
A
. Under certain choices of
and
A
, we can view
multiple BO acquisition functions in a single decision-theoretic perspective, which sheds light
on the settings for which each is best suited.
By selecting a suitable
and
A
, we can produce a problem-specific acquisition function, which
is tailored to a given downstream use case. This yields a customizable BO method that can be
applied to new optimization problems and other sequential decision making tasks, where no
applicable methods currently exist.
We develop an acquisition optimization procedure that applies generically to many instance of
our framework. This procedure is computationally efficient, using a gradient-based approach.
We demonstrate that our method shows strong empirical performance on a diverse set of tasks
including top-koptimization with diversity, multi-level set estimation, and sequence search.
2 Setup
Let
f:X → Y R
denote an expensive black-box function that maps from an input search space
X
to an output space
Y
, and
f∈ F
. We assume that we can evaluate
f
at an input
x∈ X
, and will
observe a noisy function value yx=f(x) + , where ∼ N(0, η2).
We also assume that our uncertainty about
f
is captured by a probabilistic model with prior distribution
p(f)
, which reflects our prior beliefs about
f
. Given a dataset of observed function evaluations
Dt={(xi, yxi)}t1
i=1, our model gives a posterior distribution over F, denoted by p(f|Dt).
Suppose that, after a given BO procedure is complete, we intend to choose a terminal action
a
from
some set of actions
A
, and then incur a loss based on both this action
a
and the function
f
. We denote
this loss as
:F × A R
. As one example, after the BO procedure, suppose we make a single
guess for the function maximizer, and then incur a loss based on the value of the function at this
guess. In this case, the action set is A=Xand the loss is (f, a) = f(a).
2
3 Decision-theoretic Entropy Search
In this section, we first describe a family of decision-theoretic entropies from past work in Bayesian
decision theory [
13
,
39
,
21
], which are parameterized by a problem-specific action set
A
and loss
function
. This family includes Shannon entropy as a special case. We denote this family using
the symbol H`,A, and refer to it as the H`,A-entropy.
Definition 3.1.
(
H`,A
-entropy of
f
). Given a prior distribution
p(f)
on functions, and a dataset
D
of
observed function evaluations, the posterior H`,A-entropy with loss and action set Ais defined as
H`,A[f| D] = inf
a∈A
Ep(f|D)[(f, a)] .(1)
Intuitively, after expending our budget of function queries, suppose that we must make a terminal
action
a∈ A
, where this action incurs a loss
(f, a)
defined by the loss
and function
f
. Given
a posterior
p(f|D)
that describes our belief about
f
after observing
D
, we take the terminal
action
a
to be the Bayes action, i.e. the action that minimizes the posterior expected loss,
a= arg infa∈A Ep(f|D)[(f, a)]
. The
H`,A
-entropy can then be viewed as the posterior expected
loss of the Bayes action. We next describe how this generalizes Shannon entropy, and why it is a
reasonable definition for an uncertainty measure.
Example: Shannon entropy
Let
P(F)
denote a set of probability distributions on a function
space
F
, which we assume contains the posterior distribution
p(f| D)∈ P(F)
. Suppose, for the
H`,A
-entropy, that we let the action set
A=P(F)
, and loss function
(f, a) = log a(f)
, for
a∈ P(F). Unlike the previous examples, note that the action set is now a set of distributions.
Then, the Bayes action will be
a=p(f| D)
(this can be shown by writing out the definition of the
Bayes action as a cross entropy, see Appendix A.1), and thus
H`,A[f| D] = Ep(f|D)[log a(f)] = H[f| D],(2)
where
H[f| D] = Rp(f| D) log p(f| D)
is the Shannon differential entropy. Thus, the
H`,A-entropy using the above (, A)is equal to the Shannon differential entropy.
Note that we have focused here on the Shannon entropy of the posterior over functions
p(f| D)
. In
Section 4we show how this example can be extended to the Shannon entropy of the posterior over
properties of
f
, such as the location (or values) of optima, which will provide a direct equivalence to
entropy search methods in BO.
Why is this a reasonable measure of uncertainty?
The
H`,A
-entropy has been interpreted as
a measurement of uncertainty in the literature because it satisfies a few intuitive properties. First,
similar to Shannon differential entropy, the
H`,A
-entropy is a concave uncertainty measure [
13
,
21
].
Intuitively, if we have two distributions
p1
and
p2
, and flip a coin to sample from
p1
or
p2
, then we
should have less uncertainty if we were told the outcome of the coin flip than if we weren’t. In other
words, the average uncertainty of
p1
and
p2
(i.e. coin flip outcome known) should be less than the
uncertainty of
0.5p1+ 0.5p2
(coin flip outcome unknown). Since
H`,A
is concave, it has this property.
As a consequence—also similar to Shannon differential entropy—the
H`,A
-entropy of the posterior
is less than the
H`,A
-entropy of the prior, in expectation. Intuitively, whenever we make additional
observations (i.e. gain more information), the posterior entropy is expected to decrease.
Acquisition function
We propose a family of acquisition functions for BO based on the
H`,A
-
entropy, which are similar in structure to information-theoretic acquisition functions in the entropy
search family. Like these, our acquisition function selects the query
xt∈ X
that maximally reduces
the uncertainty, as characterized by the
H`,A
-entropy, in expectation. We refer to this quantity as the
expected H`,A-information gain (EHIG).
Definition 3.2.
(Expected
H`,A
-information gain). Given a prior
p(f)
on functions and a dataset
of observed function evaluations
Dt
, the expected
H`,A
-information gain (EHIG), with loss
and
action set A, is defined as
EHIGt(x;, A) = H`,A[f| Dt]Ep(yx|Dt)[H`,A[f| Dt∪ {(x, yx)}]] .(3)
There are multiple benefits to developing this acquisition function. Though similar in form to entropy
search acquisition functions, the EHIG yields (based on the definition of
H`,A
) the one-step Bayes
3
optimal query for the associated decision problem specified by the given loss
and action set
A
. We
prove in Section 4that the EHIG casts both uncertainty-reduction and decision-theoretic acquisition
functions under a common umbrella, using different choices of
and
A
; this standardization provides
guidance on which acquisition function is optimal for a given use case, based on details of the
associated terminal action. More interestingly, in Section 5we show how the EHIG allows us to
derive problem-specific acquisition functions tailored to novel optimization and sequential decision
making tasks. And importantly, since we frame acquisition optimization of this family in a common
way—as a bilevel optimization problem over the sample space and action space—we can develop a
single acquisition optimization method that can generically apply to many custom tasks (Section 6).
In Algorithm 1, we present
H`,A
-ENTROPY SEARCH, our full Bayesian optimization procedure using
the EHIG acquisition function. This procedure takes as input a loss , action set A, and prior model
p(f)
. At each iteration, the procedure optimizes
EHIGt(x;, A)
to select a design
xt∈ X
to query,
and then evaluates the black-box function on this design to observe an outcome
yxtf(xt) +
.
In Section 6we describe methods for optimizing the EHIG acquisition function via gradient-based
procedures, which provide a computationally efficient algorithm for many Aand .
Algorithm 1 H`,A-ENTROPY SEARCH
Input: initial dataset D1, prior p(f), action set A, loss .
1: for t= 1, . . . , T do
2: xtarg maxx∈X EHIGt(x;, A)Optimize the EHIG acquisition function
3: yxtf(xt) + Evaluate the function fat xt
4: Dt+1 ← Dt∪ {(xt, yxt)}Update the dataset
Output: distribution p(f| DT+1)
4 A Unified View of Information-based and Decision-theoretic Acquisitions
In this section, we aim to show how acquisition functions commonly used in BO are special cases
of the proposed EHIG family, for particular choices of
and
A
. This will allow us to view each
acquisition function (including information-based ones) from the perspective of a common decision
problem: after the BO procedure is complete, we choose a terminal action from action set
A
and
then incur a loss defined by
. Each acquisition function can be viewed as reducing the posterior
uncertainty over fin a way that yields a terminal action with lowest expected loss.
This unified view provides two main benefits. First, it sheds light on the particular scenarios in which
one of the existing acquisition functions is optimal over the others (which we focus on in this section).
Second, it shows how using the EHIG with other choices for
and
A
provides new acquisition
functions for a broader set of optimization scenarios and related tasks (which is the focus of Section 5).
Information-based acquisition functions
We state the family of entropy search acquisitions func-
tion in a general way that includes the entropy search (ES) [
22
], predictive entropy search (PES) [
24
],
and max-value entropy search (MES) [
47
] algorithms. Let
θfΘ
denote a property of
f
we would
like to infer. For example, we could set
θf= arg maxx∈X f(x) = x∈ X
, i.e. the location of the
global maximizer of
f
, or
θf= maxx∈X f(x)R
, i.e. the maximum value achieved by
f
in
X
.
This family of entropy search acquisition function can then be written as:
ESt(x) = H[θf|Dt]Ep(yx|Dt)[H[θf|Dt∪ {(x, yx)}]] .
We can view this acquisition function as a special case of the
EHIG
in the following way. Suppose,
after the BO procedure is complete, we choose a distribution
q
from a set of distributions
P(Θ)
and
then incur a loss equal to the negative log-likelihood of
q
for the true value of
θf
. In this case, we
view the action set as
A=P(Θ)
and the loss function as
(f, a) = log a(θf)
, where
a∈ A
. To
visualize this, in the case where
θf=x
, see Figure 1(left), which shows the terminal action (gold
density function) and corresponding loss (horizontal dashed line).
Under this choice, the
H`,A
-entropy of
f
will be equal to the Shannon entropy of
θf
, and thus the
EHIGtwill be equal to ESt. We formalize this in the following proposition.
Proposition 1.If we choose
A=P(Θ)
and
(f, q) = log q(θf)
, then the
EHIG
is equivalent to
the entropy search acquisition function, i.e. EHIGt(x;, A) = ESt(x).
4
摘要:

GeneralizingBayesianOptimizationwithDecision-theoreticEntropiesWillieNeiswanger,LantaoYu,ShengjiaZhao,ChenlinMeng,StefanoErmonComputerScienceDepartment,StanfordUniversityStanford,CA94305{neiswanger,lantaoyu,sjzhao,chenlin,ermon}@cs.stanford.eduAbstractBayesianoptimization(BO)isapopularmethodforef...

展开>> 收起<<
Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger Lantao Yu Shengjia Zhao Chenlin Meng Stefano Ermon.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:6.55MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注