Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger Lantao Yu Shengjia Zhao Chenlin Meng Stefano Ermon

2025-05-06 0 0 6.55MB 18 页 10玖币

侵权投诉

Generalizing Bayesian Optimization with

Decision-theoretic Entropies

Willie Neiswanger∗, Lantao Yu∗, Shengjia Zhao, Chenlin Meng, Stefano Ermon

Computer Science Department, Stanford University

Stanford, CA 94305

{neiswanger,lantaoyu,sjzhao,chenlin,ermon}@cs.stanford.edu

Abstract

Bayesian optimization (BO) is a popular method for efﬁciently inferring optima of

an expensive black-box function via a sequence of queries. Existing information-

theoretic BO procedures aim to make queries that most reduce the uncertainty

about optima, where the uncertainty is captured by Shannon entropy. However,

an optimal measure of uncertainty would, ideally, factor in how we intend to use

the inferred quantity in some downstream procedure. In this paper, we instead

consider a generalization of Shannon entropy from work in statistical decision

theory [

], which contains a broad class of uncertainty measures parameterized

by a problem-speciﬁc loss function corresponding to a downstream task. We ﬁrst

show that special cases of this entropy lead to popular acquisition functions used in

BO procedures such as knowledge gradient, expected improvement, and entropy

search. We then show how alternative choices for the loss yield a ﬂexible family

of acquisition functions that can be customized for use in novel optimization

settings. Additionally, we develop gradient-based methods to efﬁciently optimize

our proposed family of acquisition functions, and demonstrate strong empirical

performance on a diverse set of sequential decision making tasks, including variants

of top-koptimization, multi-level set estimation, and sequence search.

1 Introduction

Bayesian optimization (BO) is a popular method for efﬁcient global optimization of an expensive

black-box function, which leverages a probabilistic model to judiciously choose a sequence of

function queries. In BO, there are a few key paradigms that motivate existing methodologies. One

paradigm is decision-theoretic BO, which includes methods such as knowledge gradient [

] and

expected improvement [

]. At each iteration of BO, these methods aim to make a query that

maximally increases the expected value, under the posterior, of a ﬁnal estimate of the optima (some-

times referred to as a terminal action). Another common paradigm is based on maximal uncertainty

reduction and includes information-based BO methods such as the family of entropy search methods

[

]. At each iteration of BO, these methods aim to make a query that most reduces the

uncertainty, under the posterior, about a quantity of interest (such as the location of the optima).

In the uncertainty-reduction paradigm, the information-based methods have predominantly used

Shannon entropy as the measure of uncertainty. While Shannon entropy is one measure of uncertainty

that we could aim to reduce at each iteration of BO, it is not the only measure, and it is not necessarily

the most ideal measure for every optimization task. For instance, an optimal uncertainty function

would, ideally, factor in how we intend to use the ﬁnal uncertainty about an inferred quantity in some

downstream procedure.

∗The ﬁrst two authors contributed equally to this work.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01383v1 [stat.ML] 4 Oct 2022

In this paper, we develop a framework that aims to ﬁrst unify and then extend these two paradigms.

Speciﬁcally, we adopt a generalized deﬁnition of entropy from past work in Bayesian decision

theory [

], which proposes a family of decision-theoretic entropies parameterized by a

problem-speciﬁc loss function and action set. This family includes Shannon entropy as a special case.

Using this generalized entropy, we can view information-based BO methods as instances of decision-

theoretic BO, with a terminal action chosen from a different type of action set. Similarly, this frame-

work also includes as special cases the decision-theoretic methods such as expected improvement and

knowledge gradient, which yields an uncertainty-reduction view of these methods. Beyond this uni-

ﬁed view, our framework can be easily adapted to novel problem settings by choosing an appropriate

loss and action set tailored to a given downstream use case. This allows for handling new optimization

scenarios that have not previously been studied and where no BO procedure currently exists.

As an example, there are many real-world problems where we want to estimate a set of optimal points,

rather than a single global optimum. Use cases include when we wish to ﬁnd a set of highest-value

points subject to some constraints on the similarity between these points (e.g. to produce a diverse

set of candidates in drug or materials design [

]), or points which satisfy some sequential

relation (e.g. to construct a library of molecules that attains a sequence of desired measurements [

]).

Further, we may wish to estimate other properties of a black-box function, such as certain curves,

surfaces, or subsets of the domain [

]. Due to the vast number of possibilities, most custom

problem settings have not been explicitly studied in the literature. A key advantage of our framework

is that it provides a way to approach these problems where no suitable methods have been developed.

Additionally, since we deﬁne this family of generalized entropies in a standardized way, we can

develop a common acquisition optimization procedure, which applies generically to many members

of this family (where each member is induced by a speciﬁc loss function and action set). In particular,

we develop a fully differentiable acquisition optimization method inspired by recent work on one-shot

knowledge gradient procedures [

]. This yields an effective and computationally efﬁcient algorithm

for many optimization and sequential decision making tasks, as long as the problem-speciﬁc loss

function is differentiable. In summary, our main contributions are the following:

•

We propose an acquisition function based on a family of decision-theoretic entropies parame-

terized by a loss function



and action set

. Under certain choices of



and

, we can view

multiple BO acquisition functions in a single decision-theoretic perspective, which sheds light

on the settings for which each is best suited.

•

By selecting a suitable



and

, we can produce a problem-speciﬁc acquisition function, which

is tailored to a given downstream use case. This yields a customizable BO method that can be

applied to new optimization problems and other sequential decision making tasks, where no

applicable methods currently exist.

•

We develop an acquisition optimization procedure that applies generically to many instance of

our framework. This procedure is computationally efﬁcient, using a gradient-based approach.

•

We demonstrate that our method shows strong empirical performance on a diverse set of tasks

including top-koptimization with diversity, multi-level set estimation, and sequence search.

2 Setup

Let

f:X → Y ⊂ R

denote an expensive black-box function that maps from an input search space

to an output space

, and

f∈ F

. We assume that we can evaluate

at an input

x∈ X

, and will

observe a noisy function value yx=f(x) + , where ∼ N(0, η2).

We also assume that our uncertainty about

is captured by a probabilistic model with prior distribution

p(f)

, which reﬂects our prior beliefs about

. Given a dataset of observed function evaluations

Dt={(xi, yxi)}t−1

i=1, our model gives a posterior distribution over F, denoted by p(f|Dt).

Suppose that, after a given BO procedure is complete, we intend to choose a terminal action

from

some set of actions

, and then incur a loss based on both this action

and the function

. We denote

this loss as

:F × A → R

. As one example, after the BO procedure, suppose we make a single

guess for the function maximizer, and then incur a loss based on the value of the function at this

guess. In this case, the action set is A=Xand the loss is (f, a) = −f(a).

3 Decision-theoretic Entropy Search

In this section, we ﬁrst describe a family of decision-theoretic entropies from past work in Bayesian

decision theory [

], which are parameterized by a problem-speciﬁc action set

and loss

function



. This family includes Shannon entropy as a special case. We denote this family using

the symbol H`,A, and refer to it as the H`,A-entropy.

Deﬁnition 3.1.

(

H`,A

-entropy of

). Given a prior distribution

p(f)

on functions, and a dataset

observed function evaluations, the posterior H`,A-entropy with loss and action set Ais deﬁned as

H`,A[f| D] = inf

a∈A

Ep(f|D)[(f, a)] .(1)

Intuitively, after expending our budget of function queries, suppose that we must make a terminal

action

a∗∈ A

, where this action incurs a loss

(f, a∗)

deﬁned by the loss



and function

. Given

a posterior

p(f|D)

that describes our belief about

after observing

, we take the terminal

action

a∗

to be the Bayes action, i.e. the action that minimizes the posterior expected loss,

a∗= arg infa∈A Ep(f|D)[(f, a)]

. The

H`,A

-entropy can then be viewed as the posterior expected

loss of the Bayes action. We next describe how this generalizes Shannon entropy, and why it is a

reasonable deﬁnition for an uncertainty measure.

Example: Shannon entropy

Let

P(F)

denote a set of probability distributions on a function

space

, which we assume contains the posterior distribution

p(f| D)∈ P(F)

. Suppose, for the

H`,A

-entropy, that we let the action set

A=P(F)

, and loss function

(f, a) = −log a(f)

, for

a∈ P(F). Unlike the previous examples, note that the action set is now a set of distributions.

Then, the Bayes action will be

a∗=p(f| D)

(this can be shown by writing out the deﬁnition of the

Bayes action as a cross entropy, see Appendix A.1), and thus

H`,A[f| D] = Ep(f|D)[−log a∗(f)] = H[f| D],(2)

where

H[f| D] = −Rp(f| D) log p(f| D)

is the Shannon differential entropy. Thus, the

H`,A-entropy using the above (, A)is equal to the Shannon differential entropy.

Note that we have focused here on the Shannon entropy of the posterior over functions

p(f| D)

. In

Section 4we show how this example can be extended to the Shannon entropy of the posterior over

properties of

, such as the location (or values) of optima, which will provide a direct equivalence to

entropy search methods in BO.

Why is this a reasonable measure of uncertainty?

The

H`,A

-entropy has been interpreted as

a measurement of uncertainty in the literature because it satisﬁes a few intuitive properties. First,

similar to Shannon differential entropy, the

H`,A

-entropy is a concave uncertainty measure [

Intuitively, if we have two distributions

and

, and ﬂip a coin to sample from

, then we

should have less uncertainty if we were told the outcome of the coin ﬂip than if we weren’t. In other

words, the average uncertainty of

and

(i.e. coin ﬂip outcome known) should be less than the

uncertainty of

0.5p1+ 0.5p2

(coin ﬂip outcome unknown). Since

H`,A

is concave, it has this property.

As a consequence—also similar to Shannon differential entropy—the

H`,A

-entropy of the posterior

is less than the

H`,A

-entropy of the prior, in expectation. Intuitively, whenever we make additional

observations (i.e. gain more information), the posterior entropy is expected to decrease.

Acquisition function

We propose a family of acquisition functions for BO based on the

H`,A

entropy, which are similar in structure to information-theoretic acquisition functions in the entropy

search family. Like these, our acquisition function selects the query

xt∈ X

that maximally reduces

the uncertainty, as characterized by the

H`,A

-entropy, in expectation. We refer to this quantity as the

expected H`,A-information gain (EHIG).

Deﬁnition 3.2.

(Expected

H`,A

-information gain). Given a prior

p(f)

on functions and a dataset

of observed function evaluations

, the expected

H`,A

-information gain (EHIG), with loss



and

action set A, is deﬁned as

EHIGt(x;, A) = H`,A[f| Dt]−Ep(yx|Dt)[H`,A[f| Dt∪ {(x, yx)}]] .(3)

There are multiple beneﬁts to developing this acquisition function. Though similar in form to entropy

search acquisition functions, the EHIG yields (based on the deﬁnition of

H`,A

) the one-step Bayes

optimal query for the associated decision problem speciﬁed by the given loss



and action set

. We

prove in Section 4that the EHIG casts both uncertainty-reduction and decision-theoretic acquisition

functions under a common umbrella, using different choices of



and

; this standardization provides

guidance on which acquisition function is optimal for a given use case, based on details of the

associated terminal action. More interestingly, in Section 5we show how the EHIG allows us to

derive problem-speciﬁc acquisition functions tailored to novel optimization and sequential decision

making tasks. And importantly, since we frame acquisition optimization of this family in a common

way—as a bilevel optimization problem over the sample space and action space—we can develop a

single acquisition optimization method that can generically apply to many custom tasks (Section 6).

In Algorithm 1, we present

H`,A

-ENTROPY SEARCH, our full Bayesian optimization procedure using

the EHIG acquisition function. This procedure takes as input a loss , action set A, and prior model

p(f)

. At each iteration, the procedure optimizes

EHIGt(x;, A)

to select a design

xt∈ X

to query,

and then evaluates the black-box function on this design to observe an outcome

yxt∼f(xt) + 

In Section 6we describe methods for optimizing the EHIG acquisition function via gradient-based

procedures, which provide a computationally efﬁcient algorithm for many Aand .

Algorithm 1 H`,A-ENTROPY SEARCH

Input: initial dataset D1, prior p(f), action set A, loss .

1: for t= 1, . . . , T do

2: xt←arg maxx∈X EHIGt(x;, A)Optimize the EHIG acquisition function

3: yxt∼f(xt) +   Evaluate the function fat xt

4: Dt+1 ← Dt∪ {(xt, yxt)}Update the dataset

Output: distribution p(f| DT+1)

4 A Uniﬁed View of Information-based and Decision-theoretic Acquisitions

In this section, we aim to show how acquisition functions commonly used in BO are special cases

of the proposed EHIG family, for particular choices of



and

. This will allow us to view each

acquisition function (including information-based ones) from the perspective of a common decision

problem: after the BO procedure is complete, we choose a terminal action from action set

and

then incur a loss deﬁned by



. Each acquisition function can be viewed as reducing the posterior

uncertainty over fin a way that yields a terminal action with lowest expected loss.

This uniﬁed view provides two main beneﬁts. First, it sheds light on the particular scenarios in which

one of the existing acquisition functions is optimal over the others (which we focus on in this section).

Second, it shows how using the EHIG with other choices for



and

provides new acquisition

functions for a broader set of optimization scenarios and related tasks (which is the focus of Section 5).

Information-based acquisition functions

We state the family of entropy search acquisitions func-

tion in a general way that includes the entropy search (ES) [

], predictive entropy search (PES) [

and max-value entropy search (MES) [

] algorithms. Let

θf∈Θ

denote a property of

we would

like to infer. For example, we could set

θf= arg maxx∈X f(x) = x∗∈ X

, i.e. the location of the

global maximizer of

, or

θf= maxx∈X f(x)∈R

, i.e. the maximum value achieved by

This family of entropy search acquisition function can then be written as:

ESt(x) = H[θf|Dt]−Ep(yx|Dt)[H[θf|Dt∪ {(x, yx)}]] .

We can view this acquisition function as a special case of the

EHIG

in the following way. Suppose,

after the BO procedure is complete, we choose a distribution

from a set of distributions

P(Θ)

and

then incur a loss equal to the negative log-likelihood of

for the true value of

θf

. In this case, we

view the action set as

A=P(Θ)

and the loss function as

(f, a) = −log a(θf)

, where

a∈ A

. To

visualize this, in the case where

θf=x∗

, see Figure 1(left), which shows the terminal action (gold

density function) and corresponding loss (horizontal dashed line).

Under this choice, the

H`,A

-entropy of

will be equal to the Shannon entropy of

θf

, and thus the

EHIGtwill be equal to ESt. We formalize this in the following proposition.

Proposition 1.If we choose

A=P(Θ)

and

(f, q) = −log q(θf)

, then the

EHIG

is equivalent to

the entropy search acquisition function, i.e. EHIGt(x;, A) = ESt(x).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GeneralizingBayesianOptimizationwithDecision-theoreticEntropiesWillieNeiswanger,LantaoYu,ShengjiaZhao,ChenlinMeng,StefanoErmonComputerScienceDepartment,StanfordUniversityStanford,CA94305{neiswanger,lantaoyu,sjzhao,chenlin,ermon}@cs.stanford.eduAbstractBayesianoptimization(BO)isapopularmethodforef...

展开>> 收起<<

Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger Lantao Yu Shengjia Zhao Chenlin Meng Stefano Ermon.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Generalizing Bayesian Optimization with Decision-theoretic Entropies Willie Neiswanger Lantao Yu Shengjia Zhao Chenlin Meng Stefano Ermon

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: