Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization Samuel Daulton

2025-05-06 0 0 3.48MB 32 页 10玖币

侵权投诉

Bayesian Optimization over Discrete and Mixed

Spaces via Probabilistic Reparameterization

Samuel Daulton

University of Oxford, Meta

sdaulton@meta.com

Xingchen Wan

University of Oxford

xwan@robots.ox.ac.uk

David Eriksson

Meta

ebakshy@meta.com

Abstract

Optimizing expensive-to-evaluate black-box functions of discrete (and potentially

continuous) design parameters is a ubiquitous problem in scientiﬁc and engineering

applications. Bayesian optimization (BO) is a popular, sample-efﬁcient method

that leverages a probabilistic surrogate model and an acquisition function (AF) to

select promising designs to evaluate. However, maximizing the AF over mixed

or high-cardinality discrete search spaces is challenging standard gradient-based

methods cannot be used directly or evaluating the AF at every point in the search

space would be computationally prohibitive. To address this issue, we propose

using probabilistic reparameterization (PR). Instead of directly optimizing the

AF over the search space containing discrete parameters, we instead maximize

the expectation of the AF over a probability distribution deﬁned by continuous

parameters. We prove that under suitable reparameterizations, the BO policy that

maximizes the probabilistic objective is the same as that which maximizes the AF,

and therefore, PR enjoys the same regret bounds as the original BO policy using

the underlying AF. Moreover, our approach provably converges to a stationary

point of the probabilistic objective under gradient ascent using scalable, unbiased

estimators of both the probabilistic objective and its gradient. Therefore, as the

number of starting points and gradient steps increase, our approach will recover

of a maximizer of the AF (an often-neglected requisite for commonly used BO

regret bounds). We validate our approach empirically and demonstrate state-of-the-

art optimization performance on a wide range of real-world applications. PR is

complementary to (and beneﬁts) recent work and naturally generalizes to settings

with multiple objectives and black-box constraints.

1 Introduction

Many scientiﬁc and engineering problems involve tuning discrete and/or continuous parameters

to optimize an objective function. Often, the objective function is “black-box”, meaning it has no

known closed-form expression. For example, optimizing the design of an electrospun oil sorbent—a

material that can be used to absorb oil in the case of a marine oil spill to mitigate ecological harm—to

maximize properties such as the oil absorption capacity and mechanical strength [

] can involve

tuning both discrete ordinal experimental conditions and continuous parameters controlling the

composition of the material. For another example, optimizing the structural design of a welded beam

can involve tuning the type of metal (categorical), the welding type (binary), and the dimensions

of the different components of the beam (discrete ordinals)–resulting in a search space with over

370 million possible designs [

]. We consider the scenario where querying the objective function is

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.10199v1 [cs.LG] 18 Oct 2022

expensive and sample-efﬁciency is crucial. In the case of designing the oil sorbent, evaluating the

objective function requires manufacturing the material and measuring its properties in a laboratory,

requiring signiﬁcant time and resources.

Bayesian optimization (BO) is a popular technique for sample-efﬁcient black-box optimization, due

to its proven performance guarantees in many settings [

] and its strong empirical performance

[

]. BO leverages a probabilistic surrogate model of the unknown objective(s) and an acquisition

function (AF) that provides utility values for evaluating a new design to balance exploration and

exploitation. Typically, the maximizer of the AF is selected as the next design to evaluate. However,

maximizing the AF over mixed search spaces (i.e., those consisting of discrete and continuous

parameters) or large discrete search spaces is challenging

and continuous (or gradient-based)

optimization routines cannot be directly applied. Theoretical performance guarantees of BO policies

require that the maximizer of the AF is found and selected as the next design to evaluate on the

black-box objective function [

]. When the maximizer is not found, regret properties are not

guaranteed, and the performance of the BO policy may degrade.

To tackle these challenges, we propose a technique for improving AF optimization using a probabilis-

tic reparameterization (PR) of the discrete parameters. Our main contributions are:

We propose a technique, probabilistic reparameterization (PR), for maximizing AFs over

discrete and mixed spaces by instead optimizing a probabilistic objective (PO): the expecta-

tion of the AF over a probability distribution of discrete random variables corresponding to

the discrete parameters.

We prove that there is an equivalence between the maximizers of the acquisition function

and the the maximizers of the PO and hence, the policy that chooses designs that are best

with respect to the PO enjoys the same performance guarantees as the standard BO policy.

We derive scalable, unbiased Monte Carlo estimators of the PO and its gradient with respect

to the parameters of the introduced probability distribution. We show that stochastic gradient

ascent using our gradient estimator is guaranteed to converge to a stationary point on the PO

surface and will recover a global maximum of the underlying AF as the number of starting

points and gradient steps increase. This is important because many BO regret bounds require

maximizing the AF [

]. Although the AF is often non-convex and maximization is hard,

empirically, with a modest number of starting points, PR leads to better AF optimization

than alternative methods.

We show that PR yields state-of-the-art optimization performance on a wide variety of

real-world design problems with discrete and mixed search spaces. Importantly, PR is com-

plementary to many existing approaches such as popular multi-objective, constrained, and

trust region-based approaches; in particular, PR is agnostic to the underlying probabilistic

model over discrete parameters—which is not the case for many alternative methods.

2 Preliminaries

Bayesian Optimization

We consider the problem of optimizing a black-box function

f:X ×Z →

over a compact search space

X × Z

, where

X=X(1) × · · · × X (d)

is the domain of the

d≥0

continuous parameters (

x(i)∈ X (i)

for

i= 1, ..., d

) and

Z=Z(1) × · · · × Z(dz)

is the domain of

the dz≥1discrete parameters (z(i)∈ Z(i)for i= 1, ..., dz).2

BO leverages (i) a probabilistic surrogate model—typically a Gaussian process (GP) [

]—ﬁt to

a data set

Dn={xi,zi, yi}n

i=1

of designs and corresponding (potentially noisy) observations

yi=f(xi,zi) + i, i∼ N (0, σ2)

, and (ii) an acquisition function

α(x,z)

that uses the surrogate

model’s posterior distribution to quantify the value of evaluating a new design. Common AFs include

expected improvement (EI) [

] and upper conﬁdence bound (UCB) [

]—the latter of which enjoys

no-regret guarantees in certain settings [

]. The next design to evaluate is chosen by maximizing

If the discrete search space has low enough cardinality that the AF can be evaluated at every discrete

element, then acquisition optimization can be solved trivially.

Throughout this paper, we use a mixed search space

X × Z

in our derivations, theorems, and proofs,

without loss of generality with respect to the case of a purely discrete search space. If

d= 0

, then the objective

function

f:Z → R

is deﬁned over the discrete space

and the continuous parameters in this exposition can

simply be ignored.

the AF

α(x,z)

over

X × Z

. Although the black-box objective

is expensive-to-evaluate, the AF is

relatively cheap-to-query, and therefore, it can be optimized numerically. Gradient-based optimization

routines are often used to maximize the AF over continuous domains [25].

Discrete Parameters

In its basic form, BO assumes that the inputs are continuous. However,

discrete parameters such as binary, discrete ordinal, and non-ordered categorical parameters are

ubiquitous in many applications. In the presence of such parameters, optimizing the AF is more

difﬁcult, as standard gradient-based approaches cannot be directly applied. Recent works have

proposed various approaches including multi-armed bandits [

] and local search [

] for discrete

domains and interleaved discrete/continuous optimization procedures for mixed domains [

]. A

simple and widely-used approach across many popular BO packages [

] is to one-hot encode the

categorical parameters, apply a continuous relaxation when solving the optimization, and discretize

(round) the resulting continuous candidates. Examples of continuous relaxations and discretization

functions are listed in Table 1.

Table 1: Different parameter types, their continuous relaxations, and discretization functions.

TYPE DOMAIN CONT. RELAXATION discretize(·)FUNCTION

BINARY z∈ {0,1}z0∈[0,1] round(z0)

ORDINAL z∈ {0, ..., C −1}z0∈[−0.5, C −0.5) round(z0)

CATEGORICAL z∈ {0, ..., C −1}z0∈[0,1]Carg maxcz0(c)

Although using a continuous relaxation allows for efﬁcient optimization using standard optimization

routines in an alternate continuous domain

Z0⊂Rm

, the AF value for an infeasible continuous

value (i.e.,

z0/∈ Z

) does not account for the discretization that must occur before the black-box

function is evaluated. Moreover, the acquisition value for an infeasible continuous value can be

larger than the AF value after discretization. For an illustration of this, see Fig. 1 (middle/right).

In the worst case, BO will repeatedly select the same infeasible continuous design due to its high

AF value, but discretization will result in a design that has already been evaluated and has zero AF

value. To mitigate this degenerate behavior and avoid the over-estimation issue, Garrido-Merchán

and Hernández-Lobato

[26]

propose discretizing

before evaluating the AF, but the AF is then

non-differentiable with respect to the

. While this improves performance on small search spaces,

the response surface has large ﬂat regions after discretizing

, which makes it difﬁcult to optimize the

AF. The authors of [

] propose to approximate the gradients using ﬁnite differences, but, empirically,

we ﬁnd that this approach to be leads to sub-optimal AF optimization relative to PR.

3 Probabilistic Reparameterization

Algorithm 1 BO with PR

1: Input: black-box objective f:X × Z→R

2: Initialize D0← ∅, GP0←GP(0, k)

3: for n= 1 to Niterations do

4: (xn,θn)←arg max(x,θ)∈X ×ΘEZ∼p(Z|θ)[α(x,Z)]

5: Sample zn∼p(Z|θn)

6: Evaluate f(xn,zn)

7: Dn← Dn−1∪ {(xn,zn,f(xn,zn))}

8: Update posterior GPngiven Dn

9: end for

We propose an alternative approach

based on probabilistic reparameter-

ization, a relaxation of the original

optimization problem involving dis-

crete parameters. Rather than di-

rectly optimizing the AF via a con-

tinuous relaxation

of the design

, we instead reparameterize the

optimization problem by introduc-

ing a discrete probability distribu-

tion

p(Z|θ)

over a random variable

with support exclusively over

This distribution is parameterized

by a vector of continuous parameters

. We use

to denote the vector

(z(1), ..., z(dz))

, where each

element is a different (possibly vector-valued) discrete parameter. Given this reparameterization, we

deﬁne the probabilistic objective (PO):

EZ∼p(Z|θ)[α(x,Z)].(1)

Algorithm 1 outlines BO with probabilistic reparameterization.

PR allows us to optimize

and

over a continuous space to maximize the PO instead of optimizing

and

to maximize

directly over the mixed search space

X × Z

. As we will show later, maximizing

Figure 1: (

Left

) A comparison of AF optimization using different methods over a mixed search

space shows that PR outperforms alternative methods for AF optimization and has much lower

variance across replications. The violin plots show the distribution of ﬁnal AF values and the mean.

“Cont. Relax.” denotes optimizing a continuous relaxation of the categoricals with exact gradients.

“Exact Round” refers to optimizing a continuous relaxation with approximate gradients (via ﬁnite

difference), but discretizes the relaxation before evaluating the surrogate [

]. "Interleaved" alternates

between one step of local search on the discrete parameters and one step of gradient ascent on

the continuous parameters (used in CASMOPOLITAN [

]). For each method, the best candidate

across 20 restarts is selected (after discretization) and the acquisition value of the resulting feasible

candidate is recorded. The AF is expected improvement [

]. (

Middle/Right

) AF values with a

continuous relaxation (middle) and the PO (right) for the Branin function over a mixed domain

with one continuous parameter (

) and one binary parameter (

z0)

(see Appendix C for details on

Branin). (

Middle

) Under a continuous relaxation, the maximizer of the AF is an infeasible point

in the domain (grey circle), which results in a suboptimal AF value when rounded (black star); the

resulting candidate only has 86% of the AF value of the true maximizer. The maximum AF value

across the feasible search space is shown in white and the red regions indicate that the continuous

relaxation overestimates the AF value since it is greater than the maximum AF value of any feasible

design. (

Right

) The PO is maximized at the AF unique maximizer within the valid search domain.

These contours show that PR avoids the overestimation issue that the naive continuous relaxation

suffers from.

the PO allows us to recover a maximizer of

over the space

X × Z

. Choosing

p(Z|θ)

to be a

discrete distribution over

means the realizations of

are feasible values in

. Hence, the AF is

only evaluated for feasible discrete designs. Since

p(Z|θ)

is a discrete probability distribution, we

can express

EZ∼p(Z|θ)[α(x,Z)]

as a linear combination where each discrete design is weighted by

its probability mass:

EZ∼p(Z|θ)[α(x,Z)] = X

z∈Z

p(z|θ)α(x,z).(2)

Example distributions for binary, ordinal, and categorical parameters are provided in Table 2.

Table 2: Examples of probabilistic reparameterizations for different parameter types. We denote the

(C−1)-simplex as ∆C−1.

PARAMETER TYPE RANDOM VARIABLE CONTINUOUS PARAMETER

BINARY Z∼BERNOULLI(θ)θ∈[0,1]

ORDINAL Z=bθc+B, B ∼BERNOULLI(θ− bθc)θ∈[0, C −1]

CATEGORICAL Z∼CATEGORICAL(θ), θ = (θ(1), ..., θ(C))θ∈∆C−1

Although ordinal parameters could use the same categorical distributions as the non-ordered categori-

cal parameters, we opt for the provided proposal distribution since it uses a scalar

(rather than a

-element vector) and it naturally encodes the ordering of the values. Using an independent random

variable

Z(i)∼p(Z(i)|θ(i))

for each parameter

z(i)

for

i= 1, ..., dz

means that the probabilistic

objective can be expressed as

EZ∼p(Z|θ)[α(x,Z)] = X

z(1)∈Z(1)

··· X

z(dz)∈Z(dz)

αx, z(1), ..., z(dz)

i=1

p(z(i)|θ(i)).(3)

3.1 Analytic Gradients

One important beneﬁt of PR is that the PO in

(1)

is differentiable with respect to

(and

, if the

gradient of

with respect to

exists), whereas

α(x,z)

is not differentiable with respect to

. The

gradients of the PO with respect to θand xcan be obtained by differentiating Equation 2:

∇θEZ∼p(Z|θ)[α(x,Z)] = X

z∈Z

α(x,z)∇θp(z|θ)(4)

∇xEZ∼p(Z|θ)[α(x,Z)] = X

z∈Z

p(z|θ)∇xα(x,z)(5)

This enables optimizing the PO (line 4 of Algorithm 1) efﬁciently and effectively using gradient-based

methods.

3.2 Theoretical Properties

In this section, we derive theoretical properties of PR. Proofs are provided in Appendix B. Our ﬁrst

result is that there is an equivalence between the maximizers of the PO and the maximizers of the AF

over X × Z.

Theorem 1

(Consistent Maximizers)

Suppose that

is continuous in

for every

z∈ Z

. Let

H∗

be the maximizers of

α(x,z)

H∗={(x,z)∈arg max(x,z)∈X ×Z α(x,z)}

. Let

J∗⊆ X × Θ

the maximizers of

EZ∼p(Z|θ)[α(x,Z)]

J∗={(x,θ)∈arg max(x,θ)∈X ×ΘEZ∼p(Z|θ)[α(x,Z)]}

where

is the domain of

. Let

H∗⊆ X × Z

be deﬁned as:

H∗={(x,˜

z) : (x,θ)∈ J ∗,˜

z∼

p(Z|θ)}. Then, ˆ

H∗=H∗.

Algorithm 1 outlines BO with probabilistic reparameterization. Importantly, Theorem 1 states that

sampling from the distribution parameterized by a maximizer of the PO yields a maximizer of

, and

therefore, Algorithm 1 enjoys the performance guarantees of α(·).

Corollary 1

(Regret Bounds)

Let

α(x,z)

be an acquisition function over a search space

X ×Z

such

that when

is applied as part of a BO strategy that strategy has bounded regret . If the conditions

for the regret bounds of that BO strategy using

are satisﬁed, then Algorithm 1 using

enjoys the

same regret bound.

Examples of BO policies with bounded regret include those based on AFs such as upper conﬁdence

bound (UCB) [

] or Thompson sampling (TS) [

] for single objective optimization, and UCB or

TS with Chebyshev [43] or hypervolume [66] scalarizations in the multi-objective setting.

Although the BO policy selects a maximizer of

is equivalent to the BO policy in Algorithm 1,

maximizing the AF over mixed or high-dimensional discrete search spaces is challenging because

commonly used gradient-based methods cannot directly be applied. The key advantage of our

approach is that maximizers of the AF can be identiﬁed efﬁciently and effectively by optimizing the

PO using gradient information instead of directly optimizing the AF. We ﬁnd that optimizing PR

yields better results than directly optimizing

or other common relaxations as shown in Figure 1(Left),

where we compare AF optimization methods on the mixed Rosenbrock test problem (see Appendix C

for details).

4 Practical Monte Carlo Estimators

4.1 Unbiased estimators of the Probabilistic Reparameterization and its Gradient

As the number of discrete conﬁgurations (

|Z|

) increases, the PO and its gradient may become

computationally expensive to evaluate analytically because both require a summation of

|Z|

terms.

Therefore, we propose to estimate the PO and its gradient using Monte Carlo (MC) sampling. The

MC estimator of the PO is given by

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BayesianOptimizationoverDiscreteandMixedSpacesviaProbabilisticReparameterizationSamuelDaultonUniversityofOxford,Metasdaulton@meta.comXingchenWanUniversityofOxfordxwan@robots.ox.ac.ukDavidErikssonMetaderiksson@meta.comMaximilianBalandatMetabalandat@meta.comMichaelA.OsborneUniversityofOxfordmosb@robot...

展开>> 收起<<

Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization Samuel Daulton.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization Samuel Daulton

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: