Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization Samuel Daulton

2025-05-06 0 0 3.48MB 32 页 10玖币
侵权投诉
Bayesian Optimization over Discrete and Mixed
Spaces via Probabilistic Reparameterization
Samuel Daulton
University of Oxford, Meta
sdaulton@meta.com
Xingchen Wan
University of Oxford
xwan@robots.ox.ac.uk
David Eriksson
Meta
deriksson@meta.com
Maximilian Balandat
Meta
balandat@meta.com
Michael A. Osborne
University of Oxford
mosb@robots.ox.ac.uk
Eytan Bakshy
Meta
ebakshy@meta.com
Abstract
Optimizing expensive-to-evaluate black-box functions of discrete (and potentially
continuous) design parameters is a ubiquitous problem in scientific and engineering
applications. Bayesian optimization (BO) is a popular, sample-efficient method
that leverages a probabilistic surrogate model and an acquisition function (AF) to
select promising designs to evaluate. However, maximizing the AF over mixed
or high-cardinality discrete search spaces is challenging standard gradient-based
methods cannot be used directly or evaluating the AF at every point in the search
space would be computationally prohibitive. To address this issue, we propose
using probabilistic reparameterization (PR). Instead of directly optimizing the
AF over the search space containing discrete parameters, we instead maximize
the expectation of the AF over a probability distribution defined by continuous
parameters. We prove that under suitable reparameterizations, the BO policy that
maximizes the probabilistic objective is the same as that which maximizes the AF,
and therefore, PR enjoys the same regret bounds as the original BO policy using
the underlying AF. Moreover, our approach provably converges to a stationary
point of the probabilistic objective under gradient ascent using scalable, unbiased
estimators of both the probabilistic objective and its gradient. Therefore, as the
number of starting points and gradient steps increase, our approach will recover
of a maximizer of the AF (an often-neglected requisite for commonly used BO
regret bounds). We validate our approach empirically and demonstrate state-of-the-
art optimization performance on a wide range of real-world applications. PR is
complementary to (and benefits) recent work and naturally generalizes to settings
with multiple objectives and black-box constraints.
1 Introduction
Many scientific and engineering problems involve tuning discrete and/or continuous parameters
to optimize an objective function. Often, the objective function is “black-box”, meaning it has no
known closed-form expression. For example, optimizing the design of an electrospun oil sorbent—a
material that can be used to absorb oil in the case of a marine oil spill to mitigate ecological harm—to
maximize properties such as the oil absorption capacity and mechanical strength [
59
] can involve
tuning both discrete ordinal experimental conditions and continuous parameters controlling the
composition of the material. For another example, optimizing the structural design of a welded beam
can involve tuning the type of metal (categorical), the welding type (binary), and the dimensions
of the different components of the beam (discrete ordinals)–resulting in a search space with over
370 million possible designs [
54
]. We consider the scenario where querying the objective function is
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.10199v1 [cs.LG] 18 Oct 2022
expensive and sample-efficiency is crucial. In the case of designing the oil sorbent, evaluating the
objective function requires manufacturing the material and measuring its properties in a laboratory,
requiring significant time and resources.
Bayesian optimization (BO) is a popular technique for sample-efficient black-box optimization, due
to its proven performance guarantees in many settings [
5
,
52
] and its strong empirical performance
[
23
,
55
]. BO leverages a probabilistic surrogate model of the unknown objective(s) and an acquisition
function (AF) that provides utility values for evaluating a new design to balance exploration and
exploitation. Typically, the maximizer of the AF is selected as the next design to evaluate. However,
maximizing the AF over mixed search spaces (i.e., those consisting of discrete and continuous
parameters) or large discrete search spaces is challenging
1
and continuous (or gradient-based)
optimization routines cannot be directly applied. Theoretical performance guarantees of BO policies
require that the maximizer of the AF is found and selected as the next design to evaluate on the
black-box objective function [
52
]. When the maximizer is not found, regret properties are not
guaranteed, and the performance of the BO policy may degrade.
To tackle these challenges, we propose a technique for improving AF optimization using a probabilis-
tic reparameterization (PR) of the discrete parameters. Our main contributions are:
1.
We propose a technique, probabilistic reparameterization (PR), for maximizing AFs over
discrete and mixed spaces by instead optimizing a probabilistic objective (PO): the expecta-
tion of the AF over a probability distribution of discrete random variables corresponding to
the discrete parameters.
2.
We prove that there is an equivalence between the maximizers of the acquisition function
and the the maximizers of the PO and hence, the policy that chooses designs that are best
with respect to the PO enjoys the same performance guarantees as the standard BO policy.
3.
We derive scalable, unbiased Monte Carlo estimators of the PO and its gradient with respect
to the parameters of the introduced probability distribution. We show that stochastic gradient
ascent using our gradient estimator is guaranteed to converge to a stationary point on the PO
surface and will recover a global maximum of the underlying AF as the number of starting
points and gradient steps increase. This is important because many BO regret bounds require
maximizing the AF [
52
]. Although the AF is often non-convex and maximization is hard,
empirically, with a modest number of starting points, PR leads to better AF optimization
than alternative methods.
4.
We show that PR yields state-of-the-art optimization performance on a wide variety of
real-world design problems with discrete and mixed search spaces. Importantly, PR is com-
plementary to many existing approaches such as popular multi-objective, constrained, and
trust region-based approaches; in particular, PR is agnostic to the underlying probabilistic
model over discrete parameters—which is not the case for many alternative methods.
2 Preliminaries
Bayesian Optimization
We consider the problem of optimizing a black-box function
f:X ×Z
R
over a compact search space
X × Z
, where
X=X(1) × · · · × X (d)
is the domain of the
d0
continuous parameters (
x(i) X (i)
for
i= 1, ..., d
) and
Z=Z(1) × · · · × Z(dz)
is the domain of
the dz1discrete parameters (z(i)∈ Z(i)for i= 1, ..., dz).2
BO leverages (i) a probabilistic surrogate model—typically a Gaussian process (GP) [
46
]—fit to
a data set
Dn={xi,zi, yi}n
i=1
of designs and corresponding (potentially noisy) observations
yi=f(xi,zi) + i, i N (0, σ2)
, and (ii) an acquisition function
α(x,z)
that uses the surrogate
model’s posterior distribution to quantify the value of evaluating a new design. Common AFs include
expected improvement (EI) [
32
] and upper confidence bound (UCB) [
52
]—the latter of which enjoys
no-regret guarantees in certain settings [
52
]. The next design to evaluate is chosen by maximizing
1
If the discrete search space has low enough cardinality that the AF can be evaluated at every discrete
element, then acquisition optimization can be solved trivially.
2
Throughout this paper, we use a mixed search space
X × Z
in our derivations, theorems, and proofs,
without loss of generality with respect to the case of a purely discrete search space. If
d= 0
, then the objective
function
f:Z R
is defined over the discrete space
Z
and the continuous parameters in this exposition can
simply be ignored.
2
the AF
α(x,z)
over
X × Z
. Although the black-box objective
f
is expensive-to-evaluate, the AF is
relatively cheap-to-query, and therefore, it can be optimized numerically. Gradient-based optimization
routines are often used to maximize the AF over continuous domains [25].
Discrete Parameters
In its basic form, BO assumes that the inputs are continuous. However,
discrete parameters such as binary, discrete ordinal, and non-ordered categorical parameters are
ubiquitous in many applications. In the presence of such parameters, optimizing the AF is more
difficult, as standard gradient-based approaches cannot be directly applied. Recent works have
proposed various approaches including multi-armed bandits [
40
,
48
] and local search [
41
] for discrete
domains and interleaved discrete/continuous optimization procedures for mixed domains [
15
,
57
]. A
simple and widely-used approach across many popular BO packages [
1
,
53
] is to one-hot encode the
categorical parameters, apply a continuous relaxation when solving the optimization, and discretize
(round) the resulting continuous candidates. Examples of continuous relaxations and discretization
functions are listed in Table 1.
Table 1: Different parameter types, their continuous relaxations, and discretization functions.
TYPE DOMAIN CONT. RELAXATION discretize(·)FUNCTION
BINARY z∈ {0,1}z0[0,1] round(z0)
ORDINAL z∈ {0, ..., C 1}z0[0.5, C 0.5) round(z0)
CATEGORICAL z∈ {0, ..., C 1}z0[0,1]Carg maxcz0(c)
Although using a continuous relaxation allows for efficient optimization using standard optimization
routines in an alternate continuous domain
Z0Rm
, the AF value for an infeasible continuous
value (i.e.,
z0/∈ Z
) does not account for the discretization that must occur before the black-box
function is evaluated. Moreover, the acquisition value for an infeasible continuous value can be
larger than the AF value after discretization. For an illustration of this, see Fig. 1 (middle/right).
In the worst case, BO will repeatedly select the same infeasible continuous design due to its high
AF value, but discretization will result in a design that has already been evaluated and has zero AF
value. To mitigate this degenerate behavior and avoid the over-estimation issue, Garrido-Merchán
and Hernández-Lobato
[26]
propose discretizing
z0
before evaluating the AF, but the AF is then
non-differentiable with respect to the
z0
. While this improves performance on small search spaces,
the response surface has large flat regions after discretizing
z0
, which makes it difficult to optimize the
AF. The authors of [
26
] propose to approximate the gradients using finite differences, but, empirically,
we find that this approach to be leads to sub-optimal AF optimization relative to PR.
3 Probabilistic Reparameterization
Algorithm 1 BO with PR
1: Input: black-box objective f:X × ZR
2: Initialize D0← ∅, GP0GP(0, k)
3: for n= 1 to Niterations do
4: (xn,θn)arg max(x,θ)∈X ×ΘEZp(Z|θ)[α(x,Z)]
5: Sample znp(Z|θn)
6: Evaluate f(xn,zn)
7: Dn← Dn1∪ {(xn,zn,f(xn,zn))}
8: Update posterior GPngiven Dn
9: end for
We propose an alternative approach
based on probabilistic reparameter-
ization, a relaxation of the original
optimization problem involving dis-
crete parameters. Rather than di-
rectly optimizing the AF via a con-
tinuous relaxation
z0
of the design
z
, we instead reparameterize the
optimization problem by introduc-
ing a discrete probability distribu-
tion
p(Z|θ)
over a random variable
Z
with support exclusively over
Z
.
This distribution is parameterized
by a vector of continuous parameters
θ
. We use
z
to denote the vector
(z(1), ..., z(dz))
, where each
element is a different (possibly vector-valued) discrete parameter. Given this reparameterization, we
define the probabilistic objective (PO):
EZp(Z|θ)[α(x,Z)].(1)
Algorithm 1 outlines BO with probabilistic reparameterization.
PR allows us to optimize
θ
and
x
over a continuous space to maximize the PO instead of optimizing
x
and
z
to maximize
α
directly over the mixed search space
X × Z
. As we will show later, maximizing
3
Figure 1: (
Left
) A comparison of AF optimization using different methods over a mixed search
space shows that PR outperforms alternative methods for AF optimization and has much lower
variance across replications. The violin plots show the distribution of final AF values and the mean.
“Cont. Relax. denotes optimizing a continuous relaxation of the categoricals with exact gradients.
“Exact Round” refers to optimizing a continuous relaxation with approximate gradients (via finite
difference), but discretizes the relaxation before evaluating the surrogate [
26
]. "Interleaved" alternates
between one step of local search on the discrete parameters and one step of gradient ascent on
the continuous parameters (used in CASMOPOLITAN [
57
]). For each method, the best candidate
across 20 restarts is selected (after discretization) and the acquisition value of the resulting feasible
candidate is recorded. The AF is expected improvement [
32
]. (
Middle/Right
) AF values with a
continuous relaxation (middle) and the PO (right) for the Branin function over a mixed domain
with one continuous parameter (
x0
) and one binary parameter (
z0)
(see Appendix C for details on
Branin). (
Middle
) Under a continuous relaxation, the maximizer of the AF is an infeasible point
in the domain (grey circle), which results in a suboptimal AF value when rounded (black star); the
resulting candidate only has 86% of the AF value of the true maximizer. The maximum AF value
across the feasible search space is shown in white and the red regions indicate that the continuous
relaxation overestimates the AF value since it is greater than the maximum AF value of any feasible
design. (
Right
) The PO is maximized at the AF unique maximizer within the valid search domain.
These contours show that PR avoids the overestimation issue that the naive continuous relaxation
suffers from.
the PO allows us to recover a maximizer of
α
over the space
X × Z
. Choosing
p(Z|θ)
to be a
discrete distribution over
Z
means the realizations of
Z
are feasible values in
Z
. Hence, the AF is
only evaluated for feasible discrete designs. Since
p(Z|θ)
is a discrete probability distribution, we
can express
EZp(Z|θ)[α(x,Z)]
as a linear combination where each discrete design is weighted by
its probability mass:
EZp(Z|θ)[α(x,Z)] = X
z∈Z
p(z|θ)α(x,z).(2)
Example distributions for binary, ordinal, and categorical parameters are provided in Table 2.
Table 2: Examples of probabilistic reparameterizations for different parameter types. We denote the
(C1)-simplex as C1.
PARAMETER TYPE RANDOM VARIABLE CONTINUOUS PARAMETER
BINARY ZBERNOULLI(θ)θ[0,1]
ORDINAL Z=bθc+B, B BERNOULLI(θ− bθc)θ[0, C 1]
CATEGORICAL ZCATEGORICAL(θ), θ = (θ(1), ..., θ(C))θC1
Although ordinal parameters could use the same categorical distributions as the non-ordered categori-
cal parameters, we opt for the provided proposal distribution since it uses a scalar
θ
(rather than a
C
-element vector) and it naturally encodes the ordering of the values. Using an independent random
variable
Z(i)p(Z(i)|θ(i))
for each parameter
z(i)
for
i= 1, ..., dz
means that the probabilistic
4
objective can be expressed as
EZp(Z|θ)[α(x,Z)] = X
z(1)∈Z(1)
··· X
z(dz)∈Z(dz)
αx, z(1), ..., z(dz)
dz
Y
i=1
p(z(i)|θ(i)).(3)
3.1 Analytic Gradients
One important benefit of PR is that the PO in
(1)
is differentiable with respect to
θ
(and
x
, if the
gradient of
α
with respect to
x
exists), whereas
α(x,z)
is not differentiable with respect to
z
. The
gradients of the PO with respect to θand xcan be obtained by differentiating Equation 2:
θEZp(Z|θ)[α(x,Z)] = X
z∈Z
α(x,z)θp(z|θ)(4)
xEZp(Z|θ)[α(x,Z)] = X
z∈Z
p(z|θ)xα(x,z)(5)
This enables optimizing the PO (line 4 of Algorithm 1) efficiently and effectively using gradient-based
methods.
3.2 Theoretical Properties
In this section, we derive theoretical properties of PR. Proofs are provided in Appendix B. Our first
result is that there is an equivalence between the maximizers of the PO and the maximizers of the AF
over X × Z.
Theorem 1
(Consistent Maximizers)
.
Suppose that
α
is continuous in
x
for every
z∈ Z
. Let
H
be the maximizers of
α(x,z)
:
H={(x,z)arg max(x,z)∈X ×Z α(x,z)}
. Let
J X × Θ
be
the maximizers of
EZp(Z|θ)[α(x,Z)]
:
J={(x,θ)arg max(x,θ)∈X ×ΘEZp(Z|θ)[α(x,Z)]}
,
where
Θ
is the domain of
θ
. Let
ˆ
H X × Z
be defined as:
ˆ
H={(x,˜
z) : (x,θ) J ,˜
z
p(Z|θ)}. Then, ˆ
H=H.
Algorithm 1 outlines BO with probabilistic reparameterization. Importantly, Theorem 1 states that
sampling from the distribution parameterized by a maximizer of the PO yields a maximizer of
α
, and
therefore, Algorithm 1 enjoys the performance guarantees of α(·).
Corollary 1
(Regret Bounds)
.
Let
α(x,z)
be an acquisition function over a search space
X ×Z
such
that when
α
is applied as part of a BO strategy that strategy has bounded regret . If the conditions
for the regret bounds of that BO strategy using
α
are satisfied, then Algorithm 1 using
α
enjoys the
same regret bound.
Examples of BO policies with bounded regret include those based on AFs such as upper confidence
bound (UCB) [
52
] or Thompson sampling (TS) [
49
] for single objective optimization, and UCB or
TS with Chebyshev [43] or hypervolume [66] scalarizations in the multi-objective setting.
Although the BO policy selects a maximizer of
α
is equivalent to the BO policy in Algorithm 1,
maximizing the AF over mixed or high-dimensional discrete search spaces is challenging because
commonly used gradient-based methods cannot directly be applied. The key advantage of our
approach is that maximizers of the AF can be identified efficiently and effectively by optimizing the
PO using gradient information instead of directly optimizing the AF. We find that optimizing PR
yields better results than directly optimizing
α
or other common relaxations as shown in Figure 1(Left),
where we compare AF optimization methods on the mixed Rosenbrock test problem (see Appendix C
for details).
4 Practical Monte Carlo Estimators
4.1 Unbiased estimators of the Probabilistic Reparameterization and its Gradient
As the number of discrete configurations (
|Z|
) increases, the PO and its gradient may become
computationally expensive to evaluate analytically because both require a summation of
|Z|
terms.
Therefore, we propose to estimate the PO and its gradient using Monte Carlo (MC) sampling. The
MC estimator of the PO is given by
5
摘要:

BayesianOptimizationoverDiscreteandMixedSpacesviaProbabilisticReparameterizationSamuelDaultonUniversityofOxford,Metasdaulton@meta.comXingchenWanUniversityofOxfordxwan@robots.ox.ac.ukDavidErikssonMetaderiksson@meta.comMaximilianBalandatMetabalandat@meta.comMichaelA.OsborneUniversityofOxfordmosb@robot...

展开>> 收起<<
Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization Samuel Daulton.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:3.48MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注