Local Bayesian optimization via maximizing probability of descent Quan Nguyen1Kaiwen Wu2Jacob R. Gardner2Roman Garnett1

2025-05-02 0 0 531.34KB 16 页 10玖币

侵权投诉

Local Bayesian optimization via maximizing

probability of descent

Quan Nguyen∗1Kaiwen Wu∗2Jacob R. Gardner2Roman Garnett1

1Washington University in St. Louis 2University of Pennsylvania

{quan,garnett}@wustl.edu

{kaiwenwu,jacobrg}@seas.upenn.edu

Abstract

Local optimization presents a promising approach to expensive, high-dimensional

black-box optimization by sidestepping the need to globally explore the search

space. For objective functions whose gradient cannot be evaluated directly,

Bayesian optimization offers one solution – we construct a probabilistic model of

the objective, design a policy to learn about the gradient at the current location, and

use the resulting information to navigate the objective landscape. Previous work

has realized this scheme by minimizing the variance in the estimate of the gradient,

then moving in the direction of the expected gradient. In this paper, we re-examine

and reﬁne this approach. We demonstrate that, surprisingly, the expected value

of the gradient is not always the direction maximizing the probability of descent,

and in fact, these directions may be nearly orthogonal. This observation then

inspires an elegant optimization scheme seeking to maximize the probability of

descent while moving in the direction of most-probable descent. Experiments

on both synthetic and real-world objectives show that our method outperforms

previous realizations of this optimization scheme and is competitive against other,

signiﬁcantly more complicated baselines.

1 Introduction

The optimization of expensive-to-evaluate, high-dimensional black-box functions is ubiquitous in

machine learning, science, engineering, and beyond; examples range from hyperparameter tuning

[

] and policy search in reinforcement learning [

], to conﬁguring physics simulations [

High-dimensional global optimization faces an inherent difﬁculty stemming from the curse of

dimensionality, as a thorough exploration of the search space becomes exponentially more expensive.

It is more feasible to seek to locally optimize these high-dimensional objective functions, as we can

then sidestep this inherent burden. This is true even in settings where we cannot directly observe the

gradient of the objective function, as we may appeal to sophisticated techniques such as Bayesian

optimization to nonetheless learn about the gradient of the objective through noisy observations, and

then use this knowledge to navigate the high-dimensional search space locally.

A realization of this scheme has been proposed by Müller et al.

[18]

, where a Gaussian process (GP) is

used to model the objective function, and observations are designed to alternate between minimizing

the variance – and thus uncertainty – of the GP’s estimate of the gradient of the objective at a given

location, then moving in the direction of the expected gradient. Although this approach seems natural,

it fails to account for some nuances in the distribution of the directional derivative induced by the GP.

Speciﬁcally, it turns out that beliefs about the gradient with identical uncertainty may nonetheless have

different probabilities of descent along the expected gradient. Further and perhaps surprisingly, the

expected gradient is not necessarily the direction maximizing the probability of descent – in fact, these

∗Equal contribution.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11662v2 [cs.LG] 16 Jan 2023

directions can be nearly orthogonal. In other words, simply minimizing the gradient variance and mov-

ing in the direction of the expected gradient may lead to suboptimal (local) optimization performance.

With this insight, we propose a scheme for local Bayesian optimization that alternates between

identifying the direction of most probable descent, then moving in that direction. The result is a

local optimizer that is efﬁcient by design. To this end, we derive a closed-form solution for the

direction of most probable descent at a given location in the input space under a GP belief about the

objective function. We then design a corresponding closed-form acquisition function that optimizes

(an upper bound of) the one-step maximum descent probability. Taken together, these components

comprise an elegant and efﬁcient optimization scheme. We demonstrate empirically that, across many

synthetic and real-world functions, our method outperforms the aforementioned prior realization of

this framework and is competitive against other, signiﬁcantly more complicated baselines.

2 Preliminaries

We ﬁrst introduce the problem setting and the local Bayesian optimization framework. We aim to

numerically solve optimization problems of the form:

given x0∈D, ﬁnd x∗= arg min

x∈D(x0)

f(x),

where

f:D→R

is the black-box objective function we wish to optimize locally from a starting

point

, and

D(x0)

is the local region around

inside the domain

. We model the objective

function as a black box, and only assume that we may obtain potentially noisy function evaluations

y=f(x) + ε

, where

ε∼ N (0, σ2)

, at locations of our choosing. We further assume the gradient

cannot be measured directly, but only estimated from such noisy evaluations of the function. Finally,

we consider the case where querying the objective is relatively expensive, limiting the number of

times it may be evaluated. This constraint on our querying budget requires strategically selecting

where to evaluate during optimization.

Bayesian optimization (BO) is one potential approach to this problem that offers unparalleled sample

efﬁciency. BO constructs a probabilistic model of the objective function, typically a Gaussian process

(GP) [

], and uses this model to design the next point(s) to evaluate the objective. After each

observation, the GP is updated to reﬂect our current belief about the objective, which is then used to

inform future decisions. We refer the reader to Garnett [6] for a thorough treatment of GPs and BO.

2.1 Local Bayesian optimization

In many applications, the objective function

is high-dimensional. The curse of dimensionality poses

a challenge for BO, as it will take exponentially more function evaluations to sufﬁciently cover the

search space and ﬁnd the global optimum. It may be more fruitful, therefore, to instead pursue local

optimization, where we aim to descend from the current location, by probing the objective function

in nearby regions to learn about its gradient.

It turns out the BO framework is particularly amenable to this idea, as a GP belief on the objective

function induces a joint GP belief with its gradient [

], which we may use to guide local optimization.

In particular, given a GP belief about the objective function

with a once-differentiable mean function

and a twice-differentiable covariance function

, the joint distribution of noisy function evaluations

observations (X,y)and the gradient of fat some point xis

p y

∇f(x)=Nµ(X)

∇µ(x),K(X,X) + σ2IK(X,x)∇>

∇K(x,X)∇K(x,x)∇>.

Here, when placed in front of

, the differential operator

∇

indicates that we are taking the derivative

with respect to its ﬁrst input; when placed behind

, it indicates the derivative is with respect to

its second input. Conditioned on the observations

(X,y)

, the posterior distribution of the derivative

∇f(x)may be obtained as:

p∇f(x)|x,X,y=N(µx,Σx),

where µx=∇µ(x) + ∇K(x,X)K(X,X) + σ2I−1y−µ(X),

Σx=∇K(x,x)∇>− ∇K(x,X)K(X,X) + σ2I−1K(X,x)∇>.

(1)

Given the ability to reason about the objective function gradient given noisy function observations, we

may realize a Bayesian local optimization scheme as follows. From a current location

, we devise a

policy that ﬁrst designs observations seeking relevant information about the gradient

∇f(x)

, then,

once satisﬁed, moves within the search space to a new location (that is, update

) seeking to descend

on the objective. A particular realization of this local BO scheme named GIBO was investigated

by Müller et al.

[18]

. In that study, the authors choose to learn about

∇f(x)

by minimizing the

uncertainty (quantiﬁed by the trace of the posterior covariance matrix) about the gradient, followed

by moving in the direction of the expected gradient. This algorithm may be thought of as simulating

gradient descent, as it actively builds then follows a noisy estimate of the gradient. Although effective,

GIBO fails to account for nuances in our belief about the objective function gradient and may behave

suboptimally during optimization as a result. Our work addresses this gap by exploiting the rich

structure in the belief about ∇f(x)to design an elegant and principled policy for local BO.

2.2 Related work

We re-examine and extend the work of Müller et al.

[18]

, who proposed using local BO for the

purpose of policy search in reinforcement learning (RL). As mentioned, their proposed algorithm

GIBO alternates between minimizing the variance of the estimate of the gradient – this is analogous

to the goal of A-optimality in optimal design – and moving in the direction of the expected gradient.

This scheme was shown to outperform baselines such as global BO using expected improvement [

]

and the evolutionary algorithm CMA-ES [

] on several problems. Prior to this work, Mania et al.

[16]

noted that local black-box optimization is a promising approach for RL. They developed a simple

algorithm, Augmented Random Search (ARS), that estimates the gradient of the objective via ﬁnite

differencing and random perturbations; this simple method was competitive in their experiments on

RL tasks. GIBO and ARS are the two main baselines that we will be comparing our method against.

As mentioned, scaling to high-dimensional problems has been an enduring challenge in the BO

community, and there have been many proposals to make BO “more local” as a way to relieve the

burden of the curse of dimensionality. In particular, several lines of research have proposed restricting

the search space to only speciﬁc regions, e.g., maintaining a belief about the local optimum [

], using

trust regions [

], and forcing queries to stay close to past observations [

]. Among these, of

note is the TuRBO algorithm [

], which expands and shrinks the size of its trust regions based on the

optimization history within each region, and has been shown to achieve strong performance across

many tasks. We include TuRBO as another baseline in our experiments.

Other approaches have considered dynamically switching from global and gradient-based local

optimization, particularly when a local region is believed to contain the global optimum. For example,

McLeod et al.

[17]

proposed alternating between global BO and using BFGS for local optimization

when there is high certainty that we are close to the global optimum. Diouane et al.

[4]

leveraged the

same scheme to identify good local regions and uses a trust region-based policy for its local phase.

Wang et al.

[26]

, on the other hand, proposed learning about which subregions of the search space are

more likely to contain good objective values and should be locally exploited using Monte Carlo tree

search, by recursively partitioning the space based on optimization performance. The authors also

showed that when combined with TuRBO, their algorithm achieves state-of-the-art performance on a

wide range of tasks. Our optimization method can replace the local optimizer in these approaches,

and in general can act as a subroutine within a larger framework relying on local optimization.

Tackling local optimization from a probabilistic angle, our method belongs to a larger class of

probabilistic numerical methods; see chapter 4 of Hennig et al.

[10]

for a thorough discussion on

probabilistic numerics for local optimization. Within this line of search are other efforts at leveraging

probabilistic reasoning in optimization, including a Bayesian quasi-Newton algorithm that learns

from noisy observations of the gradient [

], a probabilistic interpretation of the incremental proximal

methods [2], and probabilistic line searches [15].

We note that Le Roux et al.

[12]

arrived at a similar update expression as our algorithm (see Sect. 3),

though aiming at developing fast optimization algorithms for good generalization, a different problem

from BO. Moreover, their derivation is devoted to justifying the natural gradient descent. In particular,

they show that the descent direction maximizing the probability of not increasing generalization error

is precisely the natural gradient direction.

3 Maximizing probability of descent

What behavior is desirable for a local optimization routine that values sample efﬁciency? We argue

that we should seek to quickly identify directions that will, with high probability, yield progress

on the objective function. Pursuing this idea requires reasoning about the probability that a given

direction leads “downhill” from a given location. Although one might guess that the direction most

likely to lead downhill is always the (negative) expected gradient, this is not necessarily the case.

Consider the directional derivative of the objective fwith respect to a unit vector vat point x:

∇vf(x) = v>∇f(x),

which quantiﬁes the rate of change of

along the direction of

. According to our GP belief,

∇f(x)follows a multivariate normal distribution, so the directional derivative ∇vf(x)is then:

p∇vf(x)|x,v=Nv>µx,v>Σxv,

where

µx

and

Σx

are the mean and covariance matrix of the normal belief about

∇f(x)

, as deﬁned

in Eq. (1). This distribution allows us to reason about the probability that we descend on the objective

function by moving along the direction of

from

, which is simply the probability that the directional

derivative is negative. Thus, we have the following deﬁnition.

Deﬁnition 3.1

(Descent probability and most probable descent direction)

Given a unit vector

, the

descent probability of the direction vat the location xis given by

Pr ∇vf(x)<0|x,v= Φ−v>µx

pv>Σxv,(2)

where

is the CDF of the standard normal distribution. If

v∗

achieves the maximum descent

probability

v∗∈arg maxvPr ∇vf(x)<0|x,v

, then we call

v∗

a most probable descent

direction.

Note that the deﬁnition Eq. (2) is scaling invariant. Thus, the length of

v∗

does not matter since

the descent probability only depends on its direction. Moreover, we note that descent probability

depends on both the expected gradient

µx

and the gradient uncertainty

Σx

. Therefore, learning about

the gradient by minimizing uncertainty via the trace of the posterior covariance matrix (which does

not consider the expected gradient) and moving in the direction of the negative expected gradient

(which does not consider uncertainty in the gradient) in a decoupled manner may lead to suboptimal

behavior. We ﬁrst present a simple example to demonstrate the nuances that are not captured by this

scheme and to motivate our proposed solution.

3.1 The (negative) expected gradient does not always maximize descent probability

In Fig. 1, we show polar plots of the descent probability

Pr ∇vf(x)<0|x,v

with respect to

different beliefs about the gradient. The angles in the polar plots are the angles between

and the

vector

[1,0]>

. Critically for the discussion below, the uncertainty in the gradient, as measured by the

trace of the covariance matrix, is identical for all three examples.

In the ﬁrst example in the left panel of Fig. 1, the negative expected gradient happens to maximize

the descent probability, and moving in this direction is almost certain to lead downhill. In the middle

panel, the expected gradient is the same as in the left panel, but the covariance matrix has been

permuted. Here, the negative expected gradient again maximizes the descent probability; however,

the largest descent probability is now much lower. In fact, there is non-negligible probability that the

descent direction is in the opposite direction. This is because most of the uncertainty we have about

the gradient concentrates on the ﬁrst element of

µx

, which determines its direction. We note that the

situation in the left panel is inarguably preferable to that in the middle panel, but distinguishing these

two is impossible from uncertainty in ∇f(x)alone.

Finally, in the right panel, the direction of the expected gradient has rotated with respect to that in the

ﬁrst two panels. Now the (negative) expected gradient is entirely different from the most probable

descent direction. Intuitively, the variance in the ﬁrst coordinate is much smaller than in the second

coordinate, and thus the mean in the ﬁrst coordinate is more likely to have the same sign as the

true gradient. However, using negative expected gradient as a descent direction entirely ignores the

uncertainty estimate in the gradient. This example shows that, when we reason about the descent of

a function, the mean vector

µx

and the covariance matrix

Σx

need to be jointly considered, as the

probability of descent depends on both of these quantities (Eq. (2)).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LocalBayesianoptimizationviamaximizingprobabilityofdescentQuanNguyen1KaiwenWu2JacobR.Gardner2RomanGarnett11WashingtonUniversityinSt.Louis2UniversityofPennsylvania{quan,garnett}@wustl.edu{kaiwenwu,jacobrg}@seas.upenn.eduAbstractLocaloptimizationpresentsapromisingapproachtoexpensive,high-dimensional...

展开>> 收起<<

Local Bayesian optimization via maximizing probability of descent Quan Nguyen1Kaiwen Wu2Jacob R. Gardner2Roman Garnett1.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Local Bayesian optimization via maximizing probability of descent Quan Nguyen1Kaiwen Wu2Jacob R. Gardner2Roman Garnett1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: