Local Bayesian optimization via maximizing probability of descent Quan Nguyen1Kaiwen Wu2Jacob R. Gardner2Roman Garnett1

2025-05-02 0 0 531.34KB 16 页 10玖币
侵权投诉
Local Bayesian optimization via maximizing
probability of descent
Quan Nguyen1Kaiwen Wu2Jacob R. Gardner2Roman Garnett1
1Washington University in St. Louis 2University of Pennsylvania
{quan,garnett}@wustl.edu
{kaiwenwu,jacobrg}@seas.upenn.edu
Abstract
Local optimization presents a promising approach to expensive, high-dimensional
black-box optimization by sidestepping the need to globally explore the search
space. For objective functions whose gradient cannot be evaluated directly,
Bayesian optimization offers one solution – we construct a probabilistic model of
the objective, design a policy to learn about the gradient at the current location, and
use the resulting information to navigate the objective landscape. Previous work
has realized this scheme by minimizing the variance in the estimate of the gradient,
then moving in the direction of the expected gradient. In this paper, we re-examine
and refine this approach. We demonstrate that, surprisingly, the expected value
of the gradient is not always the direction maximizing the probability of descent,
and in fact, these directions may be nearly orthogonal. This observation then
inspires an elegant optimization scheme seeking to maximize the probability of
descent while moving in the direction of most-probable descent. Experiments
on both synthetic and real-world objectives show that our method outperforms
previous realizations of this optimization scheme and is competitive against other,
significantly more complicated baselines.
1 Introduction
The optimization of expensive-to-evaluate, high-dimensional black-box functions is ubiquitous in
machine learning, science, engineering, and beyond; examples range from hyperparameter tuning
[
21
] and policy search in reinforcement learning [
3
,
7
], to configuring physics simulations [
14
].
High-dimensional global optimization faces an inherent difficulty stemming from the curse of
dimensionality, as a thorough exploration of the search space becomes exponentially more expensive.
It is more feasible to seek to locally optimize these high-dimensional objective functions, as we can
then sidestep this inherent burden. This is true even in settings where we cannot directly observe the
gradient of the objective function, as we may appeal to sophisticated techniques such as Bayesian
optimization to nonetheless learn about the gradient of the objective through noisy observations, and
then use this knowledge to navigate the high-dimensional search space locally.
A realization of this scheme has been proposed by Müller et al.
[18]
, where a Gaussian process (GP) is
used to model the objective function, and observations are designed to alternate between minimizing
the variance – and thus uncertainty – of the GPs estimate of the gradient of the objective at a given
location, then moving in the direction of the expected gradient. Although this approach seems natural,
it fails to account for some nuances in the distribution of the directional derivative induced by the GP.
Specifically, it turns out that beliefs about the gradient with identical uncertainty may nonetheless have
different probabilities of descent along the expected gradient. Further and perhaps surprisingly, the
expected gradient is not necessarily the direction maximizing the probability of descent – in fact, these
Equal contribution.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11662v2 [cs.LG] 16 Jan 2023
directions can be nearly orthogonal. In other words, simply minimizing the gradient variance and mov-
ing in the direction of the expected gradient may lead to suboptimal (local) optimization performance.
With this insight, we propose a scheme for local Bayesian optimization that alternates between
identifying the direction of most probable descent, then moving in that direction. The result is a
local optimizer that is efficient by design. To this end, we derive a closed-form solution for the
direction of most probable descent at a given location in the input space under a GP belief about the
objective function. We then design a corresponding closed-form acquisition function that optimizes
(an upper bound of) the one-step maximum descent probability. Taken together, these components
comprise an elegant and efficient optimization scheme. We demonstrate empirically that, across many
synthetic and real-world functions, our method outperforms the aforementioned prior realization of
this framework and is competitive against other, significantly more complicated baselines.
2 Preliminaries
We first introduce the problem setting and the local Bayesian optimization framework. We aim to
numerically solve optimization problems of the form:
given x0D, find x= arg min
xD(x0)
f(x),
where
f:DR
is the black-box objective function we wish to optimize locally from a starting
point
x0
, and
D(x0)
is the local region around
x0
inside the domain
D
. We model the objective
function as a black box, and only assume that we may obtain potentially noisy function evaluations
y=f(x) + ε
, where
ε N (0, σ2)
, at locations of our choosing. We further assume the gradient
cannot be measured directly, but only estimated from such noisy evaluations of the function. Finally,
we consider the case where querying the objective is relatively expensive, limiting the number of
times it may be evaluated. This constraint on our querying budget requires strategically selecting
where to evaluate during optimization.
Bayesian optimization (BO) is one potential approach to this problem that offers unparalleled sample
efficiency. BO constructs a probabilistic model of the objective function, typically a Gaussian process
(GP) [
19
], and uses this model to design the next point(s) to evaluate the objective. After each
observation, the GP is updated to reflect our current belief about the objective, which is then used to
inform future decisions. We refer the reader to Garnett [6] for a thorough treatment of GPs and BO.
2.1 Local Bayesian optimization
In many applications, the objective function
f
is high-dimensional. The curse of dimensionality poses
a challenge for BO, as it will take exponentially more function evaluations to sufficiently cover the
search space and find the global optimum. It may be more fruitful, therefore, to instead pursue local
optimization, where we aim to descend from the current location, by probing the objective function
in nearby regions to learn about its gradient.
It turns out the BO framework is particularly amenable to this idea, as a GP belief on the objective
function induces a joint GP belief with its gradient [
19
], which we may use to guide local optimization.
In particular, given a GP belief about the objective function
f
with a once-differentiable mean function
µ
and a twice-differentiable covariance function
K
, the joint distribution of noisy function evaluations
observations (X,y)and the gradient of fat some point xis
p y
f(x)=Nµ(X)
µ(x),K(X,X) + σ2IK(X,x)>
K(x,X)K(x,x)>.
Here, when placed in front of
K
, the differential operator
indicates that we are taking the derivative
of
K
with respect to its first input; when placed behind
K
, it indicates the derivative is with respect to
its second input. Conditioned on the observations
(X,y)
, the posterior distribution of the derivative
f(x)may be obtained as:
pf(x)|x,X,y=N(µx,Σx),
where µx=µ(x) + K(x,X)K(X,X) + σ2I1yµ(X),
Σx=K(x,x)>− ∇K(x,X)K(X,X) + σ2I1K(X,x)>.
(1)
2
Given the ability to reason about the objective function gradient given noisy function observations, we
may realize a Bayesian local optimization scheme as follows. From a current location
x
, we devise a
policy that first designs observations seeking relevant information about the gradient
f(x)
, then,
once satisfied, moves within the search space to a new location (that is, update
x
) seeking to descend
on the objective. A particular realization of this local BO scheme named GIBO was investigated
by Müller et al.
[18]
. In that study, the authors choose to learn about
f(x)
by minimizing the
uncertainty (quantified by the trace of the posterior covariance matrix) about the gradient, followed
by moving in the direction of the expected gradient. This algorithm may be thought of as simulating
gradient descent, as it actively builds then follows a noisy estimate of the gradient. Although effective,
GIBO fails to account for nuances in our belief about the objective function gradient and may behave
suboptimally during optimization as a result. Our work addresses this gap by exploiting the rich
structure in the belief about f(x)to design an elegant and principled policy for local BO.
2.2 Related work
We re-examine and extend the work of Müller et al.
[18]
, who proposed using local BO for the
purpose of policy search in reinforcement learning (RL). As mentioned, their proposed algorithm
GIBO alternates between minimizing the variance of the estimate of the gradient – this is analogous
to the goal of A-optimality in optimal design – and moving in the direction of the expected gradient.
This scheme was shown to outperform baselines such as global BO using expected improvement [
11
]
and the evolutionary algorithm CMA-ES [
8
] on several problems. Prior to this work, Mania et al.
[16]
noted that local black-box optimization is a promising approach for RL. They developed a simple
algorithm, Augmented Random Search (ARS), that estimates the gradient of the objective via finite
differencing and random perturbations; this simple method was competitive in their experiments on
RL tasks. GIBO and ARS are the two main baselines that we will be comparing our method against.
As mentioned, scaling to high-dimensional problems has been an enduring challenge in the BO
community, and there have been many proposals to make BO “more local” as a way to relieve the
burden of the curse of dimensionality. In particular, several lines of research have proposed restricting
the search space to only specific regions, e.g., maintaining a belief about the local optimum [
1
], using
trust regions [
5
,
25
], and forcing queries to stay close to past observations [
13
]. Among these, of
note is the TuRBO algorithm [
5
], which expands and shrinks the size of its trust regions based on the
optimization history within each region, and has been shown to achieve strong performance across
many tasks. We include TuRBO as another baseline in our experiments.
Other approaches have considered dynamically switching from global and gradient-based local
optimization, particularly when a local region is believed to contain the global optimum. For example,
McLeod et al.
[17]
proposed alternating between global BO and using BFGS for local optimization
when there is high certainty that we are close to the global optimum. Diouane et al.
[4]
leveraged the
same scheme to identify good local regions and uses a trust region-based policy for its local phase.
Wang et al.
[26]
, on the other hand, proposed learning about which subregions of the search space are
more likely to contain good objective values and should be locally exploited using Monte Carlo tree
search, by recursively partitioning the space based on optimization performance. The authors also
showed that when combined with TuRBO, their algorithm achieves state-of-the-art performance on a
wide range of tasks. Our optimization method can replace the local optimizer in these approaches,
and in general can act as a subroutine within a larger framework relying on local optimization.
Tackling local optimization from a probabilistic angle, our method belongs to a larger class of
probabilistic numerical methods; see chapter 4 of Hennig et al.
[10]
for a thorough discussion on
probabilistic numerics for local optimization. Within this line of search are other efforts at leveraging
probabilistic reasoning in optimization, including a Bayesian quasi-Newton algorithm that learns
from noisy observations of the gradient [
9
], a probabilistic interpretation of the incremental proximal
methods [2], and probabilistic line searches [15].
We note that Le Roux et al.
[12]
arrived at a similar update expression as our algorithm (see Sect. 3),
though aiming at developing fast optimization algorithms for good generalization, a different problem
from BO. Moreover, their derivation is devoted to justifying the natural gradient descent. In particular,
they show that the descent direction maximizing the probability of not increasing generalization error
is precisely the natural gradient direction.
3
3 Maximizing probability of descent
What behavior is desirable for a local optimization routine that values sample efficiency? We argue
that we should seek to quickly identify directions that will, with high probability, yield progress
on the objective function. Pursuing this idea requires reasoning about the probability that a given
direction leads “downhill” from a given location. Although one might guess that the direction most
likely to lead downhill is always the (negative) expected gradient, this is not necessarily the case.
Consider the directional derivative of the objective fwith respect to a unit vector vat point x:
vf(x) = v>f(x),
which quantifies the rate of change of
f
at
x
along the direction of
v
. According to our GP belief,
f(x)follows a multivariate normal distribution, so the directional derivative vf(x)is then:
pvf(x)|x,v=Nv>µx,v>Σxv,
where
µx
and
Σx
are the mean and covariance matrix of the normal belief about
f(x)
, as defined
in Eq. (1). This distribution allows us to reason about the probability that we descend on the objective
function by moving along the direction of
v
from
x
, which is simply the probability that the directional
derivative is negative. Thus, we have the following definition.
Definition 3.1
(Descent probability and most probable descent direction)
.
Given a unit vector
v
, the
descent probability of the direction vat the location xis given by
Pr vf(x)<0|x,v= Φv>µx
pv>Σxv,(2)
where
Φ
is the CDF of the standard normal distribution. If
v
achieves the maximum descent
probability
varg maxvPr vf(x)<0|x,v
, then we call
v
a most probable descent
direction.
Note that the definition Eq. (2) is scaling invariant. Thus, the length of
v
does not matter since
the descent probability only depends on its direction. Moreover, we note that descent probability
depends on both the expected gradient
µx
and the gradient uncertainty
Σx
. Therefore, learning about
the gradient by minimizing uncertainty via the trace of the posterior covariance matrix (which does
not consider the expected gradient) and moving in the direction of the negative expected gradient
(which does not consider uncertainty in the gradient) in a decoupled manner may lead to suboptimal
behavior. We first present a simple example to demonstrate the nuances that are not captured by this
scheme and to motivate our proposed solution.
3.1 The (negative) expected gradient does not always maximize descent probability
In Fig. 1, we show polar plots of the descent probability
Pr vf(x)<0|x,v
with respect to
different beliefs about the gradient. The angles in the polar plots are the angles between
v
and the
vector
[1,0]>
. Critically for the discussion below, the uncertainty in the gradient, as measured by the
trace of the covariance matrix, is identical for all three examples.
In the first example in the left panel of Fig. 1, the negative expected gradient happens to maximize
the descent probability, and moving in this direction is almost certain to lead downhill. In the middle
panel, the expected gradient is the same as in the left panel, but the covariance matrix has been
permuted. Here, the negative expected gradient again maximizes the descent probability; however,
the largest descent probability is now much lower. In fact, there is non-negligible probability that the
descent direction is in the opposite direction. This is because most of the uncertainty we have about
the gradient concentrates on the first element of
µx
, which determines its direction. We note that the
situation in the left panel is inarguably preferable to that in the middle panel, but distinguishing these
two is impossible from uncertainty in f(x)alone.
Finally, in the right panel, the direction of the expected gradient has rotated with respect to that in the
first two panels. Now the (negative) expected gradient is entirely different from the most probable
descent direction. Intuitively, the variance in the first coordinate is much smaller than in the second
coordinate, and thus the mean in the first coordinate is more likely to have the same sign as the
true gradient. However, using negative expected gradient as a descent direction entirely ignores the
uncertainty estimate in the gradient. This example shows that, when we reason about the descent of
a function, the mean vector
µx
and the covariance matrix
Σx
need to be jointly considered, as the
probability of descent depends on both of these quantities (Eq. (2)).
4
摘要:

LocalBayesianoptimizationviamaximizingprobabilityofdescentQuanNguyen1KaiwenWu2JacobR.Gardner2RomanGarnett11WashingtonUniversityinSt.Louis2UniversityofPennsylvania{quan,garnett}@wustl.edu{kaiwenwu,jacobrg}@seas.upenn.eduAbstractLocaloptimizationpresentsapromisingapproachtoexpensive,high-dimensional...

展开>> 收起<<
Local Bayesian optimization via maximizing probability of descent Quan Nguyen1Kaiwen Wu2Jacob R. Gardner2Roman Garnett1.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:531.34KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注