Given the ability to reason about the objective function gradient given noisy function observations, we
may realize a Bayesian local optimization scheme as follows. From a current location
x
, we devise a
policy that first designs observations seeking relevant information about the gradient
∇f(x)
, then,
once satisfied, moves within the search space to a new location (that is, update
x
) seeking to descend
on the objective. A particular realization of this local BO scheme named GIBO was investigated
by Müller et al.
[18]
. In that study, the authors choose to learn about
∇f(x)
by minimizing the
uncertainty (quantified by the trace of the posterior covariance matrix) about the gradient, followed
by moving in the direction of the expected gradient. This algorithm may be thought of as simulating
gradient descent, as it actively builds then follows a noisy estimate of the gradient. Although effective,
GIBO fails to account for nuances in our belief about the objective function gradient and may behave
suboptimally during optimization as a result. Our work addresses this gap by exploiting the rich
structure in the belief about ∇f(x)to design an elegant and principled policy for local BO.
2.2 Related work
We re-examine and extend the work of Müller et al.
[18]
, who proposed using local BO for the
purpose of policy search in reinforcement learning (RL). As mentioned, their proposed algorithm
GIBO alternates between minimizing the variance of the estimate of the gradient – this is analogous
to the goal of A-optimality in optimal design – and moving in the direction of the expected gradient.
This scheme was shown to outperform baselines such as global BO using expected improvement [
11
]
and the evolutionary algorithm CMA-ES [
8
] on several problems. Prior to this work, Mania et al.
[16]
noted that local black-box optimization is a promising approach for RL. They developed a simple
algorithm, Augmented Random Search (ARS), that estimates the gradient of the objective via finite
differencing and random perturbations; this simple method was competitive in their experiments on
RL tasks. GIBO and ARS are the two main baselines that we will be comparing our method against.
As mentioned, scaling to high-dimensional problems has been an enduring challenge in the BO
community, and there have been many proposals to make BO “more local” as a way to relieve the
burden of the curse of dimensionality. In particular, several lines of research have proposed restricting
the search space to only specific regions, e.g., maintaining a belief about the local optimum [
1
], using
trust regions [
5
,
25
], and forcing queries to stay close to past observations [
13
]. Among these, of
note is the TuRBO algorithm [
5
], which expands and shrinks the size of its trust regions based on the
optimization history within each region, and has been shown to achieve strong performance across
many tasks. We include TuRBO as another baseline in our experiments.
Other approaches have considered dynamically switching from global and gradient-based local
optimization, particularly when a local region is believed to contain the global optimum. For example,
McLeod et al.
[17]
proposed alternating between global BO and using BFGS for local optimization
when there is high certainty that we are close to the global optimum. Diouane et al.
[4]
leveraged the
same scheme to identify good local regions and uses a trust region-based policy for its local phase.
Wang et al.
[26]
, on the other hand, proposed learning about which subregions of the search space are
more likely to contain good objective values and should be locally exploited using Monte Carlo tree
search, by recursively partitioning the space based on optimization performance. The authors also
showed that when combined with TuRBO, their algorithm achieves state-of-the-art performance on a
wide range of tasks. Our optimization method can replace the local optimizer in these approaches,
and in general can act as a subroutine within a larger framework relying on local optimization.
Tackling local optimization from a probabilistic angle, our method belongs to a larger class of
probabilistic numerical methods; see chapter 4 of Hennig et al.
[10]
for a thorough discussion on
probabilistic numerics for local optimization. Within this line of search are other efforts at leveraging
probabilistic reasoning in optimization, including a Bayesian quasi-Newton algorithm that learns
from noisy observations of the gradient [
9
], a probabilistic interpretation of the incremental proximal
methods [2], and probabilistic line searches [15].
We note that Le Roux et al.
[12]
arrived at a similar update expression as our algorithm (see Sect. 3),
though aiming at developing fast optimization algorithms for good generalization, a different problem
from BO. Moreover, their derivation is devoted to justifying the natural gradient descent. In particular,
they show that the descent direction maximizing the probability of not increasing generalization error
is precisely the natural gradient direction.
3