age classification problem demonstrates the effectiveness of our
framework in capturing multiple local and global optima.
2 Related Work
2.1 Bayesian Optimization and Acquisition Functions
Among many optimization frameworks [37, 32, 41], Bayesian opti-
mization has emerged as a popular method for the sample-efficient
optimization of expensive objective functions [35, 13]. It is capable
of optimizing objective functions with no available closed-form ex-
pression, where only point-based (and possibly noisy) evaluation of
the function is possible. The sampling-efficiency of BO schemes has
made them suitable for domains with non-convex and costly func-
tion evaluations. Fundamentally, BO is a sequential model-based ap-
proach to solving the problem of finding a global optimum of an
unknown objective function f(·):x∗= arg maxx∈X f(x), where
X ⊂ Rdis a compact set. It performs a sequential search, and at each
iteration k, selects a new location xk+1 to evaluate fand observe its
value, until making a final recommendation of the best estimate of
the optimum x∗.
The sequential selection is achieved through the acquisition func-
tion a:X → R, defined over the posterior of GP model,
where BO selects a sample in the search space with the highest ac-
quisition value. Upon evaluating the objective function at the se-
lected input, the surrogate GP model gets updated. Various BO
policies have been developed depending on the acquisition func-
tion choices, including expected improvement (EI) [27], knowledge
gradient (KG) [12], probability of improvement (PI) [19], upper-
confidence bounds (UCB) [20], and entropy search (ES) [15]. Recent
studies have also considered accounting for future improvements in
solution quality [44, 17, 39] and BO methods for constrained prob-
lems [2, 30]. However, existing work often focuses on finding the
global optimum rather than a set of local/global optima, which are
also necessary for multimodal objective functions and will be ex-
plored in this paper.
2.2 Local Maxima in Gaussian Random Fields
A separate line of statistical applications concentrates on the tail
distribution of the heights of local maxima in non-stationary Gaus-
sian random fields, such as in peak detection problems. Such a tail
distribution is defined as the probability that the height of the lo-
cal maximum surpasses a given threshold at the point x, condi-
tioned on the case that the point is a local maximum of the Gaus-
sian process model, given by the following equation: Pr(f(x>
ξ)|f(x)is one local maximum), where ξis the threshold. Existing
work [6] has probed into this problem in the Gaussian process model.
The general formulae are derived in [6] for non-stationary Gaussian
fields and a subset of Euclidean space or Riemannian manifold of
arbitrary dimension [21], which depends on local properties of the
Gaussian process model rather than the global supremum of the field.
Although the goal is to characterize certain local properties rather
than finding a set of local/global optima, the contributions of the
above work provide the key intuitions to our new BO framework
design.
2.3 First-Order Derivative in Bayesian Optimization
The first-order derivative information has been exploited in exist-
ing works of BO, such as [25, 45]. Besides, [26] make gradient in-
formation available for hyperparameter tuning. Additionally, adjoint
methods provide gradients cheaply in the optimization of engineer-
ing systems [16, 31]. The gradients provide useful information about
the objective function and can help the BO during the optimization
process. For instance, [40] develop a derivative-enabled knowledge-
gradient algorithm by incorporating derivative information into GP
for BO, given by (f(xk+1),∇f(xk+1))T|f(x1:k),∇f(x1:k)∼
N(f(x1:k),∇f(x1:k))T,diag(σ2), where σ2is the variance.
These methods assume the gradients of the objective function can
be queried along with the objective function during the optimization
process. GIBO[28] alternates between minimizing the variance of the
estimate of the gradient and moving in the direction of the expected
gradient. Later proposed MPD [29] extended and refined it by find-
ing the maximum look-ahead gradient descent direction. However,
we propose the BO framework to find local optima by computing
and updating the joint distribution of the prediction with its first-order
derivative regarding kernel functions. Besides, existing methods with
gradients concentrate on using gradients as additional information to
improve the traditional BO model targeting global optimum, while
our algorithm aims to find as many local optima as possible. Despite
the similarity, we do not directly access the objective’s gradient.
3 Background
Gaussian model: Gaussian process (GP) model is the most com-
monly used model for standard BO and also adopted by our frame-
work, providing the posterior distribution of the objective function
according to all the past evaluations. In this part, we introduce sev-
eral preliminaries of the GP model. Considering the objective func-
tion f(·)and the GP model with k+ 1 input samples xof dimension
n, the prior of the model is:
f(x1:k+1)∼ N (µx1:k+1 ,Σx1:k+1,x1:k+1 ),
where we use µx1:k+1 to denote the mean of the prior and
Σx1:k+1,x1:k+1 to represent the initial covariance of k+1 input sam-
ples. If we know the first ksamples’ values as observations f(x1:k),
based on the prior, the posterior of the GP model representing the
objective function at the next sampling point xk+1 can be obtained
as:
pf
def
=f(xk+1)|f(x1:k)∼ N (µxk+1 ,Σxk+1,xk+1 ),(1)
where µxk+1 and Σxk+1,xk+1 are the mean and variance respec-
tively at this step. For simplicity purpose, we use pfas the short
notation of the posterior f(xk+1)|f(x1:k).
4 Finding Local Optima via Bayesian Optimization
We aim to develop a multimodal BO framework capable of efficiently
computing optimal solutions. Our method will achieve optimization
regarding a joint distribution containing the Gaussian posterior and
its first-order derivative: the Gaussian posterior reflects the surrogate
model as we observe the objectives in new places, and the first-order
derivative of the posterior informs us about the latent location of lo-
cal/global optima. Therefore, to this end, we compute the first-order
derivative of the objective function to get the gradient and apply it
to our new BO framework. Since the derivative operation is linear,
we can combine the Gaussian posterior of the objective and its first-
order derivative into a joint distribution and then compute the joint
mean/variance for the acquisition functions. In this framework, we
design new acquisition functions to fulfill the needs of finding local
optima. Inspired by existing work [6], we first introduce a particular