Optimization on Manifolds via Graph Gaussian Processes Hwanwoo Kim1 Daniel Sanz-Alonso1 and Ruiyi Yang2 1University of Chicago2Princeton University

2025-04-29 0 0 2.36MB 35 页 10玖币
侵权投诉
Optimization on Manifolds via Graph Gaussian Processes
Hwanwoo Kim1, Daniel Sanz-Alonso1, and Ruiyi Yang2
1University of Chicago, 2Princeton University
Abstract
This paper integrates manifold learning techniques within a Gaussian process upper confidence
bound algorithm to optimize an objective function on a manifold. Our approach is motivated by
applications where a full representation of the manifold is not available and querying the objective
is expensive. We rely on a point cloud of manifold samples to define a graph Gaussian process
surrogate model for the objective. Query points are sequentially chosen using the posterior
distribution of the surrogate model given all previous queries. We establish regret bounds in
terms of the number of queries and the size of the point cloud. Several numerical examples
complement the theory and illustrate the performance of our method.
1 Introduction
Optimization problems on manifolds are ubiquitous in science and engineering. For instance, low-
rank matrix completion and rotational alignment of 3D bodies can be formulated as optimization
problems over spaces of matrices that are naturally endowed with manifold structures. These matrix
manifolds belong to agreeable families [
56
] for which Riemannian gradients, geodesics, and other
geometric quantities have closed-form expressions that facilitate the use of Riemannian optimization
algorithms [
19
,
1
,
9
]. In contrast, this paper is motivated by optimization problems where the search
space is a manifold that the practitioner can only access through a discrete point cloud representation,
preventing direct use of Riemannian optimization algorithms. Moreover, the hidden manifold may
not belong to an agreeable family, further hindering the use of classical methods. Illustrative
examples where manifolds are represented by point cloud data include computer vision, robotics,
and shape analysis of geometric morphometrics [
33
,
23
,
25
]. Additionally, across many applications
in data science, high-dimensional point cloud data contains low-dimensional structure that can be
modeled as a manifold for algorithmic design and theoretical analysis [
14
,
3
,
27
]. Motivated by these
problems, this paper introduces a Bayesian optimization method with convergence guarantees to
optimize an expensive-to-evaluate function on a point cloud of manifold samples.
To formalize our setting, consider the optimization problem
maximize 𝑓(𝑥), 𝑥 ∈ ℳ𝑁,(1.1)
where
𝑁
=
{𝑥𝑖}𝑁
𝑖=1
is a collection of samples from a compact manifold
ℳ ⊂ R𝑑
. We assume that
the manifold
is unknown to the practitioner, but that they have access to the samples
𝑁.
The
objective function
𝑓
in
(1.1)
is defined on the hidden manifold
; however, since
is unknown,
we restrict the search domain to the given point cloud
𝑁.
Motivating examples include locating
the portion of busiest traffic along a highway (idealized as a one-dimensional manifold), or finding
the point of highest temperature on an artificial surface for material design. In these and other
applications, the search domains are manifolds for which only a discrete representation may be
available. As a result, Riemannian optimization methods [
19
,
1
,
9
,
34
,
56
] that require Riemannian
gradients or geodesics are not directly applicable.
1
arXiv:2210.10962v3 [stat.ML] 9 Nov 2023
While being discrete, the optimization problem
(1.1)
is challenging when the objective function
𝑓
is expensive to evaluate due to computational, monetary, or opportunity costs. For instance,
querying
𝑓
may involve numerically solving a system of partial differential equations, placing a
sensor at a new location, or time-consuming human labor. In such cases, solving
(1.1)
by exhaustive
search over
𝑁
is unfeasible for large
𝑁,
and it is important to design optimization algorithms that
provably require fewer evaluations of the objective than the size
𝑁
of the point cloud. Solving
(1.1)
is
also challenging in applications where the objective function does not satisfy structural assumptions
(e.g. concavity or linearity) other than a sufficient degree of smoothness, and in applications where
𝑓
is a black-box in that one has only access to noisy output from
𝑓
rather than to an analytic
expression of this function. We refer to [22] for a survey of problems where these conditions arise.
Motivated by these geometric and computational challenges, we introduce an approach to solve
(1.1)
that works directly on the point cloud
𝑁
and necessitates few evaluations of the objective.
In particular, we show that in the large
𝑁
limit and under suitable smoothness assumptions, our
method provably requires far fewer evaluations of the objective than the size
𝑁
of the point cloud.
Our algorithm falls in the general framework of Bayesian optimization and is specifically designed to
achieve such a convergence guarantee. The main focus will be on the mathematical analysis of the
proposed approach, but we also present simulation studies to illustrate and complement our theory.
1.1 Overview of our Approach
The problem features that gradients are not available and evaluation of the objective is expensive
naturally lead us to adopt a Bayesian optimization (BO) approach to solve
(1.1)
. BO is an iterative
procedure that relies on solving a sequence of surrogate optimization problems to sidestep the need
of gradient information on
𝑓.
At each iteration, the surrogate problem is to optimize an acquisition
function defined using a probabilistic model of the objective function conditioned to previous iterates.
The acquisition function should be inexpensive to evaluate and optimize, and at the same time
provide useful information about where the optimizer of
𝑓
is most likely to lie. The probabilistic
model should be sufficiently rich to adequately represent the objective function. Many choices of
acquisition function have been proposed in the literature, including expected improvement, entropy
search, and knowledge gradient (see [
22
] for a review). Popular probabilistic models for
𝑓
include
Gaussian processes [
54
,
30
] and Bayesian additive regression trees [
13
]. Adequately choosing the
acquisition function and the probabilistic model is essential to the success of BO algorithms.
The BO method that we propose and analyze has the distinctive feature that both the probabilistic
model and the acquisition function are carefully chosen to ensure convergence of the returned solution
to a global maximizer of
𝑓
under suitable smoothness assumptions. A natural way to characterize
the smoothness of
𝑓
is to assume it is a sample path from a Gaussian process (GP) defined on
.
Under this smoothness assumption, we adopt a graph GP model [
44
,
8
] for
𝑓|𝑁
, the restriction of
𝑓
to the point cloud. The graph GP is designed to be a discretely indexed GP that approximates
a Matérn or squared exponential GP on the hidden manifold
as the size of the point cloud
grows to infinity. Applications of graph GPs in Bayesian inverse problems, spatial statistics, and
semi-supervised learning are discussed in [
44
,
28
,
31
,
32
]. In this paper, we extend the convergence
analysis for Matérn graph GP models in [
44
,
45
,
27
,
24
] to also cover squared exponential kernels,
see Proposition 2.3.
Such error analysis is important since it allows us to quantify the misspecification error when
modeling
𝑓|𝑁
with a graph GP. In particular, the model that we use for computation does not
2
necessarily match the true distribution of
𝑓|𝑁
due to the missing information about
; to obtain
convergence guarantees, this geometric misspecification needs to be corrected by suitably choosing
the acquisition function. We accomplish this goal by applying the framework developed in [
5
]. In so
doing, we adapt their formulation to cover our problem setting, where
𝑓
is a sample path from a
GP instead of an element of a reproducing kernel Hilbert space.
1.2 Contributions and Related Work
Our careful choice of probabilistic model and acquisition function allows us to establish a bound on
the simple regret (see
(2.3)
for its definition) that converges to zero as the number
𝐿
of evaluations
of the objective and the size
𝑁
of the point cloud converge to infinity while keeping the relation
𝐿𝑁
(see Theorem 2.5, Remark 2.6, and Corollary 2.7). In other words, our algorithm can
provably find a global maximizer of
𝑓
as we acquire more samples from the compact manifold
while still keeping the number of evaluations of the objective much smaller than the size of
the point cloud. We are not aware of an existing algorithm to solve
(1.1)
that enjoys a similar
convergence guarantee. Synthetic computed examples will complement the theory, illustrate the
applicability of our method, and showcase the importance of incorporating geometric information in
the probabilistic model.
As noted in [
22
], BO algorithms have been most popular in continuous Euclidean domains.
Methods that are tailored to manifold settings [
35
,
36
] and discrete spaces [
2
,
42
,
49
,
16
] have
received less attention. On the one hand, the search domain in our setting
(1.1)
is a discrete subset
of a manifold, but naive application of discrete BO (e.g. using a standard Euclidean GP on the
ambient space
R𝑑
) would fail to adequately exploit the geometric information contained in the point
cloud; in particular, it would fail to suitably encode smoothness of the probabilistic model for
𝑓
along the hidden manifold
.
The empirical advantage of our graph GPs over Euclidean kernels will
be illustrated in our numerical experiments (see Subsection 4.2). On the other hand, the manifold
in our setting is only available as a point cloud, which precludes the use of manifold BO approaches
[
35
,
36
] that require access to geodesic distances and eigenpairs of the Laplace-Beltrami operator on
for modeling
𝑓
, and to Riemannian gradients for optimizing the acquisition function. Therefore,
our algorithm solves a practical problem for which limited tools with theoretical guarantees are
available. In the context of Riemannian optimization, our algorithm is still applicable when the
differential geometric quantities necessary for gradient-based methods are not readily available. A
closely related work in this direction is [
47
], which also assumes a point cloud representation of the
manifold but instead reconstructs from it tangent spaces, gradients, and retractions, followed by
an approximate Riemannian gradient descent. Our paper differs from [
47
] in that our algorithm is
based on Bayesian optimization and no gradient approximation is carried out, as a result of which
we do not need to assume the point cloud to be quasi-uniform. Going beyond manifold constraints,
optimization of functions with low effective dimensionality has been addressed in [
52
,
38
,
11
,
12
]
employing subspace methods (see also the references therein).
1.3 Outline
Section 2introduces the graph Gaussian process upper confidence bound (GGP-UCB) algorithm
and describes the choice of surrogate model and acquisition function. Our main result, Theorem
2.5, establishes convergence rates.
Section 3discusses important practical considerations such as estimating the parameters of
3
the surrogate model and tuning the acquisition function.
Section 4contains numerical examples that illustrate and complement the theory.
Section 5closes with a summary of our paper and directions for further research.
The proofs of our main results can be found in the appendices.
1.4 Notation
For
𝑎, 𝑏
two real numbers, we denote
𝑎𝑏
=min
{𝑎, 𝑏}
and
𝑎𝑏
=max
{𝑎, 𝑏}
. The symbol
will
denote less than or equal to up to a universal constant. For two real sequences
{𝑎𝑖}
and
{𝑏𝑖}
, we
denote (i)
𝑎𝑖𝑏𝑖
if
lim𝑖
(
𝑎𝑖/𝑏𝑖
) = 0; (ii)
𝑎𝑖
=
𝑂
(
𝑏𝑖
)if
lim sup𝑖
(
𝑎𝑖/𝑏𝑖
)
𝐶
for some positive constant
𝐶; and (iii) 𝑎𝑖𝑏𝑖if 𝑐1lim inf𝑖(𝑎𝑖/𝑏𝑖)lim sup𝑖(𝑎𝑖/𝑏𝑖)𝑐2for some positive constants 𝑐1, 𝑐2.
2 The GGP-UCB Algorithm
In this section we introduce our algorithm and establish convergence guarantees. We start in
Subsection 2.1 by formalizing the problem setting. Subsection 2.2 describes the main GGP-UCB
algorithm. The choice of surrogate model and acquisition function are discussed in Subsections 2.3
and 2.4, respectively. Finally, Subsection 2.5 presents our main theoretical result, Theorem 2.5.
2.1 Problem Formulation
Let
𝑓
be a function defined over a compact Riemannian submanifold
ℳ ⊂ R𝑑
of dimension
𝑚
.
Suppose that a full representation of
is not available and we are only given the dimension
𝑚
and a point cloud of manifold samples
{𝑥𝑖}𝑁
𝑖=1
=
:𝑁⊂ ℳ
. We are interested in solving the
optimization problem
max
𝑥∈ℳ𝑁
𝑓(𝑥)(2.1)
in applications where the objective
𝑓
is expensive to evaluate and we may only collect
𝐿𝑁
noisy
measurements 𝑦of the form
𝑦=𝑓(𝑧) + 𝜂, 𝜂𝑖.𝑖.𝑑.
∼ 𝒩(0, 𝜎2),1𝐿, (2.2)
where
{𝑧}𝐿
=1
are query points and
𝜎
is a given noise level. The goal is then to solve
(2.1)
with
𝐿𝑁queries of 𝑓.
Let
𝒵𝐿:
=
{𝑧}𝐿
=1 ⊂ ℳ𝑁
denote the query points sequentially found by our algorithm,
introduced in Subsection 2.2 below. We shall quantify the performance of our approach using the
simple regret, defined as
𝑟𝑁,𝐿 :=𝑓(𝑧*
𝑁)𝑓(𝑧*
𝐿), 𝑧*
𝑁= arg max
𝑧∈ℳ𝑁
𝑓(𝑧), 𝑧*
𝐿= arg max
𝑧∈𝒵𝐿
𝑓(𝑧).(2.3)
Note that the simple regret depends both on the number
𝐿
of queries and on the size
𝑁
of the
point cloud, since
𝑧*
𝑁
and
𝑧*
𝐿
both depend implicitly on
𝑁
. One should interpret
𝑁
as a large
fixed number and
𝐿
as the running index. The dependence on
𝑁
of the query points
𝑧
’s will be
omitted for notational simplicity.
4
Remark 2.1. The optimizer
𝑧*
𝑁
over the point cloud
𝑁
is not necessarily the global optimizer of
𝑓
over
. Since we only have access to
𝑁
, finding the maximizer over
𝑁
is the best we can
hope for without reconstructing or estimating the hidden manifold
. Nevertheless, we will show
in Corollary 2.7 that the continuum regret, defined as
𝑟cont
𝑁,𝐿 :=𝑓(𝑧*
)𝑓(𝑧*
𝐿), 𝑧*
= arg max
𝑧∈ℳ
𝑓(𝑧), 𝑧*
𝐿= arg max
𝑧∈𝒵𝐿
𝑓(𝑧),(2.4)
also converges to zero as both
𝑁
and
𝐿
approach infinity while keeping
𝐿𝑁
if the
𝑥𝑖
’s satisfy
Assumption 2.2. In other words, the maximizer
𝑧*
𝐿
returned by our algorithm is an approximate
global maximizer of 𝑓over despite the fact that 𝑧*
𝐿∈ ℳ𝑁.
2.2 Main Algorithm
The Bayesian approach to optimization starts by constructing a GP model for the function to be
optimized. We recall that a GP with mean
𝜇
(
·
)and covariance
𝑐
(
·,·
)is a stochastic process where the
joint distribution over any finite set of indices
𝑠1, . . . , 𝑠𝑛
is a multivariate Gaussian with mean vector
[
𝜇
(
𝑠𝑖
)]
𝑛
𝑖=1
and covariance matrix [
𝑐
(
𝑠𝑖, 𝑠𝑗
)]
𝑛
𝑖,𝑗=1
[
54
]. The mean and covariance functions together
encode information about the values of the function, their correlation, and their uncertainty.
In our setting, we need to construct a GP surrogate prior model
𝜋𝑁
for
𝑓𝑁,
where
𝜋𝑁
would
simply be an
𝑁
-dimensional multivariate Gaussian. A natural requirement is that, for
𝑢𝑁𝜋𝑁,
𝑢𝑁
(
𝑥𝑖
)and
𝑢𝑁
(
𝑥𝑗
)should be highly correlated iff
𝑥𝑖
and
𝑥𝑗
are close along the manifold, that is,
if the geodesic distance
𝑑
(
𝑥𝑖, 𝑥𝑗
)is small. We shall discuss in Subsection 2.3 prior models
𝜋𝑁
that fulfill this requirement. Defining the covariance matrix of
𝜋𝑁
by using a standard covariance
function in the Euclidean space
R𝑑
would in general fail to meet this requirement, since two points
may be close in Euclidean space but far apart in terms of the geodesic distance 𝑑in .
Once a choice of surrogate prior model is made, the next step is to sequentially find query points
by maximizing an acquisition function [
48
]. Suppose we have picked query points
𝑧1, . . . , 𝑧1
in the
first 1iterations and obtained noisy measurements
𝑦𝑘=𝑓(𝑧𝑘) + 𝜂𝑘, 𝜂𝑘𝑖.𝑖.𝑑.
∼ 𝒩(0, 𝜎2),1𝑘1.(2.5)
At the
-th iteration, we will pick the next query point
𝑧
by maximizing an upper confidence bound
acquisition function [48,5] of the form
𝐴𝑁,ℓ(𝑧) = 𝜇𝑁,ℓ1(𝑧) + 𝐵𝑁,ℓ𝜎𝑁,ℓ1(𝑧), 𝑧 ∈ ℳ𝑁,(2.6)
where
𝐵𝑁,ℓ
is a user-chosen parameter, and
𝜇𝑁,ℓ1, 𝜎𝑁,ℓ1
are the mean and standard deviation
of the posterior distribution
𝜋𝑁
(
·|𝑦1, . . . , 𝑦1
)
.
Denoting by
𝑐𝑁
(
·,·
)the covariance function of the
surrogate prior
𝜋𝑁
, i.e.,
𝑐𝑁
(
𝑥𝑖, 𝑥𝑗
)is the covariance between
𝑢𝑁
(
𝑥𝑖
)and
𝑢𝑁
(
𝑥𝑗
)for
𝑢𝑁𝜋𝑁
, we
have the expressions
𝜇𝑁,ℓ1(𝑧) = 𝑐𝑁,ℓ1(𝑧)(𝐶𝑁,ℓ1+𝜎2𝐼)1𝑌1,
𝜎2
𝑁,ℓ1(𝑧) = 𝑐𝑁(𝑧, 𝑧)𝑐𝑁,ℓ1(𝑧)(𝐶𝑁,ℓ1+𝜎2𝐼)1𝑐𝑁,ℓ1(𝑧),𝑧∈ ℳ𝑁,(2.7)
where
𝑌1
= (
𝑦1, . . . , 𝑦1
)
R1
,
𝑐𝑁,ℓ1
(
𝑧
)
R1
is a vector with entries
𝑐𝑁,ℓ1
(
𝑧
)
𝑖
=
𝑐𝑁(𝑧, 𝑧𝑖), and 𝐶𝑁,ℓ1R1×1is a matrix with entries (𝐶𝑁,ℓ1)𝑖𝑗 =𝑐𝑁(𝑧𝑖, 𝑧𝑗).
5
摘要:

OptimizationonManifoldsviaGraphGaussianProcessesHwanwooKim1,DanielSanz-Alonso1,andRuiyiYang21UniversityofChicago,2PrincetonUniversityAbstractThispaperintegratesmanifoldlearningtechniqueswithinaGaussianprocessupperconfidenceboundalgorithmtooptimizeanobjectivefunctiononamanifold.Ourapproachismotivated...

展开>> 收起<<
Optimization on Manifolds via Graph Gaussian Processes Hwanwoo Kim1 Daniel Sanz-Alonso1 and Ruiyi Yang2 1University of Chicago2Princeton University.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:2.36MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注