Equispaced Fourier representations for ecient Gaussian process regression from a billion data points Philip Greengard1 Manas Rachh2 and Alex Barnett2

2025-05-06 0 0 852.74KB 28 页 10玖币
侵权投诉
Equispaced Fourier representations for efficient Gaussian
process regression from a billion data points
Philip Greengard1, Manas Rachh2, and Alex Barnett2
1Department of Statistics, Columbia University, New York, NY, 10027
2Center for Computational Mathematics, Flatiron Institute, New York, NY, 10010
May 19, 2023
Abstract
We introduce a Fourier-based fast algorithm for Gaussian process regression in low
dimensions. It approximates a translationally-invariant covariance kernel by complex
exponentials on an equispaced Cartesian frequency grid of Mnodes. This results in a
weight-space M×Msystem matrix with Toeplitz structure, which can thus be applied
to a vector in O(Mlog M) operations via the fast Fourier transform (FFT), independent
of the number of data points N. The linear system can be set up in O(N+Mlog M)
operations using nonuniform FFTs. This enables efficient massive-scale regression via
an iterative solver, even for kernels with fat-tailed spectral densities (large M). We
provide bounds on both kernel approximation and posterior mean errors. Numer-
ical experiments for squared-exponential and Mat´ern kernels in one, two and three
dimensions often show 1–2 orders of magnitude acceleration over state-of-the-art rank-
structured solvers at comparable accuracy. Our method allows 2D Mat´ern-3
2regression
from N= 109data points to be performed in 2 minutes on a standard desktop, with
posterior mean accuracy 103. This opens up spatial statistics applications 100 times
larger than previously possible.
1 Introduction
Gaussian process (GP) regression is ubiquitous in spatial statistics and machine learning,
due in large part to its high predictive performance and rigorous foundation in Bayesian
inference [4, 8, 32, 36, 44, 45, 51]. The goal is to recover a stochastic function with a certain
covariance kernel, given noisy pointwise observations. More precisely, one is given Npoints
x1, . . . , xNRd, with corresponding data observations y1, . . . , yNwhich are noisy samples
from a function fwith a zero-mean Gaussian process distribution. This is denoted by
ynf(xn) + n, n = 1, . . . , N (1)
f(x)∼ GP(0, k(x, x0)) (2)
where n∼ N(0, σ2) is independent and identically distributed (iid) noise of known variance
σ2>0, and k:Rd×RdRis a given covariance kernel. We address the common trans-
lationally invariant case where k(x, x0) = k(xx0), using these notations interchangeably.
pg2118@columbia.edu. Research supported by the Alfred P. Sloan Foundation.
1
arXiv:2210.10210v2 [stat.CO] 18 May 2023
This covers a wide range of statistical problems, where kis generally chosen to belong to
one of a handful of families of kernels such as Mat´ern, squared-exponential (SE), or rational
quadratic. In this paper we present a regression method efficient in low dimensions d(say
d5), targeting applications in time series (d= 1) and spatial/geostatistics (d= 2 or 3);
we do not address high-dimensional machine learning tasks.
In the GP prior model the vector {f(xn)}N
n=1 is a zero-mean multivariate Gaussian
with N×Ncovariance matrix Kwith elements Knl := k(xn, xl). Then, given data y:=
{yn}N
n=1, the regression task to compute the (necessarily Gaussian) posterior of f(x), at new
target points xRd. A standard calculation [44, §2.2] (conditioning a N+ 1 dimensional
multivariate Gaussian on the known subset of Nvariables) gives the posterior mean of f(x)
as
µ(x) =
N
X
n=1
αnk(x, xn),(3)
where α:= {αn}N
n=1 is the vector in RNuniquely solving the N×Nsymmetric linear
system
(K+σ2I)α=y,(4)
Ibeing the N×Nidentity matrix. Computing µ(x), the main goal of this paper, is
equivalent to kernel ridge regression [36, Ch. 18.3.7].
Remark 1 (variance evaluation).Formulae similar to (3)(4) exist for the variance s(x)
(or covariance between targets) [44, (2.26)] [25, §1–2]. Unlike the mean, they demand a
new linear solve for each target. Our proposed accelerated method would also apply to such
cases, with scaling as in the last column of Table 1; for space reasons we leave this for future
study.
The main obstacle to the use of GP regression as a large-scale practical tool is the
solution of linear systems such as (4). Dense direct solution (e.g. by Cholesky factorization of
K+σ2I) costs O(N3) operations, limiting Nto of order 104on a single machine, prohibitive
for even a moderately small modern data set. Thus much literature has emerged on more
scalable methods. While most of these tackle the above N×Nlinear system (this is called
the “function space” approach), others solve a different linear system for basis coefficients
(“weight space”) [44, §2] [36, Ch. 18].
1.1 Prior work in GP regression in low dimensions
Much prior work has applied iterative solvers, usually conjugate gradients (CG) [20, §10.2],
to the N×Nfunction-space linear system (4). This requires multiplications of arbitrary
vectors by the matrix K+σ2I. These may be performed densely with O(N2) cost per
iteration [19]; for a brute-force parallel implementation see [54]. However, most approaches
instead apply an approximation to Kwhich exploits the low-rank structure of covariance
matrices.1The kernel approximation k(xx0)˜
k(xx0) := PM
j=1 φj(x)φj(x0), for a
set of Mbasis functions φj, leads to a global factorization K˜
K:= ΦΦ, whose rank
is thus at most M. Perhaps the most popular examples of this are the Nystr¨om method
1These are sometimes called “sparse” GPs [27, 32].
2
method error (SE kernel)precomputation linear solve mean at qpoints variance at qpoints
direct mach N2N3Nq N 2q
SKI [59] M3/d niter(N+Mlog M)Nq niter(N+Mlog M)q
FLAM [35] set by εmax[Nlog N, N 33/d]Nlog N(N+q) log(N+q)Nq log N
RLCM [7] set by rank N N N +qlog N N +qlog N
PG ’21 [24] eγM 1/d N+M2log M M3q+Mlog M M2q
EFGP eγM 1/d N+Mlog M niter Mlog M q +Mlog M niterMq log M
Table 1: Error and computational complexity of GP regression for different algorithms, for
Npoints defined on Rd, with regression to qnew targets. All entries are O(·); constants are
not compared. For “SKI,” Mdenotes the number of inducing points; otherwise Mis the
number of Fourier nodes. For “EFGP,” the proposal in this paper, M= (2m+ 1)d, since
2m+ 1 nodes are used in each dimension. niter is the number of iterations in an iterative
solver, εis a matrix compression tolerance, and mach is machine error. “Error” refers to
kernel approximation, not downstream prediction errors that also scale with Nand σ.SE
denotes squared-exponential.
[37, 44], inducing point methods [42,48], and subset of regressors (SoR) [44, §8.3.1]. These
factorizations often involve a subsampling of rows and columns of K, so that the bases are
themselves kernels centered at a size-Msubset of the given points. Their cost is typically
O(NM2). Building on inducing points, “structured kernel interpolation” (SKI) [59] (similar
to precorrected FFT in PDE solvers [41]) replaces these xnjby a regular spatial product grid
of interpolation points; the resulting Toeplitz structure allows a fast matrix-vector multiply
via the d-dimensional FFT, giving the much improved scaling O(N+Mlog M) per iteration
(see Table 1). Recently, analytic expansions have improved upon its speed at high accuracy
up to moderate d[46].
Most such iterative methods fit into the framework of “black box” matrix-vector multi-
plications (BBMM) [18] with K. However, their disadvantage is that each iteration costs at
least O(N) while poor condition numbers mean that many iterations are required. Precon-
ditioning can help in certain settings [18,52], but is far from a panacea for GP regression [10].
Furthermore, the cost per posterior mean evaluation at each new target is again O(N). In
2015 it was stated that “with sparse approximations it is inconceivable to apply GPs to
training set sizes of N≥ O(107)” [12]. The largest data sizes tested on a desktop machine
to date in this framework are around N106[18, 32, 54].
We now turn to prior work in direct solvers, which have the advantage of immunity to ill-
conditioning. Simple truncation of the kernel beyond a certain range sparsifies K, allowing a
sparse direct solution [17], but this can only be effective for short-range kernels. In dimension
d= 1 (e.g. time series), certain kernels make the upper and lower triangular halves of Klow
rank; this “semiseparable” structure allows a direct solution in O(N) time [16]. Recently,
fast direct solvers have been successfully applied to the function-space linear system in low
dimension d > 1 [1,5,7,29,35]. Going beyond the above “one-level” compression for K, these
exploit hierarchical (multi-scale) low-rank structure to compute a compressed approximate
representation of (K+σ2I)1, which may be quickly (in O(N) or O(Nlog N) time) applied
to new vectors yin (4). For instance, in HODLR (hierarchically off-diagonal low-rank)
3
structure, at each scale off-diagonal blocks are numerically low rank, while the remaining
diagonal blocks themselves have HODLR structure at a finer scale [1]. Refinements of
this structure (e.g., HSS, H,H2, and HBS), as well as randomized linear algebra and
“proxy point” sampling [33, 35], are needed to achieve low ranks and fast factorizations.
These build upon recent advances in fast direct solvers for various kernels arising in integral
equations and PDEs [26,33], which in turn exploit adaptive domain decomposition and far-
field expansion ideas from tree codes and the fast multipole method [23]. One disadvantage
of fast direct solvers is that the ranks grow rapidly with d, and that compression times
scale quadratically with the ranks [1]. Another is their high RAM usage. The largest Nis
around 2 ×106in recent single workstation examples [1, 7,35].
In contrast to all of the above function-space methods, weight-space methods avoid solv-
ing the N×Nlinear system [42]. Recalling the low-rank kernel approximation K˜
K=
ΦΦ, they instead solve a dual system involving the smaller M×Mmatrix ΦΦ. We
recap the equivalence of these two approaches in Section 2. The needed rank (basis size)
Mdepends on the kernel, dimension d, and desired accuracy. An advantage is that mean
prediction at new targets involves only evaluating the basis expansion, an O(M) cost inde-
pendent of the data size N. However, traditional weight-space methods (including SoR and
“sparse”/inducing variable methods [42, 48]) cost O(NM2) to form ΦΦ, again prohibitive
for large N.
Recently it has been realized that Fourier (real trigonometric or complex exponential)
basis functions φj(x) are a natural weight-space representation for translationally invariant
kernels (for a review of such spectral representations see [27]). In early work their frequencies
were chosen stochastically [43], or optimized as part of the inference [31]. Instead, Hensman
et al. [27] proposed regular (equispaced) frequencies, following [39]. This allowed regression
(in a separable model, i.e., effectively d= 1) from N= 6 ×106points in around 4 minutes.
However, since direct Fourier summation was used, the cost was still O(NM).
Recently the first author introduced a weight space method for d= 1 with Fourier
frequencies chosen at the nodes of a high-order accurate quadrature scheme for the Fourier
integral [24]. Since smooth kernels (such as SE or Mat´ern with larger ν) have rapidly de-
caying Fourier transforms, truncation to a small Mcan approximate ˜
KKto many digits
of accuracy, giving essentially “exact” inference. That work also proposed the nonuniform
FFT (NUFFT [14]) of type 3 for filling of the weight space system matrix and evaluation
at targets, leading to a solution cost O(N+M3); see Table 1. This allowed N= 108to be
handled in a few seconds.
Finally, for d > 1, the only way to handle N106currently is via distributed computing
[32]. There, methods include approximation by mixtures (a “committee” [44, §8.3.5]) of
fixed-size GP solutions [12], distributed gradient descent [40], and multi-CPU Cholesky
preconditioning [34]. For instance, in studies with N= 109points (in d= 9), convergence
took around 1 hour, either on 1000 CPU cores [40, Sec. 6.3], or 2 GPUs [34].
1.2 Our contribution
We present a weight space iterative solver for general low dimension using Fourier frequencies
(features) on an equispaced d-dimensional grid, as in [27]. The equal spacing makes the
M-by-Msystem matrix Toeplitz (or its generalization for d > 1) up to diagonal scalings,
enabling its multiplication with a vector via the padded FFT, to give O(Mlog M) cost
per iteration. (Although SKI [59] also uses an FFT, the grid there is in physical space,
4
forcing an O(N) cost per iteration as with all BBMM methods.) The result is that large M
(e.g. 106) may be tackled (unlike in [24,27]). This is a necessity for higher dimension since
M=O(md), where mcontrols the number of grid points per dimension. Furthermore,
building on [24], we fill the vector defining the Toeplitz matrix, and the right hand side,
via a modern implementation of the NUFFT [3] with O(N+Mlog M) cost. Since there
is only a single pass through the data, a standard desktop can handle very large N, around
100 times larger than any prior work of which we are aware in d= 2.
We benchmark our equispaced Fourier (EFGP) proposal against three popular methods
for GP regression that have desirable theoretical properties and available software: SKI of
Wilson et al. [59] implemented in GPyTorch [18], an implementation of the fast direct solver
of Minden et al. [35] using FLAM [28], and the linear-scaling direct solver RLCM of Chen
and Stein [7]. Their theoretical scalings are compared in Table 1. Our synthetic data tests
in dimensions d= 1, 2, and 3 explore the tradeoff of accuracy vs CPU time for all four
methods, and show that EFGP is very competitive for the very smooth squared-exponential
kernel, and remains competitive even for the least smooth Mat´ern kernel (ν= 1/2). We
show that real satellite data in d= 2 with N= 1.4×106may be regressed in a couple of
seconds on a laptop. We show large-scale simulated data tests in d= 2 at up to a billion
data points; the latter takes only a couple of minutes on a desktop. For reproducibility,
we implemented or wrapped all methods in a common, documented, publicly available
MATLAB/Octave interface.
Our contribution includes both analysis and numerical study of the error in the computed
posterior mean µrelative to “exact” GP inference (exact solution of (3)–(4)). If this error—
the bias in the GP regression introduced by approximation—is pushed to a negligible value
then the computation becomes effectively exact (a point appreciated recently with other
iterative methods [18,54]). We bound this error rigorously in our main Theorem 11 in terms
of ε, a uniform kernel approximation error, and ρ, the weight space residual norm arising
from the iterative solver. We show that with our EFGP representation εsplits into aliasing
and truncation parts (Proposition 8). For the popular squared-exponential and Mat´ern
kernels, we give formulae for numerical discretization parameters that guarantee a given ε,
using analytic results from [2]. Numerically we study the estimated error in posterior mean
(EEPM), which should converge to zero as εand ρvanish. This is more informative than
the prediction error at held-out targets (commonly RMSE), which instead bottoms out at
the noise level (σin the case of a well-specified model).
Remark 2 (Connection to Fourier imaging).Although the combination of equispaced Fourier
modes with an iterative FFT-accelerated Toeplitz solver appears to be new in GP regression,
a similar idea has been used in computed tomography [13], MRI [15, 22], and cryo-electron
microscopy [55]. The normal equations in those applications are mathematically equivalent
to the GP weight-space linear system with σ= 0, but, curiously, the roles of physical and
Fourier space are swapped.
1.3 Outline of this paper
In the next section we present the weight-space approach for an exactly factorized covari-
ance kernel, and review its equivalence to GP regression in function space. In Section 3
we describe the proposed equispaced Fourier (EFGP) method and its fast numerical imple-
mentation. Our error analysis comprises kernel approximation (Section 4.1), discretization
parameter choice (Section 4.2), and posterior mean error (Section 4.3). Section 5 presents
5
摘要:

EquispacedFourierrepresentationsforecientGaussianprocessregressionfromabilliondatapointsPhilipGreengard*1,ManasRachh2,andAlexBarnett21DepartmentofStatistics,ColumbiaUniversity,NewYork,NY,100272CenterforComputationalMathematics,FlatironInstitute,NewYork,NY,10010May19,2023AbstractWeintroduceaFourier-...

展开>> 收起<<
Equispaced Fourier representations for ecient Gaussian process regression from a billion data points Philip Greengard1 Manas Rachh2 and Alex Barnett2.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:852.74KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注