Equispaced Fourier representations for ecient Gaussian process regression from a billion data points Philip Greengard1 Manas Rachh2 and Alex Barnett2

2025-05-06 0 0 852.74KB 28 页 10玖币

侵权投诉

Equispaced Fourier representations for eﬃcient Gaussian

process regression from a billion data points

Philip Greengard∗1, Manas Rachh2, and Alex Barnett2

1Department of Statistics, Columbia University, New York, NY, 10027

2Center for Computational Mathematics, Flatiron Institute, New York, NY, 10010

May 19, 2023

Abstract

We introduce a Fourier-based fast algorithm for Gaussian process regression in low

dimensions. It approximates a translationally-invariant covariance kernel by complex

exponentials on an equispaced Cartesian frequency grid of Mnodes. This results in a

weight-space M×Msystem matrix with Toeplitz structure, which can thus be applied

to a vector in O(Mlog M) operations via the fast Fourier transform (FFT), independent

of the number of data points N. The linear system can be set up in O(N+Mlog M)

operations using nonuniform FFTs. This enables eﬃcient massive-scale regression via

an iterative solver, even for kernels with fat-tailed spectral densities (large M). We

provide bounds on both kernel approximation and posterior mean errors. Numer-

ical experiments for squared-exponential and Mat´ern kernels in one, two and three

dimensions often show 1–2 orders of magnitude acceleration over state-of-the-art rank-

structured solvers at comparable accuracy. Our method allows 2D Mat´ern-3

2regression

from N= 109data points to be performed in 2 minutes on a standard desktop, with

posterior mean accuracy 10−3. This opens up spatial statistics applications 100 times

larger than previously possible.

1 Introduction

Gaussian process (GP) regression is ubiquitous in spatial statistics and machine learning,

due in large part to its high predictive performance and rigorous foundation in Bayesian

inference [4, 8, 32, 36, 44, 45, 51]. The goal is to recover a stochastic function with a certain

covariance kernel, given noisy pointwise observations. More precisely, one is given Npoints

x1, . . . , xN∈Rd, with corresponding data observations y1, . . . , yNwhich are noisy samples

from a function fwith a zero-mean Gaussian process distribution. This is denoted by

yn∼f(xn) + n, n = 1, . . . , N (1)

f(x)∼ GP(0, k(x, x0)) (2)

where n∼ N(0, σ2) is independent and identically distributed (iid) noise of known variance

σ2>0, and k:Rd×Rd→Ris a given covariance kernel. We address the common trans-

lationally invariant case where k(x, x0) = k(x−x0), using these notations interchangeably.

∗pg2118@columbia.edu. Research supported by the Alfred P. Sloan Foundation.

arXiv:2210.10210v2 [stat.CO] 18 May 2023

This covers a wide range of statistical problems, where kis generally chosen to belong to

one of a handful of families of kernels such as Mat´ern, squared-exponential (SE), or rational

quadratic. In this paper we present a regression method eﬃcient in low dimensions d(say

d≤5), targeting applications in time series (d= 1) and spatial/geostatistics (d= 2 or 3);

we do not address high-dimensional machine learning tasks.

In the GP prior model the vector {f(xn)}N

n=1 is a zero-mean multivariate Gaussian

with N×Ncovariance matrix Kwith elements Knl := k(xn, xl). Then, given data y:=

{yn}N

n=1, the regression task to compute the (necessarily Gaussian) posterior of f(x), at new

target points x∈Rd. A standard calculation [44, §2.2] (conditioning a N+ 1 dimensional

multivariate Gaussian on the known subset of Nvariables) gives the posterior mean of f(x)

µ(x) =

n=1

αnk(x, xn),(3)

where α:= {αn}N

n=1 is the vector in RNuniquely solving the N×Nsymmetric linear

system

(K+σ2I)α=y,(4)

Ibeing the N×Nidentity matrix. Computing µ(x), the main goal of this paper, is

equivalent to kernel ridge regression [36, Ch. 18.3.7].

Remark 1 (variance evaluation).Formulae similar to (3)–(4) exist for the variance s(x)

(or covariance between targets) [44, (2.26)] [25, §1–2]. Unlike the mean, they demand a

new linear solve for each target. Our proposed accelerated method would also apply to such

cases, with scaling as in the last column of Table 1; for space reasons we leave this for future

study.

The main obstacle to the use of GP regression as a large-scale practical tool is the

solution of linear systems such as (4). Dense direct solution (e.g. by Cholesky factorization of

K+σ2I) costs O(N3) operations, limiting Nto of order 104on a single machine, prohibitive

for even a moderately small modern data set. Thus much literature has emerged on more

scalable methods. While most of these tackle the above N×Nlinear system (this is called

the “function space” approach), others solve a diﬀerent linear system for basis coeﬃcients

(“weight space”) [44, §2] [36, Ch. 18].

1.1 Prior work in GP regression in low dimensions

Much prior work has applied iterative solvers, usually conjugate gradients (CG) [20, §10.2],

to the N×Nfunction-space linear system (4). This requires multiplications of arbitrary

vectors by the matrix K+σ2I. These may be performed densely with O(N2) cost per

iteration [19]; for a brute-force parallel implementation see [54]. However, most approaches

instead apply an approximation to Kwhich exploits the low-rank structure of covariance

matrices.1The kernel approximation k(x−x0)≈˜

k(x−x0) := PM

j=1 φj(x)φj(x0), for a

set of Mbasis functions φj, leads to a global factorization K≈˜

K:= ΦΦ∗, whose rank

is thus at most M. Perhaps the most popular examples of this are the Nystr¨om method

1These are sometimes called “sparse” GPs [27, 32].

method error (SE kernel)∗precomputation linear solve mean at qpoints variance at qpoints

direct mach N2N3Nq N 2q

SKI [59] M−3/d –niter(N+Mlog M)Nq niter(N+Mlog M)q

FLAM [35] set by εmax[Nlog N, N 3−3/d]Nlog N(N+q) log(N+q)Nq log N

RLCM [7] set by rank N N N +qlog N N +qlog N

PG ’21 [24] e−γM 1/d N+M2log M M3q+Mlog M M2q

EFGP e−γM 1/d N+Mlog M niter Mlog M q +Mlog M niterMq log M

Table 1: Error and computational complexity of GP regression for diﬀerent algorithms, for

Npoints deﬁned on Rd, with regression to qnew targets. All entries are O(·); constants are

not compared. For “SKI,” Mdenotes the number of inducing points; otherwise Mis the

number of Fourier nodes. For “EFGP,” the proposal in this paper, M= (2m+ 1)d, since

2m+ 1 nodes are used in each dimension. niter is the number of iterations in an iterative

solver, εis a matrix compression tolerance, and mach is machine error. “Error” refers to

kernel approximation, not downstream prediction errors that also scale with Nand σ.∗SE

denotes squared-exponential.

[37, 44], inducing point methods [42,48], and subset of regressors (SoR) [44, §8.3.1]. These

factorizations often involve a subsampling of rows and columns of K, so that the bases are

themselves kernels centered at a size-Msubset of the given points. Their cost is typically

O(NM2). Building on inducing points, “structured kernel interpolation” (SKI) [59] (similar

to precorrected FFT in PDE solvers [41]) replaces these xnjby a regular spatial product grid

of interpolation points; the resulting Toeplitz structure allows a fast matrix-vector multiply

via the d-dimensional FFT, giving the much improved scaling O(N+Mlog M) per iteration

(see Table 1). Recently, analytic expansions have improved upon its speed at high accuracy

up to moderate d[46].

Most such iterative methods ﬁt into the framework of “black box” matrix-vector multi-

plications (BBMM) [18] with K. However, their disadvantage is that each iteration costs at

least O(N) while poor condition numbers mean that many iterations are required. Precon-

ditioning can help in certain settings [18,52], but is far from a panacea for GP regression [10].

Furthermore, the cost per posterior mean evaluation at each new target is again O(N). In

2015 it was stated that “with sparse approximations it is inconceivable to apply GPs to

training set sizes of N≥ O(107)” [12]. The largest data sizes tested on a desktop machine

to date in this framework are around N≈106[18, 32, 54].

We now turn to prior work in direct solvers, which have the advantage of immunity to ill-

conditioning. Simple truncation of the kernel beyond a certain range sparsiﬁes K, allowing a

sparse direct solution [17], but this can only be eﬀective for short-range kernels. In dimension

d= 1 (e.g. time series), certain kernels make the upper and lower triangular halves of Klow

rank; this “semiseparable” structure allows a direct solution in O(N) time [16]. Recently,

fast direct solvers have been successfully applied to the function-space linear system in low

dimension d > 1 [1,5,7,29,35]. Going beyond the above “one-level” compression for K, these

exploit hierarchical (multi-scale) low-rank structure to compute a compressed approximate

representation of (K+σ2I)−1, which may be quickly (in O(N) or O(Nlog N) time) applied

to new vectors yin (4). For instance, in HODLR (hierarchically oﬀ-diagonal low-rank)

structure, at each scale oﬀ-diagonal blocks are numerically low rank, while the remaining

diagonal blocks themselves have HODLR structure at a ﬁner scale [1]. Reﬁnements of

this structure (e.g., HSS, H,H2, and HBS), as well as randomized linear algebra and

“proxy point” sampling [33, 35], are needed to achieve low ranks and fast factorizations.

These build upon recent advances in fast direct solvers for various kernels arising in integral

equations and PDEs [26,33], which in turn exploit adaptive domain decomposition and far-

ﬁeld expansion ideas from tree codes and the fast multipole method [23]. One disadvantage

of fast direct solvers is that the ranks grow rapidly with d, and that compression times

scale quadratically with the ranks [1]. Another is their high RAM usage. The largest Nis

around 2 ×106in recent single workstation examples [1, 7,35].

In contrast to all of the above function-space methods, weight-space methods avoid solv-

ing the N×Nlinear system [42]. Recalling the low-rank kernel approximation K≈˜

ΦΦ∗, they instead solve a dual system involving the smaller M×Mmatrix Φ∗Φ. We

recap the equivalence of these two approaches in Section 2. The needed rank (basis size)

Mdepends on the kernel, dimension d, and desired accuracy. An advantage is that mean

prediction at new targets involves only evaluating the basis expansion, an O(M) cost inde-

pendent of the data size N. However, traditional weight-space methods (including SoR and

“sparse”/inducing variable methods [42, 48]) cost O(NM2) to form Φ∗Φ, again prohibitive

for large N.

Recently it has been realized that Fourier (real trigonometric or complex exponential)

basis functions φj(x) are a natural weight-space representation for translationally invariant

kernels (for a review of such spectral representations see [27]). In early work their frequencies

were chosen stochastically [43], or optimized as part of the inference [31]. Instead, Hensman

et al. [27] proposed regular (equispaced) frequencies, following [39]. This allowed regression

(in a separable model, i.e., eﬀectively d= 1) from N= 6 ×106points in around 4 minutes.

However, since direct Fourier summation was used, the cost was still O(NM).

Recently the ﬁrst author introduced a weight space method for d= 1 with Fourier

frequencies chosen at the nodes of a high-order accurate quadrature scheme for the Fourier

integral [24]. Since smooth kernels (such as SE or Mat´ern with larger ν) have rapidly de-

caying Fourier transforms, truncation to a small Mcan approximate ˜

K≈Kto many digits

of accuracy, giving essentially “exact” inference. That work also proposed the nonuniform

FFT (NUFFT [14]) of type 3 for ﬁlling of the weight space system matrix and evaluation

at targets, leading to a solution cost O(N+M3); see Table 1. This allowed N= 108to be

handled in a few seconds.

Finally, for d > 1, the only way to handle N106currently is via distributed computing

[32]. There, methods include approximation by mixtures (a “committee” [44, §8.3.5]) of

ﬁxed-size GP solutions [12], distributed gradient descent [40], and multi-CPU Cholesky

preconditioning [34]. For instance, in studies with N= 109points (in d= 9), convergence

took around 1 hour, either on 1000 CPU cores [40, Sec. 6.3], or 2 GPUs [34].

1.2 Our contribution

We present a weight space iterative solver for general low dimension using Fourier frequencies

(features) on an equispaced d-dimensional grid, as in [27]. The equal spacing makes the

M-by-Msystem matrix Toeplitz (or its generalization for d > 1) up to diagonal scalings,

enabling its multiplication with a vector via the padded FFT, to give O(Mlog M) cost

per iteration. (Although SKI [59] also uses an FFT, the grid there is in physical space,

forcing an O(N) cost per iteration as with all BBMM methods.) The result is that large M

(e.g. 106) may be tackled (unlike in [24,27]). This is a necessity for higher dimension since

M=O(md), where mcontrols the number of grid points per dimension. Furthermore,

building on [24], we ﬁll the vector deﬁning the Toeplitz matrix, and the right hand side,

via a modern implementation of the NUFFT [3] with O(N+Mlog M) cost. Since there

is only a single pass through the data, a standard desktop can handle very large N, around

100 times larger than any prior work of which we are aware in d= 2.

We benchmark our equispaced Fourier (EFGP) proposal against three popular methods

for GP regression that have desirable theoretical properties and available software: SKI of

Wilson et al. [59] implemented in GPyTorch [18], an implementation of the fast direct solver

of Minden et al. [35] using FLAM [28], and the linear-scaling direct solver RLCM of Chen

and Stein [7]. Their theoretical scalings are compared in Table 1. Our synthetic data tests

in dimensions d= 1, 2, and 3 explore the tradeoﬀ of accuracy vs CPU time for all four

methods, and show that EFGP is very competitive for the very smooth squared-exponential

kernel, and remains competitive even for the least smooth Mat´ern kernel (ν= 1/2). We

show that real satellite data in d= 2 with N= 1.4×106may be regressed in a couple of

seconds on a laptop. We show large-scale simulated data tests in d= 2 at up to a billion

data points; the latter takes only a couple of minutes on a desktop. For reproducibility,

we implemented or wrapped all methods in a common, documented, publicly available

MATLAB/Octave interface.

Our contribution includes both analysis and numerical study of the error in the computed

posterior mean µrelative to “exact” GP inference (exact solution of (3)–(4)). If this error—

the bias in the GP regression introduced by approximation—is pushed to a negligible value

then the computation becomes eﬀectively exact (a point appreciated recently with other

iterative methods [18,54]). We bound this error rigorously in our main Theorem 11 in terms

of ε, a uniform kernel approximation error, and ρ, the weight space residual norm arising

from the iterative solver. We show that with our EFGP representation εsplits into aliasing

and truncation parts (Proposition 8). For the popular squared-exponential and Mat´ern

kernels, we give formulae for numerical discretization parameters that guarantee a given ε,

using analytic results from [2]. Numerically we study the estimated error in posterior mean

(EEPM), which should converge to zero as εand ρvanish. This is more informative than

the prediction error at held-out targets (commonly RMSE), which instead bottoms out at

the noise level (σin the case of a well-speciﬁed model).

Remark 2 (Connection to Fourier imaging).Although the combination of equispaced Fourier

modes with an iterative FFT-accelerated Toeplitz solver appears to be new in GP regression,

a similar idea has been used in computed tomography [13], MRI [15, 22], and cryo-electron

microscopy [55]. The normal equations in those applications are mathematically equivalent

to the GP weight-space linear system with σ= 0, but, curiously, the roles of physical and

Fourier space are swapped.

1.3 Outline of this paper

In the next section we present the weight-space approach for an exactly factorized covari-

ance kernel, and review its equivalence to GP regression in function space. In Section 3

we describe the proposed equispaced Fourier (EFGP) method and its fast numerical imple-

mentation. Our error analysis comprises kernel approximation (Section 4.1), discretization

parameter choice (Section 4.2), and posterior mean error (Section 4.3). Section 5 presents

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EquispacedFourierrepresentationsforecientGaussianprocessregressionfromabilliondatapointsPhilipGreengard*1,ManasRachh2,andAlexBarnett21DepartmentofStatistics,ColumbiaUniversity,NewYork,NY,100272CenterforComputationalMathematics,FlatironInstitute,NewYork,NY,10010May19,2023AbstractWeintroduceaFourier-...

展开>> 收起<<

Equispaced Fourier representations for ecient Gaussian process regression from a billion data points Philip Greengard1 Manas Rachh2 and Alex Barnett2.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Equispaced Fourier representations for ecient Gaussian process regression from a billion data points Philip Greengard1 Manas Rachh2 and Alex Barnett2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: