structure, at each scale off-diagonal blocks are numerically low rank, while the remaining
diagonal blocks themselves have HODLR structure at a finer scale [1]. Refinements of
this structure (e.g., HSS, H,H2, and HBS), as well as randomized linear algebra and
“proxy point” sampling [33, 35], are needed to achieve low ranks and fast factorizations.
These build upon recent advances in fast direct solvers for various kernels arising in integral
equations and PDEs [26,33], which in turn exploit adaptive domain decomposition and far-
field expansion ideas from tree codes and the fast multipole method [23]. One disadvantage
of fast direct solvers is that the ranks grow rapidly with d, and that compression times
scale quadratically with the ranks [1]. Another is their high RAM usage. The largest Nis
around 2 ×106in recent single workstation examples [1, 7,35].
In contrast to all of the above function-space methods, weight-space methods avoid solv-
ing the N×Nlinear system [42]. Recalling the low-rank kernel approximation K≈˜
K=
ΦΦ∗, they instead solve a dual system involving the smaller M×Mmatrix Φ∗Φ. We
recap the equivalence of these two approaches in Section 2. The needed rank (basis size)
Mdepends on the kernel, dimension d, and desired accuracy. An advantage is that mean
prediction at new targets involves only evaluating the basis expansion, an O(M) cost inde-
pendent of the data size N. However, traditional weight-space methods (including SoR and
“sparse”/inducing variable methods [42, 48]) cost O(NM2) to form Φ∗Φ, again prohibitive
for large N.
Recently it has been realized that Fourier (real trigonometric or complex exponential)
basis functions φj(x) are a natural weight-space representation for translationally invariant
kernels (for a review of such spectral representations see [27]). In early work their frequencies
were chosen stochastically [43], or optimized as part of the inference [31]. Instead, Hensman
et al. [27] proposed regular (equispaced) frequencies, following [39]. This allowed regression
(in a separable model, i.e., effectively d= 1) from N= 6 ×106points in around 4 minutes.
However, since direct Fourier summation was used, the cost was still O(NM).
Recently the first author introduced a weight space method for d= 1 with Fourier
frequencies chosen at the nodes of a high-order accurate quadrature scheme for the Fourier
integral [24]. Since smooth kernels (such as SE or Mat´ern with larger ν) have rapidly de-
caying Fourier transforms, truncation to a small Mcan approximate ˜
K≈Kto many digits
of accuracy, giving essentially “exact” inference. That work also proposed the nonuniform
FFT (NUFFT [14]) of type 3 for filling of the weight space system matrix and evaluation
at targets, leading to a solution cost O(N+M3); see Table 1. This allowed N= 108to be
handled in a few seconds.
Finally, for d > 1, the only way to handle N106currently is via distributed computing
[32]. There, methods include approximation by mixtures (a “committee” [44, §8.3.5]) of
fixed-size GP solutions [12], distributed gradient descent [40], and multi-CPU Cholesky
preconditioning [34]. For instance, in studies with N= 109points (in d= 9), convergence
took around 1 hour, either on 1000 CPU cores [40, Sec. 6.3], or 2 GPUs [34].
1.2 Our contribution
We present a weight space iterative solver for general low dimension using Fourier frequencies
(features) on an equispaced d-dimensional grid, as in [27]. The equal spacing makes the
M-by-Msystem matrix Toeplitz (or its generalization for d > 1) up to diagonal scalings,
enabling its multiplication with a vector via the padded FFT, to give O(Mlog M) cost
per iteration. (Although SKI [59] also uses an FFT, the grid there is in physical space,
4