
Scalable Bayesian Transformed Gaussian Processes
better uncertainty propagation, it comes with several
computational challenges, which hinder its scalability
and limit its competitiveness with the MLE approach.
First, the cost of numerical integration in BTG scales
with the dimension of hyperparameter space, which
can be large when transforms and noise model param-
eters are incorporated. Traditional methods such as
Monte Carlo (MC) suffer from slow convergence. As
such, we leverage sparse grid quadrature and quasi
Monte Carlo (QMC), which have a higher degree of
precision but require a sufficiently smooth integrand.
Second, the posterior mean of BTG is not guaranteed
to exist, hence the need to use the posterior median
predictor. The posterior median and credible intervals
do not generally have closed forms, so one must resort
to expensive numerical root-finding to compute them.
Finally, while fast cross-validation schemes are known
for vanilla GP models, leave-one-out-cross-validation
(LOOCV) on BTG, which incurs quartic cost naively,
is less straightforward to perform because of an em-
bedded generalized least squares problem.
In this paper, we reduce the overall computational cost
of end-to-end BTG inference, including model predic-
tion and selection. Our main contributions follow.
•We propose efficient and scalable methods for
computing BTG predictive medians and quantiles
through a combination of doubly sparse quadra-
ture and quantile bounds. We also propose fast
LOOCV using rank-one matrix algebra.
•We develop a framework to control the tradeoff
between speed and accuracy for BTG and ana-
lyze the error in sparsifying QMC and sparse grid
quadrature rules.
•We empirically compare the Bayesian and MLE
approaches and provide experimental results for
BTG and WGP coupled with 1-layer and 2-layer
transformations. We find evidence that BTG is
well-suited for low-data regimes, where hyperpa-
rameters are under-specified by the data.
•We develop a modular Julia package for comput-
ing with transformed GPs (e.g., BTG and WGP)
which exploits vectorized linear algebra opera-
tions and supports MLE and Bayesian inference.
2 Background
2.1 Gaussian Process Regression
A GP f∼ GP(µ, τ−1k) is a distribution over functions
in Rd, where µ(x) is the expected value of f(x) and
τ−1k(x, x0) is the positive (semi)-definite covariance
between f(x) and f(x0). For later clarity, we separate
the precision hyperparameter τfrom lengthscales and
other kernel hyperparameters (typically denoted by θ).
Unless otherwise specified, we assume a linear mean
field and the squared exponential kernel:
µβ(x) = βTm(x), m:Rd→Rp,
kθ(x,x0) = exp −1
2kx−x0k2
D−2
θ.
Here mis a known function mapping a location to a
vector of covariates, βconsists of coefficients in the
linear combination, and D2
θis a diagonal matrix of
length scales determined by the parameter(s) θ.
For any finite set of input locations, let:
X= [x1,...,xn]TX∈Rn×d,
MX= [m(x1), . . . , m(xn)]TMX∈Rn×p,
fX= [f(x1), . . . , f(xn)]TfX∈Rn,
where Xis the matrix of observations locations, MX
is the matrix of covariates at X, and fXis the vector
of observations. A GP has the property that any finite
number of evaluations of fwill have a joint Gaussian
distribution: fX|β, τ, θ ∼ N(MXβ, τ−1KX), where
(τ−1KX)ij =τ−1kθ(xi,xj) is the covariance matrix
of fX. We assume MXto be full rank.
The posterior predictive density of a point xis:
f(x)|β, τ, θ, fX∼ N(µθ,β , sθ,β ),
µθ,β =βTm(x) + KT
XxK−1
X(fX−MXβ),
sθ,β =τ−1kθ(x,x)−KT
XxK−1
XKXx,
where (KXx)i=kθ(xi,x). Typically, τ,β, and θare
fit by minimizing the negative log likelihood:
−log L(fX|X, β, τ, θ)∝
1
2
fX−MXβ
2
K−1
X
+1
2log|KX|.
This is known as maximum likelihood estimation
(MLE) of the kernel hyperparameters. In order to
improve the clarity of later sections, we modified the
standard GP treatment of Rasmussen and Williams
(2008); notational differences aside, our formulations
are equivalent.
2.2 Warped Gaussian Processes
While GPs are powerful tools for modeling nonlin-
ear functions, they make the fairly strong assumption
of Gaussianity and homoscedasticity. WGPs (Snelson
et al., 2004) address this problem by warping the ob-
servation space to a latent space, which itself is mod-
eled by a GP. Given a strictly increasing, differentiable