Scalable Bayesian Transformed Gaussian Processes Xinran Zhu Leo Huang Cameron Ibrahim Eric Hans Lee David Bindel Cornell University Cornell University University of Delaware SigOpt Cornell University

2025-05-03 0 0 790.71KB 15 页 10玖币
侵权投诉
Scalable Bayesian Transformed Gaussian Processes
Xinran Zhu Leo Huang Cameron Ibrahim Eric Hans Lee David Bindel
Cornell University Cornell University University of Delaware SigOpt Cornell University
Abstract
The Bayesian transformed Gaussian pro-
cess (BTG) model, proposed by Kedem and
Oliviera, is a fully Bayesian counterpart to
the warped Gaussian process (WGP) and
marginalizes out a joint prior over input
warping and kernel hyperparameters. This
fully Bayesian treatment of hyperparameters
often provides more accurate regression es-
timates and superior uncertainty propaga-
tion, but is prohibitively expensive. The
BTG posterior predictive distribution, itself
estimated through high-dimensional integra-
tion, must be inverted in order to perform
model prediction. To make the Bayesian ap-
proach practical and comparable in speed to
maximum-likelihood estimation (MLE), we
propose principled and fast techniques for
computing with BTG. Our framework uses
doubly sparse quadrature rules, tight quan-
tile bounds, and rank-one matrix algebra to
enable both fast model prediction and model
selection. These scalable methods allow us to
regress over higher-dimensional datasets and
apply BTG with layered transformations that
greatly improve its expressibility. We demon-
strate that BTG achieves superior empirical
performance over MLE-based models.
1 Introduction
Gaussian processes (GPs) provide a powerful proba-
bilistic learning framework, including a marginal like-
lihood which represents the probability of data given
only GP hyperparameters. The marginal likelihood
automatically balances model fit and complexity terms
to favor the simplest models that explain the data.
Preprint. In submission.
A GP assumes normally distributed observations. In
practice, however, this condition is not always ade-
quately met. The classic approach to moderate depar-
tures from normality is trans-Gaussian kriging, which
applies a normalizing nonlinear transformation to the
data (Cressie, 1993). This idea was reprised and ex-
panded upon in the machine learning literature. One
instance is the warped GP (WGP), which maps the
observation space to a latent space in which the data
is well-modeled by a GP and which learns GP hyper-
parameters through maximum likelihood estimation
(Snelson et al., 2004). The WGP paper employs a
class of parametrized, hyperbolic tangent transforma-
tions. Later, Rios and Tobar (2019) introduced com-
positionally warped GPs (CWGP), which chain to-
gether a sequence of parametric transformations with
closed form inverses. Bayesian warped GPs further
generalize WGPs by modelling the transformation as a
GP (L´azaro-Gredilla, 2012). These are in turn general-
ized to Deep GPs by Damianou and Lawrence (2013),
which stack GPs in the layers of a neural network.
Throughout this line of work, the GP transforma-
tion and kernel hyperparameters are typically learned
through joint maximum likelihood estimation (MLE).
A known drawback of MLE is overconfidence in the
data-sparse or low-data regime, which may be exacer-
bated by warping (Chai and Garnett, 2019). Bayesian
approaches, on the other hand, offer a way to account
for uncertainty in values of model parameters.
Bayesian trans-kriging (Sp¨ock et al., 2009) treats both
transformation and kernel parameters in a Bayesian
fashion. A prototypical Bayesian trans-kriging model
is the BTG model developed by Oliveira et al. (1997).
The model places an uninformative prior on the pre-
cision hyperparameter and analytically marginalizes it
out to obtain a posterior distribution that is a mixture
of Student’s t-distributions. Then, it uses a numeri-
cal integration scheme to marginalize out transforma-
tion and remaining kernel parameters. In this latter
regard, BTG is consistent with other Bayesian meth-
ods in the literature, including those of Gibbs (1998);
Adams et al. (2009); Lalchand and Rasmussen (2020).
While BTG shows improved prediction accuracy and
arXiv:2210.10973v1 [cs.LG] 20 Oct 2022
Scalable Bayesian Transformed Gaussian Processes
better uncertainty propagation, it comes with several
computational challenges, which hinder its scalability
and limit its competitiveness with the MLE approach.
First, the cost of numerical integration in BTG scales
with the dimension of hyperparameter space, which
can be large when transforms and noise model param-
eters are incorporated. Traditional methods such as
Monte Carlo (MC) suffer from slow convergence. As
such, we leverage sparse grid quadrature and quasi
Monte Carlo (QMC), which have a higher degree of
precision but require a sufficiently smooth integrand.
Second, the posterior mean of BTG is not guaranteed
to exist, hence the need to use the posterior median
predictor. The posterior median and credible intervals
do not generally have closed forms, so one must resort
to expensive numerical root-finding to compute them.
Finally, while fast cross-validation schemes are known
for vanilla GP models, leave-one-out-cross-validation
(LOOCV) on BTG, which incurs quartic cost naively,
is less straightforward to perform because of an em-
bedded generalized least squares problem.
In this paper, we reduce the overall computational cost
of end-to-end BTG inference, including model predic-
tion and selection. Our main contributions follow.
We propose efficient and scalable methods for
computing BTG predictive medians and quantiles
through a combination of doubly sparse quadra-
ture and quantile bounds. We also propose fast
LOOCV using rank-one matrix algebra.
We develop a framework to control the tradeoff
between speed and accuracy for BTG and ana-
lyze the error in sparsifying QMC and sparse grid
quadrature rules.
We empirically compare the Bayesian and MLE
approaches and provide experimental results for
BTG and WGP coupled with 1-layer and 2-layer
transformations. We find evidence that BTG is
well-suited for low-data regimes, where hyperpa-
rameters are under-specified by the data.
We develop a modular Julia package for comput-
ing with transformed GPs (e.g., BTG and WGP)
which exploits vectorized linear algebra opera-
tions and supports MLE and Bayesian inference.
2 Background
2.1 Gaussian Process Regression
A GP f∼ GP(µ, τ1k) is a distribution over functions
in Rd, where µ(x) is the expected value of f(x) and
τ1k(x, x0) is the positive (semi)-definite covariance
between f(x) and f(x0). For later clarity, we separate
the precision hyperparameter τfrom lengthscales and
other kernel hyperparameters (typically denoted by θ).
Unless otherwise specified, we assume a linear mean
field and the squared exponential kernel:
µβ(x) = βTm(x), m:RdRp,
kθ(x,x0) = exp 1
2kxx0k2
D2
θ.
Here mis a known function mapping a location to a
vector of covariates, βconsists of coefficients in the
linear combination, and D2
θis a diagonal matrix of
length scales determined by the parameter(s) θ.
For any finite set of input locations, let:
X= [x1,...,xn]TXRn×d,
MX= [m(x1), . . . , m(xn)]TMXRn×p,
fX= [f(x1), . . . , f(xn)]TfXRn,
where Xis the matrix of observations locations, MX
is the matrix of covariates at X, and fXis the vector
of observations. A GP has the property that any finite
number of evaluations of fwill have a joint Gaussian
distribution: fX|β, τ, θ ∼ N(MXβ, τ1KX), where
(τ1KX)ij =τ1kθ(xi,xj) is the covariance matrix
of fX. We assume MXto be full rank.
The posterior predictive density of a point xis:
f(x)|β, τ, θ, fX∼ N(µθ,β , sθ,β ),
µθ,β =βTm(x) + KT
XxK1
X(fXMXβ),
sθ,β =τ1kθ(x,x)KT
XxK1
XKXx,
where (KXx)i=kθ(xi,x). Typically, τ,β, and θare
fit by minimizing the negative log likelihood:
log L(fX|X, β, τ, θ)
1
2
fXMXβ
2
K1
X
+1
2log|KX|.
This is known as maximum likelihood estimation
(MLE) of the kernel hyperparameters. In order to
improve the clarity of later sections, we modified the
standard GP treatment of Rasmussen and Williams
(2008); notational differences aside, our formulations
are equivalent.
2.2 Warped Gaussian Processes
While GPs are powerful tools for modeling nonlin-
ear functions, they make the fairly strong assumption
of Gaussianity and homoscedasticity. WGPs (Snelson
et al., 2004) address this problem by warping the ob-
servation space to a latent space, which itself is mod-
eled by a GP. Given a strictly increasing, differentiable
Xinran Zhu, Leo Huang, Cameron Ibrahim, Eric Hans Lee, David Bindel
Figure 1: A comparison of GP and BTG: predictive
mean/median and 95% equal tailed credible interval.
Trained on 48 random samples from the rounded sine func-
tion with noise from N(0,0.05).
parametric transformation gλ, WGPs model the com-
posite function gλfwith a GP:
(gλf)|β, τ, λ, θ ∼ GP(µβ, τ1kθ).
Let (gλ(fX))i=gλ(f(xi)). WGP jointly computes the
parameters through MLE in the latent space, where
the negative log likelihood is:
log Lgλ(fX)|X, β, τ, θ, λ
1
2
gλ(fX)MXβ
2
K1
X
+1
2log|KX| − log Jλ,
and where Jrepresents the transformation Jacobian:
Jλ=
n
Y
i=1
f(xi)gλ(f(xi))
.
WGPs predict the value of a point xby computing
its posterior mean in the latent space and then invert-
ing the transformation back to the observation space:
g1
λ(ˆµ(x)). Snelson et al. (2004) uses the tanh trans-
form family, whose members do not generally have
closed form inverses; they must be computed numeri-
cally.
2.3 Bayesian Transformed GPs (BTG)
One might think of the Bayesian Transformed Gaus-
sian (BTG) model (Oliveira et al., 1997) as a fully
Bayesian generalization of WGP. BTG uses Bayesian
model selection and marginalizes out priors over
all model parameters: transformation parameters λ,
mean vector β, signal variance τ, and lengthscales θ.
Just like WGP, BTG models a function f(x) as:
(gλf)|β, τ, λ, θ ∼ GP(µβ, τ1kθ).
BTG was originally a Bayesian generalization of trans-
kriging models. Because appropriate values for β,τ,
and θdepend nontrivially on λ, BTG adopts the im-
proper joint prior:
p(β, τ, θ, λ)p(θ)p(λ)/(τJp/n
λ).
As it turns out, BTG’s posterior predictive dis-
tribution can be approximated as a mixture of t-
distributions:
p(f(x)|fX) =
M
X
i=1
wipgλi(f(x)) |θi, λi, fX,
where here pis the t-distribution pdf. We provide a
condensed derivation in §2.4 and 3; for a comprehen-
sive analysis, see Box and Cox (1964). This predictive
distribution must then be inverted to perform predic-
tion or uncertainty quantification.
Figure 1 demonstrates the advantage of fully Bayesian
model selection. BTG resolves the underlying data-
points much better than a GP. In later sections, we
explore the advantages of being Bayesian in the low-
data regime.
2.4 The Predictive Density
A key idea of the BTG model is that, conditioned on
λ,θ, and fX, the resulting WGP is a generalized linear
model (Oliveira et al., 1997). We estimate βby ˆ
βθ,λ,
the solution to the weighted least squares problem:
qθ,λ = min
β
gλ(fX)MXβ
2
K1
X
,
where qθ,λ is the residual norm. BTG then adopts a
conditional normal-inverse-gamma posterior on (β, τ):
β|τ, λ, θ, fX∼ Nˆ
βλ,θ, τ1(MT
XK1
XMX)1,
τ|λ, θ, fXGanp
2,2
qλ,θ .
At a point x, the marginal predictive density of
gλ(f(x)) is then given by the following t-distribution:
gλ(f(x)) |λ, θ, fXTnpmλ,θ,(qθCθ,λ)1,(1)
where the mean largely resembles that of a GP:
mλ,θ =KxXK1
Xgλ(fX)MXˆ
βλ,θ+ˆ
βT
λ,θm(x),
and Cλ,θ is the final Schur complement
B(x)/[kθ(x,x)] of the bordered matrix:
B(x) =
0MT
Xm(x)
MXKXKXx
m(x)TKT
Xxkθ(x,x)
.
By Bayes’ theorem, the marginal posterior of BTG is:
摘要:

ScalableBayesianTransformedGaussianProcessesXinranZhuLeoHuangCameronIbrahimEricHansLeeDavidBindelCornellUniversityCornellUniversityUniversityofDelawareSigOptCornellUniversityAbstractTheBayesiantransformedGaussianpro-cess(BTG)model,proposedbyKedemandOliviera,isafullyBayesiancounterparttothewarpedGaus...

展开>> 收起<<
Scalable Bayesian Transformed Gaussian Processes Xinran Zhu Leo Huang Cameron Ibrahim Eric Hans Lee David Bindel Cornell University Cornell University University of Delaware SigOpt Cornell University.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:790.71KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注