Scalable Bayesian Transformed Gaussian Processes Xinran Zhu Leo Huang Cameron Ibrahim Eric Hans Lee David Bindel Cornell University Cornell University University of Delaware SigOpt Cornell University

2025-05-03 0 0 790.71KB 15 页 10玖币

侵权投诉

Scalable Bayesian Transformed Gaussian Processes

Xinran Zhu Leo Huang Cameron Ibrahim Eric Hans Lee David Bindel

Cornell University Cornell University University of Delaware SigOpt Cornell University

Abstract

The Bayesian transformed Gaussian pro-

cess (BTG) model, proposed by Kedem and

Oliviera, is a fully Bayesian counterpart to

the warped Gaussian process (WGP) and

marginalizes out a joint prior over input

warping and kernel hyperparameters. This

fully Bayesian treatment of hyperparameters

often provides more accurate regression es-

timates and superior uncertainty propaga-

tion, but is prohibitively expensive. The

BTG posterior predictive distribution, itself

estimated through high-dimensional integra-

tion, must be inverted in order to perform

model prediction. To make the Bayesian ap-

proach practical and comparable in speed to

maximum-likelihood estimation (MLE), we

propose principled and fast techniques for

computing with BTG. Our framework uses

doubly sparse quadrature rules, tight quan-

tile bounds, and rank-one matrix algebra to

enable both fast model prediction and model

selection. These scalable methods allow us to

regress over higher-dimensional datasets and

apply BTG with layered transformations that

greatly improve its expressibility. We demon-

strate that BTG achieves superior empirical

performance over MLE-based models.

1 Introduction

Gaussian processes (GPs) provide a powerful proba-

bilistic learning framework, including a marginal like-

lihood which represents the probability of data given

only GP hyperparameters. The marginal likelihood

automatically balances model ﬁt and complexity terms

to favor the simplest models that explain the data.

Preprint. In submission.

A GP assumes normally distributed observations. In

practice, however, this condition is not always ade-

quately met. The classic approach to moderate depar-

tures from normality is trans-Gaussian kriging, which

applies a normalizing nonlinear transformation to the

data (Cressie, 1993). This idea was reprised and ex-

panded upon in the machine learning literature. One

instance is the warped GP (WGP), which maps the

observation space to a latent space in which the data

is well-modeled by a GP and which learns GP hyper-

parameters through maximum likelihood estimation

(Snelson et al., 2004). The WGP paper employs a

class of parametrized, hyperbolic tangent transforma-

tions. Later, Rios and Tobar (2019) introduced com-

positionally warped GPs (CWGP), which chain to-

gether a sequence of parametric transformations with

closed form inverses. Bayesian warped GPs further

generalize WGPs by modelling the transformation as a

GP (L´azaro-Gredilla, 2012). These are in turn general-

ized to Deep GPs by Damianou and Lawrence (2013),

which stack GPs in the layers of a neural network.

Throughout this line of work, the GP transforma-

tion and kernel hyperparameters are typically learned

through joint maximum likelihood estimation (MLE).

A known drawback of MLE is overconﬁdence in the

data-sparse or low-data regime, which may be exacer-

bated by warping (Chai and Garnett, 2019). Bayesian

approaches, on the other hand, oﬀer a way to account

for uncertainty in values of model parameters.

Bayesian trans-kriging (Sp¨ock et al., 2009) treats both

transformation and kernel parameters in a Bayesian

fashion. A prototypical Bayesian trans-kriging model

is the BTG model developed by Oliveira et al. (1997).

The model places an uninformative prior on the pre-

cision hyperparameter and analytically marginalizes it

out to obtain a posterior distribution that is a mixture

of Student’s t-distributions. Then, it uses a numeri-

cal integration scheme to marginalize out transforma-

tion and remaining kernel parameters. In this latter

regard, BTG is consistent with other Bayesian meth-

ods in the literature, including those of Gibbs (1998);

Adams et al. (2009); Lalchand and Rasmussen (2020).

While BTG shows improved prediction accuracy and

arXiv:2210.10973v1 [cs.LG] 20 Oct 2022

Scalable Bayesian Transformed Gaussian Processes

better uncertainty propagation, it comes with several

computational challenges, which hinder its scalability

and limit its competitiveness with the MLE approach.

First, the cost of numerical integration in BTG scales

with the dimension of hyperparameter space, which

can be large when transforms and noise model param-

eters are incorporated. Traditional methods such as

Monte Carlo (MC) suﬀer from slow convergence. As

such, we leverage sparse grid quadrature and quasi

Monte Carlo (QMC), which have a higher degree of

precision but require a suﬃciently smooth integrand.

Second, the posterior mean of BTG is not guaranteed

to exist, hence the need to use the posterior median

predictor. The posterior median and credible intervals

do not generally have closed forms, so one must resort

to expensive numerical root-ﬁnding to compute them.

Finally, while fast cross-validation schemes are known

for vanilla GP models, leave-one-out-cross-validation

(LOOCV) on BTG, which incurs quartic cost naively,

is less straightforward to perform because of an em-

bedded generalized least squares problem.

In this paper, we reduce the overall computational cost

of end-to-end BTG inference, including model predic-

tion and selection. Our main contributions follow.

•We propose eﬃcient and scalable methods for

computing BTG predictive medians and quantiles

through a combination of doubly sparse quadra-

ture and quantile bounds. We also propose fast

LOOCV using rank-one matrix algebra.

•We develop a framework to control the tradeoﬀ

between speed and accuracy for BTG and ana-

lyze the error in sparsifying QMC and sparse grid

quadrature rules.

•We empirically compare the Bayesian and MLE

approaches and provide experimental results for

BTG and WGP coupled with 1-layer and 2-layer

transformations. We ﬁnd evidence that BTG is

well-suited for low-data regimes, where hyperpa-

rameters are under-speciﬁed by the data.

•We develop a modular Julia package for comput-

ing with transformed GPs (e.g., BTG and WGP)

which exploits vectorized linear algebra opera-

tions and supports MLE and Bayesian inference.

2 Background

2.1 Gaussian Process Regression

A GP f∼ GP(µ, τ−1k) is a distribution over functions

in Rd, where µ(x) is the expected value of f(x) and

τ−1k(x, x0) is the positive (semi)-deﬁnite covariance

between f(x) and f(x0). For later clarity, we separate

the precision hyperparameter τfrom lengthscales and

other kernel hyperparameters (typically denoted by θ).

Unless otherwise speciﬁed, we assume a linear mean

ﬁeld and the squared exponential kernel:

µβ(x) = βTm(x), m:Rd→Rp,

kθ(x,x0) = exp −1

2kx−x0k2

D−2

θ.

Here mis a known function mapping a location to a

vector of covariates, βconsists of coeﬃcients in the

linear combination, and D2

θis a diagonal matrix of

length scales determined by the parameter(s) θ.

For any ﬁnite set of input locations, let:

X= [x1,...,xn]TX∈Rn×d,

MX= [m(x1), . . . , m(xn)]TMX∈Rn×p,

fX= [f(x1), . . . , f(xn)]TfX∈Rn,

where Xis the matrix of observations locations, MX

is the matrix of covariates at X, and fXis the vector

of observations. A GP has the property that any ﬁnite

number of evaluations of fwill have a joint Gaussian

distribution: fX|β, τ, θ ∼ N(MXβ, τ−1KX), where

(τ−1KX)ij =τ−1kθ(xi,xj) is the covariance matrix

of fX. We assume MXto be full rank.

The posterior predictive density of a point xis:

f(x)|β, τ, θ, fX∼ N(µθ,β , sθ,β ),

µθ,β =βTm(x) + KT

XxK−1

X(fX−MXβ),

sθ,β =τ−1kθ(x,x)−KT

XxK−1

XKXx,

where (KXx)i=kθ(xi,x). Typically, τ,β, and θare

ﬁt by minimizing the negative log likelihood:

−log L(fX|X, β, τ, θ)∝

2

fX−MXβ



K−1

2log|KX|.

This is known as maximum likelihood estimation

(MLE) of the kernel hyperparameters. In order to

improve the clarity of later sections, we modiﬁed the

standard GP treatment of Rasmussen and Williams

(2008); notational diﬀerences aside, our formulations

are equivalent.

2.2 Warped Gaussian Processes

While GPs are powerful tools for modeling nonlin-

ear functions, they make the fairly strong assumption

of Gaussianity and homoscedasticity. WGPs (Snelson

et al., 2004) address this problem by warping the ob-

servation space to a latent space, which itself is mod-

eled by a GP. Given a strictly increasing, diﬀerentiable

Xinran Zhu, Leo Huang, Cameron Ibrahim, Eric Hans Lee, David Bindel

Figure 1: A comparison of GP and BTG: predictive

mean/median and 95% equal tailed credible interval.

Trained on 48 random samples from the rounded sine func-

tion with noise from N(0,0.05).

parametric transformation gλ, WGPs model the com-

posite function gλ◦fwith a GP:

(gλ◦f)|β, τ, λ, θ ∼ GP(µβ, τ−1kθ).

Let (gλ(fX))i=gλ(f(xi)). WGP jointly computes the

parameters through MLE in the latent space, where

the negative log likelihood is:

−log Lgλ(fX)|X, β, τ, θ, λ∝

2

gλ(fX)−MXβ



K−1

2log|KX| − log Jλ,

and where Jrepresents the transformation Jacobian:

Jλ=

i=1

∂

∂f(xi)gλ(f(xi))

WGPs predict the value of a point xby computing

its posterior mean in the latent space and then invert-

ing the transformation back to the observation space:

g−1

λ(ˆµ(x)). Snelson et al. (2004) uses the tanh trans-

form family, whose members do not generally have

closed form inverses; they must be computed numeri-

cally.

2.3 Bayesian Transformed GPs (BTG)

One might think of the Bayesian Transformed Gaus-

sian (BTG) model (Oliveira et al., 1997) as a fully

Bayesian generalization of WGP. BTG uses Bayesian

model selection and marginalizes out priors over

all model parameters: transformation parameters λ,

mean vector β, signal variance τ, and lengthscales θ.

Just like WGP, BTG models a function f(x) as:

(gλ◦f)|β, τ, λ, θ ∼ GP(µβ, τ−1kθ).

BTG was originally a Bayesian generalization of trans-

kriging models. Because appropriate values for β,τ,

and θdepend nontrivially on λ, BTG adopts the im-

proper joint prior:

p(β, τ, θ, λ)∝p(θ)p(λ)/(τJp/n

λ).

As it turns out, BTG’s posterior predictive dis-

tribution can be approximated as a mixture of t-

distributions:

p(f(x)|fX) =

i=1

wipgλi(f(x)) |θi, λi, fX,

where here pis the t-distribution pdf. We provide a

condensed derivation in §2.4 and 3; for a comprehen-

sive analysis, see Box and Cox (1964). This predictive

distribution must then be inverted to perform predic-

tion or uncertainty quantiﬁcation.

Figure 1 demonstrates the advantage of fully Bayesian

model selection. BTG resolves the underlying data-

points much better than a GP. In later sections, we

explore the advantages of being Bayesian in the low-

data regime.

2.4 The Predictive Density

A key idea of the BTG model is that, conditioned on

λ,θ, and fX, the resulting WGP is a generalized linear

model (Oliveira et al., 1997). We estimate βby ˆ

βθ,λ,

the solution to the weighted least squares problem:

qθ,λ = min

β

gλ(fX)−MXβ



K−1

where qθ,λ is the residual norm. BTG then adopts a

conditional normal-inverse-gamma posterior on (β, τ):

β|τ, λ, θ, fX∼ Nˆ

βλ,θ, τ−1(MT

XK−1

XMX)−1,

τ|λ, θ, fX∼Gan−p

2,2

qλ,θ .

At a point x, the marginal predictive density of

gλ(f(x)) is then given by the following t-distribution:

gλ(f(x)) |λ, θ, fX∼Tn−pmλ,θ,(qθ,λCθ,λ)−1,(1)

where the mean largely resembles that of a GP:

mλ,θ =KxXK−1

Xgλ(fX)−MXˆ

βλ,θ+ˆ

βT

λ,θm(x),

and Cλ,θ is the ﬁnal Schur complement

B(x)/[kθ(x,x)] of the bordered matrix:

B(x) = 



0MT

Xm(x)

MXKXKXx

m(x)TKT

Xxkθ(x,x)

.

By Bayes’ theorem, the marginal posterior of BTG is:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ScalableBayesianTransformedGaussianProcessesXinranZhuLeoHuangCameronIbrahimEricHansLeeDavidBindelCornellUniversityCornellUniversityUniversityofDelawareSigOptCornellUniversityAbstractTheBayesiantransformedGaussianpro-cess(BTG)model,proposedbyKedemandOliviera,isafullyBayesiancounterparttothewarpedGaus...

展开>> 收起<<

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Scalable Bayesian Transformed Gaussian Processes Xinran Zhu Leo Huang Cameron Ibrahim Eric Hans Lee David Bindel Cornell University Cornell University University of Delaware SigOpt Cornell University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: