Log-Linear-Time Gaussian Processes Using Binary Tree Kernels Michael K. Cohen

2025-05-02 0 0 1.08MB 18 页 10玖币

侵权投诉

Log-Linear-Time Gaussian Processes

Using Binary Tree Kernels

Michael K. Cohen

University of Oxford

michael.cohen@eng.ox.ax.uk

Samuel Daulton

University of Oxford, Meta

sdaulton@meta.com

Michael A. Osborne

University of Oxford

mosb@robots.ox.ax.uk

Abstract

Gaussian processes (GPs) produce good probabilistic models of functions, but

most GP kernels require

O((n+m)n2)

time, where

is the number of data points

and

the number of predictive locations. We present a new kernel that allows for

Gaussian process regression in

O((n+m) log(n+m))

time. Our “binary tree”

kernel places all data points on the leaves of a binary tree, with the kernel depending

only on the depth of the deepest common ancestor. We can store the resulting kernel

matrix in

O(n)

space in

O(nlog n)

time, as a sum of sparse rank-one matrices,

and approximately invert the kernel matrix in

O(n)

time. Sparse GP methods also

offer linear run time, but they predict less well than higher dimensional kernels. On

a classic suite of regression tasks, we compare our kernel against Matérn, sparse,

and sparse variational kernels. The binary tree GP assigns the highest likelihood to

the test data on a plurality of datasets, usually achieves lower mean squared error

than the sparse methods, and often ties or beats the Matérn GP. On large datasets,

the binary tree GP is fastest, and much faster than a Matérn GP.

1 Introduction

Gaussian processes (GPs) can be used to perform regression with high-quality uncertainty estimates,

but they are slow. Naïvely, GP regression requires

O(n3+n2m)

computation time and

O(n2)

computation space when predicting at

locations given

data points [

]. A kernel matrix of size

n×n

must be inverted (or Cholesky decomposed), and then

matrix-vector multiplications must

be done with that inverse matrix (or

linear solves with the Cholesky factors). A few methods that

we will discuss later achieve O(n2m)time complexity [25, 30].

With special kernels, GP regression can be faster and use less space. Inducing point methods, using

inducing points, allow regression to be done in

O(z2(n+m))

time and in

O(z2+zn)

space

[

]. We will discuss the details of these inducing point kernels later, but they are kernels

in their own right, not just approximations to other kernels. Unfortunately, these kernels are low

dimensional (having a z-dimensional Hilbert space), which limits the expressivity of the GP model.

We present a new kernel, the binary tree kernel, that also allows for GP regression in

O(n+m)

space and

O((n+m) log(n+m))

time (both model ﬁtting and prediction). The time and space

complexity of our method is also linear in the depth of the binary tree, which is naïvely linear in the

dimension of the data, although in practice we can increase the depth sublinearly. Training some

kernel parameters takes time quadratic in the depth of the tree. The dimensionality of the binary

tree kernel is exponential in the depth of the tree, making it much more expressive than an inducing

points kernel. Whereas for an inducing points kernel, the runtime is quadratic in the dimension of the

Hilbert space, for the binary tree kernel, it is only logarithmic—an exponential speedup.

A simple depiction of our kernel is shown in Figure 1, which we will deﬁne precisely in Section

3. First, we create a procedure for placing all data points on the leaves of a binary tree. Given the

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01633v1 [cs.LG] 4 Oct 2022

binary tree, the kernel between two points depends only on the depth of the deepest common ancestor.

Because very different tree structures are possible for the data, we can easily form an ensemble of

diverse GP regression models. Figure 2 depicts a schematic sample from a binary tree kernel. Note

how the posterior mean is piecewise ﬂat, but the pieces can be small.

Figure 1: A binary tree kernel with four data points.

In this example,

k(x1, x1) = 1

k(x1, x2) = 0

k(x1, x3)=0.8, and k(x1, x4)=0.3.

On a standard suite of benchmark regres-

sion tasks [

], we show that our kernel usu-

ally achieves better negative log likelihood

(NLL) than state-of-the-art sparse methods and

conjugate-gradient-based “exact” methods, at

lower computational cost in the big-data regime.

There are not many limitations to using our ker-

nel. The main limitation is that other kernels

sometimes capture the relationships in the data

better. We do not have a good procedure for un-

derstanding when data has more Matérn charac-

ter or more binary tree character (except through

running both and comparing training NLL). But

given that the binary tree kernel usually outperforms the Matérn, we’ll tentatively say the best ﬁrst

guess is that a new dataset has more binary tree character. One concrete limitation for some appli-

cations, like Bayesian Optimization, is that the posterior mean is piecewise-ﬂat, so gradient-based

heuristics for ﬁnding extrema would not work.

In contexts where a piecewise-ﬂat posterior mean is suitable, we struggle to see when one would

prefer a sparse or sparse variational GP to a binary tree kernel. The most thorough approach would

be to run both and see which has a better training NLL, but if you had to pick one, the binary tree

GP seems to be better performing and comparably fast. If minimizing mean-squared error is the

objective, the Matérn kernel seems to do slightly better than the binary tree. If the dataset is small,

and one needs a very fast prediction, a Matérn kernel may be the best option. But otherwise, if one

cares about well-calibrated predictions, these initial results we present tentatively suggest using a

binary tree kernel over the widely-used Matérn kernel.

The log-linear time and linear space complexity of the binary tree GP, with performance exceeding a

“normal” kernel, could profoundly expand the viability of GP regression to larger datasets.

2 Preliminaries

Our problem setting is regression. Given a function

f:X → R

, for some arbitrary set

, we would

like to predict

f(x)

for various

x∈ X

. What we have are observations of

f(x)

for various (other)

x∈ X

. Let

X∈ Xn

be an

-tuple of elements of

, and let

y∈Rn

be an

-tuple of real numbers,

such that yi∼f(Xi) + N(0, λ), for λ∈R≥0.Xand ycomprise our training data.

With an

-tuple of test locations

X0∈ Xm

, let

y0∈Rm

, with

i=f(X0

is the ground truth

for the target locations. Given training data, we would like to produce a distribution over

for each

target location

, such that it assigns high marginal probability to the unknown

. Alternatively, we

sometimes would like to produce point estimates

ˆy0

in order to minimize the squared error

(ˆy0

i−y0

i)2

A GP prior over functions is deﬁned by a mean function

m:X → R

, and a kernel

k:X ×X → R

The expected function value at a point

is deﬁned to be

m(x)

, and the covariance of the function

values at two points

and

is deﬁned to be

k(x1, x2)

. Let

KXX ∈Rn×n

be the matrix of kernel

Figure 2: A schematic diagram of a function sampled from a binary tree kernel. The function is

over the interval [0, 1], and points on the interval are placed onto the leaves of a depth-4 binary

tree according to the ﬁrst 4 bits of their binary expansion. The sampled function is in black. Purple

represents the sample if the tree had depth 3, green depth 2, orange depth 1, and red depth 0.

values

(KXX )ij =k(Xi, Xj)

, and let

mX∈Rn

be the vector of mean values

(mX)i=m(Xi)

For a GP to be well-deﬁned, the kernel must be such that

KXX

is positive semideﬁnite for any

X∈ Xn

. For a point

x∈ X

, Let

KXx ∈Rn

be the vector of kernel values:

(KXx)i=k(Xi, x)

and let

KxX =K>

. Let

λ≥0

be the variance of observation noise. Let

µx

and

σ2

be the mean

and variance of our posterior predictive distribution at

. Then, with

Kλinv

XX = (KXX +λI)−1

µx:= (y−mX)>Kλinv

XX KXx +m(x)(1) σ2

x:= k(x, x)−KxX Kλinv

XX KXx +λ. (2)

See Williams and Rasmussen [28] for a derivation. We compute Equations 1 and 2 for all x∈X0.

3 Binary tree kernel

We now introduce the binary tree kernel. First, we encode our data points as binary strings. So we

have X=Bq, where B={0,1}, and q∈N.

X=Rd

, we must map

Rd7→ Bq

. First, we rescale all points (training points and test points) to lie

within the box

[0,1]d

. (If we have a stream of test points, and one lands outside of the box

[0,1]d

, we

can either set

KxX

for that point or we rescale and retrain in

O(nlog n)

time.) Then, for each

x∈[0,1]d

, for each dimension, we take the binary expansion up to some precision

, and for those

d×p

bits, we permute them using some ﬁxed permutation. We call this permutation the bit order, and

it is the same for all

x∈[0,1]d

. Note that now

q=dp

. See Figure 3 for an example. We optimize

the bit order during training, and we can also form an ensemble of GPs using different bit orders.

Figure 3: Function from

[0,1]2→B8.

For

x∈Bq

, let

x≤i

be the ﬁrst

bits of

[[expression]]

evaluates to

1, if expression is true, otherwise 0. We now deﬁne the kernel:

Deﬁnition 1

(Binary Tree Kernel)

Given a weight vector

w∈Rq

with w0and ||w||1= 1,

kw(x1, x2) =

i=1

wihhx≤i

1=x≤i

2ii

So the more leading bits shared by

and

, the larger the covari-

ance between the function values. Consider, for example, points

and

from Figure 1, where

(

left, left, right

)

, and

(

left, right, right

)

; they share only the

ﬁrst leading “bit”. We train the weight vector wto maximize the likelihood of the training data.

Proposition 1 (Positive Semideﬁniteness).For X∈ Xn, for k=kw,KXX 0.

Proof.

Let

s∈Sq

i=1 Bi

be a binary string, and let

|s|

be the length of

. Let

X[s]∈Rn

with

(X[s])j=

hhX≤|s|

j=sii

X[s]X>

[s]

is clearly positive semideﬁnite. Finally,

KXX =Pq

i=1 Ps∈BiwiX[s]X>

[s]

and recall wi≥0, so KXX 0.

4 Sparse rank one sum representation

In order to do GP regression in

O(n)

space and

O(nlog n)

time, we develop a “Sparse Rank One

Sum” representation of linear operators (SROS). This was developed separately from the very similar

Hierarchical matrices [

], which we discuss below. In SROS form, linear transformation of a vector

can be done in

O(n)

time instead of

O(n2)

. We will store our kernel matrix and inverse kernel matrix

in SROS form. The proof of Proposition 1 exempliﬁes representing a matrix as the sum of sparse

rank one matrices. Note that each X[s]is sparse—if qis large, most X[s]’s are the zero vector.

We now show how to interpret an SROS representation of an

n×n

matrix. Let

[n] = {1,2, ..., n}

For r∈N, let L: [r]n×[r]n×Rn×Rn→Rn×nconstruct a linear operator from four vectors.

Deﬁnition 2

(Linear Operator from Simple SROS Representation)

Let

p, p0∈[r]n

, and let

u, u0∈

. For

l∈[r]

, let

up=l∈Rn

be the vector where

up=l

j=uj[[pj=l]]

, likewise for

and

. Then:

L(p, p0, u, u0)7→ Pm

i=1 up=i(u0)p0=i>.

Figure 4: A matrix in standard form

constructed from a matrix in SROS

form. The large square depicts the

matrix

L(p, p0, u, u0)∈R5×5

with

elements colored by value. See

up=0

for a color legend.

We depict Deﬁnition 2 in Figure 4.

and

represent par-

titions over

elements: all elements with the same integer

value in the vector

belong to the same partition. Note that

, the number of parts in the partition, need not exceed

, the

number of elements being partitioned. If

p=p0

(which is

almost always the case for us) and the elements of

, and

were shufﬂed so that all elements in the same partition were

next to each other, then

L(p, p0, u, u0)

would be block diag-

onal. Note that

L(p, p0, u, u0)

is not necessarily low rank. If

is the ﬁnest possible partition, and

p=p0

L(p, p0, u, u0)

diagonal. SROS matrices can be thought of as a generalization

of two types of matrix that are famously amenable to fast com-

putation: rank one matrices (all points in the same partition)

and diagonal matrices (each point in its own partition).

We now extend the deﬁnition of

to allow for multiple

u, and u0vectors.

Deﬁnition 3

(Linear Operator from SROS Representation)

Let

L: [r]n×q×[r]n×q×Rn×q×Rn×q→Rn×n

. Let

P, P 0∈

[r]n×q

, and let

U, U0∈Rn×q

. Let

P:,i

U:,i

, etc. be the

ith

columns of the respective arrays. Then:

L(P, P 0, U, U0)7→ Pq

i=1 L(P:,i, P 0

:,i, U:,i, U0

:,i).

Algorithm 1 performs linear transformation of a vector using SROS representation in O(nq)time.

Algorithm 1

Linear Transformation with SROS Linear Operator. This can be vectorized on a

Graphical Processing Unit (GPU), using e.g.

torch.Tensor.index_add_

for Line 5 and non-slice

indexing for Line 6 [19]. Slight restructuring allows vectorization over [q]as well.

Require: P, P 0∈[r]n×q,U, U0∈Rn×q,x∈Rn

Ensure: y=L(P, P 0, U, U0)x

1: y←0∈Rn

2: for i∈[q]do  O(nq)time

3: p, p0, u, u0←P:,i, P 0

:,i, U:,i, U0

:,i

4: z←0∈Rr zlwill store the dot product ((u0)p0=l)>xp0=l

5: for j∈[n]do zp0

j←zp0

j+u0

jxj O(n)time

6: for j∈[n]do yj←yj+zpjuj O(n)time

return y

We now discuss how to approximately invert a certain kind of symmetric SROS matrix, but our

methods could be extended to asymmetric matrices. First, we deﬁne a partial ordering over partitions.

For two partitions

p, p0

, we say

p0≤p

is ﬁner than or equal to

; that is,

j=p0

j0=⇒pj=pj0

Using that partial ordering, a symmetric SROS matrix can be approximately inverted efﬁciently if for

all

1≤i, i0≤q

P:,i ≤P:,i0

P:,i0≤P:,i

. As the reader may have recognized, our kernel matrix

KXX can be written as an SROS matrix with this property.

We will write symmetric SROS matrices in a slightly more convenient form. All

(u0)p=l

must be a

constant times

up=l

. We will store these constants in an array

. Let

L(P, C, U)

be shorthand for

L(P, P, U, C U)

, where



denotes element-wise multiplication. For

L(P, C, U)

to be symmetric,

it must be the case that

Pji =Pj0i=⇒Cji =Cj0i

. Then, all elements of

corresponding to

a given

up=l

are multiplied by the same constant. We now present an algorithm for calculating

(L(P, C, U) + λI)−1

, for

λ6= 0

, which is an approximate inversion of

L(P, C, U)

. We have not

yet analyzed numerical sensitivity for

λ→0

, but we conjecture that all ﬂoating point numbers

involved need to be stored to at least

log2(1/λ)

bits. Without loss of generality, let

λ= 1

, and note

(L(P, C, U) + λI)−1=λ−1(L(P, λ−1C, U) + I)−1.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Log-Linear-TimeGaussianProcessesUsingBinaryTreeKernelsMichaelK.CohenUniversityofOxfordmichael.cohen@eng.ox.ax.ukSamuelDaultonUniversityofOxford,Metasdaulton@meta.comMichaelA.OsborneUniversityofOxfordmosb@robots.ox.ax.ukAbstractGaussianprocesses(GPs)producegoodprobabilisticmodelsoffunctions,butmostGP...

展开>> 收起<<

Log-Linear-Time Gaussian Processes Using Binary Tree Kernels Michael K. Cohen.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Log-Linear-Time Gaussian Processes Using Binary Tree Kernels Michael K. Cohen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: