Log-Linear-Time Gaussian Processes Using Binary Tree Kernels Michael K. Cohen

2025-05-02 0 0 1.08MB 18 页 10玖币
侵权投诉
Log-Linear-Time Gaussian Processes
Using Binary Tree Kernels
Michael K. Cohen
University of Oxford
michael.cohen@eng.ox.ax.uk
Samuel Daulton
University of Oxford, Meta
sdaulton@meta.com
Michael A. Osborne
University of Oxford
mosb@robots.ox.ax.uk
Abstract
Gaussian processes (GPs) produce good probabilistic models of functions, but
most GP kernels require
O((n+m)n2)
time, where
n
is the number of data points
and
m
the number of predictive locations. We present a new kernel that allows for
Gaussian process regression in
O((n+m) log(n+m))
time. Our “binary tree”
kernel places all data points on the leaves of a binary tree, with the kernel depending
only on the depth of the deepest common ancestor. We can store the resulting kernel
matrix in
O(n)
space in
O(nlog n)
time, as a sum of sparse rank-one matrices,
and approximately invert the kernel matrix in
O(n)
time. Sparse GP methods also
offer linear run time, but they predict less well than higher dimensional kernels. On
a classic suite of regression tasks, we compare our kernel against Matérn, sparse,
and sparse variational kernels. The binary tree GP assigns the highest likelihood to
the test data on a plurality of datasets, usually achieves lower mean squared error
than the sparse methods, and often ties or beats the Matérn GP. On large datasets,
the binary tree GP is fastest, and much faster than a Matérn GP.
1 Introduction
Gaussian processes (GPs) can be used to perform regression with high-quality uncertainty estimates,
but they are slow. Naïvely, GP regression requires
O(n3+n2m)
computation time and
O(n2)
computation space when predicting at
m
locations given
n
data points [
28
]. A kernel matrix of size
n×n
must be inverted (or Cholesky decomposed), and then
m
matrix-vector multiplications must
be done with that inverse matrix (or
m
linear solves with the Cholesky factors). A few methods that
we will discuss later achieve O(n2m)time complexity [25, 30].
With special kernels, GP regression can be faster and use less space. Inducing point methods, using
z
inducing points, allow regression to be done in
O(z2(n+m))
time and in
O(z2+zn)
space
[
21
,
22
,
23
,
13
]. We will discuss the details of these inducing point kernels later, but they are kernels
in their own right, not just approximations to other kernels. Unfortunately, these kernels are low
dimensional (having a z-dimensional Hilbert space), which limits the expressivity of the GP model.
We present a new kernel, the binary tree kernel, that also allows for GP regression in
O(n+m)
space and
O((n+m) log(n+m))
time (both model fitting and prediction). The time and space
complexity of our method is also linear in the depth of the binary tree, which is naïvely linear in the
dimension of the data, although in practice we can increase the depth sublinearly. Training some
kernel parameters takes time quadratic in the depth of the tree. The dimensionality of the binary
tree kernel is exponential in the depth of the tree, making it much more expressive than an inducing
points kernel. Whereas for an inducing points kernel, the runtime is quadratic in the dimension of the
Hilbert space, for the binary tree kernel, it is only logarithmic—an exponential speedup.
A simple depiction of our kernel is shown in Figure 1, which we will define precisely in Section
3. First, we create a procedure for placing all data points on the leaves of a binary tree. Given the
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01633v1 [cs.LG] 4 Oct 2022
binary tree, the kernel between two points depends only on the depth of the deepest common ancestor.
Because very different tree structures are possible for the data, we can easily form an ensemble of
diverse GP regression models. Figure 2 depicts a schematic sample from a binary tree kernel. Note
how the posterior mean is piecewise flat, but the pieces can be small.
Figure 1: A binary tree kernel with four data points.
In this example,
k(x1, x1) = 1
,
k(x1, x2) = 0
,
k(x1, x3)=0.8, and k(x1, x4)=0.3.
On a standard suite of benchmark regres-
sion tasks [
25
], we show that our kernel usu-
ally achieves better negative log likelihood
(NLL) than state-of-the-art sparse methods and
conjugate-gradient-based “exact” methods, at
lower computational cost in the big-data regime.
There are not many limitations to using our ker-
nel. The main limitation is that other kernels
sometimes capture the relationships in the data
better. We do not have a good procedure for un-
derstanding when data has more Matérn charac-
ter or more binary tree character (except through
running both and comparing training NLL). But
given that the binary tree kernel usually outperforms the Matérn, we’ll tentatively say the best first
guess is that a new dataset has more binary tree character. One concrete limitation for some appli-
cations, like Bayesian Optimization, is that the posterior mean is piecewise-flat, so gradient-based
heuristics for finding extrema would not work.
In contexts where a piecewise-flat posterior mean is suitable, we struggle to see when one would
prefer a sparse or sparse variational GP to a binary tree kernel. The most thorough approach would
be to run both and see which has a better training NLL, but if you had to pick one, the binary tree
GP seems to be better performing and comparably fast. If minimizing mean-squared error is the
objective, the Matérn kernel seems to do slightly better than the binary tree. If the dataset is small,
and one needs a very fast prediction, a Matérn kernel may be the best option. But otherwise, if one
cares about well-calibrated predictions, these initial results we present tentatively suggest using a
binary tree kernel over the widely-used Matérn kernel.
The log-linear time and linear space complexity of the binary tree GP, with performance exceeding a
“normal” kernel, could profoundly expand the viability of GP regression to larger datasets.
2 Preliminaries
Our problem setting is regression. Given a function
f:X R
, for some arbitrary set
X
, we would
like to predict
f(x)
for various
x∈ X
. What we have are observations of
f(x)
for various (other)
x∈ X
. Let
X∈ Xn
be an
n
-tuple of elements of
X
, and let
yRn
be an
n
-tuple of real numbers,
such that yif(Xi) + N(0, λ), for λR0.Xand ycomprise our training data.
With an
m
-tuple of test locations
X0∈ Xm
, let
y0Rm
, with
y0
i=f(X0
i)
.
y0
is the ground truth
for the target locations. Given training data, we would like to produce a distribution over
R
for each
target location
X0
i
, such that it assigns high marginal probability to the unknown
y0
i
. Alternatively, we
sometimes would like to produce point estimates
ˆy0
i
in order to minimize the squared error
(ˆy0
iy0
i)2
.
A GP prior over functions is defined by a mean function
m:X R
, and a kernel
k:X ×X R
.
The expected function value at a point
x
is defined to be
m(x)
, and the covariance of the function
values at two points
x1
and
x2
is defined to be
k(x1, x2)
. Let
KXX Rn×n
be the matrix of kernel
Figure 2: A schematic diagram of a function sampled from a binary tree kernel. The function is
over the interval [0, 1], and points on the interval are placed onto the leaves of a depth-4 binary
tree according to the first 4 bits of their binary expansion. The sampled function is in black. Purple
represents the sample if the tree had depth 3, green depth 2, orange depth 1, and red depth 0.
2
values
(KXX )ij =k(Xi, Xj)
, and let
mXRn
be the vector of mean values
(mX)i=m(Xi)
.
For a GP to be well-defined, the kernel must be such that
KXX
is positive semidefinite for any
X∈ Xn
. For a point
x∈ X
, Let
KXx Rn
be the vector of kernel values:
(KXx)i=k(Xi, x)
,
and let
KxX =K>
Xx
. Let
λ0
be the variance of observation noise. Let
µx
and
σ2
x
be the mean
and variance of our posterior predictive distribution at
x
. Then, with
Kλinv
XX = (KXX +λI)1
,
µx:= (ymX)>Kλinv
XX KXx +m(x)(1) σ2
x:= k(x, x)KxX Kλinv
XX KXx +λ. (2)
See Williams and Rasmussen [28] for a derivation. We compute Equations 1 and 2 for all xX0.
3 Binary tree kernel
We now introduce the binary tree kernel. First, we encode our data points as binary strings. So we
have X=Bq, where B={0,1}, and qN.
If
X=Rd
, we must map
Rd7→ Bq
. First, we rescale all points (training points and test points) to lie
within the box
[0,1]d
. (If we have a stream of test points, and one lands outside of the box
[0,1]d
, we
can either set
KxX
to
0
for that point or we rescale and retrain in
O(nlog n)
time.) Then, for each
x[0,1]d
, for each dimension, we take the binary expansion up to some precision
p
, and for those
d×p
bits, we permute them using some fixed permutation. We call this permutation the bit order, and
it is the same for all
x[0,1]d
. Note that now
q=dp
. See Figure 3 for an example. We optimize
the bit order during training, and we can also form an ensemble of GPs using different bit orders.
Figure 3: Function from
[0,1]2B8.
For
xBq
, let
xi
be the first
i
bits of
x
.
[[expression]]
evaluates to
1, if expression is true, otherwise 0. We now define the kernel:
Definition 1
(Binary Tree Kernel)
.
Given a weight vector
wRq
,
with w0and ||w||1= 1,
kw(x1, x2) =
q
X
i=1
wihhxi
1=xi
2ii
So the more leading bits shared by
x1
and
x2
, the larger the covari-
ance between the function values. Consider, for example, points
x1
and
x4
from Figure 1, where
x1
is
(
left, left, right
)
, and
x4
is
(
left, right, right
)
; they share only the
first leading “bit”. We train the weight vector wto maximize the likelihood of the training data.
Proposition 1 (Positive Semidefiniteness).For X∈ Xn, for k=kw,KXX 0.
Proof.
Let
sSq
i=1 Bi
be a binary string, and let
|s|
be the length of
s
. Let
X[s]Rn
with
(X[s])j=
hhX≤|s|
j=sii
.
X[s]X>
[s]
is clearly positive semidefinite. Finally,
KXX =Pq
i=1 PsBiwiX[s]X>
[s]
,
and recall wi0, so KXX 0.
4 Sparse rank one sum representation
In order to do GP regression in
O(n)
space and
O(nlog n)
time, we develop a “Sparse Rank One
Sum” representation of linear operators (SROS). This was developed separately from the very similar
Hierarchical matrices [
1
], which we discuss below. In SROS form, linear transformation of a vector
can be done in
O(n)
time instead of
O(n2)
. We will store our kernel matrix and inverse kernel matrix
in SROS form. The proof of Proposition 1 exemplifies representing a matrix as the sum of sparse
rank one matrices. Note that each X[s]is sparse—if qis large, most X[s]s are the zero vector.
We now show how to interpret an SROS representation of an
n×n
matrix. Let
[n] = {1,2, ..., n}
.
For rN, let L: [r]n×[r]n×Rn×RnRn×nconstruct a linear operator from four vectors.
Definition 2
(Linear Operator from Simple SROS Representation)
.
Let
p, p0[r]n
, and let
u, u0
Rn
. For
l[r]
, let
up=lRn
be the vector where
up=l
j=uj[[pj=l]]
, likewise for
u0
and
p0
. Then:
L(p, p0, u, u0)7→ Pm
i=1 up=i(u0)p0=i>.
3
Figure 4: A matrix in standard form
constructed from a matrix in SROS
form. The large square depicts the
matrix
L(p, p0, u, u0)R5×5
with
elements colored by value. See
up=0
for a color legend.
We depict Definition 2 in Figure 4.
p
and
p0
represent par-
titions over
n
elements: all elements with the same integer
value in the vector
p
belong to the same partition. Note that
r
, the number of parts in the partition, need not exceed
n
, the
number of elements being partitioned. If
p=p0
(which is
almost always the case for us) and the elements of
p
,
u
, and
u0
were shuffled so that all elements in the same partition were
next to each other, then
L(p, p0, u, u0)
would be block diag-
onal. Note that
L(p, p0, u, u0)
is not necessarily low rank. If
p
is the finest possible partition, and
p=p0
,
L(p, p0, u, u0)
is
diagonal. SROS matrices can be thought of as a generalization
of two types of matrix that are famously amenable to fast com-
putation: rank one matrices (all points in the same partition)
and diagonal matrices (each point in its own partition).
We now extend the definition of
L
to allow for multiple
p
,
p0
,
u, and u0vectors.
Definition 3
(Linear Operator from SROS Representation)
.
Let
L: [r]n×q×[r]n×q×Rn×q×Rn×qRn×n
. Let
P, P 0
[r]n×q
, and let
U, U0Rn×q
. Let
P:,i
,
U:,i
, etc. be the
ith
columns of the respective arrays. Then:
L(P, P 0, U, U0)7→ Pq
i=1 L(P:,i, P 0
:,i, U:,i, U0
:,i).
Algorithm 1 performs linear transformation of a vector using SROS representation in O(nq)time.
Algorithm 1
Linear Transformation with SROS Linear Operator. This can be vectorized on a
Graphical Processing Unit (GPU), using e.g.
torch.Tensor.index_add_
for Line 5 and non-slice
indexing for Line 6 [19]. Slight restructuring allows vectorization over [q]as well.
Require: P, P 0[r]n×q,U, U0Rn×q,xRn
Ensure: y=L(P, P 0, U, U0)x
1: y0Rn
2: for i[q]do  O(nq)time
3: p, p0, u, u0P:,i, P 0
:,i, U:,i, U0
:,i
4: z0Rr zlwill store the dot product ((u0)p0=l)>xp0=l
5: for j[n]do zp0
jzp0
j+u0
jxj O(n)time
6: for j[n]do yjyj+zpjuj O(n)time
return y
We now discuss how to approximately invert a certain kind of symmetric SROS matrix, but our
methods could be extended to asymmetric matrices. First, we define a partial ordering over partitions.
For two partitions
p, p0
, we say
p0p
if
p0
is finer than or equal to
p
; that is,
p0
j=p0
j0=pj=pj0
.
Using that partial ordering, a symmetric SROS matrix can be approximately inverted efficiently if for
all
1i, i0q
,
P:,i P:,i0
or
P:,i0P:,i
. As the reader may have recognized, our kernel matrix
KXX can be written as an SROS matrix with this property.
We will write symmetric SROS matrices in a slightly more convenient form. All
(u0)p=l
must be a
constant times
up=l
. We will store these constants in an array
C
. Let
L(P, C, U)
be shorthand for
L(P, P, U, C U)
, where
denotes element-wise multiplication. For
L(P, C, U)
to be symmetric,
it must be the case that
Pji =Pj0i=Cji =Cj0i
. Then, all elements of
U
corresponding to
a given
up=l
are multiplied by the same constant. We now present an algorithm for calculating
(L(P, C, U) + λI)1
, for
λ6= 0
, which is an approximate inversion of
L(P, C, U)
. We have not
yet analyzed numerical sensitivity for
λ0
, but we conjecture that all floating point numbers
involved need to be stored to at least
log2(1)
bits. Without loss of generality, let
λ= 1
, and note
(L(P, C, U) + λI)1=λ1(L(P, λ1C, U) + I)1.
4
摘要:

Log-Linear-TimeGaussianProcessesUsingBinaryTreeKernelsMichaelK.CohenUniversityofOxfordmichael.cohen@eng.ox.ax.ukSamuelDaultonUniversityofOxford,Metasdaulton@meta.comMichaelA.OsborneUniversityofOxfordmosb@robots.ox.ax.ukAbstractGaussianprocesses(GPs)producegoodprobabilisticmodelsoffunctions,butmostGP...

展开>> 收起<<
Log-Linear-Time Gaussian Processes Using Binary Tree Kernels Michael K. Cohen.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:1.08MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注