
binary tree, the kernel between two points depends only on the depth of the deepest common ancestor.
Because very different tree structures are possible for the data, we can easily form an ensemble of
diverse GP regression models. Figure 2 depicts a schematic sample from a binary tree kernel. Note
how the posterior mean is piecewise flat, but the pieces can be small.
Figure 1: A binary tree kernel with four data points.
In this example,
k(x1, x1) = 1
,
k(x1, x2) = 0
,
k(x1, x3)=0.8, and k(x1, x4)=0.3.
On a standard suite of benchmark regres-
sion tasks [
25
], we show that our kernel usu-
ally achieves better negative log likelihood
(NLL) than state-of-the-art sparse methods and
conjugate-gradient-based “exact” methods, at
lower computational cost in the big-data regime.
There are not many limitations to using our ker-
nel. The main limitation is that other kernels
sometimes capture the relationships in the data
better. We do not have a good procedure for un-
derstanding when data has more Matérn charac-
ter or more binary tree character (except through
running both and comparing training NLL). But
given that the binary tree kernel usually outperforms the Matérn, we’ll tentatively say the best first
guess is that a new dataset has more binary tree character. One concrete limitation for some appli-
cations, like Bayesian Optimization, is that the posterior mean is piecewise-flat, so gradient-based
heuristics for finding extrema would not work.
In contexts where a piecewise-flat posterior mean is suitable, we struggle to see when one would
prefer a sparse or sparse variational GP to a binary tree kernel. The most thorough approach would
be to run both and see which has a better training NLL, but if you had to pick one, the binary tree
GP seems to be better performing and comparably fast. If minimizing mean-squared error is the
objective, the Matérn kernel seems to do slightly better than the binary tree. If the dataset is small,
and one needs a very fast prediction, a Matérn kernel may be the best option. But otherwise, if one
cares about well-calibrated predictions, these initial results we present tentatively suggest using a
binary tree kernel over the widely-used Matérn kernel.
The log-linear time and linear space complexity of the binary tree GP, with performance exceeding a
“normal” kernel, could profoundly expand the viability of GP regression to larger datasets.
2 Preliminaries
Our problem setting is regression. Given a function
f:X → R
, for some arbitrary set
X
, we would
like to predict
f(x)
for various
x∈ X
. What we have are observations of
f(x)
for various (other)
x∈ X
. Let
X∈ Xn
be an
n
-tuple of elements of
X
, and let
y∈Rn
be an
n
-tuple of real numbers,
such that yi∼f(Xi) + N(0, λ), for λ∈R≥0.Xand ycomprise our training data.
With an
m
-tuple of test locations
X0∈ Xm
, let
y0∈Rm
, with
y0
i=f(X0
i)
.
y0
is the ground truth
for the target locations. Given training data, we would like to produce a distribution over
R
for each
target location
X0
i
, such that it assigns high marginal probability to the unknown
y0
i
. Alternatively, we
sometimes would like to produce point estimates
ˆy0
i
in order to minimize the squared error
(ˆy0
i−y0
i)2
.
A GP prior over functions is defined by a mean function
m:X → R
, and a kernel
k:X ×X → R
.
The expected function value at a point
x
is defined to be
m(x)
, and the covariance of the function
values at two points
x1
and
x2
is defined to be
k(x1, x2)
. Let
KXX ∈Rn×n
be the matrix of kernel
Figure 2: A schematic diagram of a function sampled from a binary tree kernel. The function is
over the interval [0, 1], and points on the interval are placed onto the leaves of a depth-4 binary
tree according to the first 4 bits of their binary expansion. The sampled function is in black. Purple
represents the sample if the tree had depth 3, green depth 2, orange depth 1, and red depth 0.
2