kernel can be interpreted as a similarity measure between two elements of the input space, with
the GP assigning higher correlations to function values whose inputs are more similar according to
the kernel. Furthermore, the kernel governs the main assumptions on the modeled function such as
smoothness, periodicity or long-range correlations and therefore provides the inductive-bias of the
GP. An important property of GP’s is that they are not restricted to euclidean input spaces
X ⊂ Rd
,
but can also be defined on highly structured spaces like trees and graphs, a property we will later use
to define a GP over GP’s.
While our method might also be used for model selection in classification, we consider from now on
Gaussian processes regression. For regression, a dataset
D= (X,y)
with
X={x1, . . . , xN}⊂X
and
y= (y1, . . . , yN)|∈RN
is given, where we suppose that
f∼ GP(µ(·), k(·,·))
and
yi=
f(xi) + i
with
i
i.i.d
∼ N (0, σ2)
. Given the observed data
D
the posterior distribution
f|D
is again a
GP with mean and covariance functions
µD(x) = µ(x) + k(x)|(K+σ2I)−1(y−µ(X)),
kD(x, y) = k(x, y)−k(x)|(K+σ2I)−1k(y)
with
K= [k(xm, xl)]N
m,l=1
and
k(x)=[k(x, x1), . . . , k(x, xN)]|
(see [
13
]). Probabilistic predic-
tions can be done via the resulting predictive distribution p(f∗|x∗,D) = N(µD(x∗), kD(x∗, x∗)).
2.2 Model selection for GP’s
Typically, the kernel
kθ
comes with a set of parameters
θ
that can be learned via maximization of
the marginal likelihood
p(y|X, θ, σ2) = N(y;µ(X), kθ(X,X) + σ2I)
or via maximum a posteriori
(MAP) estimation, in case the parameters
θ
are equipped with a prior
p(θ)
. This procedure is
sometimes also called model selection, as one selects the hyperparameters of the kernel given the data
(see [
13
]). However, we consider selecting the structural form of the kernel itself. The structural form
of the kernel determines the statistical hypothesis that is assumed to be true for the data-generating
process. Intuitively, the kernel is similar to the architecture in deep neural networks, which induces
an inductive bias.
Our goal is to do model selection over a discrete, possibly infinite space of kernels
K:= {k1, k2, . . . }
.
As each kernel comes with its own parameters, we are actually dealing with a space of kernel families.
Thus, when mentioning a kernel
k
we associate it with its whole family over parameters
{kθ|θ∈Θ}
.
Once a kernel is selected, predictions are done with learned kernel parameters (that usually are a
by-product of calculating the model selection criteria). The parameters
θ
are potentially equipped
with a prior
p(θ)
depending on the selection criteria. As mean function, we always utilize the zero
mean function
µ(x) := 0
in case
y
is centered and a constant mean function otherwise. This is a
common choice in GP regression. Given some model selection criteria
g:K→R
our task is solving
k∗= arg maxk∈Kg(k|D)
. While our method is not restricted to a specific model selection criteria,
we focus on the model evidence
p(y|X, k)
, which is a well-known selection criteria for probabilistic
models (see [
9
,
14
,
8
]). Given a prior on the kernel parameters
p(θ)
and the likelihood variance
p(σ2)
,
the log-model evidence of the marginalized GP is given as
g(k|D) = log p(y|X, k) = log Zp(y|X, θ, σ2, k)p(σ2)p(θ|k)dθdσ2.
This quantity can be approximated via Laplace approximation of
p(θ, σ2|D)
(see Appendix A
for details). Computing this approximation includes performing a MAP estimation of the GP
parameters. Thus, once the log evidence has been computed, learned kernel hyperparameters θMAP
are automatically provided. Performing the MAP estimation scales cubically in the data set size
N
,
which renders model selection for GP’s computationally intense.
2.3 Kernel Grammar
The kernel grammar introduced by [
3
] specifies a highly general search space over kernels. The
grammar is based on the observation that kernels are closed under addition and multiplication,
meaning, for kernels
k1(x, x0)
and
k2(x, x0)
also
k1(x, x0) + k2(x, x0)
and
k1(x, x0)×k2(x, x0)
are
kernels. Given some base kernels, such as the squared exponential kernel
SE
, the linear kernel
LIN
,
the periodic kernel
PER
or the rational quadratic kernel
RQ
, different statistical hypotheses can be
3