
Imposing a unit covariance. One simple way to tackle this issue would be to remove Σfrom the
learnable parameters ξ, i.e., fixing it to be the identity Σ=IP. In this case, ξ= (θ0,µ). This
computational benefit comes at the cost of model expressivity, as we lose a degree of freedom in
how we can optimize our learned prior distribution ˜pξ(f). In particular, we are unable to choose a
prior over the weights of our model that captures correlations between elements of the feature map.
Learning a low-dimensional representation of the covariance. An alternative is to learn a low-
rank representation of Σ, allowing for a learnable weight-space prior covariance that can encode
correlations. Specifically, we consider a covariance of the form Σ=Q>diag(s2)Q, where Qis a
fixed projection matrix on an s-dimensional subspace of RP, while s2is learnable. In this case, the
parameters that are learned are ξ= (θ0,µ,s). We define S:= diag(s2). The computation of the
covariance of the prior predictive (equation 4) could then be broken down into two steps:
A:= J(θ0,Xi)Q>
J(θ0,Xi)ΣJ(θ0,Xi)>=ASA>
which requires a memory footprint of O(P(s+NyK)), if we include the storage of the Jacobian.
Because NyKPin typical deep learning contexts, it suffices that sPso that it becomes
tractable to deal with this new representation of the covariance.
A trade-off between feature-map expressiveness and learning a rich prior over the weights.
Note that even if a low-dimensional representation of Σenriches the prior distribution over the
weights, it also restrains the expressiveness of the feature map in the kernel by projecting the P-
dimensional features J(θ0,X)on a subspace of size sPvia Q. This presents a trade-off: we
can use the full feature map, but limit the weight-space prior covariance to be the identity matrix
by keeping Σ=I(case UNLIMITD-I). Alternatively, we could learn a low-rank representation
of Σby randomly choosing sorthogonal directions in RP, with the risk that they could limit the
expressiveness of the feature map if the directions are not relevant to the problem that is considered
(case UNLIMITD-R). As a compromise between these two cases, we can choose the projection
matrix more intelligently and project to the most impactful subspace of the full feature map — in
this way, we can reap the benefits of a tuneable prior covariance while minimizing the useful features
that the projection drops. To select this subspace, we construct this projection map by choosing the
top seigenvectors of the Fisher information matrix (FIM) evaluated on the training dataset D(case
UNLIMITD-F). Recent work has shown that the FIM for deep neural networks tends to have rapid
spectral decay (Sharma et al., 2021), which suggests that keeping only a few of the top eigenvectors
of the FIM is enough to encode an expressive task-tailored prior. See Appendix A.1 for more details.
4.1.2 GENERALIZING THE STRUCTURE TO A MIXTURE OF GAUSSIANS
When learning on multiple clusters of tasks, p(f)can become non-unimodal, and thus cannot be
accurately described by a single GP. Instead, we can capture this multimodality by structuring ˜pξ(f)
as a mixture of Gaussian processes.
Building a more general structure. We assume that at train time, a task Ticomes from any
cluster {Cj}j=α
j=1 with equal probability. Thus, we choose to construct ˜pξ(f)as an equal-weighted
mixture of αGaussian processes.
For each element of the mixture, the structure is similar to the single cluster case, where the pa-
rameters of the cluster’s weight-space prior are given by (µj,Σj). We choose to have both the
projection matrix Qand the linearization point θ0(and hence, the feature map φ(·) = J(θ0,·))
shared across the clusters. This yields improved computational efficiency, as we can compute
the projected features once, simultaneously, for all clusters. This yields the parameters ξα=
(θ0,Q,(µ1,s1),...,(µα,sα)).
This can be viewed as a mixture of linear regression models, with a common feature map but sepa-
rate, independent prior distributions over the weights for each cluster. These separate distributions
are encoded using the low-dimensional representations Sjfor each Σj. Notice how this is a gener-
alization of the single cluster case, for when α= 1,˜pξ(f)becomes a Gaussian and ξα=ξ2.
2In theory, it is possible to drop Qand extend the identity covariance case to the multi-cluster setting;
however, this leads to each cluster having an identical covariance function, and thus is not effective at modeling
heterogeneous behaviors among clusters.
5