Locally Smoothed Gaussian Process Regression
gains compared to other approaches to approximate GPs. We demonstrate such performance gains on regression tasks
on standard UCI benchmarks [14].
1.1 Related work
Local learning algorithms were introduced by Bottou and Vapnik [
3
], with the main objective of estimating the optimal
decision function for each single testing point. Examples of local learning algorithms include the well-known K-Nearest
Neighbor regression [
15
] and local polynomial regression [
16
]. These methods provide simple means for solving
regression problems for the cases where training data are nonstationary or their size is prohibitively large for building
a global model. However, neither of these methods provides ways to quantify uncertainty in predictions, which is a
highly desirable feature in cost-sensitive applications.
Gaussian Process Regression (GPR) [
17
] is a popular nonparametric regression method based on Bayesian principles
which provides uncertainty estimates for its predictions. Similarly to other kernel methods (e.g., SVMs and KRR), GPR
is a global method, meaning that it takes into account the whole dataset at prediction time. Thus, GPR inherits the
computational complexity of global kernel methods, which is prohibitive for large datasets. Among the large class of
scalable approximations for GPR, successful ones are based on Random Fourier Features [
11
] and on sparsification of
the Gram matrix induced by the kernel [17].
Random feature approximation of the kernel proposed in [
11
] is based on Bochner theorem and allows representing the
kernel function as a dot product of (possibly infinite) feature maps applied to the input data. In practice, infinite feature
maps are replaced by a finite Monte Carlo approximation. The disadvantage of this approach is that it is necessary to
construct a specific random feature mapping for each type of kernel. While random feature approximations are known
for popular kernels such as RBF [
11
] and polynomial [
18
], there is no straightforward application of this method to
approximate arbitrary kernels.
The Gram matrix sparsification approach is based on the idea of introducing so-called inducing points in order to
approximate the full Gram matrix. One of the most popular methods in this family is the Nyström approximation [
17
].
The main drawback of this approach is that a low number of inducing points might lead to a poor approximation of the
original model, which affects predictive performance. An important advancement within this family of approaches
which provides a scalable variational formulation was proposed in [19].
While providing good performance and scalability for large datasets, these approaches still require some design choices
for the kernel. For stationary kernels, they assume that the same kernel is suitable for all the regions of input space and
if data are nonstationary, this may harm the predictive performance. The literature has a wide range of proposals to
address kernel design by incorporating ideas from deep learning [13,20].
Recently, partitioning strategies have also gained some attention. The main idea is to divide the input space in regions
where local estimators are defined [
21
,
22
,
23
,
24
]. In partition-based methods, the main challenge is to define an
effective partitioning of the space.
There are several approaches that use the idea of local learning for training GP models. The method proposed in [
25
]
and extended in [
26
] mostly focuses on Bayesian parametric linear regression. The methods in these papers build an
ensemble of local models centered at several fixed points, where each training point is weighted accordingly to the
distance from the center of the model. Predictions are computed as a weighted sum of the local models. The authors
claim that their approach extends to GPR, but in this case each local model considers the full training set. This means
that these methods use localization to address nonstationarity, but poorly scale to large datasets. The method proposed
in [
27
] proposes to build local GP models that use only subsets of the training data, but it lacks a mechanism that assigns
importance weight for the training points for each local model according to the distance from the center of the model.
That is why the model can make overconfident predictions for the points that lay far away from the centers of the local
models. In [
28
] in order to obtain fast approximate prediction at a target point, the Authors propose a forward step-wise
variable selection procedure to find the optimal sub-design.
The paper is organized as follows: Section 2 introduces GPs and the idea of localization, along with a discussion on
the connection between our localized GPs and local kernel ridge regression. The experimental campaign in Section 3
reports results on a variety of benchmarks for regression, and Section 4 concludes the paper.
2