LOCALLY SMOOTHED GAUSSIAN PROCESS REGRESSION Davit Gogolashvili Bogdan Kozyrskiy Maurizio Filippone Department of Data Science EURECOM

2025-05-02 0 0 582.23KB 10 页 10玖币
侵权投诉
LOCALLY SMOOTHED GAUSSIAN PROCESS REGRESSION
Davit Gogolashvili, Bogdan Kozyrskiy Maurizio Filippone
Department of Data Science, EURECOM
450 Route des Chappes, 06410 Biot, France
{davit.gogolashvili, bogdan.kozyrskiy, maurizio.filippone}@eurecom.fr
ABSTRACT
We develop a novel framework to accelerate Gaussian process regression (GPR). In particular, we
consider localization kernels at each data point to down-weigh the contributions from other data
points that are far away, and we derive the GPR model stemming from the application of such
localization operation. Through a set of experiments, we demonstrate the competitive performance of
the proposed approach compared to full GPR, other localized models, and deep Gaussian processes.
Crucially, these performances are obtained with considerable speedups compared to standard global
GPR due to the sparsification effect of the Gram matrix induced by the localization operation.
Keywords Gaussian processes ·Kernel smoothing ·Local regression
1 Introduction
Function estimation is a fundamental problem in Machine Learning. In supervised learning tasks applied to a data set
composed of observed input data and labels, the goal of function estimation is to establish a mapping between these
two groups of observed quantities. Function estimation can be approached in various ways, and we can broadly divide
algorithms in two categories, as global and local. Examples of global algorithms are Neural Networks [
1
] and kernel
machines [
2
], which impose a functional form yielding a global representation of the function. The functional form is
parameterized by a set of parameters which are optimized or inferred based on all the available data. The estimated
model can later be used to query the function at any input points of interest. In local algorithms such as K-Nearest
Neighbors (KNN), instead, the target point is fixed and the corresponding value of the function is estimated based on
the closest data available.
Obviously, any global algorithm can be made local by training it only for the few training points located in the vicinity
of the target test point. While it may seem that the idea of localizing global algorithms is not a very profound one,
empirical evidence shows that localization could improve the performance of the best global models [
3
]. The idea of
localization was therefore applied to global models such as SVMs [
4
,
5
]. In addition to performance gains, by operating
on smaller sets of data points, these local approaches enjoy computational advantages, which are particularly attractive
for kernel machines for which scalability with the number of data points is generally an issue [6,7,8].
In this work, we develop novel ideas to implement a localization of Gaussian processes (GPs) in order to obtain
performance gains, as well as computational ones. GPs are great candidates to benefit from computational speedups
given that a naïve implementation requires expensive algebraic computations with the covariance matrix; denoting by
n
the number of input data, such operations cost
O(n3)
operations and require storing
O(n2)
elements, hindering the
applicability of GPs to data sets beyond a few thousand data points [
9
]. Another issue with GPs is how to choose a
suitable kernel for the problem at hand so as to avoid problems of model misspecification. Both of these issues have
been addressed in various ways, by proposing scalable approximations based on inducing points [
10
] and random
features [11,12], and by composing GPs to obtain a rich and flexible class of equivalent kernels [13].
In this work we explore an alternative way to address scalability and kernel design issues by localizing GPs. In particular,
we show how the localization operation leads to a particular form for the localized GP, and what is the effect on the
kernel of this model. Furthermore, the localization makes it apparent how to implement the model with considerable
arXiv:2210.09998v1 [stat.ML] 18 Oct 2022
Locally Smoothed Gaussian Process Regression
gains compared to other approaches to approximate GPs. We demonstrate such performance gains on regression tasks
on standard UCI benchmarks [14].
1.1 Related work
Local learning algorithms were introduced by Bottou and Vapnik [
3
], with the main objective of estimating the optimal
decision function for each single testing point. Examples of local learning algorithms include the well-known K-Nearest
Neighbor regression [
15
] and local polynomial regression [
16
]. These methods provide simple means for solving
regression problems for the cases where training data are nonstationary or their size is prohibitively large for building
a global model. However, neither of these methods provides ways to quantify uncertainty in predictions, which is a
highly desirable feature in cost-sensitive applications.
Gaussian Process Regression (GPR) [
17
] is a popular nonparametric regression method based on Bayesian principles
which provides uncertainty estimates for its predictions. Similarly to other kernel methods (e.g., SVMs and KRR), GPR
is a global method, meaning that it takes into account the whole dataset at prediction time. Thus, GPR inherits the
computational complexity of global kernel methods, which is prohibitive for large datasets. Among the large class of
scalable approximations for GPR, successful ones are based on Random Fourier Features [
11
] and on sparsification of
the Gram matrix induced by the kernel [17].
Random feature approximation of the kernel proposed in [
11
] is based on Bochner theorem and allows representing the
kernel function as a dot product of (possibly infinite) feature maps applied to the input data. In practice, infinite feature
maps are replaced by a finite Monte Carlo approximation. The disadvantage of this approach is that it is necessary to
construct a specific random feature mapping for each type of kernel. While random feature approximations are known
for popular kernels such as RBF [
11
] and polynomial [
18
], there is no straightforward application of this method to
approximate arbitrary kernels.
The Gram matrix sparsification approach is based on the idea of introducing so-called inducing points in order to
approximate the full Gram matrix. One of the most popular methods in this family is the Nyström approximation [
17
].
The main drawback of this approach is that a low number of inducing points might lead to a poor approximation of the
original model, which affects predictive performance. An important advancement within this family of approaches
which provides a scalable variational formulation was proposed in [19].
While providing good performance and scalability for large datasets, these approaches still require some design choices
for the kernel. For stationary kernels, they assume that the same kernel is suitable for all the regions of input space and
if data are nonstationary, this may harm the predictive performance. The literature has a wide range of proposals to
address kernel design by incorporating ideas from deep learning [13,20].
Recently, partitioning strategies have also gained some attention. The main idea is to divide the input space in regions
where local estimators are defined [
21
,
22
,
23
,
24
]. In partition-based methods, the main challenge is to define an
effective partitioning of the space.
There are several approaches that use the idea of local learning for training GP models. The method proposed in [
25
]
and extended in [
26
] mostly focuses on Bayesian parametric linear regression. The methods in these papers build an
ensemble of local models centered at several fixed points, where each training point is weighted accordingly to the
distance from the center of the model. Predictions are computed as a weighted sum of the local models. The authors
claim that their approach extends to GPR, but in this case each local model considers the full training set. This means
that these methods use localization to address nonstationarity, but poorly scale to large datasets. The method proposed
in [
27
] proposes to build local GP models that use only subsets of the training data, but it lacks a mechanism that assigns
importance weight for the training points for each local model according to the distance from the center of the model.
That is why the model can make overconfident predictions for the points that lay far away from the centers of the local
models. In [
28
] in order to obtain fast approximate prediction at a target point, the Authors propose a forward step-wise
variable selection procedure to find the optimal sub-design.
The paper is organized as follows: Section 2 introduces GPs and the idea of localization, along with a discussion on
the connection between our localized GPs and local kernel ridge regression. The experimental campaign in Section 3
reports results on a variety of benchmarks for regression, and Section 4 concludes the paper.
2
摘要:

LOCALLYSMOOTHEDGAUSSIANPROCESSREGRESSIONDavitGogolashvili,BogdanKozyrskiyMaurizioFilipponeDepartmentofDataScience,EURECOM450RoutedesChappes,06410Biot,France{davit.gogolashvili,bogdan.kozyrskiy,maurizio.filippone}@eurecom.frABSTRACTWedevelopanovelframeworktoaccelerateGaussianprocessregression(GPR).In...

展开>> 收起<<
LOCALLY SMOOTHED GAUSSIAN PROCESS REGRESSION Davit Gogolashvili Bogdan Kozyrskiy Maurizio Filippone Department of Data Science EURECOM.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:582.23KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注