Optimal plug-in Gaussian processes for modelling derivatives Zejian Liuand Meng Li

2025-04-29 0 0 761.09KB 42 页 10玖币

侵权投诉

Optimal plug-in Gaussian processes for modelling

derivatives

Zejian Liu*and Meng Li †

Department of Statistics, Rice University

Abstract

Derivatives are a key nonparametric functional in wide-ranging applications where the rate

of change of an unknown function is of interest. In the Bayesian paradigm, Gaussian pro-

cesses (GPs) are routinely used as a ﬂexible prior for unknown functions, and are arguably one

of the most popular tools in many areas. However, little is known about the optimal modelling

strategy and theoretical properties when using GPs for derivatives. In this article, we study

a plug-in strategy by differentiating the posterior distribution with GP priors for derivatives

of any order. This practically appealing plug-in GP method has been previously perceived as

suboptimal and degraded, but this is not necessarily the case. We provide posterior contrac-

tion rates for plug-in GPs and establish that they remarkably adapt to derivative orders. We

show that the posterior measure of the regression function and its derivatives, with the same

choice of hyperparameter that does not depend on the order of derivatives, converges at the

minimax optimal rate up to a logarithmic factor for functions in certain classes. We analyze

a data-driven hyperparameter tuning method based on empirical Bayes, and show that it satis-

ﬁes the optimal rate condition while maintaining computational efﬁciency. This article to the

best of our knowledge provides the ﬁrst positive result for plug-in GPs in the context of infer-

ring derivative functionals, and leads to a practically simple nonparametric Bayesian method

with optimal and adaptive hyperparameter tuning for simultaneously estimating the regression

function and its derivatives. Simulations show competitive ﬁnite sample performance of the

plug-in GP method. A climate change application for analyzing the global sea-level rise is

discussed.

1 Introduction

Consider the nonparametric regression model

Yi=f(Xi) + εi, εi∼N(0, σ2),(1)

*zejian.liu@rice.edu

†meng@rice.edu

arXiv:2210.11626v2 [math.ST] 21 Oct 2023

where the data Dn={Xi, Yi}n

i=1 are independent and identically distributed samples from a dis-

tribution P0on X × Rthat is determined by PX,f0, and σ2, which are respectively the marginal

distribution of Xi, the true regression function, and the noise variance that is possibly unknown.

Let pXdenote the density of PXwith respect to the Lebesgue measure µ. Here X ⊂ Rpis a

compact metric space for p≥1.

We are interested in the inference on the derivative functions of f. Derivatives emerge as a key

nonparametric quantity when the rate of change of an unknown surface is of interest. Examples

include surface roughness for digital terrain models, temperature or rainfall slope in meteorology,

and pollution curvature for environmental data. The importance of derivatives, either as a non-

parametric functional or localized characteristic of f, can be also found in efﬁcient modelling of

functional data (Dai et al.,2018), shape constrained function estimation (Riihim¨

aki and Vehtari,

2010;Wang and Berger,2016), and the detection of stationary points (Yu et al.,2022).

Gaussian processes (GPs) are a popular nonparametric Bayesian method in many areas such as

spatially correlated data analysis (Stein,2012;Gelfand et al.,2003;Banerjee et al.,2003), func-

tional data analysis (Shi and Choi,2011), and machine learning (Rasmussen and Williams,2006);

see also the excellent review article by Gelfand and Schliep (2016) which elaborates on the instru-

mental role GPs have played as a key ingredient in an extensive list of twenty years of modelling

work. GPs not only provide a ﬂexible process for unknown functions but also serve as a building

block in hierarchical models for broader applications.

For function derivatives, the so-called plug-in strategy that directly differentiates the posterior

distribution of GP priors is practically appealing, as it would allow users to employ the same prior

no matter whether the inference goal is on the regression function or its derivatives. However,

this plug-in estimator has been perceived as suboptimal and degraded for a decade (Stein,2012;

Holsclaw et al.,2013) based on heuristics, while a theoretical understanding is lacking partly owing

to technical challenges posed by the irregularity and nonparametric nature of derivative functionals

(see Section 2.2 for more details). As a result, there is limited study of plug-in GPs ever since, and

substantially more complicated methods that hamper easy implementation and often restrict to one

particular derivative order are pursued.

In this article, we study the plug-in strategy with GPs for derivative functionals by characterizing

large sample properties of the plug-in posterior measure with GP priors, and obtain the ﬁrst positive

result. We show that the plug-in posterior distribution, with the same choice of hyperparameter in

the GP prior, concentrates at the derivative functionals of any order at a nearly minimax rate in

speciﬁc examples, thus achieving a remarkable plug-in property for nonparametric functionals that

gains increasing attention recently (Yoo and Ghosal,2016;Liu and Li,2023). It is known that

many commonly used nonparametric methods such as smoothing splines and local polynomials

do not enjoy this property when estimating derivatives (Wahba and Wang,1990;Charnigo et al.,

2011), and the only nonparametric Bayesian method with established plug-in property, to the best

of our knowledge, is random series priors with B-splines (Yoo and Ghosal,2016).

In recent years, the nonparametric Bayesian literature has seen remarkable adaptability of GP pri-

ors in various regression settings (van der Vaart and van Zanten,2009;Bhattacharya et al.,2014;

Jiang and Tokdar,2021). Our ﬁndings contribute to this growing literature and indicate that the

widely used GP priors offer more than inferring regression functions. In particular, the estab-

lished theory reassures the use of plug-in GPs for optimal modelling of derivatives, and further

sheds lights into hyperparameter tuning in the presence of varying derivative orders, for which

we propose to use an empirical Bayes approach. Our analysis indicates that this data-driven hy-

perparameter tuning strategy attains theoretical optimality and adapts to the derivative order and

the true function’s smoothness level with an oversmooth kernel, while maintaining computational

efﬁciency. Therefore, this article shows that the Bayes procedure using GP priors automatically

adapts to the order of derivative, leading to a practically simple nonparametric Bayesian method

with guided hyperparameter tuning for simultaneously estimating the regression function and its

derivatives. These theoretical guarantees are complemented by competitive ﬁnite sample perfor-

mance using simulations, as well as a climate change application to analyzing the global sea-level

rise.

The following notation is used throughout this paper. We write X= (XT

1, . . . , XT

n)T∈Rn×pand

Y= (Y1, . . . , Yn)T∈Rn. Let ∥·∥be the Euclidean norm; for f, g :X → R, let ∥f∥∞be the L∞

(supremum) norm, ∥f∥2= (RXf2dPX)1/2the L2norm with respect to the covariate distribution

PX, and ⟨f, g⟩2= (RXfgdPX)1/2the inner product. The corresponding L2space relative to PX

is denoted by L2

pX(X); we write L2(X)as the L2space with respect to the Lebesgue measure µ.

Denote the space of all essentially bounded functions by L∞(X). Let Nbe the set of all positive

integers and write N0=N∪ {0}. We let C(X)and C(X,X)denote the space of continuous

functions and continuous bivariate functions. In one-dimensional case, for Ω⊂R, a function

f: Ω →Rand k∈N, we use f(k)to denote its k-th derivative as long as it exists and f(0) =f.

Let Cm(Ω) = {f: Ω →R|f(k)∈C(Ω) for all 1 ≤k≤m}denote the space of m-times

continuously differentiable functions and C2m(Ω,Ω) = {K: Ω ×Ω→R|∂k

x∂k

x′K(x, x′)∈

C(Ω,Ω) for all 1 ≤k≤m}denote the space of m-times continuously differentiable bivariate

functions, where ∂k

x=∂k/∂xk. For two sequences anand bn, we write an≲bnif an≤Cbnfor a

universal constant C > 0, and an≍bnif an≲bnand bn≲an.

2 Main results

2.1 Plug-in Gaussian process for derivative functionals

We assign a Gaussian process prior Πon the regression function such that f∼GP(0, σ2(nλ)−1K).

Here K(·,·) : X × X → Ris a continuous, symmetric and positive deﬁnite kernel function, and

λ > 0is a regularization parameter that possibly depends on the sample size n. The rescaling

factor σ2(nλ)−1in the covariance kernel connects the posterior mean Bayes estimator with kernel

ridge regression (Wahba,1990;Cucker and Zhou,2007); see also Theorem 11.61 in Ghosal and

van der Vaart (2017) for more discussion on this connection.

It is not difﬁcult to derive that the posterior distribution Πn(· | Dn)is also a GP: f|Dn∼

GP( ˆ

fn,ˆ

Vn), where the posterior mean ˆ

fnand posterior covariance ˆ

Vnare given by

fn(x) = K(x, X)[K(X, X) + nλIn]−1Y,

Vn(x, x′) = σ2(nλ)−1{K(x, x′)−K(x, X)[K(X, X) + nλIn]−1K(X, x′)},(2)

for any x, x′∈ X. Here K(X, X)is the nby nmatrix (K(Xi, Xj))n

i,j=1,K(x, X)is the 1 by n

vector (K(x, Xi))n

i=1, and Inis the nby nidentity matrix.

We now deﬁne the plug-in Gaussian process for differential operators. For simplicity we focus

on the one-dimensional case where X= [0,1] throughout the paper, but remark that the studied

plug-in strategy can be extended to multivariate cases straightforwardly despite more complicated

notation for high-order derivatives.

For any k∈N, deﬁne the k-th differential operator Dk:Ck[0,1] →C[0,1] by Dk(f) = f(k).

If K∈C2k([0,1],[0,1]), then the posterior distribution of the derivative f(k)|Dn, denoted by

Πn,k(· | Dn), is also a Gaussian process since differentiation is a linear operator. In particular,

f(k)|Dn∼GP( ˆ

f(k)

n,˜

n), where

f(k)

n(x) = Kk0(x, X)[K(X, X) + nλIn]−1Y,

n(x, x′) = σ2(nλ)−1Kkk(x, x′)−Kk0(x, X)[K(X, X) + nλIn]−1K0k(X, x′),(3)

with Kk0(x, X) = (∂k

xK(x, Xi))n

i=1 and Kkk(x, x′) = ∂k

x∂k

x′K(x, x′). Then the nonparametric

plug-in procedure for Dkrefers to using the plug-in posterior measure Πn,k(· | Dn)for inference

on Dk(f).

The plug-in posterior measure Πn,k(· | Dn)has a closed-form expression with given λand σ2, sub-

stantially facilitating its implementation in practice. The plug-in strategy is practically appealing

but has been perceived as suboptimal for a decade (Stein,2012;Holsclaw et al.,2013) based on

heuristics. To the contrary, we will establish optimality of plug-in GPs and uncover its adaptivity

to derivative orders. Before we move on to studying large sample properties of Πn,k(· | Dn), in

the next section we ﬁrst take a detour to present technical challenges when studying derivative

functionals that hamper theoretical development for this problem.

2.2 Nonparametric plug-in property and technical challenges

We note two technical challenges posed by derivative functionals: the irregularity of function

derivatives at ﬁxed points and nonparametric extension of such derivatives.

The ﬁrst challenge is related to the “plug-in property” in the literature. The plug-in property

proposed by Bickel and Ritov (2003) refers to the phenomenon that a rate-optimal nonparamet-

ric estimator also efﬁciently estimates some bounded linear functionals. A parallel concept has

been studied in the Bayesian paradigm relying on posterior distributions and posterior contraction

rates (Rivoirard and Rousseau,2012;Castillo and Nickl,2013;Castillo and Rousseau,2015).

However, function derivatives may not fall into the classical plug-in property framework. To see

this, let Dt=f′(t)be a functional which maps fto its derivative at any ﬁxed point t∈[0,1].

While it is easy to see that Dtis a linear functional, the following Proposition 1(Conway,1994,

page 13) shows that Dtis not bounded.

Proposition 1. Let t∈[0,1] and deﬁne Dt:C1[0,1] →Rby Dt(f) = f′(t). Then, there is no

bounded linear functional on L2[0,1] that agrees with Dton C1[0,1].

Therefore, it appears difﬁcult to analyze function derivatives evaluated at a ﬁxed point, as existing

work on the plug-in property typically assumes the functional to be bounded (Bickel and Ritov,

2003;Castillo and Nickl,2013;Castillo and Rousseau,2015).

The second challenge is linked to that the differential operator Dkpoints to function-valued func-

tionals, or nonparametric functionals, as opposed to real-valued functionals studied in the classical

plug-in property literature. Hence, one needs to analyze the function-valued functionals uniformly

for all points in the support. To distinguish the plug-in property for nonparametric functionals

from its traditional counterpart for real-valued functionals, we term this property as nonparametric

plug-in property.

We overcome these challenges by resting on an operator-theoretic framework (Smale and Zhou,

2005,2007), the equivalent kernel technique, and a recent non-asymptotic analysis of nonparamet-

ric quantities (Liu and Li,2023), and show that GPs enjoy the nonparametric plug-in property for

differential operators.

2.3 Posterior contraction for function derivatives

Throughout this article, we assume the true regression function f0∈Ck[0,1] and the covariance

kernel K∈C2k([0,1],[0,1]). Let {µi}∞

i=1 and {ϕi}∞

i=1 be the eigenvalues and eigenfunctions of

the kernel Ksuch that K(x, x′) = P∞

i=1 µiϕi(x)ϕi(x′)for any x, x′∈[0,1], where the eigenvalues

satisfy µ1≥µ2≥ ··· >0and µi↓0, and eigenfunctions form an orthonormal basis of L2

pX[0,1].

The existence of such eigendecomposition is ensured by Mercer’s theorem. It can also be seen that

ϕi∈Ck[0,1] for all i∈Nas K∈C2k([0,1],[0,1]).

We make the following assumptions on the eigenfunctions of the covariance kernel.

Assumption (A). There exists Ck,ϕ >0such that ∥ϕ(k)

i∥∞≤Ck,ϕikfor any i∈N.

Assumption (B). There exists Lk,ϕ >0such that |ϕ(k)

i(x)−ϕ(k)

i(x′)| ≤ Lk,ϕik+1|x−x′|for all

i∈Nand any x, x′∈[0,1].

We will make extensive use of the so-called equivalent kernel ˜

K(Rasmussen and Williams,

2006, Chapter 7), which shares the same eigenfunctions with Kwith altered eigenvalues νi=

µi/(λ+µi)for i∈N, i.e., ˜

K(x, x′) = P∞

i=1 νiϕi(x)ϕi(x′). Note that ˜

Kis also a continuous,

symmetric, and positive deﬁnite kernel.

Under Assumption (A), we deﬁne an m-th order analog of effective dimension of the kernel Kwith

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Optimalplug-inGaussianprocessesformodellingderivativesZejianLiu*andMengLi†DepartmentofStatistics,RiceUniversityAbstractDerivativesareakeynonparametricfunctionalinwide-rangingapplicationswheretherateofchangeofanunknownfunctionisofinterest.IntheBayesianparadigm,Gaussianpro-cesses(GPs)areroutinelyuseda...

展开>> 收起<<

Optimal plug-in Gaussian processes for modelling derivatives Zejian Liuand Meng Li.pdf

共42页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimal plug-in Gaussian processes for modelling derivatives Zejian Liuand Meng Li

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: