LEAVE -GROUP -OUTCROSS -VALIDATION FOR LATENT GAUSSIAN MODELS Zhedong Liu

2025-04-29 2 0 785.34KB 16 页 10玖币

侵权投诉

LEAVE-GROUP-OUT CROSS-VALIDATION FOR LATENT

GAUSSIAN MODELS

Zhedong Liu

Statistics Program, Computer, Electrical and Mathematical Sciences and Engineering Division

King Abdullah University of Science and Technology (KAUST)

Kingdom of Saudi Arabia, Thuwal 23955-6900

zhedong.liu@kaust.edu.sa

Håvard Rue

Statistics Program, Computer, Electrical and Mathematical Sciences and Engineering Division

King Abdullah University of Science and Technology (KAUST)

Kingdom of Saudi Arabia, Thuwal 23955-6900

haavard.rue@kaust.edu.sa

September 4, 2024

ABSTRACT

Evaluating the predictive performance of a statistical model is commonly done using cross-validation.

Although the leave-one-out method is frequently employed, its application is justiﬁed primarily

for independent and identically distributed observations. However, this method tends to mimic

interpolation rather than prediction when dealing with dependent observations.

This paper proposes a modiﬁed cross-validation for dependent observations. This is achieved by

excluding an automatically determined set of observations from the training set to mimic a more

reasonable prediction scenario. Also, within the framework of latent Gaussian models, we illustrate a

method to adjust the joint posterior for this modiﬁed cross-validation to avoid model reﬁtting. This

new approach is accessible in the R-INLA package (www.r-inla.org).

Keywords: Bayesian Cross-Validation; Latent Gaussian Models; R-INLA

1 Introduction

1.1 Background

Leave-one-out cross-validation (LOOCV) [

] stands as a popular method for evaluating a statistical model’s predictive

performance, model selections, or estimating some critical parameters in the model. The core concept of LOOCV

is elegantly straightforward. Suppose we have data,

y={yi}

, for

i= 1, . . . , n

, presumed to be Independent and

Identically Distributed (I.I.D.) samples from the true distribution

πT(y)

. Our objective is to determine how well a

ﬁtted model can predict a new observation,

˜y

, sampled from this true distribution. In the Bayesian context, we use

the posterior predictive distribution

π(y|y)

to predict

˜y

sampled from

πT(y)

. By using the logarithmic score [

], we

compute E˜y[log π(˜y|y)] as a metric for prediction quality.

Owing to the lack of

πT(y)

, directly computing the expectation becomes infeasible. Nonetheless, since

is an I.I.D.

sample of πT(y), we can estimate this expectation by evaluating

uLOOCV =1

i=1

log π(yi|y−i),

where yiis the testing point and y−iis the training set, and y−iare all data except the ith observation.

arXiv:2210.04482v5 [stat.CO] 2 Sep 2024

APREPRINT - SEPTEMBER 4, 2024

The informal interpretation of LOOCV is that it mimics “using

to predict

˜y

" by “using

y−i

to predict

". This

intuitive interpretation is then used to justify, often implicitly, the use of LOOCV as a “default” way to evaluate

predictive performance.

However, issues arise when the I.I.D. assumption does not hold. We can have longitudinal data following each subject in

a study [

], dependence due to time and/or space [

], or hierarchical structure [

]. In those cases, LOOCV often has

an overly optimistic assessment of models’ predictive performance since LOOCV no longer mimics a proper prediction

task, which deﬁnes the new data generation process given observed data. This paper proposes an adaptation of LOOCV

that is more aligned with “measuring predictive performance” when the I.I.D. assumption is invalid.

1.2 The prediction task

The critical observation is that the meaning of “prediction" is not clearly deﬁned when

are not I.I.D samples of

πT(y)

πT(˜y|y)

lacks a unique deﬁnition in non-I.I.D. scenarios as without a clear prediction task, i.e., how we imagine a new

future data point,

˜y

, is generated given observed data

. This ambiguity extends to the act of “using

to predict

˜y

" as it

is uncertain what our target, ˜y, represents. To illustrate these concepts, let us discuss some more concrete examples.

Time-series model

Assume data

y={y1, y2, . . . , yT}

is a time-series, observed sequentially at time

1,2, . . . , T

. The inherent prediction

task is to predict future values, given the temporal nature of the data. We can predict a new observation at

k≥1

steps

ahead into the future by π(yT+k|y1, . . . , yT).

In this example, the LOOCV will be computed from

π(yt|y1, . . . , yt−1, yt+1, . . . , yT), t = 1, . . . , T,

which is often referred as interpolation or imputation of missing values rather than a prediction. However, time series

models’ predictive performance is often assessed through leave-future-out cross-validation (LFOCV) [5]:

T−k

T′=T0

log π(yT′+k|y1, . . . , yT′),

where T′starts from time T0>1as we need some data to estimate the model.

The message from this example is that LOOCV, when applied to such models, is essentially evaluating interpolation

performance rather than predictive performance.

We acknowledge two issues. First, the distinction between interpolation and prediction is not always clear-cut, leading

to overlapping concepts. For example, a one-step-ahead forecast leans more towards interpolation than a two-step-ahead

prediction. In contrast, a one-step-ahead forecast leans less towards interpolation than a missing value imputation.

However, this does not deter our discussion. Secondly, while an ideal model succeeds in all prediction tasks, real-world

scenarios demand us to settle for the deﬁnition of the “best ﬁt". Consequently, our choice of evaluation should align

with our speciﬁc objectives.

Multilevel model

Figure 1 illustrates an example of a multilevel model. Consider observations of student grades or performance. This

data exhibits a hierarchical structure: students belong to classes, classes reside within schools, and schools are nested

within regions. This hierarchical arrangement is signiﬁcant because it introduces effects attributed to the class, school,

and region levels, substantially deviating from the I.I.D. assumption.

Given such a model, the prediction task becomes ambiguous. Are we aiming to predict the performance of an unobserved

student from an observed class? Or are we trying to predict the performance of an unobserved student in an unobserved

class, school, or even region? This difﬁculty mirrors the challenges in deﬁning asymptotic regimes for these models. As

students, classes, schools, and regions can grow indeﬁnitely in various ways, it is unclear whether one of such choices

is the most reasonable.

To properly evaluate predictive performance within this context, users must ﬁrst explicitly deﬁne their prediction

task and then assess the model in line with this deﬁnition. It is worth noting that applying LOOCV would imply the

prediction of individual students within observed classes. In our view, this mimics more interpolation rather than

prediction.

APREPRINT - SEPTEMBER 4, 2024

y1y2y3

y4y5

y6y7

y8y9

y10 y11

y12 y13 y14 y15

y16

y17 y18

Figure 1: A nested multilevel model.

1.3 Improving LOOCV for non-I.I.D. models

Our discussions illuminate an important insight: when dealing with non-I.I.D. data, the prediction task implicitly deﬁned

through LOOCV may be less appropriate, as it leans more towards assessing interpolation qualities than predictive

performance. This prompts the question: What is a suitable approach moving forward?

One observation is the absence of a “one size ﬁts all" solution. Each model may possess a natural prediction task—or

several—based on its intended application. Thus, for a speciﬁc assessment of predictive performance, we need to deﬁne

these prediction tasks explicitly. One can then evaluate distinct predictive performance metrics using our proposed

leave-group-out cross-validation (LGOCV):

uLGOCV =1

i=1

log(π(yi|y−Ii)).(1)

Here, the group (denoted by

) is an index set including

. This conﬁguration facilitates that the pair

(yi,y−Ii)

mimics

a speciﬁed prediction task, with

y−Ii

being the data subset excluding the data indexed by

. In a multilevel model, as

depicted in Figure 1, predicting a student’s grade from an unseen class necessitates that

includes

and all observations

from student

’s class. However, more complex models, such as models containing both time series and hierarchical

elements, pose challenges when deﬁning a natural prediction task. Therefore, unless in simple cases otherwise, LOOCV

is often applied for its simplicity—even if it leans more towards interpolation.

Our primary contribution is offering an automatic LGOCV adaptive to the speciﬁed model when prediction tasks are

not deﬁned. Our approach automatically constructs a group,

, for each

. Though we will delve into automatically

deﬁning

later, an initial understanding is that

comprises the data points most informative for predicting the testing

point,

. This set ensures that our LGOCV focuses less on interpolation and more on prediction than LOOCV. In other

words, LGOCV tests the model on more difﬁcult prediction tasks.

The rationale behind LGOCV is to create more reasonable criteria than LOOCV when the user has difﬁculty deﬁning

manual groups in non-I.I.D. models. In various practical examples, we will show how this automatic procedure produces

reasonable groups. Also, many advanced spatial examples applying this method are shown in the published paper [

For a simple time-series example, our new approach will correspond to evaluating

π(yt|y1, . . . , yt−k, yt+k, . . . , yT),

for ﬁxed

k > 1

. This corresponds to removing a sequence of data with length

2k−1

, to predict the central one. As

we see, this is neither pure interpolation nor pure prediction. Our interpretation is that it is less interpolation or more

prediction than what LOOCV provides when k > 1.

There are two key challenges to address to make our proposal practical. Firstly, we must quantify the information

contributed by one data point in predicting another; this is crucial for automatic group construction. Secondly, we face

the computational task of evaluating

uLGOCV

given a set of groups. The naive computation of LGOCV by ﬁtting models

across all potential training sets and evaluating their utility against corresponding testing points is computationally

infeasible, especially given the resource-demanding nature of modern statistical models. However, these challenges can

be handled elegantly within the framework of latent Gaussian models (LGMs) combined with the integrated nested

Laplace approximation (INLA) inference, as detailed in [

]. Throughout this paper, we will assume that

our model is an LGM. We will discuss how to integrate the automatic group construction and the fast computation of

uLGOCV

into the LGMs-INLA framework. Notably, our proposed methodology has been incorporated into the R-INLA

package (www.rinla.org), extending its applicability across all LGMs supported by R-INLA.

APREPRINT - SEPTEMBER 4, 2024

1.4 Related studies

The issues of applying LOOCV to non-I.I.D. models have been recognized and addressed by numerous researchers.

To name a few, [

] advocate block cross-validation, partitioning ecological data based on inherent patterns, but this

approach should be adopted when the prediction task is not simply interpolation. [

] offers a modiﬁcation to LOOCV,

ensuring an unbiased measure of predictive performance given the correlation between new and observed data, where

the unbiasedness is in the sense of randomized both observed data and new data. We should note that an assumed

prediction task determines the correlation between new and observed data. [

] presents an efﬁcient approximation

for LFOCV, which is used when the prediction task is forecasting in time series. [

] considers a multilevel model

and demonstrates that marginal WAIC is akin to LOOCV. In contrast, conditional WAIC aligns with LGOCV, where a

hierarchical level, such as a school, deﬁnes the groups. The choice of criteria is according to one’s assumption of the

new data, i.e., the prediction task. Additionally, [

] recommends h-block cross-validation to estimate one-step-ahead

performance in stationary processes, which is similar to applying our approach to a stationary model.

These studies assume that model users can select an appropriate evaluation method judiciously according to their

understanding of their prediction tasks. Yet, in practice, many users, despite knowing LOOCV’s limitations, still resort

to LOOCV or randomized K-fold cross-validation, primarily for user-friendliness. This reality motivates our pursuit of

a method that provides user-friendliness with better predictive performance estimation. To this extent, an R package

blockCV

[

] implements spatial and environmental blocking and spatial buffering. Spatial blocking forms clusters of

data points according to spatial effects, and environmental blocking forms clusters using K-means [

] on the covariates.

Buffering is the same as our cross-validation with the groups formed only by spatial effects. This approach ensures

that no test data is spatially next to any training data. This package is written mainly for species distribution modeling,

while our approach is designed for a more general purpose. If we put our method in the context of species distribution

modeling, our method provides buffers that are generated by a posterior Gaussian process considering all the relevant

model effects, including spatial, temporal, and environmental factors. In other words, our method ensures that no testing

data abuts training data in terms of the combination of spatial and environmental effects.

1.5 Theoretical aspects

Cross-validation (CV), particularly LOOCV, is frequently considered as an estimator of

E˜y[log π(˜y|y)]

E˜y,y[log π(˜y|y)]

. The ﬁrst expectation describes the generalized predictive performance given a speciﬁc training

set, while the second expectation describes the generalized predictive performance averaged over different identically

distributed training sets. These expectations can be evaluated when assuming the existence of the joint density

πT(˜y, y)

representing the true data generation process. Under the assumption of I.I.D. and some regularity conditions on

the model, the Bernstein-Von-Mises theorem states that

log π(˜y|y)

converges to a random variable irrelevant to

Consequently,

E˜y[log π(˜y|y)]

and

E˜y,y[log π(˜y|y)]

become equivalent in the limit. If we further assume that

˜y

sampled from the same distribution as all the training data, LOOCV is an asymptotically unbiased estimator of the

expectations. Commonly used information criteria, such as AIC [

], WAIC [

], are asymptotically equivalent to

LOOCV in I.I.D. cases.This type of analysis is prevalent in the literature with various settings [1, 19, 20, 21].

However, a similar analysis does not hold in non-I.I.D. cases in general. Firstly, the existence of different prediction

tasks means that both the model prediction,

π(˜y|y)

, and the true data generation process,

πT(˜y|y)

, are not uniquely

deﬁned as discussed in section 1.2. Secondly, the asymptotic scheme is not uniquely deﬁned, even with a speciﬁc

prediction task. For example, in a temporal model where data

y={y1, y2, . . . , yn}

is a time-series, observed at time

t1< t2<· · · < tnand we denote the last time step as T. Several meanings of of n→ ∞ can be considered:

•T→ ∞ and ti−ti−1is a constant

•ti−ti−1→0and Tis a constant

•ti−ti−1→0and T→ ∞ with T(ti−ti−1)ﬁxed

These scenarios correspond to observing more future data and having higher sample rates within a time frame.

As mentioned in the section 1.2, multilevel data can also have various asymptotic regimes. Thirdly, if the data

generation process is not stationary, the model will not converge under certain asymptotic regimes, which differentiate

E˜y[log π(˜y|y)]

from

E˜y,y[log π(˜y|y)]

even in asymptotic scenarios. These points highlight that the estimand of CV is

not uniquely deﬁned in non-I.I.D. cases, preventing the establishment of asymptotic analysis frameworks similar to

I.I.D. analysis.

From the perspective of CV, it is also inappropriate to consider it an estimator since each summand in CV should be

viewed as a sample from different distributions due to the relevance of data indexes in non-I.I.D. scenarios. For example,

if we compute LOOCV in a time series. Each

is sampled from different conditional distribution

πT(yt|y−t)

and thus

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LEAVE-GROUP-OUTCROSS-VALIDATIONFORLATENTGAUSSIANMODELSZhedongLiuStatisticsProgram,Computer,ElectricalandMathematicalSciencesandEngineeringDivisionKingAbdullahUniversityofScienceandTechnology(KAUST)KingdomofSaudiArabia,Thuwal23955-6900zhedong.liu@kaust.edu.saHåvardRueStatisticsProgram,Computer,Electr...

展开>> 收起<<

LEAVE -GROUP -OUTCROSS -VALIDATION FOR LATENT GAUSSIAN MODELS Zhedong Liu.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LEAVE -GROUP -OUTCROSS -VALIDATION FOR LATENT GAUSSIAN MODELS Zhedong Liu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: