LEAVE -GROUP -OUTCROSS -VALIDATION FOR LATENT GAUSSIAN MODELS Zhedong Liu

2025-04-29 0 0 785.34KB 16 页 10玖币
侵权投诉
LEAVE-GROUP-OUT CROSS-VALIDATION FOR LATENT
GAUSSIAN MODELS
Zhedong Liu
Statistics Program, Computer, Electrical and Mathematical Sciences and Engineering Division
King Abdullah University of Science and Technology (KAUST)
Kingdom of Saudi Arabia, Thuwal 23955-6900
zhedong.liu@kaust.edu.sa
Håvard Rue
Statistics Program, Computer, Electrical and Mathematical Sciences and Engineering Division
King Abdullah University of Science and Technology (KAUST)
Kingdom of Saudi Arabia, Thuwal 23955-6900
haavard.rue@kaust.edu.sa
September 4, 2024
ABSTRACT
Evaluating the predictive performance of a statistical model is commonly done using cross-validation.
Although the leave-one-out method is frequently employed, its application is justified primarily
for independent and identically distributed observations. However, this method tends to mimic
interpolation rather than prediction when dealing with dependent observations.
This paper proposes a modified cross-validation for dependent observations. This is achieved by
excluding an automatically determined set of observations from the training set to mimic a more
reasonable prediction scenario. Also, within the framework of latent Gaussian models, we illustrate a
method to adjust the joint posterior for this modified cross-validation to avoid model refitting. This
new approach is accessible in the R-INLA package (www.r-inla.org).
Keywords: Bayesian Cross-Validation; Latent Gaussian Models; R-INLA
1 Introduction
1.1 Background
Leave-one-out cross-validation (LOOCV) [
1
] stands as a popular method for evaluating a statistical model’s predictive
performance, model selections, or estimating some critical parameters in the model. The core concept of LOOCV
is elegantly straightforward. Suppose we have data,
y={yi}
, for
i= 1, . . . , n
, presumed to be Independent and
Identically Distributed (I.I.D.) samples from the true distribution
πT(y)
. Our objective is to determine how well a
fitted model can predict a new observation,
˜y
, sampled from this true distribution. In the Bayesian context, we use
the posterior predictive distribution
π(y|y)
to predict
˜y
sampled from
πT(y)
. By using the logarithmic score [
2
], we
compute E˜y[log π(˜y|y)] as a metric for prediction quality.
Owing to the lack of
πT(y)
, directly computing the expectation becomes infeasible. Nonetheless, since
yi
is an I.I.D.
sample of πT(y), we can estimate this expectation by evaluating
uLOOCV =1
n
n
X
i=1
log π(yi|yi),
where yiis the testing point and yiis the training set, and yiare all data except the ith observation.
arXiv:2210.04482v5 [stat.CO] 2 Sep 2024
APREPRINT - SEPTEMBER 4, 2024
The informal interpretation of LOOCV is that it mimics “using
y
to predict
˜y
" by “using
yi
to predict
yi
". This
intuitive interpretation is then used to justify, often implicitly, the use of LOOCV as a “default” way to evaluate
predictive performance.
However, issues arise when the I.I.D. assumption does not hold. We can have longitudinal data following each subject in
a study [
3
], dependence due to time and/or space [
4
,
5
], or hierarchical structure [
6
]. In those cases, LOOCV often has
an overly optimistic assessment of models’ predictive performance since LOOCV no longer mimics a proper prediction
task, which defines the new data generation process given observed data. This paper proposes an adaptation of LOOCV
that is more aligned with “measuring predictive performance” when the I.I.D. assumption is invalid.
1.2 The prediction task
The critical observation is that the meaning of “prediction" is not clearly defined when
y
are not I.I.D samples of
πT(y)
.
πT(˜y|y)
lacks a unique definition in non-I.I.D. scenarios as without a clear prediction task, i.e., how we imagine a new
future data point,
˜y
, is generated given observed data
y
. This ambiguity extends to the act of “using
y
to predict
˜y
" as it
is uncertain what our target, ˜y, represents. To illustrate these concepts, let us discuss some more concrete examples.
Time-series model
Assume data
y={y1, y2, . . . , yT}
is a time-series, observed sequentially at time
1,2, . . . , T
. The inherent prediction
task is to predict future values, given the temporal nature of the data. We can predict a new observation at
k1
steps
ahead into the future by π(yT+k|y1, . . . , yT).
In this example, the LOOCV will be computed from
π(yt|y1, . . . , yt1, yt+1, . . . , yT), t = 1, . . . , T,
which is often referred as interpolation or imputation of missing values rather than a prediction. However, time series
models’ predictive performance is often assessed through leave-future-out cross-validation (LFOCV) [5]:
Tk
X
T=T0
log π(yT+k|y1, . . . , yT),
where Tstarts from time T0>1as we need some data to estimate the model.
The message from this example is that LOOCV, when applied to such models, is essentially evaluating interpolation
performance rather than predictive performance.
We acknowledge two issues. First, the distinction between interpolation and prediction is not always clear-cut, leading
to overlapping concepts. For example, a one-step-ahead forecast leans more towards interpolation than a two-step-ahead
prediction. In contrast, a one-step-ahead forecast leans less towards interpolation than a missing value imputation.
However, this does not deter our discussion. Secondly, while an ideal model succeeds in all prediction tasks, real-world
scenarios demand us to settle for the definition of the “best fit". Consequently, our choice of evaluation should align
with our specific objectives.
Multilevel model
Figure 1 illustrates an example of a multilevel model. Consider observations of student grades or performance. This
data exhibits a hierarchical structure: students belong to classes, classes reside within schools, and schools are nested
within regions. This hierarchical arrangement is significant because it introduces effects attributed to the class, school,
and region levels, substantially deviating from the I.I.D. assumption.
Given such a model, the prediction task becomes ambiguous. Are we aiming to predict the performance of an unobserved
student from an observed class? Or are we trying to predict the performance of an unobserved student in an unobserved
class, school, or even region? This difficulty mirrors the challenges in defining asymptotic regimes for these models. As
students, classes, schools, and regions can grow indefinitely in various ways, it is unclear whether one of such choices
is the most reasonable.
To properly evaluate predictive performance within this context, users must first explicitly define their prediction
task and then assess the model in line with this definition. It is worth noting that applying LOOCV would imply the
prediction of individual students within observed classes. In our view, this mimics more interpolation rather than
prediction.
2
APREPRINT - SEPTEMBER 4, 2024
D
r1
s1
c1
y1y2y3
c2
y4y5
s2
c3
y6y7
c4
y8y9
r2
s3
c5
y10 y11
c6
y12 y13 y14 y15
s4
c7
y16
c8
y17 y18
Figure 1: A nested multilevel model.
1.3 Improving LOOCV for non-I.I.D. models
Our discussions illuminate an important insight: when dealing with non-I.I.D. data, the prediction task implicitly defined
through LOOCV may be less appropriate, as it leans more towards assessing interpolation qualities than predictive
performance. This prompts the question: What is a suitable approach moving forward?
One observation is the absence of a “one size fits all" solution. Each model may possess a natural prediction task—or
several—based on its intended application. Thus, for a specific assessment of predictive performance, we need to define
these prediction tasks explicitly. One can then evaluate distinct predictive performance metrics using our proposed
leave-group-out cross-validation (LGOCV):
uLGOCV =1
n
n
X
i=1
log(π(yi|yIi)).(1)
Here, the group (denoted by
Ii
) is an index set including
i
. This configuration facilitates that the pair
(yi,yIi)
mimics
a specified prediction task, with
yIi
being the data subset excluding the data indexed by
Ii
. In a multilevel model, as
depicted in Figure 1, predicting a student’s grade from an unseen class necessitates that
Ii
includes
i
and all observations
from student
i
s class. However, more complex models, such as models containing both time series and hierarchical
elements, pose challenges when defining a natural prediction task. Therefore, unless in simple cases otherwise, LOOCV
is often applied for its simplicity—even if it leans more towards interpolation.
Our primary contribution is offering an automatic LGOCV adaptive to the specified model when prediction tasks are
not defined. Our approach automatically constructs a group,
Ii
, for each
i
. Though we will delve into automatically
defining
Ii
later, an initial understanding is that
Ii
comprises the data points most informative for predicting the testing
point,
yi
. This set ensures that our LGOCV focuses less on interpolation and more on prediction than LOOCV. In other
words, LGOCV tests the model on more difficult prediction tasks.
The rationale behind LGOCV is to create more reasonable criteria than LOOCV when the user has difficulty defining
manual groups in non-I.I.D. models. In various practical examples, we will show how this automatic procedure produces
reasonable groups. Also, many advanced spatial examples applying this method are shown in the published paper [
7
].
For a simple time-series example, our new approach will correspond to evaluating
π(yt|y1, . . . , ytk, yt+k, . . . , yT),
for fixed
k > 1
. This corresponds to removing a sequence of data with length
2k1
, to predict the central one. As
we see, this is neither pure interpolation nor pure prediction. Our interpretation is that it is less interpolation or more
prediction than what LOOCV provides when k > 1.
There are two key challenges to address to make our proposal practical. Firstly, we must quantify the information
contributed by one data point in predicting another; this is crucial for automatic group construction. Secondly, we face
the computational task of evaluating
uLGOCV
given a set of groups. The naive computation of LGOCV by fitting models
across all potential training sets and evaluating their utility against corresponding testing points is computationally
infeasible, especially given the resource-demanding nature of modern statistical models. However, these challenges can
be handled elegantly within the framework of latent Gaussian models (LGMs) combined with the integrated nested
Laplace approximation (INLA) inference, as detailed in [
8
,
9
,
10
,
11
]. Throughout this paper, we will assume that
our model is an LGM. We will discuss how to integrate the automatic group construction and the fast computation of
uLGOCV
into the LGMs-INLA framework. Notably, our proposed methodology has been incorporated into the R-INLA
package (www.rinla.org), extending its applicability across all LGMs supported by R-INLA.
3
APREPRINT - SEPTEMBER 4, 2024
1.4 Related studies
The issues of applying LOOCV to non-I.I.D. models have been recognized and addressed by numerous researchers.
To name a few, [
4
] advocate block cross-validation, partitioning ecological data based on inherent patterns, but this
approach should be adopted when the prediction task is not simply interpolation. [
12
] offers a modification to LOOCV,
ensuring an unbiased measure of predictive performance given the correlation between new and observed data, where
the unbiasedness is in the sense of randomized both observed data and new data. We should note that an assumed
prediction task determines the correlation between new and observed data. [
5
] presents an efficient approximation
for LFOCV, which is used when the prediction task is forecasting in time series. [
13
] considers a multilevel model
and demonstrates that marginal WAIC is akin to LOOCV. In contrast, conditional WAIC aligns with LGOCV, where a
hierarchical level, such as a school, defines the groups. The choice of criteria is according to one’s assumption of the
new data, i.e., the prediction task. Additionally, [
14
] recommends h-block cross-validation to estimate one-step-ahead
performance in stationary processes, which is similar to applying our approach to a stationary model.
These studies assume that model users can select an appropriate evaluation method judiciously according to their
understanding of their prediction tasks. Yet, in practice, many users, despite knowing LOOCV’s limitations, still resort
to LOOCV or randomized K-fold cross-validation, primarily for user-friendliness. This reality motivates our pursuit of
a method that provides user-friendliness with better predictive performance estimation. To this extent, an R package
blockCV
[
15
] implements spatial and environmental blocking and spatial buffering. Spatial blocking forms clusters of
data points according to spatial effects, and environmental blocking forms clusters using K-means [
16
] on the covariates.
Buffering is the same as our cross-validation with the groups formed only by spatial effects. This approach ensures
that no test data is spatially next to any training data. This package is written mainly for species distribution modeling,
while our approach is designed for a more general purpose. If we put our method in the context of species distribution
modeling, our method provides buffers that are generated by a posterior Gaussian process considering all the relevant
model effects, including spatial, temporal, and environmental factors. In other words, our method ensures that no testing
data abuts training data in terms of the combination of spatial and environmental effects.
1.5 Theoretical aspects
Cross-validation (CV), particularly LOOCV, is frequently considered as an estimator of
E˜y[log π(˜y|y)]
or
E˜y,y[log π(˜y|y)]
. The first expectation describes the generalized predictive performance given a specific training
set, while the second expectation describes the generalized predictive performance averaged over different identically
distributed training sets. These expectations can be evaluated when assuming the existence of the joint density
πT(˜y, y)
,
representing the true data generation process. Under the assumption of I.I.D. and some regularity conditions on
the model, the Bernstein-Von-Mises theorem states that
log π(˜y|y)
converges to a random variable irrelevant to
y
.
Consequently,
E˜y[log π(˜y|y)]
and
E˜y,y[log π(˜y|y)]
become equivalent in the limit. If we further assume that
˜y
is
sampled from the same distribution as all the training data, LOOCV is an asymptotically unbiased estimator of the
expectations. Commonly used information criteria, such as AIC [
17
], WAIC [
18
], are asymptotically equivalent to
LOOCV in I.I.D. cases.This type of analysis is prevalent in the literature with various settings [1, 19, 20, 21].
However, a similar analysis does not hold in non-I.I.D. cases in general. Firstly, the existence of different prediction
tasks means that both the model prediction,
π(˜y|y)
, and the true data generation process,
πT(˜y|y)
, are not uniquely
defined as discussed in section 1.2. Secondly, the asymptotic scheme is not uniquely defined, even with a specific
prediction task. For example, in a temporal model where data
y={y1, y2, . . . , yn}
is a time-series, observed at time
t1< t2<· · · < tnand we denote the last time step as T. Several meanings of of n can be considered:
T and titi1is a constant
titi10and Tis a constant
titi10and T with T(titi1)fixed
These scenarios correspond to observing more future data and having higher sample rates within a time frame.
As mentioned in the section 1.2, multilevel data can also have various asymptotic regimes. Thirdly, if the data
generation process is not stationary, the model will not converge under certain asymptotic regimes, which differentiate
E˜y[log π(˜y|y)]
from
E˜y,y[log π(˜y|y)]
even in asymptotic scenarios. These points highlight that the estimand of CV is
not uniquely defined in non-I.I.D. cases, preventing the establishment of asymptotic analysis frameworks similar to
I.I.D. analysis.
From the perspective of CV, it is also inappropriate to consider it an estimator since each summand in CV should be
viewed as a sample from different distributions due to the relevance of data indexes in non-I.I.D. scenarios. For example,
if we compute LOOCV in a time series. Each
yt
is sampled from different conditional distribution
πT(yt|yt)
and thus
4
摘要:

LEAVE-GROUP-OUTCROSS-VALIDATIONFORLATENTGAUSSIANMODELSZhedongLiuStatisticsProgram,Computer,ElectricalandMathematicalSciencesandEngineeringDivisionKingAbdullahUniversityofScienceandTechnology(KAUST)KingdomofSaudiArabia,Thuwal23955-6900zhedong.liu@kaust.edu.saHåvardRueStatisticsProgram,Computer,Electr...

展开>> 收起<<
LEAVE -GROUP -OUTCROSS -VALIDATION FOR LATENT GAUSSIAN MODELS Zhedong Liu.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:785.34KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注