APREPRINT - SEPTEMBER 4, 2024
1.4 Related studies
The issues of applying LOOCV to non-I.I.D. models have been recognized and addressed by numerous researchers.
To name a few, [
4
] advocate block cross-validation, partitioning ecological data based on inherent patterns, but this
approach should be adopted when the prediction task is not simply interpolation. [
12
] offers a modification to LOOCV,
ensuring an unbiased measure of predictive performance given the correlation between new and observed data, where
the unbiasedness is in the sense of randomized both observed data and new data. We should note that an assumed
prediction task determines the correlation between new and observed data. [
5
] presents an efficient approximation
for LFOCV, which is used when the prediction task is forecasting in time series. [
13
] considers a multilevel model
and demonstrates that marginal WAIC is akin to LOOCV. In contrast, conditional WAIC aligns with LGOCV, where a
hierarchical level, such as a school, defines the groups. The choice of criteria is according to one’s assumption of the
new data, i.e., the prediction task. Additionally, [
14
] recommends h-block cross-validation to estimate one-step-ahead
performance in stationary processes, which is similar to applying our approach to a stationary model.
These studies assume that model users can select an appropriate evaluation method judiciously according to their
understanding of their prediction tasks. Yet, in practice, many users, despite knowing LOOCV’s limitations, still resort
to LOOCV or randomized K-fold cross-validation, primarily for user-friendliness. This reality motivates our pursuit of
a method that provides user-friendliness with better predictive performance estimation. To this extent, an R package
blockCV
[
15
] implements spatial and environmental blocking and spatial buffering. Spatial blocking forms clusters of
data points according to spatial effects, and environmental blocking forms clusters using K-means [
16
] on the covariates.
Buffering is the same as our cross-validation with the groups formed only by spatial effects. This approach ensures
that no test data is spatially next to any training data. This package is written mainly for species distribution modeling,
while our approach is designed for a more general purpose. If we put our method in the context of species distribution
modeling, our method provides buffers that are generated by a posterior Gaussian process considering all the relevant
model effects, including spatial, temporal, and environmental factors. In other words, our method ensures that no testing
data abuts training data in terms of the combination of spatial and environmental effects.
1.5 Theoretical aspects
Cross-validation (CV), particularly LOOCV, is frequently considered as an estimator of
E˜y[log π(˜y|y)]
or
E˜y,y[log π(˜y|y)]
. The first expectation describes the generalized predictive performance given a specific training
set, while the second expectation describes the generalized predictive performance averaged over different identically
distributed training sets. These expectations can be evaluated when assuming the existence of the joint density
πT(˜y, y)
,
representing the true data generation process. Under the assumption of I.I.D. and some regularity conditions on
the model, the Bernstein-Von-Mises theorem states that
log π(˜y|y)
converges to a random variable irrelevant to
y
.
Consequently,
E˜y[log π(˜y|y)]
and
E˜y,y[log π(˜y|y)]
become equivalent in the limit. If we further assume that
˜y
is
sampled from the same distribution as all the training data, LOOCV is an asymptotically unbiased estimator of the
expectations. Commonly used information criteria, such as AIC [
17
], WAIC [
18
], are asymptotically equivalent to
LOOCV in I.I.D. cases.This type of analysis is prevalent in the literature with various settings [1, 19, 20, 21].
However, a similar analysis does not hold in non-I.I.D. cases in general. Firstly, the existence of different prediction
tasks means that both the model prediction,
π(˜y|y)
, and the true data generation process,
πT(˜y|y)
, are not uniquely
defined as discussed in section 1.2. Secondly, the asymptotic scheme is not uniquely defined, even with a specific
prediction task. For example, in a temporal model where data
y={y1, y2, . . . , yn}
is a time-series, observed at time
t1< t2<· · · < tnand we denote the last time step as T. Several meanings of of n→ ∞ can be considered:
•T→ ∞ and ti−ti−1is a constant
•ti−ti−1→0and Tis a constant
•ti−ti−1→0and T→ ∞ with T(ti−ti−1)fixed
These scenarios correspond to observing more future data and having higher sample rates within a time frame.
As mentioned in the section 1.2, multilevel data can also have various asymptotic regimes. Thirdly, if the data
generation process is not stationary, the model will not converge under certain asymptotic regimes, which differentiate
E˜y[log π(˜y|y)]
from
E˜y,y[log π(˜y|y)]
even in asymptotic scenarios. These points highlight that the estimand of CV is
not uniquely defined in non-I.I.D. cases, preventing the establishment of asymptotic analysis frameworks similar to
I.I.D. analysis.
From the perspective of CV, it is also inappropriate to consider it an estimator since each summand in CV should be
viewed as a sample from different distributions due to the relevance of data indexes in non-I.I.D. scenarios. For example,
if we compute LOOCV in a time series. Each
yt
is sampled from different conditional distribution
πT(yt|y−t)
and thus
4