Exact Bayesian Inference for Geostatistical Models under Preferential Sampling Douglas Mateus da Silva1 Dani Gamerman2

2025-05-06 0 0 1.59MB 24 页 10玖币
侵权投诉
Exact Bayesian Inference for Geostatistical Models
under Preferential Sampling
Douglas Mateus da Silva1, Dani Gamerman2
Abstract
Preferential sampling is a common feature in geostatistics and occurs when the loca-
tions to be sampled are chosen based on information about the phenomena under study.
In this case, point pattern models are commonly used as the probability law for the
distribution of the locations. However, analytic intractability of the point process like-
lihood prevents its direct calculation. Many Bayesian (and non-Bayesian) approaches
in non-parametric model specifications handle this difficulty with approximation-based
methods. These approximations involve errors that are difficult to quantify and can
lead to biased inference. This paper presents an approach for performing exact Bayesian
inference for this setting without the need for model approximation. A qualitatively
minor change on the traditional model is proposed to circumvent the likelihood in-
tractability. This change enables the use of an augmented model strategy. Recent
work on Bayesian inference for point pattern models can be adapted to the geostatis-
tics setting and renders computational tractability for exact inference for the proposed
methodology. Estimation of model parameters and prediction of the response at un-
sampled locations can then be obtained from the joint posterior distribution of the
augmented model. Simulated studies showed good quality of the proposed model for
estimation and prediction in a variety of preferentiality scenarios. The performance
of our approach is illustrated in the analysis of real datasets and compares favourably
against approximation-based approaches. The paper is concluded with comments re-
garding extensions of and improvements to the proposed methodology.
Keywords: Bayesian inference; Data augmentation; Geostatistics; Point process; Pre-
diction; Preferential sampling.
1 Introduction
Preferential sampling (PS) occurs when sampled locations are more likely to be chosen than
others according to mechanisms that depend on the data generating process. In geostatistics,
it is common to ignore the information of the sampling design in the inference step. This is
not a problem in a non preferential sampling (NPS) context, where the sampling design is
stochastically independent of the spatial process, but the use of the traditional model will
provide biased inference and prediction when PS is present.
The NPS model has two main components: a Gaussian process that models the correlated
structure of data and a observational specification based on a random observational noise
added to the underlying Gaussian process. No information about the process that generated
1Universidade Federal de Minas Gerais,
Av. Anonio Carlos 6627, Pampulha, Belo Horizonte, MG 31270-010, Brazil
E-mail: douglas est@yahoo.com.br
2Universidade Federal do Rio de Janeiro.
1
arXiv:2210.14411v1 [stat.ME] 26 Oct 2022
the pattern of the locations (other than their coordinates) is used, since this process is
independent of the data process. When PS is present, the stochastic dependence of the point
and the Gaussian processes must be both taken into account.
In this way, Diggle et al. (2010) proposed a class of models for dealing with PS in geo-
statistics. A component to explicitly describe the sampling design was included in the model
in addition to the latent Gaussian process and the observational error process of the tra-
ditional model. This component consisted on a Cox process, ie, a spatial Poisson process
with intensity based on the same Gaussian process that drives the data. The authors showed
that the addition of this component to the model corrects the resulting inference. Many
papers have been conducted after and based on their study. For example, Pati et al. (2011)
presented theoretical results of the model in a Bayesian framework; Gelfand et al. (2012)
analyzed the effect of PS on parameter estimation and prediction, also using a Bayesian
approach; Shaddick and Zidek (2014) extended the model to the spatial-temporal context;
Ferreira and Gamerman (2015) studied the effect of PS on the optimal design; Dinsdale and
Salibian-Barrera (2019) proposed to use Laplace approximation for the likelihood function
in a classic approach instead of the Monte Carlo approximation used by Diggle et al. (2010).
The work cited above had to deal with the likelihood function of the inhomogeneous
Poisson process. It has no closed form and depends on a integral of the unknown intensity
function over the entire continuous domain of the study region. Approximated methods were
considered, based on discretizing the region under study (Møller and Waagepetersen, 2003).
Usually in this approximation, a regular grid over the region is used and the real locations
of data are approximated by the centroids of the cell that contains the sampled points.
Accordingly, the Gaussian process is approximated by a multivariate Gaussian vector over
the grid. Another option for the discretization is via approximation of the integral in the
likelihood via quadrature rules (Butler and Moffitt, 1982; Karvonen and S¨arkk¨a, 2017).
Discretization is a relatively easy solution for the problem but can generate errors that are
difficult to quantify and lead to biased estimation (Simpson et al., 2016). According to these
authors, two main sources of errors are present in the discrete approach: the approximation
of the Gaussian process in a discrete domain and the approximation of the real locations by
the discrete ones. The last is the dominant source of error in lattice approximation because
of the loss of the real information about the locations.
The likelihood function of the Poisson process is the component of the geostatistical model
under PS that requires approximated methods. Its intractability impacts the inference and,
as consequence, the capability of prediction with the model. Predicting the response variable
at unobserved locations is usually the main goal in geostatistics (Diggle and Ribeiro, 2007).
Thus, results are also compromised if the path to obtain prediction is compromised. This
situation motivates the development of a methodology free from discretization errors.
Gon¸calves and Gamerman (2018) proposed a methodology for performing exact inference
for Cox processes where the intractability of the likelihood function was avoided. This was
possible due to an augmentation data approach in which the inhomogeneous Poisson process
was embedded as a part of a larger homogeneous Poisson process. As a result, the augmented
likelihood function obtained in combination with a bounded link for the intensity leads to
a completely tractable model. No approximation was used in their approach, leading to a
better and correct inference for Cox processes.
Following this idea, a methodology to perform exact inference in geoestatistics under
2
PS is proposed in this paper, by an adaptation of their results to the geostatistics setting.
The proposed method involves the additional restriction of bounded intensity function of the
point process. This is the only change required to the traditional PS model in geostatistics.
This modification seems not only reasonable in many practical situations but also of minor
relevance since most studies are carried out over a bounded region. Therefore, the intensity
of interest is bounded to this study region and almost inevitably becomes bounded there
even in the potentially unbounded intensity scenario of the traditional approach.
The approach leads to an easily computed likelihood function and requires no approxi-
mation of the model, either by discretization or by any other form of approximation. In this
way, the methodology provides correct results for inference and prediction. Correct informa-
tion about the sampled locations is used instead of approximated ones, potentially leading to
better estimates of all model parameters. The improvement is also obtained for estimation
of the dependence between locations since they depend on the actual sampled points. Also,
the methodology works with a continuous spatial domain instead of a discretized one. A
study of simulation in PS and NPS contexts will also be presented to compare results from
the proposed model and the traditional model and to provide performance assessments.
Thus, the novelty of our contribution lies in: a) proposing a change in the intensity speci-
fication and showing how this change allows for exact model specification in the geostatistics
setting, and 2) adapting the inference performed by Gon¸calves and Gamerman (2018) over
point processes for performing inference over the geostatistics setting under PS. The paper
is organized as follows: Section 2 discusses background about the geostatistical model and
presents the proposed novel methodology to the PS approach. A simulation study and their
main findings are presented in Section 3. Section 4 illustrates the methodology in two real
datasets. Finally, Section 5 presents our discussion and final remarks.
2 Context
This section presents a brief review of the geostatistical model under preferential sampling
and shows the main difficulty with the existing methodology. Then, the proposed model to
make exact inference in the preferential sampling context is presented.
2.1 Preliminaries
Let Bbe a compact region in Rm,m1, the region of study. A point xin Bis denoted by
x= (b1, b2, ..., bm). The traditional geostatistical model is defined by
Yi=µ(xi) + S(xi) + i, i = 1, ...n, (1)
µ(xi) = η0+η1d1(xi) + ... +ηpdp(xi),
where YiY(xi) is the observed response at location xi,µ(xi) is the deterministic component
that represents the expected value of Yi, (η0, η1, ..., ηp) and (d1(.), ..., dp(.)) are regression coef-
ficients and their respective covariates, S(.) is a stationary Gaussian process with zero mean,
constant variance σ2and isotropic correlation function ρ(x, x0) = ρφ(h), where h=|xx0|
is the Euclidean distance between xand x0. This process is denoted GP (0, σ2, ρφ) hereafter.
The observation errors i’s are independent of Sand usually assumed to be independent
3
Gaussian variables with zero mean and variance τ2. The model without covariates is a
special case of equation (1), where µ(x) = η0.
Let Xbe the spatial point process that models the sampled locations. In the non-
preferential context, Sis independent of X(denoted by [X|S] = [X]) and the distribution of
Xis irrelevant for inference about Sand hence for prediction. In the preferential sampling
context, [X|S]6= [X] and ignoring the dependence between Xand Scan lead to biased
inference (Diggle et al., 2010). In this way, the complete distribution of S, X and Y must be
informed. Usually the process X|Sis modeled by a Poisson process with intensity function
λ(x) = g(S(x)),(2)
where gis a monotonic function. Its likelihood function is given by
f(X|S)exp ZB
λ(ξ)"n
Y
i=1
λ(xi)#.(3)
The main problem of dealing with the Poisson process is that equation (3) is intractable for
intensity (2) when Sis unknown. Approximations are required to evaluate it. Such approx-
imations are many times obtained by discretizing the region under study in a grid, in which
the intensity is assumed to be constant within each grid cell (Møller and Waagepetersen,
2003). Diggle et al. (2010) used a log-Gaussian Cox Process (LGCP) to model X, with the
link function in equation (2) given by g(u) = exp{α+βu}. This choice also leads to an
intractable likelihood function.
Hereafter, methods with no model approximations are referred to as exact and the exact
approach to solve the likelihood intractability is proposed in the next section.
2.2 Model specification
The model in this paper retains the basic geostatistical model (1) for the measurements Y
and also makes use of a Poisson process with intensity depending on the underlying process
Sto describe the preferentiality of the locations X. However, the link for the relation of
the intensity of Xwith the latent process Sis rather a bounded function. This change is
fundamental for achieving the exactness of our specification.
Hence, the following model is assumed
Y|X, S Nn(Dη +S(x), τ2In),(4)
X|S, λ P P (λ),(5)
λ(x) = λF(βS(x)),(6)
S|σ2, φ GP (0, σ2, ρφ),(7)
θπ(θ), θ = (λ, η, τ2, σ2, φ, β),(8)
where PP means Poisson process, Inis the identity matrix of order n,ρφis the correlation
function indexed by parameter φ,Fis a monotonic function restricted to the unit interval
(e.g. a continuous distribution function) and λ= sup{λ(x)}. The probit link Φ is used for
Fin the sequel but any other bounded function, such as the logistic distribution function,
4
could be used in (6). These two choices are similar but the probit function is chosen because
it is particularly useful for computational purposes.
The following augmented model approach can be used now due to the boundedness of
the intensity. Let Wbe a homogeneous Poisson process with intensity λand suppose that
Xis obtained by applying Poisson thinning in W. Thus, the Wprocess can be split in the
Xand ˜
Xprocess according to the thinning mechanism described below
X={(x, zx) : xW, zxBernoulli(Φ(βS(x))), zx= 1},
˜
X={(˜x, z˜x) : ˜xW, z˜xBernoulli(Φ(βS(˜x))), z˜x= 0}.
The number of points of Wis denoted by k, of Xby nand of ˜
Xby kn.˜
Xrepresents the
process of the discarded points of Wafter thinning. The Xand ˜
Xprocesses are conditionally
independent given the parameters and ˜
X|˜
λP P (˜
λ), with ˜
λ=λ[1 Φ(βS(˜x))] =
λΦ(βS(˜x)).
The likelihood function of the augmented model is given by
l(S, θ;y, x, w) = π(y, x, w|S, θ) = π(y|S, x, θ)π(x, w|S, θ),(9)
in which wand xare the observed locations of Wand X, respectively, and yis the observed
values of Y. Note that Wis partially observed with Xbeing its observed part and the
remainder ˜
Xis missing data in the above specification.
The first term of right side of equation (9) is given by
π(y|S, x, θ) = 1
2πτ 2n
2
exp 1
2τ2(yDη Sn)0(yDη Sn).(10)
Thus, this term depends on S= (Sn, Sn) only through Sn, that represents Sat the n
observed locations, and Snrepresents the values of process Sat all other locations.
The second term of equation (9) is given by π(x, w|S, θ), which is equivalent to P(x, w, k|S, θ)
as follows:
P(x, w, k|S, θ) = P(x|w, S, θ)P(w|k)P(k|λ),
=n
Y
i=1
λ(xi)
λ k
Y
i=n+1
1λ(xi)
λ n
Y
i=1
1
|B|eλ|B|(λ|B|)k
k!,
= Φnβ
σSn;InΦknβ
σSkn;Ikneλ|B|(λ)k
k!,(11)
where Φk(.;Ik) is the cumulative distribution function of a k-dimensional Gaussian distribu-
tion with mean vector 0 and covariance matrix Ikand |B|is the volume of region B.
Equation (11) is obtained because P(x|w, S, θ) is given by the product of the independent
selection (thinning) probabilities Φ(βS(x)) for the Xlocations whereas the probabilities for
the discarded locations of ˜
Xare given by the complementary values Φ(βS(˜x)). P(w|k)
is given by a product of the kindependent uniform locations of the homogeneous process W
over the region of interest Band thus each wihas density 1/|B|. Finally, the distribution of
the number Kof occurrences of the homogeneous process Wis P oisson(λ|B|).
5
摘要:

ExactBayesianInferenceforGeostatisticalModelsunderPreferentialSamplingDouglasMateusdaSilva1,DaniGamerman2AbstractPreferentialsamplingisacommonfeatureingeostatisticsandoccurswhentheloca-tionstobesampledarechosenbasedoninformationaboutthephenomenaunderstudy.Inthiscase,pointpatternmodelsarecommonlyused...

展开>> 收起<<
Exact Bayesian Inference for Geostatistical Models under Preferential Sampling Douglas Mateus da Silva1 Dani Gamerman2.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:1.59MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注