Exact Bayesian Inference for Geostatistical Models under Preferential Sampling Douglas Mateus da Silva1 Dani Gamerman2

2025-05-06 0 0 1.59MB 24 页 10玖币

侵权投诉

Exact Bayesian Inference for Geostatistical Models

under Preferential Sampling

Douglas Mateus da Silva1, Dani Gamerman2

Abstract

Preferential sampling is a common feature in geostatistics and occurs when the loca-

tions to be sampled are chosen based on information about the phenomena under study.

In this case, point pattern models are commonly used as the probability law for the

distribution of the locations. However, analytic intractability of the point process like-

lihood prevents its direct calculation. Many Bayesian (and non-Bayesian) approaches

in non-parametric model speciﬁcations handle this diﬃculty with approximation-based

methods. These approximations involve errors that are diﬃcult to quantify and can

lead to biased inference. This paper presents an approach for performing exact Bayesian

inference for this setting without the need for model approximation. A qualitatively

minor change on the traditional model is proposed to circumvent the likelihood in-

tractability. This change enables the use of an augmented model strategy. Recent

work on Bayesian inference for point pattern models can be adapted to the geostatis-

tics setting and renders computational tractability for exact inference for the proposed

methodology. Estimation of model parameters and prediction of the response at un-

sampled locations can then be obtained from the joint posterior distribution of the

augmented model. Simulated studies showed good quality of the proposed model for

estimation and prediction in a variety of preferentiality scenarios. The performance

of our approach is illustrated in the analysis of real datasets and compares favourably

against approximation-based approaches. The paper is concluded with comments re-

garding extensions of and improvements to the proposed methodology.

Keywords: Bayesian inference; Data augmentation; Geostatistics; Point process; Pre-

diction; Preferential sampling.

1 Introduction

Preferential sampling (PS) occurs when sampled locations are more likely to be chosen than

others according to mechanisms that depend on the data generating process. In geostatistics,

it is common to ignore the information of the sampling design in the inference step. This is

not a problem in a non preferential sampling (NPS) context, where the sampling design is

stochastically independent of the spatial process, but the use of the traditional model will

provide biased inference and prediction when PS is present.

The NPS model has two main components: a Gaussian process that models the correlated

structure of data and a observational speciﬁcation based on a random observational noise

added to the underlying Gaussian process. No information about the process that generated

1Universidade Federal de Minas Gerais,

Av. Antˆonio Carlos 6627, Pampulha, Belo Horizonte, MG 31270-010, Brazil

E-mail: douglas est@yahoo.com.br

2Universidade Federal do Rio de Janeiro.

arXiv:2210.14411v1 [stat.ME] 26 Oct 2022

the pattern of the locations (other than their coordinates) is used, since this process is

independent of the data process. When PS is present, the stochastic dependence of the point

and the Gaussian processes must be both taken into account.

In this way, Diggle et al. (2010) proposed a class of models for dealing with PS in geo-

statistics. A component to explicitly describe the sampling design was included in the model

in addition to the latent Gaussian process and the observational error process of the tra-

ditional model. This component consisted on a Cox process, ie, a spatial Poisson process

with intensity based on the same Gaussian process that drives the data. The authors showed

that the addition of this component to the model corrects the resulting inference. Many

papers have been conducted after and based on their study. For example, Pati et al. (2011)

presented theoretical results of the model in a Bayesian framework; Gelfand et al. (2012)

analyzed the eﬀect of PS on parameter estimation and prediction, also using a Bayesian

approach; Shaddick and Zidek (2014) extended the model to the spatial-temporal context;

Ferreira and Gamerman (2015) studied the eﬀect of PS on the optimal design; Dinsdale and

Salibian-Barrera (2019) proposed to use Laplace approximation for the likelihood function

in a classic approach instead of the Monte Carlo approximation used by Diggle et al. (2010).

The work cited above had to deal with the likelihood function of the inhomogeneous

Poisson process. It has no closed form and depends on a integral of the unknown intensity

function over the entire continuous domain of the study region. Approximated methods were

considered, based on discretizing the region under study (Møller and Waagepetersen, 2003).

Usually in this approximation, a regular grid over the region is used and the real locations

of data are approximated by the centroids of the cell that contains the sampled points.

Accordingly, the Gaussian process is approximated by a multivariate Gaussian vector over

the grid. Another option for the discretization is via approximation of the integral in the

likelihood via quadrature rules (Butler and Moﬃtt, 1982; Karvonen and S¨arkk¨a, 2017).

Discretization is a relatively easy solution for the problem but can generate errors that are

diﬃcult to quantify and lead to biased estimation (Simpson et al., 2016). According to these

authors, two main sources of errors are present in the discrete approach: the approximation

of the Gaussian process in a discrete domain and the approximation of the real locations by

the discrete ones. The last is the dominant source of error in lattice approximation because

of the loss of the real information about the locations.

The likelihood function of the Poisson process is the component of the geostatistical model

under PS that requires approximated methods. Its intractability impacts the inference and,

as consequence, the capability of prediction with the model. Predicting the response variable

at unobserved locations is usually the main goal in geostatistics (Diggle and Ribeiro, 2007).

Thus, results are also compromised if the path to obtain prediction is compromised. This

situation motivates the development of a methodology free from discretization errors.

Gon¸calves and Gamerman (2018) proposed a methodology for performing exact inference

for Cox processes where the intractability of the likelihood function was avoided. This was

possible due to an augmentation data approach in which the inhomogeneous Poisson process

was embedded as a part of a larger homogeneous Poisson process. As a result, the augmented

likelihood function obtained in combination with a bounded link for the intensity leads to

a completely tractable model. No approximation was used in their approach, leading to a

better and correct inference for Cox processes.

Following this idea, a methodology to perform exact inference in geoestatistics under

PS is proposed in this paper, by an adaptation of their results to the geostatistics setting.

The proposed method involves the additional restriction of bounded intensity function of the

point process. This is the only change required to the traditional PS model in geostatistics.

This modiﬁcation seems not only reasonable in many practical situations but also of minor

relevance since most studies are carried out over a bounded region. Therefore, the intensity

of interest is bounded to this study region and almost inevitably becomes bounded there

even in the potentially unbounded intensity scenario of the traditional approach.

The approach leads to an easily computed likelihood function and requires no approxi-

mation of the model, either by discretization or by any other form of approximation. In this

way, the methodology provides correct results for inference and prediction. Correct informa-

tion about the sampled locations is used instead of approximated ones, potentially leading to

better estimates of all model parameters. The improvement is also obtained for estimation

of the dependence between locations since they depend on the actual sampled points. Also,

the methodology works with a continuous spatial domain instead of a discretized one. A

study of simulation in PS and NPS contexts will also be presented to compare results from

the proposed model and the traditional model and to provide performance assessments.

Thus, the novelty of our contribution lies in: a) proposing a change in the intensity speci-

ﬁcation and showing how this change allows for exact model speciﬁcation in the geostatistics

setting, and 2) adapting the inference performed by Gon¸calves and Gamerman (2018) over

point processes for performing inference over the geostatistics setting under PS. The paper

is organized as follows: Section 2 discusses background about the geostatistical model and

presents the proposed novel methodology to the PS approach. A simulation study and their

main ﬁndings are presented in Section 3. Section 4 illustrates the methodology in two real

datasets. Finally, Section 5 presents our discussion and ﬁnal remarks.

2 Context

This section presents a brief review of the geostatistical model under preferential sampling

and shows the main diﬃculty with the existing methodology. Then, the proposed model to

make exact inference in the preferential sampling context is presented.

2.1 Preliminaries

Let Bbe a compact region in Rm,m≥1, the region of study. A point xin Bis denoted by

x= (b1, b2, ..., bm). The traditional geostatistical model is deﬁned by

Yi=µ(xi) + S(xi) + i, i = 1, ...n, (1)

µ(xi) = η0+η1d1(xi) + ... +ηpdp(xi),

where Yi≡Y(xi) is the observed response at location xi,µ(xi) is the deterministic component

that represents the expected value of Yi, (η0, η1, ..., ηp) and (d1(.), ..., dp(.)) are regression coef-

ﬁcients and their respective covariates, S(.) is a stationary Gaussian process with zero mean,

constant variance σ2and isotropic correlation function ρ(x, x0) = ρφ(h), where h=|x−x0|

is the Euclidean distance between xand x0. This process is denoted GP (0, σ2, ρφ) hereafter.

The observation errors i’s are independent of Sand usually assumed to be independent

Gaussian variables with zero mean and variance τ2. The model without covariates is a

special case of equation (1), where µ(x) = η0.

Let Xbe the spatial point process that models the sampled locations. In the non-

preferential context, Sis independent of X(denoted by [X|S] = [X]) and the distribution of

Xis irrelevant for inference about Sand hence for prediction. In the preferential sampling

context, [X|S]6= [X] and ignoring the dependence between Xand Scan lead to biased

inference (Diggle et al., 2010). In this way, the complete distribution of S, X and Y must be

informed. Usually the process X|Sis modeled by a Poisson process with intensity function

λ(x) = g(S(x)),(2)

where gis a monotonic function. Its likelihood function is given by

f(X|S)∝exp −ZB

λ(ξ)dξ"n

i=1

λ(xi)#.(3)

The main problem of dealing with the Poisson process is that equation (3) is intractable for

intensity (2) when Sis unknown. Approximations are required to evaluate it. Such approx-

imations are many times obtained by discretizing the region under study in a grid, in which

the intensity is assumed to be constant within each grid cell (Møller and Waagepetersen,

2003). Diggle et al. (2010) used a log-Gaussian Cox Process (LGCP) to model X, with the

link function in equation (2) given by g(u) = exp{α+βu}. This choice also leads to an

intractable likelihood function.

Hereafter, methods with no model approximations are referred to as exact and the exact

approach to solve the likelihood intractability is proposed in the next section.

2.2 Model speciﬁcation

The model in this paper retains the basic geostatistical model (1) for the measurements Y

and also makes use of a Poisson process with intensity depending on the underlying process

Sto describe the preferentiality of the locations X. However, the link for the relation of

the intensity of Xwith the latent process Sis rather a bounded function. This change is

fundamental for achieving the exactness of our speciﬁcation.

Hence, the following model is assumed

Y|X, S ∼Nn(Dη +S(x), τ2In),(4)

X|S, λ ∼P P (λ),(5)

λ(x) = λ∗F(βS(x)/σ),(6)

S|σ2, φ ∼GP (0, σ2, ρφ),(7)

θ∼π(θ), θ = (λ∗, η, τ2, σ2, φ, β),(8)

where PP means Poisson process, Inis the identity matrix of order n,ρφis the correlation

function indexed by parameter φ,Fis a monotonic function restricted to the unit interval

(e.g. a continuous distribution function) and λ∗= sup{λ(x)}. The probit link Φ is used for

Fin the sequel but any other bounded function, such as the logistic distribution function,

could be used in (6). These two choices are similar but the probit function is chosen because

it is particularly useful for computational purposes.

The following augmented model approach can be used now due to the boundedness of

the intensity. Let Wbe a homogeneous Poisson process with intensity λ∗and suppose that

Xis obtained by applying Poisson thinning in W. Thus, the Wprocess can be split in the

Xand ˜

Xprocess according to the thinning mechanism described below

X={(x, zx) : x∈W, zx∼Bernoulli(Φ(βS(x)/σ)), zx= 1},

X={(˜x, z˜x) : ˜x∈W, z˜x∼Bernoulli(Φ(βS(˜x)/σ)), z˜x= 0}.

The number of points of Wis denoted by k, of Xby nand of ˜

Xby k−n.˜

Xrepresents the

process of the discarded points of Wafter thinning. The Xand ˜

Xprocesses are conditionally

independent given the parameters and ˜

X|˜

λ∼P P (˜

λ), with ˜

λ=λ∗[1 −Φ(βS(˜x)/σ)] =

λ∗Φ(−βS(˜x)/σ).

The likelihood function of the augmented model is given by

l(S, θ;y, x, w) = π(y, x, w|S, θ) = π(y|S, x, θ)π(x, w|S, θ),(9)

in which wand xare the observed locations of Wand X, respectively, and yis the observed

values of Y. Note that Wis partially observed with Xbeing its observed part and the

remainder ˜

Xis missing data in the above speciﬁcation.

The ﬁrst term of right side of equation (9) is given by

π(y|S, x, θ) = 1

2πτ 2n

exp −1

2τ2(y−Dη −Sn)0(y−Dη −Sn).(10)

Thus, this term depends on S= (Sn, S−n) only through Sn, that represents Sat the n

observed locations, and S−nrepresents the values of process Sat all other locations.

The second term of equation (9) is given by π(x, w|S, θ), which is equivalent to P(x, w, k|S, θ)

as follows:

P(x, w, k|S, θ) = P(x|w, S, θ)P(w|k)P(k|λ∗),

=n

i=1

λ(xi)

λ∗ k

i=n+1

1−λ(xi)

λ∗ n

i=1

|B|e−λ∗|B|(λ∗|B|)k

k!,

= Φnβ

σSn;InΦk−n−β

σSk−n;Ik−ne−λ∗|B|(λ∗)k

k!,(11)

where Φk(.;Ik) is the cumulative distribution function of a k-dimensional Gaussian distribu-

tion with mean vector 0 and covariance matrix Ikand |B|is the volume of region B.

Equation (11) is obtained because P(x|w, S, θ) is given by the product of the independent

selection (thinning) probabilities Φ(βS(x)/σ) for the Xlocations whereas the probabilities for

the discarded locations of ˜

Xare given by the complementary values Φ(−βS(˜x)/σ). P(w|k)

is given by a product of the kindependent uniform locations of the homogeneous process W

over the region of interest Band thus each wihas density 1/|B|. Finally, the distribution of

the number Kof occurrences of the homogeneous process Wis P oisson(λ∗|B|).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExactBayesianInferenceforGeostatisticalModelsunderPreferentialSamplingDouglasMateusdaSilva1,DaniGamerman2AbstractPreferentialsamplingisacommonfeatureingeostatisticsandoccurswhentheloca-tionstobesampledarechosenbasedoninformationaboutthephenomenaunderstudy.Inthiscase,pointpatternmodelsarecommonlyused...

展开>> 收起<<

Exact Bayesian Inference for Geostatistical Models under Preferential Sampling Douglas Mateus da Silva1 Dani Gamerman2.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exact Bayesian Inference for Geostatistical Models under Preferential Sampling Douglas Mateus da Silva1 Dani Gamerman2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: