High-dimensional order-free multivariate spatial disease mapping Vicente G.123 Adin A.234 Goicoa T.234and Ugarte M.D.234

2025-05-06 0 0 4.59MB 31 页 10玖币

侵权投诉

High-dimensional order-free multivariate spatial

disease mapping

Vicente, G.1,2,3, Adin, A.2,3,4, Goicoa, T.2,3,4and Ugarte, M.D.2,3,4

1Facultad de Ciencias Econ´omicas, Universidad de Cuyo, Argentina.

2Department of Statistics, Computer Sciences and Mathematics, Public University of Navarre, Spain.

3Institute for Advanced Materials and Mathematics, InaMat2, Public University of Navarre, Spain.

4IdiSNA, Health Research Institute of Navarre, Recinto de Complejo Hospitalario de Navarra, Spain.

∗Correspondence to Mar´ıa Dolores Ugarte, Departamento de Estad´ıstica, Inform´atica y Matem´aticas,

Universidad P´ublica de Navarra, Campus de Arrosadia, 31006 Pamplona, Spain.

E-mail: lola@unavarra.es

Abstract

Despite the amount of research on disease mapping in recent years, the use of multivari-

ate models for areal spatial data remains limited due to diﬃculties in implementation and

computational burden. These problems are exacerbated when the number of small areas is

very large. In this paper, we introduce an order-free multivariate scalable Bayesian modelling

approach to smooth mortality (or incidence) risks of several diseases simultaneously. The

proposal partitions the spatial domain into smaller subregions, ﬁts multivariate models in

each subdivision and obtains the posterior distribution of the relative risks across the entire

spatial domain. The approach also provides posterior correlations among the spatial pat-

terns of the diseases in each partition that are combined through a consensus Monte Carlo

algorithm to obtain correlations for the whole study region. We implement the proposal

using integrated nested Laplace approximations (INLA) in the R package bigDM and use it

to jointly analyse colorectal, lung, and stomach cancer mortality data in Spanish municipali-

ties. The new proposal permits the analysis of big data sets and provides better results than

ﬁtting a single multivariate model.

Keywords: Bayesian inference; High-dimensional data; Scalable models; Spatial epidemiol-

ogy

1 Introduction

Research on methodology for the spatial (and spatio-temporal) analysis of areal count data

has grown tremendously in the last years, and statistical models have proven an essential tool

for studying the geographic distribution of data in small areas. The main objective of these

techniques is to smooth standardized mortality (incidence) ratios or crude rates to discover

geographic patterns of the phenomenon under study. These models and methods have been

mainly applied in epidemiology to analyse the incidence and mortality of chronic diseases such

as cancer, but some recent research has demonstrated their applicability to the spatial and

spatio-temporal analysis of crimes (see for example Li et al.,2014), and in particular crimes

arXiv:2210.14849v1 [stat.ME] 26 Oct 2022

against women (see for example Vicente et al.,2018,2020a). Although research on single disease

analysis has been very fruitful and abundant since the seminal work of Besag et al. (1991), joint

modelling of several responses oﬀers some advantages. The ﬁrst is that it improves smoothing

by borrowing strength between diseases. The second, and probably more important, is that

it allows to establish relationships between diﬀerent diseases in terms of similar or completely

diﬀerent geographical distributions, i.e. in terms of correlations between spatial patterns. This

is crucial, as these correlations may indicate associations with common underlying risk factors

and certain (usually unknown) connections between the diﬀerent diseases. The joint analysis

is carried out through multivariate spatial models that can cope with both spatial correlation

within diseases and correlation between diseases. Not only can multivariate models account for

correlation between diseases, but also improve estimates by borrowing information from nearby

areas.

There is a considerable amount of research on Bayesian multivariate spatial models for count

data, most of the proposals relying on Markov chain Monte Carlo (MCMC) algorithms for

estimation and inference. However, their use in practice is still limited due to a lack of “easy

to use” implementations of the models in statistical packages and the computational burden of

most of the proposals that preclude practitioners from exploiting their undoubted advantages

over univariate counterparts. A comprehensive review of the subject can be found in the work

of MacNab (2018) which discusses the three main lines in the construction of multivariate

proposals based on Gaussian Markov random ﬁelds. Namely, a multivariate conditionals-based

approach (Mardia,1988), a univariate conditionals-based approach (Sain et al.,2011), and a

coregionalization framework (Jin et al.,2007). Regarding the latter, Martinez-Beneito (2013)

derives a general theoretical setting for multivariate areal models that covers many of the existing

proposals in the literature. However, this procedure is unaﬀordable for a moderate to large

number of diseases due to the high computational cost of the MCMC algorithms. Botella-

Rocamora et al. (2015) reformulate the Mart´ınez-Beneito framework and present the so called

M-models as a simpler and more computationally eﬃcient alternative. This approach makes it

possible to increase the number of diseases in the model at the expense of the identiﬁability of

certain parameters. Recently, Vicente et al. (2020b) consider the M-models-based approach to

analyse in space and time diﬀerent crimes against women in India. These authors estimate the

M-models using integrated nested Laplace approximations (INLA) and numerical integration

for Bayesian inference (see Rue et al.,2009) and implement the procedure using the ’rgeneric’

construction in R-INLA (Lindgren and Rue,2015). The result is a “ready-to-use” function for

a wide audience with limited programming skills.

Several alternatives to Gaussian Markov random ﬁelds have been also proposed in the disease

mapping literature. A very attractive modelling approach is the use of splines to smooth risks

(Goicoa et al.,2012). Research on multivariate spline models for ﬁtting spatio-temporal count

data is not so abundant and focuses on multivariate structures to deal with the spatial and

temporal dependence for one response measured in several time periods (see for example Mac-

Nab,2016;Ugarte et al.,2010,2017). Very recently, Vicente et al. (2021) propose multivariate

P-spline models to study the spatio-temporal evolution of four crimes against women. Unfor-

tunately, inference for these multivariate proposals (and also for univariate approaches) become

unfeasible when the number of areas is very large, and the scalability of the procedures is an

issue.

New directions in disease mapping points towards developing new methods for Bayesian inference

when the number of small areas is very large (MacNab,2022). Creating computationally eﬃcient

methods for large data sets is one of the greatest challenges in the ﬁeld of univariate and

multivariate spatial statistics. Several methods for massive geostatistical data (point-referenced)

have been already proposed (see for example Cressie and Johannesson,2008;Lindgren et al.,

2011;Nychka et al.,2015;Katzfuss,2017;Katzfuss and Guinness,2021, among others). However,

in the case of areal (lattice) count data, research on the scalability of statistical models is not so

abundant. Recently, Orozco-Acosta et al. (2021,2022) propose a scalable Bayesian modelling

approach for univariate high-dimensional spatial and spatio-temporal disease mapping data.

They propose to divide the spatial domain into Dsubregions where independent models can

be ﬁtted simultaneously. To avoid the border eﬀect in the risk estimates, k-order neighbours

are added to each subregion so that some areal units will have several risk estimates. Finally,

a unique posterior distribution for these risks is obtained by either computing the mixture

distribution of the estimated posterior probability density functions or by selecting the posterior

marginal risk estimate corresponding to the original domain to which the area belongs. This

proposal reduces computational time and, in contrast to ﬁtting a single model to the whole

domain, it allows diﬀerent degree of spatial smoothness over the areas within the diﬀerent

subdomains.

The main objective of this paper is to present a new approach to ﬁt order-free multivariate spa-

tial disease mapping models in domains with a very large number of small areas avoiding high

RAM/CPU usage, and making it accessible to users with limited computing facilities. In partic-

ular, we combine the Orozco-Acosta et al. (2021,2022) “divide-and-conquer” approach with a

modiﬁcation of the Botella-Rocamora et al. (2015) M-models to avoid overparametrization. An

additional interesting novelty of our proposal is that we are able to retrieve both the posterior

distributions of the correlations between the spatial patterns of each disease in the whole spatial

domain, as well as in each of the subdivisions. We have implemented the methodology in INLA

to reduce computational burden through our R package bigDM (Adin et al.,2022), that also

implements recent high-dimensional univariate proposals.

The rest of the article has the following structure. Section 2reviews the M-models to ﬁt mul-

tivariate data. In Section 3we present the new methodology to make the multivariate models

scalable. In Section 4, we conduct a simulation study to compare the performance of this new

modelling approach with a single multivariate spatial M-model ﬁtted to the whole domain. Fi-

nally, in Section 5, we use the new proposal to jointly analyse lung, colorectal and stomach

cancer male mortality in Spanish municipalities. The paper closes with a discussion.

2 M-models for multivariate disease mapping

Let us assume that the area of interest is divided into Icontiguous small areas and data are

available for Jdiseases. Let Oij and Eij denote the number of observed and expected cases

respectively in the i-th small area (i= 1, . . . , I) and for the j-th disease (j= 1, . . . , J). Con-

ditional on the relative risks Rij , the number of observed cases in the i-th area and the j-th

disease is assumed to follow a Poisson distribution with mean µij =Eij ·Rij , that is,

Oij|Rij ∼P oisson(µij =Eij ·Rij ),

log µij = log Eij + log Rij.

Here Eij is computed using indirect standardization as Eij =Pknijk ·mjk, where kis the

age-group, nijk is the population at risk in area iand age-group kfor the j-th disease, and mjk

is the overall mortality (or incidence) rate of the j-th disease in the total area of study for the

k-th age group. The log-risk is modelled as

log Rij =αj+θij,(1)

where αjis a disease-speciﬁc intercept and θij is the spatial eﬀect of the i-th area for the j-th

disease. Following the work by Botella-Rocamora et al. (2015), we rearrange the spatial eﬀects

into the matrix Θ={θij :i= 1, . . . , I;j= 1, . . . , J}to better comprehend the dependence

structure. The main advantage of the multivariate modelling is that dependence between the

spatial patterns of the diﬀerent diseases can be included in the model, so that latent associations

between diseases can help to discover potential risk factors related to the phenomena under study.

These unknown connections can be crucial to a better understanding of complex diseases such

as cancer.

The potential association between the spatial patterns of the diﬀerent diseases are included

in the model considering the decomposition of Θas

Θ= ΦM,(2)

where Φand Mdeal with dependency within and between diseases respectively. We refer to

Equation (2) as the M-model. In the following, we brieﬂy describe the two components of the

M-model.

The matrix Φis a matrix of order I×Kand it is composed of stochastically independent

columns that are distributed following a spatially correlated distribution. Usually, as many

spatial distributions as diseases are considered, that is, K=J, although Jand Kmay be

diﬀerent (see Corpas-Burgos et al.,2019, for a discussion). To deal with spatial dependence,

diﬀerent spatial priors have been considered in the literature, most of them based on the well

known intrinsic conditional autoregressive (iCAR) prior (Besag,1974). Namely, the proper CAR

(pCAR), a proper version of the iCAR; the Besag et al. (1991) prior (BYM), which combines

iCAR and exchangeable random eﬀects; the Leroux et al. (1999) prior (LCAR) that models

spatially structured and spatially unstructured variability through a weighted sum of the iCAR

precision matrix and the identity, or a modiﬁed version of the BYM model denoted as BYM2

(Dean et al.,2001;Riebler et al.,2016). In summary, the columns of Φfollow multivariate

normal distributions with mean 0and precision matrix Ωwhose expression depends on the

spatial prior. In this paper, we consider the iCAR prior for the columns of Φ, and hence the

precision matrix is ΩiCAR =τ(Dw−W) = τQ, where W= (wil) is the spatial binary adjacency

matrix deﬁned as wii = 0, wil = 1 if the i-th and the l-th areas are neighbours (share a common

border) and 0 otherwise, Dw= diag(w1+,· · · , wI+), with the diagonal elements wi+being the

number of neighbours of the i-th area, and τis the precision parameter. Note that Qis the

usual spatial neighbourhood matrix.

On the other hand, Mis a K×Jnonsingular but arbitrary matrix and it is responsible for

inducing dependence between the diﬀerent columns of Θ, i.e, for inducing correlation between

the spatial patterns of the diseases. In Equation (2), the cells of Mact as coeﬃcients, so they

can be considered as coeﬃcients of the log-relative risks on the underlying patterns captured in

Φand treated as ﬁxed eﬀects with a normal prior distribution with mean 0 and a large (and

ﬁxed) variance. Note that, assigning N(0, σ) priors to the cells of Mis equivalent to assigning

a Wishart prior to M0M, i.e., M0M∼W ishart(J, σ2IJ). The multivariate approach allows

the estimation of the correlation between the spatial patterns of the diseases, an interesting

and useful feature, as a high positive correlation would support the hypotheses of common risk

factors, and hence connections between diseases. The covariance matrix between the spatial

patterns is obtained as M0M. For further details see Botella-Rocamora et al. (2015).

For notation purposes and to incorporate the dependencies between diﬀerent diseases in the

model, we introduce the vec(·) operator. Let A= (A1,...,AJ) be an I×Jmatrix with I×1

columns Aj, for j= 1, . . . , J. The vec(·) operator transforms Ainto an IJ ×1 vector by

stacking the columns one under the other, that is, vec(A)=(A0

1,...,A0

J)0. Using this notation,

the multivariate Model (1) can be expressed in matrix form as

log R= (IJ⊗1I)α+ vec (Θ),(3)

where α= (α1, . . . , αJ)0,R= (R0

1,...,RJ)0,Rj= (R1j, . . . , RIj )0,j= 1, . . . , J, and IJand 1I

are the J×Jidentity matrix and a column vector of ones of length Irespectively. Then, once the

between-diseases dependencies are incorporated into the model, the resulting prior distributions

for vec (Θ) with Gaussian kernel has a precision matrix given by

Ωvec(Θ)=M−1⊗IIBlockdiag(Ω1,...,ΩJ)M−1⊗II0.(4)

Recall that this precision matrix accounts for both within and between-disease dependencies: the

Ω1,...,ΩJmatrices control the within-diseases spatial variability and the matrix Mcaptures

the between-diseases variability. Note that if Ω1=. . . =ΩJ=Ωw, the covariance structure

is separable and can be expressed as Ω−1

vec(Θ)=Ω−1

b⊗Ω−1

w, where Ω−1

b=M0Mand Ω−1

ware

the between- and within-disease covariance matrices, respectively. Note that in our case Ω−1

Ω−1

iCAR. This M-model based framework includes both separable and non-separable covariance

structures, and can accommodate diﬀerent spatial dependency structures with diﬀerent within-

disease covariance matrices.

2.1 Model ﬁtting, identiﬁability issues and prior distributions

Traditionally, MCMC techniques have been used for Bayesian model ﬁtting and inference. How-

ever, despite the advances in research, it is widely acknowledged that MCMC techniques can be

computationally very demanding. The INLA approach (see Rue et al.,2009) has turned out to

be very popular in recent years. It is designed for latent Gaussian ﬁelds and is based on inte-

grated nested Laplace approximations and numerical integration. Many models used in practice

are implemented in R-INLA (Lindgren and Rue,2015), and others can be implemented by means

of generic functions with some extra-programming work. The M-model based approach is not

directly available in R-INLA, but it can be implemented using the ’rgeneric’ construct (see for

example Vicente et al.,2020b). In this paper, we use INLA for model ﬁtting and inference.

Spatial models usually present identiﬁability issues which are generally overcome using sum-

to-zero constraints on the spatial random eﬀects (see Eberly and Carlin,2000;Goicoa et al.,

2018, for details). In the multivariate setting, these constraints are considered for all the diseases

in the model. Additionally, the M-models bring about new identiﬁability issues. As pointed out

by Botella-Rocamora et al. (2015), any orthogonal transformation of the columns of Φand of the

rows of Min Equation (2), causes an alternative decomposition of Θ, and therefore neither Φnor

Mare identiﬁable and inference on them should be precluded. However, Θand the covariance

matrix M0Mare perfectly identiﬁable, so inference is conﬁned to those quantities. It is worth

noting that the decomposition of the between-diseases covariance matrix as Ω−1

b=M0Mavoids

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

High-dimensionalorder-freemultivariatespatialdiseasemappingVicente,G.1;2;3,Adin,A.2;3;4,Goicoa,T.2;3;4andUgarte,M.D.2;3;41FacultaddeCienciasEconomicas,UniversidaddeCuyo,Argentina.2DepartmentofStatistics,ComputerSciencesandMathematics,PublicUniversityofNavarre,Spain.3InstituteforAdvancedMaterialsand...

展开>> 收起<<

High-dimensional order-free multivariate spatial disease mapping Vicente G.123 Adin A.234 Goicoa T.234and Ugarte M.D.234.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

High-dimensional order-free multivariate spatial disease mapping Vicente G.123 Adin A.234 Goicoa T.234and Ugarte M.D.234

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: