High-dimensional order-free multivariate spatial disease mapping Vicente G.123 Adin A.234 Goicoa T.234and Ugarte M.D.234

2025-05-06 0 0 4.59MB 31 页 10玖币
侵权投诉
High-dimensional order-free multivariate spatial
disease mapping
Vicente, G.1,2,3, Adin, A.2,3,4, Goicoa, T.2,3,4and Ugarte, M.D.2,3,4
1Facultad de Ciencias Econ´omicas, Universidad de Cuyo, Argentina.
2Department of Statistics, Computer Sciences and Mathematics, Public University of Navarre, Spain.
3Institute for Advanced Materials and Mathematics, InaMat2, Public University of Navarre, Spain.
4IdiSNA, Health Research Institute of Navarre, Recinto de Complejo Hospitalario de Navarra, Spain.
Correspondence to Mar´ıa Dolores Ugarte, Departamento de Estad´ıstica, Inform´atica y Matem´aticas,
Universidad P´ublica de Navarra, Campus de Arrosadia, 31006 Pamplona, Spain.
E-mail: lola@unavarra.es
Abstract
Despite the amount of research on disease mapping in recent years, the use of multivari-
ate models for areal spatial data remains limited due to difficulties in implementation and
computational burden. These problems are exacerbated when the number of small areas is
very large. In this paper, we introduce an order-free multivariate scalable Bayesian modelling
approach to smooth mortality (or incidence) risks of several diseases simultaneously. The
proposal partitions the spatial domain into smaller subregions, fits multivariate models in
each subdivision and obtains the posterior distribution of the relative risks across the entire
spatial domain. The approach also provides posterior correlations among the spatial pat-
terns of the diseases in each partition that are combined through a consensus Monte Carlo
algorithm to obtain correlations for the whole study region. We implement the proposal
using integrated nested Laplace approximations (INLA) in the R package bigDM and use it
to jointly analyse colorectal, lung, and stomach cancer mortality data in Spanish municipali-
ties. The new proposal permits the analysis of big data sets and provides better results than
fitting a single multivariate model.
Keywords: Bayesian inference; High-dimensional data; Scalable models; Spatial epidemiol-
ogy
1 Introduction
Research on methodology for the spatial (and spatio-temporal) analysis of areal count data
has grown tremendously in the last years, and statistical models have proven an essential tool
for studying the geographic distribution of data in small areas. The main objective of these
techniques is to smooth standardized mortality (incidence) ratios or crude rates to discover
geographic patterns of the phenomenon under study. These models and methods have been
mainly applied in epidemiology to analyse the incidence and mortality of chronic diseases such
as cancer, but some recent research has demonstrated their applicability to the spatial and
spatio-temporal analysis of crimes (see for example Li et al.,2014), and in particular crimes
1
arXiv:2210.14849v1 [stat.ME] 26 Oct 2022
against women (see for example Vicente et al.,2018,2020a). Although research on single disease
analysis has been very fruitful and abundant since the seminal work of Besag et al. (1991), joint
modelling of several responses offers some advantages. The first is that it improves smoothing
by borrowing strength between diseases. The second, and probably more important, is that
it allows to establish relationships between different diseases in terms of similar or completely
different geographical distributions, i.e. in terms of correlations between spatial patterns. This
is crucial, as these correlations may indicate associations with common underlying risk factors
and certain (usually unknown) connections between the different diseases. The joint analysis
is carried out through multivariate spatial models that can cope with both spatial correlation
within diseases and correlation between diseases. Not only can multivariate models account for
correlation between diseases, but also improve estimates by borrowing information from nearby
areas.
There is a considerable amount of research on Bayesian multivariate spatial models for count
data, most of the proposals relying on Markov chain Monte Carlo (MCMC) algorithms for
estimation and inference. However, their use in practice is still limited due to a lack of “easy
to use” implementations of the models in statistical packages and the computational burden of
most of the proposals that preclude practitioners from exploiting their undoubted advantages
over univariate counterparts. A comprehensive review of the subject can be found in the work
of MacNab (2018) which discusses the three main lines in the construction of multivariate
proposals based on Gaussian Markov random fields. Namely, a multivariate conditionals-based
approach (Mardia,1988), a univariate conditionals-based approach (Sain et al.,2011), and a
coregionalization framework (Jin et al.,2007). Regarding the latter, Martinez-Beneito (2013)
derives a general theoretical setting for multivariate areal models that covers many of the existing
proposals in the literature. However, this procedure is unaffordable for a moderate to large
number of diseases due to the high computational cost of the MCMC algorithms. Botella-
Rocamora et al. (2015) reformulate the Mart´ınez-Beneito framework and present the so called
M-models as a simpler and more computationally efficient alternative. This approach makes it
possible to increase the number of diseases in the model at the expense of the identifiability of
certain parameters. Recently, Vicente et al. (2020b) consider the M-models-based approach to
analyse in space and time different crimes against women in India. These authors estimate the
M-models using integrated nested Laplace approximations (INLA) and numerical integration
for Bayesian inference (see Rue et al.,2009) and implement the procedure using the ’rgeneric’
construction in R-INLA (Lindgren and Rue,2015). The result is a “ready-to-use” function for
a wide audience with limited programming skills.
Several alternatives to Gaussian Markov random fields have been also proposed in the disease
mapping literature. A very attractive modelling approach is the use of splines to smooth risks
(Goicoa et al.,2012). Research on multivariate spline models for fitting spatio-temporal count
data is not so abundant and focuses on multivariate structures to deal with the spatial and
temporal dependence for one response measured in several time periods (see for example Mac-
Nab,2016;Ugarte et al.,2010,2017). Very recently, Vicente et al. (2021) propose multivariate
P-spline models to study the spatio-temporal evolution of four crimes against women. Unfor-
tunately, inference for these multivariate proposals (and also for univariate approaches) become
unfeasible when the number of areas is very large, and the scalability of the procedures is an
issue.
New directions in disease mapping points towards developing new methods for Bayesian inference
when the number of small areas is very large (MacNab,2022). Creating computationally efficient
2
methods for large data sets is one of the greatest challenges in the field of univariate and
multivariate spatial statistics. Several methods for massive geostatistical data (point-referenced)
have been already proposed (see for example Cressie and Johannesson,2008;Lindgren et al.,
2011;Nychka et al.,2015;Katzfuss,2017;Katzfuss and Guinness,2021, among others). However,
in the case of areal (lattice) count data, research on the scalability of statistical models is not so
abundant. Recently, Orozco-Acosta et al. (2021,2022) propose a scalable Bayesian modelling
approach for univariate high-dimensional spatial and spatio-temporal disease mapping data.
They propose to divide the spatial domain into Dsubregions where independent models can
be fitted simultaneously. To avoid the border effect in the risk estimates, k-order neighbours
are added to each subregion so that some areal units will have several risk estimates. Finally,
a unique posterior distribution for these risks is obtained by either computing the mixture
distribution of the estimated posterior probability density functions or by selecting the posterior
marginal risk estimate corresponding to the original domain to which the area belongs. This
proposal reduces computational time and, in contrast to fitting a single model to the whole
domain, it allows different degree of spatial smoothness over the areas within the different
subdomains.
The main objective of this paper is to present a new approach to fit order-free multivariate spa-
tial disease mapping models in domains with a very large number of small areas avoiding high
RAM/CPU usage, and making it accessible to users with limited computing facilities. In partic-
ular, we combine the Orozco-Acosta et al. (2021,2022) “divide-and-conquer” approach with a
modification of the Botella-Rocamora et al. (2015) M-models to avoid overparametrization. An
additional interesting novelty of our proposal is that we are able to retrieve both the posterior
distributions of the correlations between the spatial patterns of each disease in the whole spatial
domain, as well as in each of the subdivisions. We have implemented the methodology in INLA
to reduce computational burden through our R package bigDM (Adin et al.,2022), that also
implements recent high-dimensional univariate proposals.
The rest of the article has the following structure. Section 2reviews the M-models to fit mul-
tivariate data. In Section 3we present the new methodology to make the multivariate models
scalable. In Section 4, we conduct a simulation study to compare the performance of this new
modelling approach with a single multivariate spatial M-model fitted to the whole domain. Fi-
nally, in Section 5, we use the new proposal to jointly analyse lung, colorectal and stomach
cancer male mortality in Spanish municipalities. The paper closes with a discussion.
2 M-models for multivariate disease mapping
Let us assume that the area of interest is divided into Icontiguous small areas and data are
available for Jdiseases. Let Oij and Eij denote the number of observed and expected cases
respectively in the i-th small area (i= 1, . . . , I) and for the j-th disease (j= 1, . . . , J). Con-
ditional on the relative risks Rij , the number of observed cases in the i-th area and the j-th
disease is assumed to follow a Poisson distribution with mean µij =Eij ·Rij , that is,
Oij|Rij P oisson(µij =Eij ·Rij ),
log µij = log Eij + log Rij.
Here Eij is computed using indirect standardization as Eij =Pknijk ·mjk, where kis the
age-group, nijk is the population at risk in area iand age-group kfor the j-th disease, and mjk
3
is the overall mortality (or incidence) rate of the j-th disease in the total area of study for the
k-th age group. The log-risk is modelled as
log Rij =αj+θij,(1)
where αjis a disease-specific intercept and θij is the spatial effect of the i-th area for the j-th
disease. Following the work by Botella-Rocamora et al. (2015), we rearrange the spatial effects
into the matrix Θ={θij :i= 1, . . . , I;j= 1, . . . , J}to better comprehend the dependence
structure. The main advantage of the multivariate modelling is that dependence between the
spatial patterns of the different diseases can be included in the model, so that latent associations
between diseases can help to discover potential risk factors related to the phenomena under study.
These unknown connections can be crucial to a better understanding of complex diseases such
as cancer.
The potential association between the spatial patterns of the different diseases are included
in the model considering the decomposition of Θas
Θ= ΦM,(2)
where Φand Mdeal with dependency within and between diseases respectively. We refer to
Equation (2) as the M-model. In the following, we briefly describe the two components of the
M-model.
The matrix Φis a matrix of order I×Kand it is composed of stochastically independent
columns that are distributed following a spatially correlated distribution. Usually, as many
spatial distributions as diseases are considered, that is, K=J, although Jand Kmay be
different (see Corpas-Burgos et al.,2019, for a discussion). To deal with spatial dependence,
different spatial priors have been considered in the literature, most of them based on the well
known intrinsic conditional autoregressive (iCAR) prior (Besag,1974). Namely, the proper CAR
(pCAR), a proper version of the iCAR; the Besag et al. (1991) prior (BYM), which combines
iCAR and exchangeable random effects; the Leroux et al. (1999) prior (LCAR) that models
spatially structured and spatially unstructured variability through a weighted sum of the iCAR
precision matrix and the identity, or a modified version of the BYM model denoted as BYM2
(Dean et al.,2001;Riebler et al.,2016). In summary, the columns of Φfollow multivariate
normal distributions with mean 0and precision matrix whose expression depends on the
spatial prior. In this paper, we consider the iCAR prior for the columns of Φ, and hence the
precision matrix is iCAR =τ(DwW) = τQ, where W= (wil) is the spatial binary adjacency
matrix defined as wii = 0, wil = 1 if the i-th and the l-th areas are neighbours (share a common
border) and 0 otherwise, Dw= diag(w1+,· · · , wI+), with the diagonal elements wi+being the
number of neighbours of the i-th area, and τis the precision parameter. Note that Qis the
usual spatial neighbourhood matrix.
On the other hand, Mis a K×Jnonsingular but arbitrary matrix and it is responsible for
inducing dependence between the different columns of Θ, i.e, for inducing correlation between
the spatial patterns of the diseases. In Equation (2), the cells of Mact as coefficients, so they
can be considered as coefficients of the log-relative risks on the underlying patterns captured in
Φand treated as fixed effects with a normal prior distribution with mean 0 and a large (and
fixed) variance. Note that, assigning N(0, σ) priors to the cells of Mis equivalent to assigning
a Wishart prior to M0M, i.e., M0MW ishart(J, σ2IJ). The multivariate approach allows
the estimation of the correlation between the spatial patterns of the diseases, an interesting
4
and useful feature, as a high positive correlation would support the hypotheses of common risk
factors, and hence connections between diseases. The covariance matrix between the spatial
patterns is obtained as M0M. For further details see Botella-Rocamora et al. (2015).
For notation purposes and to incorporate the dependencies between different diseases in the
model, we introduce the vec(·) operator. Let A= (A1,...,AJ) be an I×Jmatrix with I×1
columns Aj, for j= 1, . . . , J. The vec(·) operator transforms Ainto an IJ ×1 vector by
stacking the columns one under the other, that is, vec(A)=(A0
1,...,A0
J)0. Using this notation,
the multivariate Model (1) can be expressed in matrix form as
log R= (IJ1I)α+ vec (Θ),(3)
where α= (α1, . . . , αJ)0,R= (R0
1,...,RJ)0,Rj= (R1j, . . . , RIj )0,j= 1, . . . , J, and IJand 1I
are the J×Jidentity matrix and a column vector of ones of length Irespectively. Then, once the
between-diseases dependencies are incorporated into the model, the resulting prior distributions
for vec (Θ) with Gaussian kernel has a precision matrix given by
vec(Θ)=M1IIBlockdiag(1,...,J)M1II0.(4)
Recall that this precision matrix accounts for both within and between-disease dependencies: the
1,...,Jmatrices control the within-diseases spatial variability and the matrix Mcaptures
the between-diseases variability. Note that if 1=. . . =J=w, the covariance structure
is separable and can be expressed as 1
vec(Θ)=1
b1
w, where 1
b=M0Mand 1
ware
the between- and within-disease covariance matrices, respectively. Note that in our case 1
w=
1
iCAR. This M-model based framework includes both separable and non-separable covariance
structures, and can accommodate different spatial dependency structures with different within-
disease covariance matrices.
2.1 Model fitting, identifiability issues and prior distributions
Traditionally, MCMC techniques have been used for Bayesian model fitting and inference. How-
ever, despite the advances in research, it is widely acknowledged that MCMC techniques can be
computationally very demanding. The INLA approach (see Rue et al.,2009) has turned out to
be very popular in recent years. It is designed for latent Gaussian fields and is based on inte-
grated nested Laplace approximations and numerical integration. Many models used in practice
are implemented in R-INLA (Lindgren and Rue,2015), and others can be implemented by means
of generic functions with some extra-programming work. The M-model based approach is not
directly available in R-INLA, but it can be implemented using the ’rgeneric’ construct (see for
example Vicente et al.,2020b). In this paper, we use INLA for model fitting and inference.
Spatial models usually present identifiability issues which are generally overcome using sum-
to-zero constraints on the spatial random effects (see Eberly and Carlin,2000;Goicoa et al.,
2018, for details). In the multivariate setting, these constraints are considered for all the diseases
in the model. Additionally, the M-models bring about new identifiability issues. As pointed out
by Botella-Rocamora et al. (2015), any orthogonal transformation of the columns of Φand of the
rows of Min Equation (2), causes an alternative decomposition of Θ, and therefore neither Φnor
Mare identifiable and inference on them should be precluded. However, Θand the covariance
matrix M0Mare perfectly identifiable, so inference is confined to those quantities. It is worth
noting that the decomposition of the between-diseases covariance matrix as 1
b=M0Mavoids
5
摘要:

High-dimensionalorder-freemultivariatespatialdiseasemappingVicente,G.1;2;3,Adin,A.2;3;4,Goicoa,T.2;3;4andUgarte,M.D.2;3;41FacultaddeCienciasEconomicas,UniversidaddeCuyo,Argentina.2DepartmentofStatistics,ComputerSciencesandMathematics,PublicUniversityofNavarre,Spain.3InstituteforAdvancedMaterialsand...

展开>> 收起<<
High-dimensional order-free multivariate spatial disease mapping Vicente G.123 Adin A.234 Goicoa T.234and Ugarte M.D.234.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:4.59MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注