A Partially Functional Linear Modeling Framework for Integrating Genetic Imaging and Clinical Data Ting Li1 Yang Yu2 J. S. Marron23 and Hongtu Zhu23456

2025-04-27 0 0 7.22MB 28 页 10玖币
侵权投诉
A Partially Functional Linear Modeling Framework
for Integrating Genetic, Imaging, and Clinical Data
Ting Li1, Yang Yu2, J. S. Marron2,3, and Hongtu Zhu2,3,4,5,6
1School of Statistics and Management, Shanghai University of
Finance and Economics, Shanghai, China
Departments of 2Statistics, 3Biostatistics, 4Genetics, and 5Computer
Science and 6Biomedical Research Imaging Center, University of
North Carolina at Chapel Hill, Chapel Hill 1
1These authors contributed equally: Ting Li and Yang Yu. Address for correspondence: Hongtu
Zhu, Ph.D., Email: htzhu@email.unc.edu. Data used in preparation of this article were obtained
from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such,
the investigators within the ADNI contributed to the design and implementation of ADNI and/or
provided data but did not participate in analysis or writing of this report. A complete listing of
ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_
apply/ADNI_Acknowledgement_List.pdf.
1
arXiv:2210.01084v2 [stat.ME] 22 Feb 2023
Abstract
This paper is motivated by the joint analysis of genetic, imaging, and clin-
ical (GIC) data collected in the Alzheimer’s Disease Neuroimaging Initiative
(ADNI) study. We propose a regression framework based on partially func-
tional linear regression models to map high-dimensional GIC-related pathways
for Alzheimer’s Disease (AD). We develop a joint model selection and estima-
tion procedure by embedding imaging data in the reproducing kernel Hilbert
space and imposing the `0penalty for the coefficients of genetic variables. We
apply the proposed method to the ADNI dataset to identify important features
from tens of thousands of genetic polymorphisms (reduced from millions using
a preprocessing step) and study the effects of a certain set of informative ge-
netic variants and the baseline hippocampus surface on thirteen future cognitive
scores measuring different aspects of cognitive function. We explore the shared
and different heritablity patterns of these cognitive scores. Analysis results sug-
gest that both the hippocampal and genetic data have heterogeneous effects on
different scores, with the trend that the value of both hippocampi is negatively
associated with the severity of cognition deficits. Polygenic effects are observed
for all the thirteen cognitive scores. The well-known APOE4 genotype only
explains a small part of the cognitive function. Shared genetic etiology exists,
however, greater genetic heterogeneity exists within disease classifications af-
ter accounting for the baseline diagnosis status. These analyses are useful in
further investigation of functional mechanisms for AD evolution.
Keywords: Clinical; Genetics; Imaging; Non-asymptotic error bounds; Partially func-
tional linear regression; Reproducing kernel Hilbert space; Sparsity.
1 Introduction
Alzheimer’s disease (AD) is a chronic neurodegenerative disease that causes degen-
eration of brain cells and decline in thinking, behavioral and social skills. It involves
cognitive impairment with substantial between-patient variability in clinical presen-
tation as well as the burden and distribution of pathology. Such clinicopathologic
heterogeneity is both challenges and opportunities for carrying out systematic and
biomarker-based studies to refine our understanding of AD biology, diagnosis, and
management (Duong et al., 2022). AD has complex pathophysiological mechanisms
which are not completely understood. The advances in biomarker identification, in-
cluding genetic and imaging data, may improve the identification of individuals at
risk for AD before symptom onset.
The primary aim of this study is to use Genetic, Imaging, and Clinical (GIC) vari-
ables from the ADNI study to map the biological pathways of AD related phenotypes
of interest (e.g., cognition, intelligence, disease stage, impairment score, and progres-
sion status) (Sudlow et al., 2015; Elliott et al., 2018). It may provide insights into
the biological process of brain development, healthy aging, and disease progress. For
instance, it is great interest to integrate GIC to elucidate the environmental, social,
and genetic etiologies of intelligence and to delineate the foundation of intelligence
2
differences in brain structure and functioning (Deary et al., 2022). Moreover, many
brain-related disorders including AD are often caused by a combination of multiple
genetic and environmental factors, while being the endpoints of abnormality of brain
structure and function (Shen and Thompson, 2019; Zhao et al., 2019; Knutson et al.,
2020). A thorough understanding of such neuro-biological pathways may lead to the
identification of possible hundreds of risk genes, environmental risk factors, and brain
structure and function that underline brain disorders. Once such identification has
been accomplished, it is possible to detect these risk genes and factors and brain
abnormalities early enough to make a real difference in outcome and to develop their
related treatments, ultimately preventing the onset of brain-related disorders and
reducing their severity.
We extract clinical, imaging and genetic variables from the ADNI study. It in-
cludes cognitive scores for quantifying behavior deficits, ultra-high dimensional ge-
netic covariates, other demongraphic covariates at baseline and brain structures using
brain imaging. As previous studies have shown that the hippocampus is particularly
vulnerable to AD pathology and has become a major focus in AD (Braak and Braak,
1998), we characterize the exposure of interest, hippocampal shape, by the left/right
hippocampal morphometry surface data as a 100×150 matrix. We give a detailed data
description in Section 2. Exploring how human brains and genetics connect to human
behavior is a central goal in medical studies. We are interested in how hippocampal
shape and genetics are associated with future cognition deficits in Alzheimer’s study.
The special data structure of these GIC variables presents new challenges for mapping
the GIC pathway. First, conventional statistical tools that deal with scalar exposure
are not applicable to 2D high-dimensional hippocampal imaging measures. Second,
the dimension of the genetic covariates is much larger than the sample size. An effec-
tive statistical method which can exploit the 2D hippocampal surface data and the
ultra-high dimensional genetic data to map the GIC pathway is urgently needed.
The literature on analysis for imaging genetics has proliferated over the past
decade. There have roughly four categories of statistical methods for the analysis.
The first is identifying genetic risk factors for scalar phenotype of interest through
genome-wide association study, such as Carrasquillo et al. (2009), Bertram and Tanzi
(2012) and Lo et al. (2019). The second is analysis of neuroimaging data, ranging from
acquiring raw neuroimaging data, locating brain activity, to predicting psychological,
psychiatric or cognitive states (Lindquist, 2008). The statistical tools for detecting
association between scalar phenotype of interest and imaging data include voxelwize
regression Zhou et al. (2014), functional data analysis approach (Reiss and Ogden,
2010), and tensor regression models that exploits the array structure in imaging data
(Zhou et al., 2013; Wang et al., 2017; Li and Zhang, 2021) The third is investigat-
ing the effects of genetic variations on imaging phenotypes. Blokland et al. (2012)
and Zhao et al. (2019) used imaging traits as phenotype and quantify the effects of
genetics on the structure and function of the human brain. The forth is mapping
biological pathways linking genetics and imaging data to neuropsychiatric disorders
and examining the joint effects of both genetic risk factors and imaging data, which
remains challenging and has not been studied systematically compared to the above
three categories (Zhu et al., 2022). Most of the exiting methods first extracted fea-
3
Figure 1: Directed acylic graph showing potential relationships between the genetic data, the imaging
data and the future outcome. The colored arrows denote the associations of interest.
tures from the imaging data and focused on the effects of the obtained features and
genetic data, see Dukart et al. (2016), Ossenkoppele et al. (2021), Cruciani et al.
(2022) and references therein, which ignored the rich smoothness information in the
imaging data.
To map GIC-related pathways, we consider a high-dimensional Partially Func-
tional Linear Model (PFLM) as follows:
Yi=α+XT
iβ+ZT
Zi(t)ξ(t)dt +ifor i= 1, . . . , n, (1)
where Yiis a continuous phenotype of interest for subject i,Xi∈ X is a p×1 vector of
genetic and environmental variables, and Zi(t)L2(T) is an imaging (or functional)
predictor over a compact set T. Moreover, αis the intercept term, βis a p×1
vector of coefficients, ξ(t) is an unknown slope function, which is assumed to be in
a reproducing kernel Hilbert space (RKHS) H, and is are measurement errors. We
consider the case that the dimension of βis either comparable to or much larger than
the sample size nand ξ(t) is an infinite dimensional function. Our statistical problem
of interest is to make statistical inference on βand {ξ(t) : t T }. As illustrated
in Fig 1, there exist genetic and clinical confounders which affect both hippocampal
shape and behavioral deficits (Selkoe and Hardy, 2016). Compared to the classical
high dimensional linear model that only considers the genetic data, the inclusion of
imaging exposure ξ(t) has two important implications for the ADNI dataset. First,
model (1) is able to quantify the direct effects of the confounders by controlling for
the imaging exposure. Second, model (1) investigate the influence of the imaging
exposure while controlling for the confounders and preserving the structure of the
imaging data.
There is scarce literature on PFLM with high dimensional scalar covariates with
a few exceptions. Kong et al. (2016) studied PFLM in high dimension, in which
the dimension of scalar covariates was allowed to diverge with n. Yao et al. (2017)
developed a regularized partially functional quantile regression model, while allowing
the number of scalar predictors to increase with the sample size. Ma et al. (2019)
4
focused on the partial functional partial regression model in ultra-high dimensions
with a diverging number of scalar predictors. All of the above three methods consist
of three steps, representing the functional predictors by using their leading functional
principal components (FPCs), reducing PFLM to a standard high dimensional linear
regression model, and selecting important features through the smoothly clipped ab-
solute deviation (SCAD) penalty (Fan and Li, 2001). Therefore, existing approaches
rely heavily on the success of the FPCA approach (Wang et al., 2016).
In this paper, we focus on the high dimensional PFLM (1), develop estimation
method for model selection and estimation, investigate theoretical properties of both
the functional and scalar estimators, and apply the proposed method to analyze the
ADNI dataset. We use the RKHS framework (Yuan and Cai, 2010; Cai and Yuan,
2012; Li and Zhu, 2020) and impose the roughness penalty on the functional coeffi-
cient. The success of the existing FPCA-based methods relies on the availability of
a good estimate of the functional principal components for the functional parameter,
and may not be appropriate if the functional parameter cannot be represented effec-
tively by the leading principals of the functional covariates (Yuan and Cai, 2010). On
the other hand, the truncation parameter in the FPCA changes in a discrete manner,
which may yield an imprecise control on the model complexity, as pointed out in
Ramsay and Silverman (2005). Furthermore, we impose the `0penalty on the scalar
predictors due to the fact that the `0penalty function is usually a desired choice
among the penalty functions as it directly penalizes the cardinality of a model and
seeks the most parsimonious model explaining the data. However, it is nonconvex
and the solving of an exact `0-penalized nonconvex optimization problem involves
exhaustive combinatorial best subset search, which is NP-hard and computationally
challenging (Zhao et al., 2019). We modify the computational algorithm in Huang
et al. (2018) to deal with the above difficulty and to accommodate the functional
predictor. Specifically, we proceed in three steps: (i) profiling out the functional
part by using the Representer theorem; (ii) simultaneously identifying the important
features and obtaining scalar estimates; and (iii) plugging the scalar estimates into
the loss function to derive the functional estimate. Meanwhile, we adapt the test
statistic in Li and Zhu (2020) to test the significant of the functional variable. The
implementation R code with its documentation is available as an online supplement.
Numerically, the proposed method is tested carefully on the simulated data. We
also provide theoretical properties of the estimators, including the error bounds of,
the asymptotic normality of the estimates of the nonzero scalar coefficients, and the
null limit distribution of the test statistic designed to test the nullity of the functional
variable. We apply PFLM to the ADNI dataset and carry out a throughout associa-
tion analysis between genetics, hippocampus and cognitive deficit. Different from the
existing analysis targeted to one or several cognitive measures, the proposed method
examines the joint effects of genetics and hippocampus on 13 cognitive variables ob-
served at 12 months after baseline measurements, that measure different aspects of
the cognitive function, and explore the shared and different heritablity patterns of the
13 cognitive scores. We also investigate the effect of the baseline diagnoiss information
on future cognitive outcome, denoted by the yellow arrow in Fig 1. Analysis results
suggest that both the hippocampal and genetic data have heterogeneous effects on
5
摘要:

APartiallyFunctionalLinearModelingFrameworkforIntegratingGenetic,Imaging,andClinicalDataTingLi1,YangYu2,J.S.Marron2;3,andHongtuZhu2;3;4;5;61SchoolofStatisticsandManagement,ShanghaiUniversityofFinanceandEconomics,Shanghai,ChinaDepartmentsof2Statistics,3Biostatistics,4Genetics,and5ComputerScienceand...

展开>> 收起<<
A Partially Functional Linear Modeling Framework for Integrating Genetic Imaging and Clinical Data Ting Li1 Yang Yu2 J. S. Marron23 and Hongtu Zhu23456.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:7.22MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注