A Partially Functional Linear Modeling Framework for Integrating Genetic Imaging and Clinical Data Ting Li1 Yang Yu2 J. S. Marron23 and Hongtu Zhu23456

2025-04-27 0 0 7.22MB 28 页 10玖币

侵权投诉

A Partially Functional Linear Modeling Framework

for Integrating Genetic, Imaging, and Clinical Data

Ting Li∗1, Yang Yu∗2, J. S. Marron2,3, and Hongtu Zhu2,3,4,5,6

1School of Statistics and Management, Shanghai University of

Finance and Economics, Shanghai, China

Departments of 2Statistics, 3Biostatistics, 4Genetics, and 5Computer

Science and 6Biomedical Research Imaging Center, University of

North Carolina at Chapel Hill, Chapel Hill 1

1∗These authors contributed equally: Ting Li and Yang Yu. Address for correspondence: Hongtu

Zhu, Ph.D., Email: htzhu@email.unc.edu. Data used in preparation of this article were obtained

from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such,

the investigators within the ADNI contributed to the design and implementation of ADNI and/or

provided data but did not participate in analysis or writing of this report. A complete listing of

ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_

apply/ADNI_Acknowledgement_List.pdf.

arXiv:2210.01084v2 [stat.ME] 22 Feb 2023

Abstract

This paper is motivated by the joint analysis of genetic, imaging, and clin-

ical (GIC) data collected in the Alzheimer’s Disease Neuroimaging Initiative

(ADNI) study. We propose a regression framework based on partially func-

tional linear regression models to map high-dimensional GIC-related pathways

for Alzheimer’s Disease (AD). We develop a joint model selection and estima-

tion procedure by embedding imaging data in the reproducing kernel Hilbert

space and imposing the `0penalty for the coeﬃcients of genetic variables. We

apply the proposed method to the ADNI dataset to identify important features

from tens of thousands of genetic polymorphisms (reduced from millions using

a preprocessing step) and study the eﬀects of a certain set of informative ge-

netic variants and the baseline hippocampus surface on thirteen future cognitive

scores measuring diﬀerent aspects of cognitive function. We explore the shared

and diﬀerent heritablity patterns of these cognitive scores. Analysis results sug-

gest that both the hippocampal and genetic data have heterogeneous eﬀects on

diﬀerent scores, with the trend that the value of both hippocampi is negatively

associated with the severity of cognition deﬁcits. Polygenic eﬀects are observed

for all the thirteen cognitive scores. The well-known APOE4 genotype only

explains a small part of the cognitive function. Shared genetic etiology exists,

however, greater genetic heterogeneity exists within disease classiﬁcations af-

ter accounting for the baseline diagnosis status. These analyses are useful in

further investigation of functional mechanisms for AD evolution.

Keywords: Clinical; Genetics; Imaging; Non-asymptotic error bounds; Partially func-

tional linear regression; Reproducing kernel Hilbert space; Sparsity.

1 Introduction

Alzheimer’s disease (AD) is a chronic neurodegenerative disease that causes degen-

eration of brain cells and decline in thinking, behavioral and social skills. It involves

cognitive impairment with substantial between-patient variability in clinical presen-

tation as well as the burden and distribution of pathology. Such clinicopathologic

heterogeneity is both challenges and opportunities for carrying out systematic and

biomarker-based studies to reﬁne our understanding of AD biology, diagnosis, and

management (Duong et al., 2022). AD has complex pathophysiological mechanisms

which are not completely understood. The advances in biomarker identiﬁcation, in-

cluding genetic and imaging data, may improve the identiﬁcation of individuals at

risk for AD before symptom onset.

The primary aim of this study is to use Genetic, Imaging, and Clinical (GIC) vari-

ables from the ADNI study to map the biological pathways of AD related phenotypes

of interest (e.g., cognition, intelligence, disease stage, impairment score, and progres-

sion status) (Sudlow et al., 2015; Elliott et al., 2018). It may provide insights into

the biological process of brain development, healthy aging, and disease progress. For

instance, it is great interest to integrate GIC to elucidate the environmental, social,

and genetic etiologies of intelligence and to delineate the foundation of intelligence

diﬀerences in brain structure and functioning (Deary et al., 2022). Moreover, many

brain-related disorders including AD are often caused by a combination of multiple

genetic and environmental factors, while being the endpoints of abnormality of brain

structure and function (Shen and Thompson, 2019; Zhao et al., 2019; Knutson et al.,

2020). A thorough understanding of such neuro-biological pathways may lead to the

identiﬁcation of possible hundreds of risk genes, environmental risk factors, and brain

structure and function that underline brain disorders. Once such identiﬁcation has

been accomplished, it is possible to detect these risk genes and factors and brain

abnormalities early enough to make a real diﬀerence in outcome and to develop their

related treatments, ultimately preventing the onset of brain-related disorders and

reducing their severity.

We extract clinical, imaging and genetic variables from the ADNI study. It in-

cludes cognitive scores for quantifying behavior deﬁcits, ultra-high dimensional ge-

netic covariates, other demongraphic covariates at baseline and brain structures using

brain imaging. As previous studies have shown that the hippocampus is particularly

vulnerable to AD pathology and has become a major focus in AD (Braak and Braak,

1998), we characterize the exposure of interest, hippocampal shape, by the left/right

hippocampal morphometry surface data as a 100×150 matrix. We give a detailed data

description in Section 2. Exploring how human brains and genetics connect to human

behavior is a central goal in medical studies. We are interested in how hippocampal

shape and genetics are associated with future cognition deﬁcits in Alzheimer’s study.

The special data structure of these GIC variables presents new challenges for mapping

the GIC pathway. First, conventional statistical tools that deal with scalar exposure

are not applicable to 2D high-dimensional hippocampal imaging measures. Second,

the dimension of the genetic covariates is much larger than the sample size. An eﬀec-

tive statistical method which can exploit the 2D hippocampal surface data and the

ultra-high dimensional genetic data to map the GIC pathway is urgently needed.

The literature on analysis for imaging genetics has proliferated over the past

decade. There have roughly four categories of statistical methods for the analysis.

The ﬁrst is identifying genetic risk factors for scalar phenotype of interest through

genome-wide association study, such as Carrasquillo et al. (2009), Bertram and Tanzi

(2012) and Lo et al. (2019). The second is analysis of neuroimaging data, ranging from

acquiring raw neuroimaging data, locating brain activity, to predicting psychological,

psychiatric or cognitive states (Lindquist, 2008). The statistical tools for detecting

association between scalar phenotype of interest and imaging data include voxelwize

regression Zhou et al. (2014), functional data analysis approach (Reiss and Ogden,

2010), and tensor regression models that exploits the array structure in imaging data

(Zhou et al., 2013; Wang et al., 2017; Li and Zhang, 2021) The third is investigat-

ing the eﬀects of genetic variations on imaging phenotypes. Blokland et al. (2012)

and Zhao et al. (2019) used imaging traits as phenotype and quantify the eﬀects of

genetics on the structure and function of the human brain. The forth is mapping

biological pathways linking genetics and imaging data to neuropsychiatric disorders

and examining the joint eﬀects of both genetic risk factors and imaging data, which

remains challenging and has not been studied systematically compared to the above

three categories (Zhu et al., 2022). Most of the exiting methods ﬁrst extracted fea-

Figure 1: Directed acylic graph showing potential relationships between the genetic data, the imaging

data and the future outcome. The colored arrows denote the associations of interest.

tures from the imaging data and focused on the eﬀects of the obtained features and

genetic data, see Dukart et al. (2016), Ossenkoppele et al. (2021), Cruciani et al.

(2022) and references therein, which ignored the rich smoothness information in the

imaging data.

To map GIC-related pathways, we consider a high-dimensional Partially Func-

tional Linear Model (PFLM) as follows:

Yi=α+XT

iβ+ZT

Zi(t)ξ(t)dt +ifor i= 1, . . . , n, (1)

where Yiis a continuous phenotype of interest for subject i,Xi∈ X is a p×1 vector of

genetic and environmental variables, and Zi(t)∈L2(T) is an imaging (or functional)

predictor over a compact set T. Moreover, αis the intercept term, βis a p×1

vector of coeﬃcients, ξ(t) is an unknown slope function, which is assumed to be in

a reproducing kernel Hilbert space (RKHS) H, and is are measurement errors. We

consider the case that the dimension of βis either comparable to or much larger than

the sample size nand ξ(t) is an inﬁnite dimensional function. Our statistical problem

of interest is to make statistical inference on βand {ξ(t) : t∈ T }. As illustrated

in Fig 1, there exist genetic and clinical confounders which aﬀect both hippocampal

shape and behavioral deﬁcits (Selkoe and Hardy, 2016). Compared to the classical

high dimensional linear model that only considers the genetic data, the inclusion of

imaging exposure ξ(t) has two important implications for the ADNI dataset. First,

model (1) is able to quantify the direct eﬀects of the confounders by controlling for

the imaging exposure. Second, model (1) investigate the inﬂuence of the imaging

exposure while controlling for the confounders and preserving the structure of the

imaging data.

There is scarce literature on PFLM with high dimensional scalar covariates with

a few exceptions. Kong et al. (2016) studied PFLM in high dimension, in which

the dimension of scalar covariates was allowed to diverge with n. Yao et al. (2017)

developed a regularized partially functional quantile regression model, while allowing

the number of scalar predictors to increase with the sample size. Ma et al. (2019)

focused on the partial functional partial regression model in ultra-high dimensions

with a diverging number of scalar predictors. All of the above three methods consist

of three steps, representing the functional predictors by using their leading functional

principal components (FPCs), reducing PFLM to a standard high dimensional linear

regression model, and selecting important features through the smoothly clipped ab-

solute deviation (SCAD) penalty (Fan and Li, 2001). Therefore, existing approaches

rely heavily on the success of the FPCA approach (Wang et al., 2016).

In this paper, we focus on the high dimensional PFLM (1), develop estimation

method for model selection and estimation, investigate theoretical properties of both

the functional and scalar estimators, and apply the proposed method to analyze the

ADNI dataset. We use the RKHS framework (Yuan and Cai, 2010; Cai and Yuan,

2012; Li and Zhu, 2020) and impose the roughness penalty on the functional coeﬃ-

cient. The success of the existing FPCA-based methods relies on the availability of

a good estimate of the functional principal components for the functional parameter,

and may not be appropriate if the functional parameter cannot be represented eﬀec-

tively by the leading principals of the functional covariates (Yuan and Cai, 2010). On

the other hand, the truncation parameter in the FPCA changes in a discrete manner,

which may yield an imprecise control on the model complexity, as pointed out in

Ramsay and Silverman (2005). Furthermore, we impose the `0penalty on the scalar

predictors due to the fact that the `0penalty function is usually a desired choice

among the penalty functions as it directly penalizes the cardinality of a model and

seeks the most parsimonious model explaining the data. However, it is nonconvex

and the solving of an exact `0-penalized nonconvex optimization problem involves

exhaustive combinatorial best subset search, which is NP-hard and computationally

challenging (Zhao et al., 2019). We modify the computational algorithm in Huang

et al. (2018) to deal with the above diﬃculty and to accommodate the functional

predictor. Speciﬁcally, we proceed in three steps: (i) proﬁling out the functional

part by using the Representer theorem; (ii) simultaneously identifying the important

features and obtaining scalar estimates; and (iii) plugging the scalar estimates into

the loss function to derive the functional estimate. Meanwhile, we adapt the test

statistic in Li and Zhu (2020) to test the signiﬁcant of the functional variable. The

implementation R code with its documentation is available as an online supplement.

Numerically, the proposed method is tested carefully on the simulated data. We

also provide theoretical properties of the estimators, including the error bounds of,

the asymptotic normality of the estimates of the nonzero scalar coeﬃcients, and the

null limit distribution of the test statistic designed to test the nullity of the functional

variable. We apply PFLM to the ADNI dataset and carry out a throughout associa-

tion analysis between genetics, hippocampus and cognitive deﬁcit. Diﬀerent from the

existing analysis targeted to one or several cognitive measures, the proposed method

examines the joint eﬀects of genetics and hippocampus on 13 cognitive variables ob-

served at 12 months after baseline measurements, that measure diﬀerent aspects of

the cognitive function, and explore the shared and diﬀerent heritablity patterns of the

13 cognitive scores. We also investigate the eﬀect of the baseline diagnoiss information

on future cognitive outcome, denoted by the yellow arrow in Fig 1. Analysis results

suggest that both the hippocampal and genetic data have heterogeneous eﬀects on

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

APartiallyFunctionalLinearModelingFrameworkforIntegratingGenetic,Imaging,andClinicalDataTingLi1,YangYu2,J.S.Marron2;3,andHongtuZhu2;3;4;5;61SchoolofStatisticsandManagement,ShanghaiUniversityofFinanceandEconomics,Shanghai,ChinaDepartmentsof2Statistics,3Biostatistics,4Genetics,and5ComputerScienceand...

展开>> 收起<<

A Partially Functional Linear Modeling Framework for Integrating Genetic Imaging and Clinical Data Ting Li1 Yang Yu2 J. S. Marron23 and Hongtu Zhu23456.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Partially Functional Linear Modeling Framework for Integrating Genetic Imaging and Clinical Data Ting Li1 Yang Yu2 J. S. Marron23 and Hongtu Zhu23456

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: