Fused mean structure learning in data integration with dependence Emily C. Hector

2025-05-06 0 0 430.59KB 28 页 10玖币
侵权投诉
Fused mean structure learning in data
integration with dependence
Emily C. Hector
Department of Statistics, North Carolina State University
Abstract
Motivated by image-on-scalar regression with data aggregated across multiple sites,
we consider a setting in which multiple independent studies each collect multiple de-
pendent vector outcomes, with potential mean model parameter homogeneity between
studies and outcome vectors. To determine the validity of jointly analyzing these
data sources, we must learn which of these data sources share mean model param-
eters. We propose a new model fusion approach that delivers improved flexibility,
statistical performance and computational speed over existing methods. Our proposed
approach specifies a quadratic inference function within each data source and fuses
mean model parameter vectors in their entirety based on a new formulation of a pair-
wise fusion penalty. We establish theoretical properties of our estimator and propose
an asymptotically equivalent weighted oracle meta-estimator that is more computa-
tionally efficient. Simulations and application to the ABIDE neuroimaging consortium
highlight the flexibility of the proposed approach. An R package is provided for ease
of implementation.
Keywords: Alternating direction method of multipliers, Generalized method of moments,
Homogeneity pursuit, Scalable computing.
The author thanks Dr. Andrew Whiteman for helpful discussions, Drs. Marie Davidian and Ryan Martin
for reading early manuscript drafts, and Dr. Lan Luo for R code implementing the quadratic inference
function sub-routine. The author is grateful to the participants of the ABIDE study, and the ABIDE study
organizers and members who aggregated, preprocessed and shared the ABIDE data.
1
arXiv:2210.02198v1 [stat.ME] 5 Oct 2022
1 Introduction
The development of methods to integrate mean regression models is crucial to unlocking
the scientific benefits expected from the analysis of massive data collected from multiple
sources. The utility of these methods, however, depends on first determining the validity of
joint mean regression analysis of multiple data sources (Sutton and Higgins, 2008; Liu et al.,
2015). Determining mean model parameter homogeneity, which we term mean homogeneity
structure, is of fundamental importance to generating meaningful results from data integra-
tion. Indeed, substantially erroneous conclusions may ensue from integrating data sources
that do not have homogeneous mean structures (Higgins and Thompson, 2002). We propose
a new fusion method to learn the mean homogeneity structure of multiple data sources and
determine the validity of data integration that delivers two key contributions to the existing
literature: (i) the generalization to multivariate generalized linear models from dependent
data sources and (ii) a new pairwise fusion penalty that estimates the homogeneity of data
sources rather than individual covariates from each data source.
This paper is motivated by the Autism Brain Imaging Data Exchange (ABIDE), a con-
sortium of imaging sites across the USA and Europe that aggregated and openly shared
neuroimaging data in participants with autism spectrum disorder (ASD) and neurotypical
controls (Di Martino et al., 2014). For each participant in the USA and Europe, sum-
mary resting state functional Magnetic Resonance Imaging (rfMRI) outcomes are observed
in 15 dependent brain regions. For each group of participants k∈ {1,2}(k= 1: USA;
k= 2: Europe), and each brain region j∈ {1,...,15}, denote by yir,jk the rth neuroimag-
ing outcome in brain region jfor participant iin group k. The marginal regression model
E(Yir,jk) = x>
ir,jkβjk describes the mean-covariate relationship of interest in brain region j
and study k, with covariates xir,jk including ASD status. The two central analytic goals are
to estimate βjk and to learn similarities and differences in how the covariates relate to differ-
ent brain regions in different populations through the homogeneity structure of {βjk}15,2
j,k=1.
An example homogeneity structure is illustrated in Figure 1. Learning this structure en-
ables practitioners to leverage homogeneity for improved estimation, and informs whether
estimating one model on the combined data, or one marginal model for each brain region
2
and cohort, is appropriate.
a
a
bc
bc
e
f
a
a
cd
cd
e
a
a
b b
b b
e
e
a
a
d d
d d
e
Figure 1: Example schematic of 15 brain regions for USA (left) and Europe (right) popula-
tions. Regions with the same letter have homogeneous mean structure.
More generally in this paper, we consider a complex data integration setting in which
multiple independent studies each collect multiple dependent vector outcomes. Potential
shared population structures, study design and biological function induce unknown mean
structure homogeneity between studies and outcome vectors. Most existing data fusion
methods are developed for univariate outcomes (Tibshirani et al., 2005; Tang and Song, 2016;
Shen et al., 2019) and linear models (Li et al., 2015; Ma and Huang, 2017; Tang et al., 2020b)
with independent data sources (Ke et al., 2015; Wang et al., 2016). Approaches developed
specifically for longitudinal and spatial data assume working independence between outcomes
(Li et al., 2019). These approaches do not allow for nonlinear modeling, and result in loss
of efficiency because they do not incorporate dependence within or between data sources.
They also fuse scalar elements of the parameter vector βjk, which results in elements of a
parameter vector in a single model being estimated from different data, and fails to provide
the desired insights into the shared mean structure of different data sources. There are
no suitable fusion methods that can fuse entire mean model parameter vectors βjk, handle
multivariate nonlinear models or account for dependence between data sources.
Indeed, a key desired outcome of the ABIDE analysis is to determine the validity of
jointly analyzing brain regions and populations. In practice, each data source is traditionally
believed to have homogeneous mean across its participants and outcomes, and data sources
are integrated as whole units, e.g. Glass (1976); Xie et al. (2011). Existing methods, however,
induce a homogeneity partition of covariates that results in estimation of separate elements
3
in βjk from different data sources. This does not give a clear picture of the validity of
integrating data sources. A more useful approach would yield a homogeneity partition of
data sources rather than of individual covariate effects. To achieve this, we propose a new
formulation of a pairwise fusion penalty that fuses mean model parameter vectors in their
entirety, a phenomenon we refer to as model fusion. The resulting estimated homogeneity
partition of data sources directly informs the validity of data integrative approaches.
To enable estimation in nonlinear models, we propose to estimate data source-specific
mean parameters using a quadratic inference function (QIF) (Qu et al., 2000). To lever-
age dependence between data sources, we propose to combine data source-specific QIF
and the new pairwise fusion penalty to form a penalized generalized method of moments
(GMM) objective function (Hansen, 1982; Caner, 2009; Caner and Zhang, 2014) that non-
parametrically estimates dependence between data sources for optimal estimation efficiency.
This non-trivial extension requires careful theoretical consideration. Finally, we propose an
Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2010) implementation
with an integrated meta-estimator of the fused means in the spirit of Hector and Song (2021)
that optimally weights individual data source estimators. This weighted meta-estimator is
asymptotically equivalent to the penalized GMM estimator but more computationally effi-
cient. The resulting fusion method is flexible, efficient and computationally appealing, and
can be used, for example, to deliver new insights from massive biomedical studies or as a
substitute for meta analysis in the presence of heterogeneity.
The rest of the paper is organized as follows. Section 2 establishes the formal problem
setup, describes the QIF construction in each data source and formulates the model fusion
objective function. Section 3 discusses large sample properties. Section 4 presents the
ADMM implementation details and the integrated meta-estimator. Section 5 evaluates the
proposed methods with simulations. Section 6 presents the ABIDE data analysis. Proofs,
implementation details, additional simulations, ABIDE information and an R package are
provided in the Supplementary Material.
4
2 Joint Integrative Analysis of Multiple Data Sources
2.1 Notation and Problem Setup
Define a2the outer product of a vector awith itself, namely a2=aa>. Let (x)+=x
if x > 0 and (x)+= 0 otherwise. We consider Kindependent studies with respective
sample sizes {nk}K
k=1. In each study we observe Jdependent mj-dimensional vector outcomes
yi,jk = (yi1,jk, . . . , yimj,jk) and covariates xir,jk Rq,r= 1, . . . , mj,j= 1, . . . , J, for each
participant i= 1, . . . , nkin study k,k= 1, . . . , K. Here, xi,jk = (x>
ir,jk)mj
r=1 is a q×
mjcovariate matrix assumed to be the study- and outcome-specific observations on the
same variables across outcomes and studies. Generalization to participant-specific response
dimensions mi,j of yi,jk is straightforward but omitted for clarity. Participants are assumed
independent. This results in a collection of JK data sources that are independent across
index k= 1, . . . , K but dependent across index j= 1, . . . , J. Such a collection arises, for
example, when multiple studies collect multiple dependent outcomes on participants, such
as high-dimensional longitudinal phenotypes, pathway-networked omics biomarkers or brain
imaging measurements, which collectively form one high-dimensional dependent response
vector.
Consider the generalized linear model for the mean response-covariate relationship of
interest, E(Yir,jk) = µir,jk =h(x>
ir,jkβjk), r= 1, . . . , mj, where βjk Rqthe parameter vector
of interest. Partial homogeneity of the mean structures of different outcomes is common, for
example because of shared biological function (e.g. metabolic pathways) (Hector and Song,
2021). Similarly, partial homogeneity of the mean structures of different studies is common,
for example because of similar populations, study designs and protocols (Liu et al., 2015).
We posit that there is an unknown partition P={Pg}G
g=1 of {(j, k)}J,K
j,k=1 such that βjk θg
for all (j, k)∈ Pg, for some parameter θ= (θg)G
g=1 RGq. Let β= (βjk)J,K
j,k=1 RJKq,
and denote by θg0= (θr,g0)q
r=1 Rqthe true value of θg. Let |Pmax|= maxg=1,...,G |Pg|and
|Pmin|= ming=1,...,G |Pg|. When Pis known, we define ΠRJKq×Gq in the Appendix such
that β=Πθand β0=Πθ0, with θ0= (θg0)G
g=1 RGq denoting the true value of θ. Letting
Πjk the qrows of Πcorresponding to data source (j, k), we can also rewrite βjk =Πjkθ.
Letting Πr,jk the rth row of Πjk, finally we have βr,jk =Πr,jkθ.
5
摘要:

FusedmeanstructurelearningindataintegrationwithdependenceEmilyC.Hector*DepartmentofStatistics,NorthCarolinaStateUniversityAbstractMotivatedbyimage-on-scalarregressionwithdataaggregatedacrossmultiplesites,weconsiderasettinginwhichmultipleindependentstudieseachcollectmultiplede-pendentvectoroutcomes,w...

展开>> 收起<<
Fused mean structure learning in data integration with dependence Emily C. Hector.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:430.59KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注