Fused mean structure learning in data integration with dependence Emily C. Hector

2025-05-06 0 0 430.59KB 28 页 10玖币

侵权投诉

Fused mean structure learning in data

integration with dependence

Emily C. Hector ∗

Department of Statistics, North Carolina State University

Abstract

Motivated by image-on-scalar regression with data aggregated across multiple sites,

we consider a setting in which multiple independent studies each collect multiple de-

pendent vector outcomes, with potential mean model parameter homogeneity between

studies and outcome vectors. To determine the validity of jointly analyzing these

data sources, we must learn which of these data sources share mean model param-

eters. We propose a new model fusion approach that delivers improved ﬂexibility,

statistical performance and computational speed over existing methods. Our proposed

approach speciﬁes a quadratic inference function within each data source and fuses

mean model parameter vectors in their entirety based on a new formulation of a pair-

wise fusion penalty. We establish theoretical properties of our estimator and propose

an asymptotically equivalent weighted oracle meta-estimator that is more computa-

tionally eﬃcient. Simulations and application to the ABIDE neuroimaging consortium

highlight the ﬂexibility of the proposed approach. An R package is provided for ease

of implementation.

Keywords: Alternating direction method of multipliers, Generalized method of moments,

Homogeneity pursuit, Scalable computing.

∗The author thanks Dr. Andrew Whiteman for helpful discussions, Drs. Marie Davidian and Ryan Martin

for reading early manuscript drafts, and Dr. Lan Luo for R code implementing the quadratic inference

function sub-routine. The author is grateful to the participants of the ABIDE study, and the ABIDE study

organizers and members who aggregated, preprocessed and shared the ABIDE data.

arXiv:2210.02198v1 [stat.ME] 5 Oct 2022

1 Introduction

The development of methods to integrate mean regression models is crucial to unlocking

the scientiﬁc beneﬁts expected from the analysis of massive data collected from multiple

sources. The utility of these methods, however, depends on ﬁrst determining the validity of

joint mean regression analysis of multiple data sources (Sutton and Higgins, 2008; Liu et al.,

2015). Determining mean model parameter homogeneity, which we term mean homogeneity

structure, is of fundamental importance to generating meaningful results from data integra-

tion. Indeed, substantially erroneous conclusions may ensue from integrating data sources

that do not have homogeneous mean structures (Higgins and Thompson, 2002). We propose

a new fusion method to learn the mean homogeneity structure of multiple data sources and

determine the validity of data integration that delivers two key contributions to the existing

literature: (i) the generalization to multivariate generalized linear models from dependent

data sources and (ii) a new pairwise fusion penalty that estimates the homogeneity of data

sources rather than individual covariates from each data source.

This paper is motivated by the Autism Brain Imaging Data Exchange (ABIDE), a con-

sortium of imaging sites across the USA and Europe that aggregated and openly shared

neuroimaging data in participants with autism spectrum disorder (ASD) and neurotypical

controls (Di Martino et al., 2014). For each participant in the USA and Europe, sum-

mary resting state functional Magnetic Resonance Imaging (rfMRI) outcomes are observed

in 15 dependent brain regions. For each group of participants k∈ {1,2}(k= 1: USA;

k= 2: Europe), and each brain region j∈ {1,...,15}, denote by yir,jk the rth neuroimag-

ing outcome in brain region jfor participant iin group k. The marginal regression model

E(Yir,jk) = x>

ir,jkβjk describes the mean-covariate relationship of interest in brain region j

and study k, with covariates xir,jk including ASD status. The two central analytic goals are

to estimate βjk and to learn similarities and diﬀerences in how the covariates relate to diﬀer-

ent brain regions in diﬀerent populations through the homogeneity structure of {βjk}15,2

j,k=1.

An example homogeneity structure is illustrated in Figure 1. Learning this structure en-

ables practitioners to leverage homogeneity for improved estimation, and informs whether

estimating one model on the combined data, or one marginal model for each brain region

and cohort, is appropriate.

b b

d d

Figure 1: Example schematic of 15 brain regions for USA (left) and Europe (right) popula-

tions. Regions with the same letter have homogeneous mean structure.

More generally in this paper, we consider a complex data integration setting in which

multiple independent studies each collect multiple dependent vector outcomes. Potential

shared population structures, study design and biological function induce unknown mean

structure homogeneity between studies and outcome vectors. Most existing data fusion

methods are developed for univariate outcomes (Tibshirani et al., 2005; Tang and Song, 2016;

Shen et al., 2019) and linear models (Li et al., 2015; Ma and Huang, 2017; Tang et al., 2020b)

with independent data sources (Ke et al., 2015; Wang et al., 2016). Approaches developed

speciﬁcally for longitudinal and spatial data assume working independence between outcomes

(Li et al., 2019). These approaches do not allow for nonlinear modeling, and result in loss

of eﬃciency because they do not incorporate dependence within or between data sources.

They also fuse scalar elements of the parameter vector βjk, which results in elements of a

parameter vector in a single model being estimated from diﬀerent data, and fails to provide

the desired insights into the shared mean structure of diﬀerent data sources. There are

no suitable fusion methods that can fuse entire mean model parameter vectors βjk, handle

multivariate nonlinear models or account for dependence between data sources.

Indeed, a key desired outcome of the ABIDE analysis is to determine the validity of

jointly analyzing brain regions and populations. In practice, each data source is traditionally

believed to have homogeneous mean across its participants and outcomes, and data sources

are integrated as whole units, e.g. Glass (1976); Xie et al. (2011). Existing methods, however,

induce a homogeneity partition of covariates that results in estimation of separate elements

in βjk from diﬀerent data sources. This does not give a clear picture of the validity of

integrating data sources. A more useful approach would yield a homogeneity partition of

data sources rather than of individual covariate eﬀects. To achieve this, we propose a new

formulation of a pairwise fusion penalty that fuses mean model parameter vectors in their

entirety, a phenomenon we refer to as model fusion. The resulting estimated homogeneity

partition of data sources directly informs the validity of data integrative approaches.

To enable estimation in nonlinear models, we propose to estimate data source-speciﬁc

mean parameters using a quadratic inference function (QIF) (Qu et al., 2000). To lever-

age dependence between data sources, we propose to combine data source-speciﬁc QIF

and the new pairwise fusion penalty to form a penalized generalized method of moments

(GMM) objective function (Hansen, 1982; Caner, 2009; Caner and Zhang, 2014) that non-

parametrically estimates dependence between data sources for optimal estimation eﬃciency.

This non-trivial extension requires careful theoretical consideration. Finally, we propose an

Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2010) implementation

with an integrated meta-estimator of the fused means in the spirit of Hector and Song (2021)

that optimally weights individual data source estimators. This weighted meta-estimator is

asymptotically equivalent to the penalized GMM estimator but more computationally eﬃ-

cient. The resulting fusion method is ﬂexible, eﬃcient and computationally appealing, and

can be used, for example, to deliver new insights from massive biomedical studies or as a

substitute for meta analysis in the presence of heterogeneity.

The rest of the paper is organized as follows. Section 2 establishes the formal problem

setup, describes the QIF construction in each data source and formulates the model fusion

objective function. Section 3 discusses large sample properties. Section 4 presents the

ADMM implementation details and the integrated meta-estimator. Section 5 evaluates the

proposed methods with simulations. Section 6 presents the ABIDE data analysis. Proofs,

implementation details, additional simulations, ABIDE information and an R package are

provided in the Supplementary Material.

2 Joint Integrative Analysis of Multiple Data Sources

2.1 Notation and Problem Setup

Deﬁne a⊗2the outer product of a vector awith itself, namely a⊗2=aa>. Let (x)+=x

if x > 0 and (x)+= 0 otherwise. We consider Kindependent studies with respective

sample sizes {nk}K

k=1. In each study we observe Jdependent mj-dimensional vector outcomes

yi,jk = (yi1,jk, . . . , yimj,jk) and covariates xir,jk ∈Rq,r= 1, . . . , mj,j= 1, . . . , J, for each

participant i= 1, . . . , nkin study k,k= 1, . . . , K. Here, xi,jk = (x>

ir,jk)mj

r=1 is a q×

mjcovariate matrix assumed to be the study- and outcome-speciﬁc observations on the

same variables across outcomes and studies. Generalization to participant-speciﬁc response

dimensions mi,j of yi,jk is straightforward but omitted for clarity. Participants are assumed

independent. This results in a collection of JK data sources that are independent across

index k= 1, . . . , K but dependent across index j= 1, . . . , J. Such a collection arises, for

example, when multiple studies collect multiple dependent outcomes on participants, such

as high-dimensional longitudinal phenotypes, pathway-networked omics biomarkers or brain

imaging measurements, which collectively form one high-dimensional dependent response

vector.

Consider the generalized linear model for the mean response-covariate relationship of

interest, E(Yir,jk) = µir,jk =h(x>

ir,jkβjk), r= 1, . . . , mj, where βjk ∈Rqthe parameter vector

of interest. Partial homogeneity of the mean structures of diﬀerent outcomes is common, for

example because of shared biological function (e.g. metabolic pathways) (Hector and Song,

2021). Similarly, partial homogeneity of the mean structures of diﬀerent studies is common,

for example because of similar populations, study designs and protocols (Liu et al., 2015).

We posit that there is an unknown partition P={Pg}G

g=1 of {(j, k)}J,K

j,k=1 such that βjk ≡θg

for all (j, k)∈ Pg, for some parameter θ= (θg)G

g=1 ∈RGq. Let β= (βjk)J,K

j,k=1 ∈RJKq,

and denote by θg0= (θr,g0)q

r=1 ∈Rqthe true value of θg. Let |Pmax|= maxg=1,...,G |Pg|and

|Pmin|= ming=1,...,G |Pg|. When Pis known, we deﬁne Π∈RJKq×Gq in the Appendix such

that β=Πθand β0=Πθ0, with θ0= (θg0)G

g=1 ∈RGq denoting the true value of θ. Letting

Πjk the qrows of Πcorresponding to data source (j, k), we can also rewrite βjk =Πjkθ.

Letting Πr,jk the rth row of Πjk, ﬁnally we have βr,jk =Πr,jkθ.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FusedmeanstructurelearningindataintegrationwithdependenceEmilyC.Hector*DepartmentofStatistics,NorthCarolinaStateUniversityAbstractMotivatedbyimage-on-scalarregressionwithdataaggregatedacrossmultiplesites,weconsiderasettinginwhichmultipleindependentstudieseachcollectmultiplede-pendentvectoroutcomes,w...

展开>> 收起<<

Fused mean structure learning in data integration with dependence Emily C. Hector.pdf

共28页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fused mean structure learning in data integration with dependence Emily C. Hector

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: