Graphical Model Inference with Erosely Measured Data Lili Zheng

2025-05-06 0 0 3.91MB 112 页 10玖币

侵权投诉

Graphical Model Inference with Erosely

Measured Data

Lili Zheng

Department of Electrical and Computer Engineering, Rice University

and

Genevera I. Allen

Department of Electrical and Computer Engineering, Rice University,

Department of Computer Science, Rice University,

Department of Statistics, Rice University,

Department of Pediatrics-Neurology, Baylor College of Medicine,

Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital

Abstract

In this paper, we investigate the Gaussian graphical model inference problem in

a novel setting that we call erose measurements, referring to irregularly measured or

observed data. For graphs, this results in diﬀerent node pairs having vastly diﬀerent

sample sizes which frequently arises in data integration, genomics, neuroscience, and

sensor networks. Existing works characterize the graph selection performance using the

minimum pairwise sample size, which provides little insights for erosely measured data,

and no existing inference method is applicable. We aim to ﬁll in this gap by propos-

ing the ﬁrst inference method that characterizes the diﬀerent uncertainty levels over

the graph caused by the erose measurements, named GI-JOE (Graph Inference when

Joint Observations are Erose). Speciﬁcally, we develop an edge-wise inference method

and an aﬃliated FDR control procedure, where the variance of each edge depends on

the sample sizes associated with corresponding neighbors. We prove statistical validity

under erose measurements, thanks to careful localized edge-wise analysis and disentan-

gling the dependencies across the graph. Finally, through simulation studies and a real

neuroscience data example, we demonstrate the advantages of our inference methods

for graph selection from erosely measured data.

Keywords: Uneven measurements, missing data, graph structure inference, FDR control,

graph selection

arXiv:2210.11625v2 [stat.ME] 14 May 2023

1 Introduction

Graphical models have been powerful and ubiquitous tools for understanding connection and

interaction patterns hidden in large-scale data [30], by exploiting the conditional dependence

relationships among a large number of variables. For instance, graphical models have been

applied to learn the connectivity among tens of thousands of neurons [48], gene expression

networks [2, 17], sensor networks [15, 14], among many others. The last decade has witnessed

a plethora of new statistical methods and theory proposed for various types of models in

this area, including the Gaussian graphical models [57, 39, 19, 37, 8], graphical models for

exponential families and mixed variables [55, 54, 11], Gaussian copula models [33, 34, 18],

etc.

Despite the abundant literature in this area, most existing methods and theory for graph-

ical models assume even measurements over the graph, where either all variables are mea-

sured simultaneously, or they are missing with similar probabilities. However, many real

large-scale data sets usually take the form of erose measurements, which are irregular over

the graph, and diﬀerent pairs of variables may have drastically diﬀerent sample sizes. Such

data sets frequently arise in genetics, neuroscience, sensor networks, among many others,

due to various technological limits.

1.1 Problem Setting and Motivating Applications

Consider the following sparse Gaussian graphical model: x∼ N(0,Σ∗),Θ∗= (Σ∗)−1,

where Θ∗∈Rp×pis the sparse precision matrix. The graph structure is dictated by the

nonzero patterns in Θ∗:G= (V, E), V ={1, . . . , p}, E ={(i, j):Θ∗

ij 6= 0},where the

unknown edge set Eis of primary interest. Suppose that we only have access to the following

observations: {xi,Vi:Vi⊆[p]}n

i=1,where Viis the observed index set of data point i. Then

the joint observation set for node pair (j, k) is Ojk ={i:j, k ∈Vi}of size njk =|Ojk|. There

are a number of applications where njk can be drastically diﬀerent.

Heterogeneous missingness: In a variety of biological experiments, some variables could

be missing or have erroneous zero reads (dropouts) much more than others, e.g., the ex-

pression levels of certain genes [20, 24, 21], or the abundance of some microbes [52]. Figure

1 shows the observational patterns and pairwise sample sizes of two real single-cell RNA

sequencing (scRNA-seq) data sets, which is far from uniform.

(a) Real observational patterns (b) Real pairwise sample sizes

with minimum sample size

(e) Baseline estimator

Figure 1: Two erose measurement patterns in real scRNA-seq data sets [12, 13] are presented in

(a), (b), including the top 100 genes with the highest variances. The pairwise sample sizes range

from 0 to 1018 (chu data, left) and from 12 to 366 (darmanis, right). (c)-(e) present the graph

selection and inference results for a chain graph, when the data has the darmanis measurement

pattern. (c) is selected by our GI-JOE (FDR) approach and is the most accurate; (d) is obtained by

an ad hoc implementation of the debiased graphical lasso [25] that plugs in the minimum pairwise

sample size, which is too conservative and identiﬁes no edge at all; (e) is the estimated graph by a

baseline approach [29], which plugs in a covariance estimate into the graphical lasso, and the many

false positives suggest that the graph selection problem with such data set is non-trivial.

Data integration / size-constrained measurements: Non-simultaneous and uneven

measurements also frequently arise from data integration and size-constrained measurements.

For instance, to better understand the neuronal circuits from neuronal functional activities,

one promising strategy is to estimate a large neuronal network [48, 10] from in vivo calcium

imaging data sets. However, to ensure a suﬃcient temporal resolution of the recording, the

spatial resolution is limited, putting a constraint on the number of neurons simultaneouly

measured [3, 61], and neuron pairs that are further from each other are less likely to be

measured together. In genome-wide association studies (GWAS), it is also desirable to inte-

grate genomic data across multiple sources due to the limited sample sizes of each data set,

while these diﬀerent sources might have diﬀerent genomic coverage [7]. Similar measurement

constraints also arise in sensor networks where it is extremely expensive to synchronize a

large number of sensors [15, 14].

1.2 Limitations of Existing Works for Erose Measurements

To learn graphical models from erosely measured data, one might want to leverage the cur-

rent literature on graphical models with missing data [42, 29, 50, 38]. However, most of

these works assume the variables are missing independently with the same missing proba-

bility. While [38] allows for arbitrary missing probabilities and dependency in their problem

formulation, their theoretical guarantees still hinge on the minimum observational proba-

bility. Using the minimum pairwise sample size over the whole graph to characterize the

performance of the graph learning result can be too coarse and provides little insights to

erosely measured data sets. Interestingly, one recent work [60] provides a localized theoret-

ical guarantee for neighborhood selection consistency, requiring only sample size conditions

imposed upon the corresponding neighbors instead of all node pairs. Such theoretical results

suggest that the estimation accuracy should vary over the graph when measurements are

erose, and a coarse characterization based on the minimum sample size would only provide

insights for the worst part of the graph estimate.

Inspired by this intuition, here arises one natural question: can we develop a statistical

inference method that quantiﬁes the diﬀerent uncertainty levels over the graph arising from

the erose measurements? Over the last decade, signiﬁcant eﬀorts have been devoted to the

statistical inference in high-dimensional settings, including techniques such as the debiased

Lasso [45, 59, 28], post-selection inference approaches [31, 44], knockoﬀ methods [4, 9], and

various other FDR control methods [27, 36]. These techniques have been applied in regression

or classiﬁcation problems, as well as in graphical models. However, these prior works mainly

consider simultaneous measurements across all variables [25, 40, 22, 56, 36, 26], which, in the

context of graphical models, would result in the same sample size across the entire graph;

or they consider the missing data setting where all variables are missing independently with

the same missing probability [5], still leading to approximately the same sample sizes. To

the best of our knowledge, there is no applicable statistical inference method for the general

observational patterns and erose measurements that we are considering. If practitioners want

to apply these existing inference methods with erosely measured data, they have to come up

with one single sample size quantity nto determine the uncertainty levels for each edge. To

ensure the validity of the test, one ad hoc way might be to plug in the minimum pairwise

sample size, which can be extremely conservative and has no power (see Figure 1(d)).

The rest of the paper is organized as follows. We ﬁrst review the set-ups and neighborhood

selection results from [60] in Section 2, which serves as an inspiration and basis of our graph

inference method under erose measurements; Our key contribution, the GI-JOE approach, is

introduced in Section 3 and 4. In particular, Section 3 is devoted to the edge-wise inference

method, and for any node pair, we characterize its type I error and power based on the

sample sizes involving the node pair’s neighbors. Section 4 focuses on the FDR control

procedure, also shown to be theoretically valid under appropriate conditions. The synthetic

and real data experiments are included in Sections 5. We conclude with discussion of some

open questions in Section 6.

Notations: For any matrix A∈Rp1×p2, let kAk∞= maxj,k |Aj,k|,kAk= supkuk2=1 kAuk2

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GraphicalModelInferencewithEroselyMeasuredDataLiliZhengDepartmentofElectricalandComputerEngineering,RiceUniversityandGeneveraI.AllenDepartmentofElectricalandComputerEngineering,RiceUniversity,DepartmentofComputerScience,RiceUniversity,DepartmentofStatistics,RiceUniversity,DepartmentofPediatrics-Neur...

展开>> 收起<<

Graphical Model Inference with Erosely Measured Data Lili Zheng.pdf

共112页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Graphical Model Inference with Erosely Measured Data Lili Zheng

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: