Graphical Model Inference with Erosely Measured Data Lili Zheng

2025-05-06 0 0 3.91MB 112 页 10玖币
侵权投诉
Graphical Model Inference with Erosely
Measured Data
Lili Zheng
Department of Electrical and Computer Engineering, Rice University
and
Genevera I. Allen
Department of Electrical and Computer Engineering, Rice University,
Department of Computer Science, Rice University,
Department of Statistics, Rice University,
Department of Pediatrics-Neurology, Baylor College of Medicine,
Jan and Dan Duncan Neurological Research Institute, Texas Children’s Hospital
Abstract
In this paper, we investigate the Gaussian graphical model inference problem in
a novel setting that we call erose measurements, referring to irregularly measured or
observed data. For graphs, this results in different node pairs having vastly different
sample sizes which frequently arises in data integration, genomics, neuroscience, and
sensor networks. Existing works characterize the graph selection performance using the
minimum pairwise sample size, which provides little insights for erosely measured data,
and no existing inference method is applicable. We aim to fill in this gap by propos-
ing the first inference method that characterizes the different uncertainty levels over
the graph caused by the erose measurements, named GI-JOE (Graph Inference when
Joint Observations are Erose). Specifically, we develop an edge-wise inference method
and an affiliated FDR control procedure, where the variance of each edge depends on
the sample sizes associated with corresponding neighbors. We prove statistical validity
under erose measurements, thanks to careful localized edge-wise analysis and disentan-
gling the dependencies across the graph. Finally, through simulation studies and a real
neuroscience data example, we demonstrate the advantages of our inference methods
for graph selection from erosely measured data.
Keywords: Uneven measurements, missing data, graph structure inference, FDR control,
graph selection
1
arXiv:2210.11625v2 [stat.ME] 14 May 2023
1 Introduction
Graphical models have been powerful and ubiquitous tools for understanding connection and
interaction patterns hidden in large-scale data [30], by exploiting the conditional dependence
relationships among a large number of variables. For instance, graphical models have been
applied to learn the connectivity among tens of thousands of neurons [48], gene expression
networks [2, 17], sensor networks [15, 14], among many others. The last decade has witnessed
a plethora of new statistical methods and theory proposed for various types of models in
this area, including the Gaussian graphical models [57, 39, 19, 37, 8], graphical models for
exponential families and mixed variables [55, 54, 11], Gaussian copula models [33, 34, 18],
etc.
Despite the abundant literature in this area, most existing methods and theory for graph-
ical models assume even measurements over the graph, where either all variables are mea-
sured simultaneously, or they are missing with similar probabilities. However, many real
large-scale data sets usually take the form of erose measurements, which are irregular over
the graph, and different pairs of variables may have drastically different sample sizes. Such
data sets frequently arise in genetics, neuroscience, sensor networks, among many others,
due to various technological limits.
1.1 Problem Setting and Motivating Applications
Consider the following sparse Gaussian graphical model: x∼ N(0,Σ),Θ= (Σ)1,
where ΘRp×pis the sparse precision matrix. The graph structure is dictated by the
nonzero patterns in Θ:G= (V, E), V ={1, . . . , p}, E ={(i, j):Θ
ij 6= 0},where the
unknown edge set Eis of primary interest. Suppose that we only have access to the following
observations: {xi,Vi:Vi[p]}n
i=1,where Viis the observed index set of data point i. Then
the joint observation set for node pair (j, k) is Ojk ={i:j, k Vi}of size njk =|Ojk|. There
are a number of applications where njk can be drastically different.
2
Heterogeneous missingness: In a variety of biological experiments, some variables could
be missing or have erroneous zero reads (dropouts) much more than others, e.g., the ex-
pression levels of certain genes [20, 24, 21], or the abundance of some microbes [52]. Figure
1 shows the observational patterns and pairwise sample sizes of two real single-cell RNA
sequencing (scRNA-seq) data sets, which is far from uniform.
(a) Real observational patterns (b) Real pairwise sample sizes
(c) GI-JOE (FDR) (d) Naive FDR control
with minimum sample size
(e) Baseline estimator
Figure 1: Two erose measurement patterns in real scRNA-seq data sets [12, 13] are presented in
(a), (b), including the top 100 genes with the highest variances. The pairwise sample sizes range
from 0 to 1018 (chu data, left) and from 12 to 366 (darmanis, right). (c)-(e) present the graph
selection and inference results for a chain graph, when the data has the darmanis measurement
pattern. (c) is selected by our GI-JOE (FDR) approach and is the most accurate; (d) is obtained by
an ad hoc implementation of the debiased graphical lasso [25] that plugs in the minimum pairwise
sample size, which is too conservative and identifies no edge at all; (e) is the estimated graph by a
baseline approach [29], which plugs in a covariance estimate into the graphical lasso, and the many
false positives suggest that the graph selection problem with such data set is non-trivial.
Data integration / size-constrained measurements: Non-simultaneous and uneven
measurements also frequently arise from data integration and size-constrained measurements.
3
For instance, to better understand the neuronal circuits from neuronal functional activities,
one promising strategy is to estimate a large neuronal network [48, 10] from in vivo calcium
imaging data sets. However, to ensure a sufficient temporal resolution of the recording, the
spatial resolution is limited, putting a constraint on the number of neurons simultaneouly
measured [3, 61], and neuron pairs that are further from each other are less likely to be
measured together. In genome-wide association studies (GWAS), it is also desirable to inte-
grate genomic data across multiple sources due to the limited sample sizes of each data set,
while these different sources might have different genomic coverage [7]. Similar measurement
constraints also arise in sensor networks where it is extremely expensive to synchronize a
large number of sensors [15, 14].
1.2 Limitations of Existing Works for Erose Measurements
To learn graphical models from erosely measured data, one might want to leverage the cur-
rent literature on graphical models with missing data [42, 29, 50, 38]. However, most of
these works assume the variables are missing independently with the same missing proba-
bility. While [38] allows for arbitrary missing probabilities and dependency in their problem
formulation, their theoretical guarantees still hinge on the minimum observational proba-
bility. Using the minimum pairwise sample size over the whole graph to characterize the
performance of the graph learning result can be too coarse and provides little insights to
erosely measured data sets. Interestingly, one recent work [60] provides a localized theoret-
ical guarantee for neighborhood selection consistency, requiring only sample size conditions
imposed upon the corresponding neighbors instead of all node pairs. Such theoretical results
suggest that the estimation accuracy should vary over the graph when measurements are
erose, and a coarse characterization based on the minimum sample size would only provide
insights for the worst part of the graph estimate.
Inspired by this intuition, here arises one natural question: can we develop a statistical
4
inference method that quantifies the different uncertainty levels over the graph arising from
the erose measurements? Over the last decade, significant efforts have been devoted to the
statistical inference in high-dimensional settings, including techniques such as the debiased
Lasso [45, 59, 28], post-selection inference approaches [31, 44], knockoff methods [4, 9], and
various other FDR control methods [27, 36]. These techniques have been applied in regression
or classification problems, as well as in graphical models. However, these prior works mainly
consider simultaneous measurements across all variables [25, 40, 22, 56, 36, 26], which, in the
context of graphical models, would result in the same sample size across the entire graph;
or they consider the missing data setting where all variables are missing independently with
the same missing probability [5], still leading to approximately the same sample sizes. To
the best of our knowledge, there is no applicable statistical inference method for the general
observational patterns and erose measurements that we are considering. If practitioners want
to apply these existing inference methods with erosely measured data, they have to come up
with one single sample size quantity nto determine the uncertainty levels for each edge. To
ensure the validity of the test, one ad hoc way might be to plug in the minimum pairwise
sample size, which can be extremely conservative and has no power (see Figure 1(d)).
The rest of the paper is organized as follows. We first review the set-ups and neighborhood
selection results from [60] in Section 2, which serves as an inspiration and basis of our graph
inference method under erose measurements; Our key contribution, the GI-JOE approach, is
introduced in Section 3 and 4. In particular, Section 3 is devoted to the edge-wise inference
method, and for any node pair, we characterize its type I error and power based on the
sample sizes involving the node pair’s neighbors. Section 4 focuses on the FDR control
procedure, also shown to be theoretically valid under appropriate conditions. The synthetic
and real data experiments are included in Sections 5. We conclude with discussion of some
open questions in Section 6.
Notations: For any matrix ARp1×p2, let kAk= maxj,k |Aj,k|,kAk= supkuk2=1 kAuk2
5
摘要:

GraphicalModelInferencewithEroselyMeasuredDataLiliZhengDepartmentofElectricalandComputerEngineering,RiceUniversityandGeneveraI.AllenDepartmentofElectricalandComputerEngineering,RiceUniversity,DepartmentofComputerScience,RiceUniversity,DepartmentofStatistics,RiceUniversity,DepartmentofPediatrics-Neur...

展开>> 收起<<
Graphical Model Inference with Erosely Measured Data Lili Zheng.pdf

共112页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:112 页 大小:3.91MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 112
客服
关注