A flexible model for correlated count data with application to multi-condition differential expression analyses of single-cell RNA sequencing data

2025-04-27 0 0 7.49MB 56 页 10玖币
侵权投诉
A flexible model for correlated count data,
with application to multi-condition differential expression analyses
of single-cell RNA sequencing data
Yusha Liu1, Peter Carbonetto2,3, Michihiro Takahama4,5, Adam Gruenbaum6, Dongyue
Xie7, Nicolas Chevrier4, and Matthew Stephens2,7
1Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
2Department of Human Genetics, The University of Chicago, Chicago, IL, USA
3Research Computing Center, The University of Chicago, Chicago, IL, USA
4Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL, USA
5Graduate School of Pharmaceutical Sciences, Osaka University, Osaka, Japan
6Institute for Health Metrics and Evaluation, University of Washington, Seattle, WA, USA
7Department of Statistics, The University of Chicago, Chicago, IL, USA
Abstract
Detecting differences in gene expression is an important part of single-cell RNA sequencing
experiments, and many statistical methods have been developed for this aim. Most differential
expression analyses focus on comparing expression between two groups (e.g., treatment vs.
control). But there is increasing interest in multi-condition differential expression analyses in
which expression is measured in many conditions and the aim is to accurately detect and estimate
expression differences in all conditions. We show that directly modeling single-cell RNA-seq
counts in all conditions simultaneously, while also inferring how expression differences are shared
across conditions, leads to greatly improved performance for detecting and estimating expression
differences compared to existing methods. We illustrate the potential of this new approach by
analyzing data from a single-cell experiment studying the effects of cytokine stimulation on
gene expression. We call our new method “Poisson multivariate adaptive shrinkage”, and it is
implemented in an R package available at https://github.com/stephenslab/poisson.mash.
alpha.
1 Introduction
Detecting differences in gene expression — that is, “differential expression” (DE) analysis — has
been a fundamental analysis aim ever since the introduction of technologies to measure gene expres-
sion (Soneson and Delorenzi,2013). As measurement technologies have improved, gene expression
data sets have increased in size and resolution, bringing new analysis challenges. The development
of RNA sequencing (RNA-seq) technologies (Wang et al.,2009) has greatly facilitated the mea-
surement of gene expression in “bulk” samples. More recently, the development of single-cell RNA
sequencing technologies (scRNA-seq) has allowed for rapid, high-throughput measurement of gene
mstephens@uchicago.edu
1
arXiv:2210.00697v3 [stat.ME] 26 May 2024
expression in individual cells, resulting in large datasets profiling gene expression in thousands of
cells (e.g., Zheng et al.,2017).
Increasingly, biologists are developing scRNA-seq experiments in which gene expression is as-
sayed in many experimental conditions. For example, in the motivating data set for this paper,
gene expression data were obtained for approximately 142,000 cells under 45 different treatment
conditions. In multi-condition data such as these, we seek to understand which changes in expres-
sion are specific to certain conditions (“condition-specific effects”), and which changes are shared
among two or more conditions (“shared effects”). In this paper, we develop methods to tackle these
aims — specifically, to detect which genes are differentially expressed, and to estimate the log-fold
changes (LFCs) among multiple conditions. While many methods exist for performing differential
expression analysis of scRNA-seq data, analyzing multi-condition scRNA-seq data raises at least
two key challenges that are not adequately addressed by existing methods.
First, when assessing expression across multiple conditions, many different patterns of differ-
ential expression are possible. For example, some genes may be differentially expressed in a single
condition (relative to all other conditions), while other genes may show similar expression differ-
ences in subsets of conditions. Typically, these patterns are unknown in advance, but one would
like to identify and exploit them to improve accuracy of the LFC estimates, and to improve power
to detect differentially expressed genes. To address this first challenge, we build on the empirical
Bayes (EB) method developed in Urbut et al. (2019), “multivariate adaptive shrinkage” (“mash” for
short), which is designed to model and adapt to effect-sharing patterns among conditions present
in the data.
Second, the data from scRNA-seq experiments are molecular counts, which are most naturally
modeled using count models such as Poisson measurement models (Townes et al.,2019;Sarkar
and Stephens,2021). However, there is no straightforward way to integrate a Poisson model with
mash because the Poisson model does not naturally provide summary statistics — effect estimates
and standard errors — that can be used by mash; in particular, estimates of standard errors are
unreliable in Poisson models (Robinson and Smyth,2008). An alternative would be to combine
mash with a Gaussian measurement model for log-transformed scRNA-seq counts (e.g., Finak et al.,
2015). However, as has been repeatedly pointed out (e.g., Lun,2018;Townes et al.,2019;Crowell
et al.,2020), this data transformation can lead to severe bias in the LFC estimates. This is
particularly an issue when many of the counts are zero or small, which is a common feature of
scRNA-seq data sets. This suggests that it would be desirable to combine mash with a model of
the scRNA-seq counts.
Therefore, to get the best of both worlds — improved accuracy achieved by exploiting patterns
of effect-sharing across conditions and the advantages of directly modeling the scRNA-seq counts
without first transforming them — we pursue an approach that models the scRNA-seq count data
jointly in all conditions. We call this new approach “Poisson mash” because it is based on a
Poisson model of the data, and, like mash, it improves accuracy in the effect estimates by flexibly
modeling the sharing of effects across conditions. Since the gains in accuracy will be greater as more
conditions with shared effects are included in the analysis, in this paper we focus on scRNA-seq
experiments in which gene expression is measured in many (e.g., dozens) conditions. Although its
development has been motivated by our interest in analyzing multi-condition scRNA-seq data sets,
the Poisson mash model can also be viewed as a general model of multivariate count data, so the
ideas contained in this paper may be useful in other settings where multivariate count data occur.
The structure of the paper is as follows. First, in Section 2, we define “multi-condition differ-
2
ential expression analysis” more formally, and explain the underlying assumptions about the data.
Next, we introduce the core Poisson mash model (Section 3), discuss related methods (Section 4),
then describe several enhancements to the model that improve its performance in more realistic
settings (Section 5). In Section 6, we evaluate the benefits of Poisson mash approach compared
with existing methods in simulated scRNA-seq data sets. To illustrate how Poisson mash can be
used to gain biological insights from multi-condition gene expression data, we apply Poisson mash
to the scRNA-seq data set mentioned above (Section 7). Finally, we wrap up with a discussion
(Section 8).
1.1 Software availability
The Poisson mash methods are implemented in the R package “poisson.mash.alpha”, which is
available at https://github.com/stephenslab/poisson.mash.alpha.
2 Problem setup
In a multi-condition differential expression analysis, the aim is to compare expression for each of J
genes across Rconditions from multi-condition count data X,
X=
Jgenes
x11 x12 ··· x1R
x21 x22 ··· x2R
.
.
..
.
.....
.
.
xJ1xJ2··· xJR
Rconditions
.(2.1)
This matrix can be obtained by summing, for each gene j, the unique molecular identifier (UMI)
counts from all cells in the same condition (see Section 2.1). The special case of R= 2 — that is,
when Xis a J×2 matrix — corresponds to the standard setup for DE analysis in which the aim is
to compare expression between two conditions (e.g., treatment vs. control). By contrast, we focus
on multi-condition experiments with R2; for example, in the cytokines data, R= 45.
Next, we assume a Poisson measurement model for the counts,
xjr Pois(srλjr),(2.2)
independently for each gene jand condition r, in which sr>0 denotes a “size factor”.1The
parameter λjr in (2.2) represents a relative rate, specifically the relative expression level for gene j
in condition r.
To analyze differences in expression, the log-relative expression is decomposed into a baseline
level of expression, µj, and a condition-specific expression difference βjr relative to the baseline:
log λjr =µj+βjr.(2.3)
Our main aim is to accurately estimate the expression differences βjr and other statistical quantities
1In DE analyses of scRNA-seq data, the size factors srare often defined as the sum of the counts in each condition,
sr=PJ
j=1 xjr . This is the default setting for our analyses, noting that other definitions are possible (e.g., Bullard
et al.,2010).
3
involving βjr. With these modeling assumptions, the µjand βjr are not individually identifiable
from the count data, X. However, they become identifiable once one introduces priors for βjr; this
is described in the next section where we introduce the Poisson mash model.
The assumption that the data are in the form of a J×Rmatrix Ximplies that there is only
asingle independent observation per condition. This assumption is not critical; when multiple
replicates are available per condition, we can aggregate cells by replicate rather than by condition
go that there are multiple independent observations per condition. This extension is described in
Appendix C. But the single-observation assumption simplifies the description of the method, and
furthermore the extension to multiple observations is not essential for understanding of the method
nor appreciating its benefits.
2.1 Obtaining Xfrom multi-condition scRNA-seq data
In multi-condition scRNA-seq data, we observe the UMI counts yji for genes j∈ {1, . . . , J}in cells
i∈ {1, . . . , N}in which each cell iis measured in one of Rconditions. To analyze the differences in
expression across Rconditions following the setup described above, we summarize the UMI counts
in each condition by summing them across cells from the same condition; that is, for each gene j
and condition r, we set xjr =Pi∈ Sryji, where Sr⊂ {1, . . . , N}denotes the indices of the cells
that are measured in condition r.
The idea of summing the UMI counts from the individual cells is commonly described as “pseu-
dobulk analysis”, and its benefits were noted in several recent papers (Lun and Marioni,2017;
Ahlmann-Eltze and Huber,2020;Erdmann-Pham et al.,2021;Murphy and Skene,2022;Crowell
et al.,2020;Squair et al.,2021). In the context of multi-condition DE analysis, forming pseudobulk
data has the twin advantages of simplifying modeling and reducing computation, and for these
reasons we take this approach here. One possible concern with a pseudobulk analysis of single-cell
RNA-seq data is that one may need to correct for unwanted variation (known or unknown) that
must be accounted for at the single-cell level (Risso et al.,2014;Leek and Storey,2007;Leek,2014;
Gerard and Stephens,2020). We address this concern in Section 5.2.
3 The basic Poisson mash model
We now give the minimum details needed to understand Poisson mash. Enhancements to the basic
model are described in Section 5.
3.1 The multivariate adaptive shrinkage prior
Our main aim is to detect and estimate expression differences among conditions. For example, if
some subset of conditions involves treatments with similar biological effects, then the expression
differences βjr are expected to be similar to one another; on the other hand, if one treatment has
a very different biological effect from other treatments, it may show a “condition-specific” effect
in which βjr is nonzero only in that condition. Furthermore, the patterns of DE may vary across
genes; for example, genes in the same pathway may show more similar patterns of DE than genes
in different pathways. In summary, different data sets will likely exhibit different patterns of DE
among conditions, and multiple patterns of DE may be present within a single data set.
4
A B
Figure 1: Directed acyclic graphs (DAGs) showing the independence structure of the basic Poisson mash model (A)
and the Poisson mash model augmented with random effects (B).
To capture heterogeneous DE patterns, and adapt these patterns to the data, we use the
multivariate adaptive shrinkage (“mash”) prior introduced in Urbut et al. (2019),
p(βj;π,U) =
K
X
k=1
L
X
l=1
πklNR(βj;0, wlUk),(3.1)
where βj:= (βj1, . . . , βjR), and NR(·;µ,Σ) denotes the density of the R-variate normal distribu-
tion with mean µand covariance matrix Σ. Here, wl>0, l = 1, . . . , L, is a pre-specified “grid”
of scaling coefficients, spanning from very small to very large, to capture the full range of possible
effect sizes, and the πkl are mixture weights, πkl 0, PK
k=1 PL
l=1 πkl = 1. Each Ukis an R×R
covariance matrix that captures a pattern of covariation of effects across conditions. We refer to
the Poisson measurement model (2.22.3) together with the mash prior (3.1) as “Poisson mash”.
The graphical model representation of the Poisson mash model is shown in Panel A of Figure 1.
The covariance matrices Ukcan include both pre-specified “canonical” covariance matrices that
represent, for example, DE specific to a condition, and “data-driven” covariance matrices that are
estimated from the data, and can capture arbitrary patterns of DE among the conditions. Each
weight πkl should capture the relative frequency of each combination of scaling coefficient wland
cross-condition pattern Uk. These weights will be estimated from the data.
The mash prior (3.1) is centered on zero, which helps address the nonidentifiability of µjand
βjr in (2.3); specifically, centering the prior on zero encourages the average βjr to be near zero, and
is analogous to the (non-Bayesian) approach of identifying parameters by imposing the constraint
PR
r=1 βjr = 0. In addition, we note that typically the prior will place considerable weight near
zero, reflecting the fact that many expression differences are zero or small. This has the effect of
“shrinking” many of the estimated βjr towards zero.
Our model is closely connected to multivariate Poisson log-normal (MPLN) distributions, origi-
nally proposed by Aitchison and Ho (1989) as a flexible approach to modeling dependencies among
Poisson variables. Indeed, integrating out the effects βj, the marginal distribution of the counts for
gene j,xj:= (xj1, . . . , xjR), is a mixture of MPLNs. Similar MPLN mixture models have recently
5
摘要:

Aflexiblemodelforcorrelatedcountdata,withapplicationtomulti-conditiondifferentialexpressionanalysesofsingle-cellRNAsequencingdataYushaLiu1,PeterCarbonetto2,3,MichihiroTakahama4,5,AdamGruenbaum6,DongyueXie7,NicolasChevrier4,andMatthewStephens∗2,71DepartmentofBiostatistics,TheUniversityofNorthCarolina...

展开>> 收起<<
A flexible model for correlated count data with application to multi-condition differential expression analyses of single-cell RNA sequencing data.pdf

共56页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:56 页 大小:7.49MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 56
客服
关注