A flexible model for correlated count data with application to multi-condition differential expression analyses of single-cell RNA sequencing data

2025-04-27 0 0 7.49MB 56 页 10玖币

侵权投诉

A ﬂexible model for correlated count data,

with application to multi-condition diﬀerential expression analyses

of single-cell RNA sequencing data

Yusha Liu1, Peter Carbonetto2,3, Michihiro Takahama4,5, Adam Gruenbaum6, Dongyue

Xie7, Nicolas Chevrier4, and Matthew Stephens∗2,7

1Department of Biostatistics, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

2Department of Human Genetics, The University of Chicago, Chicago, IL, USA

3Research Computing Center, The University of Chicago, Chicago, IL, USA

4Pritzker School of Molecular Engineering, The University of Chicago, Chicago, IL, USA

5Graduate School of Pharmaceutical Sciences, Osaka University, Osaka, Japan

6Institute for Health Metrics and Evaluation, University of Washington, Seattle, WA, USA

7Department of Statistics, The University of Chicago, Chicago, IL, USA

Abstract

Detecting diﬀerences in gene expression is an important part of single-cell RNA sequencing

experiments, and many statistical methods have been developed for this aim. Most diﬀerential

expression analyses focus on comparing expression between two groups (e.g., treatment vs.

control). But there is increasing interest in multi-condition diﬀerential expression analyses in

which expression is measured in many conditions and the aim is to accurately detect and estimate

expression diﬀerences in all conditions. We show that directly modeling single-cell RNA-seq

counts in all conditions simultaneously, while also inferring how expression diﬀerences are shared

across conditions, leads to greatly improved performance for detecting and estimating expression

diﬀerences compared to existing methods. We illustrate the potential of this new approach by

analyzing data from a single-cell experiment studying the eﬀects of cytokine stimulation on

gene expression. We call our new method “Poisson multivariate adaptive shrinkage”, and it is

implemented in an R package available at https://github.com/stephenslab/poisson.mash.

alpha.

1 Introduction

Detecting diﬀerences in gene expression — that is, “diﬀerential expression” (DE) analysis — has

been a fundamental analysis aim ever since the introduction of technologies to measure gene expres-

sion (Soneson and Delorenzi,2013). As measurement technologies have improved, gene expression

data sets have increased in size and resolution, bringing new analysis challenges. The development

of RNA sequencing (RNA-seq) technologies (Wang et al.,2009) has greatly facilitated the mea-

surement of gene expression in “bulk” samples. More recently, the development of single-cell RNA

sequencing technologies (scRNA-seq) has allowed for rapid, high-throughput measurement of gene

∗mstephens@uchicago.edu

arXiv:2210.00697v3 [stat.ME] 26 May 2024

expression in individual cells, resulting in large datasets proﬁling gene expression in thousands of

cells (e.g., Zheng et al.,2017).

Increasingly, biologists are developing scRNA-seq experiments in which gene expression is as-

sayed in many experimental conditions. For example, in the motivating data set for this paper,

gene expression data were obtained for approximately 142,000 cells under 45 diﬀerent treatment

conditions. In multi-condition data such as these, we seek to understand which changes in expres-

sion are speciﬁc to certain conditions (“condition-speciﬁc eﬀects”), and which changes are shared

among two or more conditions (“shared eﬀects”). In this paper, we develop methods to tackle these

aims — speciﬁcally, to detect which genes are diﬀerentially expressed, and to estimate the log-fold

changes (LFCs) among multiple conditions. While many methods exist for performing diﬀerential

expression analysis of scRNA-seq data, analyzing multi-condition scRNA-seq data raises at least

two key challenges that are not adequately addressed by existing methods.

First, when assessing expression across multiple conditions, many diﬀerent patterns of diﬀer-

ential expression are possible. For example, some genes may be diﬀerentially expressed in a single

condition (relative to all other conditions), while other genes may show similar expression diﬀer-

ences in subsets of conditions. Typically, these patterns are unknown in advance, but one would

like to identify and exploit them to improve accuracy of the LFC estimates, and to improve power

to detect diﬀerentially expressed genes. To address this ﬁrst challenge, we build on the empirical

Bayes (EB) method developed in Urbut et al. (2019), “multivariate adaptive shrinkage” (“mash” for

short), which is designed to model and adapt to eﬀect-sharing patterns among conditions present

in the data.

Second, the data from scRNA-seq experiments are molecular counts, which are most naturally

modeled using count models such as Poisson measurement models (Townes et al.,2019;Sarkar

and Stephens,2021). However, there is no straightforward way to integrate a Poisson model with

mash because the Poisson model does not naturally provide summary statistics — eﬀect estimates

and standard errors — that can be used by mash; in particular, estimates of standard errors are

unreliable in Poisson models (Robinson and Smyth,2008). An alternative would be to combine

mash with a Gaussian measurement model for log-transformed scRNA-seq counts (e.g., Finak et al.,

2015). However, as has been repeatedly pointed out (e.g., Lun,2018;Townes et al.,2019;Crowell

et al.,2020), this data transformation can lead to severe bias in the LFC estimates. This is

particularly an issue when many of the counts are zero or small, which is a common feature of

scRNA-seq data sets. This suggests that it would be desirable to combine mash with a model of

the scRNA-seq counts.

Therefore, to get the best of both worlds — improved accuracy achieved by exploiting patterns

of eﬀect-sharing across conditions and the advantages of directly modeling the scRNA-seq counts

without ﬁrst transforming them — we pursue an approach that models the scRNA-seq count data

jointly in all conditions. We call this new approach “Poisson mash” because it is based on a

Poisson model of the data, and, like mash, it improves accuracy in the eﬀect estimates by ﬂexibly

modeling the sharing of eﬀects across conditions. Since the gains in accuracy will be greater as more

conditions with shared eﬀects are included in the analysis, in this paper we focus on scRNA-seq

experiments in which gene expression is measured in many (e.g., dozens) conditions. Although its

development has been motivated by our interest in analyzing multi-condition scRNA-seq data sets,

the Poisson mash model can also be viewed as a general model of multivariate count data, so the

ideas contained in this paper may be useful in other settings where multivariate count data occur.

The structure of the paper is as follows. First, in Section 2, we deﬁne “multi-condition diﬀer-

ential expression analysis” more formally, and explain the underlying assumptions about the data.

Next, we introduce the core Poisson mash model (Section 3), discuss related methods (Section 4),

then describe several enhancements to the model that improve its performance in more realistic

settings (Section 5). In Section 6, we evaluate the beneﬁts of Poisson mash approach compared

with existing methods in simulated scRNA-seq data sets. To illustrate how Poisson mash can be

used to gain biological insights from multi-condition gene expression data, we apply Poisson mash

to the scRNA-seq data set mentioned above (Section 7). Finally, we wrap up with a discussion

(Section 8).

1.1 Software availability

The Poisson mash methods are implemented in the R package “poisson.mash.alpha”, which is

available at https://github.com/stephenslab/poisson.mash.alpha.

2 Problem setup

In a multi-condition diﬀerential expression analysis, the aim is to compare expression for each of J

genes across Rconditions from multi-condition count data X,

Jgenes







x11 x12 ··· x1R

x21 x22 ··· x2R

.....

xJ1xJ2··· xJR







Rconditions

.(2.1)

This matrix can be obtained by summing, for each gene j, the unique molecular identiﬁer (UMI)

counts from all cells in the same condition (see Section 2.1). The special case of R= 2 — that is,

when Xis a J×2 matrix — corresponds to the standard setup for DE analysis in which the aim is

to compare expression between two conditions (e.g., treatment vs. control). By contrast, we focus

on multi-condition experiments with R≫2; for example, in the cytokines data, R= 45.

Next, we assume a Poisson measurement model for the counts,

xjr ∼Pois(srλjr),(2.2)

independently for each gene jand condition r, in which sr>0 denotes a “size factor”.1The

parameter λjr in (2.2) represents a relative rate, speciﬁcally the relative expression level for gene j

in condition r.

To analyze diﬀerences in expression, the log-relative expression is decomposed into a baseline

level of expression, µj, and a condition-speciﬁc expression diﬀerence βjr relative to the baseline:

log λjr =µj+βjr.(2.3)

Our main aim is to accurately estimate the expression diﬀerences βjr and other statistical quantities

1In DE analyses of scRNA-seq data, the size factors srare often deﬁned as the sum of the counts in each condition,

sr=PJ

j=1 xjr . This is the default setting for our analyses, noting that other deﬁnitions are possible (e.g., Bullard

et al.,2010).

involving βjr. With these modeling assumptions, the µjand βjr are not individually identiﬁable

from the count data, X. However, they become identiﬁable once one introduces priors for βjr; this

is described in the next section where we introduce the Poisson mash model.

The assumption that the data are in the form of a J×Rmatrix Ximplies that there is only

asingle independent observation per condition. This assumption is not critical; when multiple

replicates are available per condition, we can aggregate cells by replicate rather than by condition

go that there are multiple independent observations per condition. This extension is described in

Appendix C. But the single-observation assumption simpliﬁes the description of the method, and

furthermore the extension to multiple observations is not essential for understanding of the method

nor appreciating its beneﬁts.

2.1 Obtaining Xfrom multi-condition scRNA-seq data

In multi-condition scRNA-seq data, we observe the UMI counts yji for genes j∈ {1, . . . , J}in cells

i∈ {1, . . . , N}in which each cell iis measured in one of Rconditions. To analyze the diﬀerences in

expression across Rconditions following the setup described above, we summarize the UMI counts

in each condition by summing them across cells from the same condition; that is, for each gene j

and condition r, we set xjr =Pi∈ Sryji, where Sr⊂ {1, . . . , N}denotes the indices of the cells

that are measured in condition r.

The idea of summing the UMI counts from the individual cells is commonly described as “pseu-

dobulk analysis”, and its beneﬁts were noted in several recent papers (Lun and Marioni,2017;

Ahlmann-Eltze and Huber,2020;Erdmann-Pham et al.,2021;Murphy and Skene,2022;Crowell

et al.,2020;Squair et al.,2021). In the context of multi-condition DE analysis, forming pseudobulk

data has the twin advantages of simplifying modeling and reducing computation, and for these

reasons we take this approach here. One possible concern with a pseudobulk analysis of single-cell

RNA-seq data is that one may need to correct for unwanted variation (known or unknown) that

must be accounted for at the single-cell level (Risso et al.,2014;Leek and Storey,2007;Leek,2014;

Gerard and Stephens,2020). We address this concern in Section 5.2.

3 The basic Poisson mash model

We now give the minimum details needed to understand Poisson mash. Enhancements to the basic

model are described in Section 5.

3.1 The multivariate adaptive shrinkage prior

Our main aim is to detect and estimate expression diﬀerences among conditions. For example, if

some subset of conditions involves treatments with similar biological eﬀects, then the expression

diﬀerences βjr are expected to be similar to one another; on the other hand, if one treatment has

a very diﬀerent biological eﬀect from other treatments, it may show a “condition-speciﬁc” eﬀect

in which βjr is nonzero only in that condition. Furthermore, the patterns of DE may vary across

genes; for example, genes in the same pathway may show more similar patterns of DE than genes

in diﬀerent pathways. In summary, diﬀerent data sets will likely exhibit diﬀerent patterns of DE

among conditions, and multiple patterns of DE may be present within a single data set.

A B

Figure 1: Directed acyclic graphs (DAGs) showing the independence structure of the basic Poisson mash model (A)

and the Poisson mash model augmented with random eﬀects (B).

To capture heterogeneous DE patterns, and adapt these patterns to the data, we use the

multivariate adaptive shrinkage (“mash”) prior introduced in Urbut et al. (2019),

p(βj;π,U) =

k=1

l=1

πklNR(βj;0, wlUk),(3.1)

where βj:= (βj1, . . . , βjR)′, and NR(·;µ,Σ) denotes the density of the R-variate normal distribu-

tion with mean µand covariance matrix Σ. Here, wl>0, l = 1, . . . , L, is a pre-speciﬁed “grid”

of scaling coeﬃcients, spanning from very small to very large, to capture the full range of possible

eﬀect sizes, and the πkl are mixture weights, πkl ≥0, PK

k=1 PL

l=1 πkl = 1. Each Ukis an R×R

covariance matrix that captures a pattern of covariation of eﬀects across conditions. We refer to

the Poisson measurement model (2.2–2.3) together with the mash prior (3.1) as “Poisson mash”.

The graphical model representation of the Poisson mash model is shown in Panel A of Figure 1.

The covariance matrices Ukcan include both pre-speciﬁed “canonical” covariance matrices that

represent, for example, DE speciﬁc to a condition, and “data-driven” covariance matrices that are

estimated from the data, and can capture arbitrary patterns of DE among the conditions. Each

weight πkl should capture the relative frequency of each combination of scaling coeﬃcient wland

cross-condition pattern Uk. These weights will be estimated from the data.

The mash prior (3.1) is centered on zero, which helps address the nonidentiﬁability of µjand

βjr in (2.3); speciﬁcally, centering the prior on zero encourages the average βjr to be near zero, and

is analogous to the (non-Bayesian) approach of identifying parameters by imposing the constraint

r=1 βjr = 0. In addition, we note that typically the prior will place considerable weight near

zero, reﬂecting the fact that many expression diﬀerences are zero or small. This has the eﬀect of

“shrinking” many of the estimated βjr towards zero.

Our model is closely connected to multivariate Poisson log-normal (MPLN) distributions, origi-

nally proposed by Aitchison and Ho (1989) as a ﬂexible approach to modeling dependencies among

Poisson variables. Indeed, integrating out the eﬀects βj, the marginal distribution of the counts for

gene j,xj:= (xj1, . . . , xjR)′, is a mixture of MPLNs. Similar MPLN mixture models have recently

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Aflexiblemodelforcorrelatedcountdata,withapplicationtomulti-conditiondifferentialexpressionanalysesofsingle-cellRNAsequencingdataYushaLiu1,PeterCarbonetto2,3,MichihiroTakahama4,5,AdamGruenbaum6,DongyueXie7,NicolasChevrier4,andMatthewStephens∗2,71DepartmentofBiostatistics,TheUniversityofNorthCarolina...

展开>> 收起<<

A flexible model for correlated count data with application to multi-condition differential expression analyses of single-cell RNA sequencing data.pdf

共56页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A flexible model for correlated count data with application to multi-condition differential expression analyses of single-cell RNA sequencing data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: