expression in individual cells, resulting in large datasets profiling gene expression in thousands of
cells (e.g., Zheng et al.,2017).
Increasingly, biologists are developing scRNA-seq experiments in which gene expression is as-
sayed in many experimental conditions. For example, in the motivating data set for this paper,
gene expression data were obtained for approximately 142,000 cells under 45 different treatment
conditions. In multi-condition data such as these, we seek to understand which changes in expres-
sion are specific to certain conditions (“condition-specific effects”), and which changes are shared
among two or more conditions (“shared effects”). In this paper, we develop methods to tackle these
aims — specifically, to detect which genes are differentially expressed, and to estimate the log-fold
changes (LFCs) among multiple conditions. While many methods exist for performing differential
expression analysis of scRNA-seq data, analyzing multi-condition scRNA-seq data raises at least
two key challenges that are not adequately addressed by existing methods.
First, when assessing expression across multiple conditions, many different patterns of differ-
ential expression are possible. For example, some genes may be differentially expressed in a single
condition (relative to all other conditions), while other genes may show similar expression differ-
ences in subsets of conditions. Typically, these patterns are unknown in advance, but one would
like to identify and exploit them to improve accuracy of the LFC estimates, and to improve power
to detect differentially expressed genes. To address this first challenge, we build on the empirical
Bayes (EB) method developed in Urbut et al. (2019), “multivariate adaptive shrinkage” (“mash” for
short), which is designed to model and adapt to effect-sharing patterns among conditions present
in the data.
Second, the data from scRNA-seq experiments are molecular counts, which are most naturally
modeled using count models such as Poisson measurement models (Townes et al.,2019;Sarkar
and Stephens,2021). However, there is no straightforward way to integrate a Poisson model with
mash because the Poisson model does not naturally provide summary statistics — effect estimates
and standard errors — that can be used by mash; in particular, estimates of standard errors are
unreliable in Poisson models (Robinson and Smyth,2008). An alternative would be to combine
mash with a Gaussian measurement model for log-transformed scRNA-seq counts (e.g., Finak et al.,
2015). However, as has been repeatedly pointed out (e.g., Lun,2018;Townes et al.,2019;Crowell
et al.,2020), this data transformation can lead to severe bias in the LFC estimates. This is
particularly an issue when many of the counts are zero or small, which is a common feature of
scRNA-seq data sets. This suggests that it would be desirable to combine mash with a model of
the scRNA-seq counts.
Therefore, to get the best of both worlds — improved accuracy achieved by exploiting patterns
of effect-sharing across conditions and the advantages of directly modeling the scRNA-seq counts
without first transforming them — we pursue an approach that models the scRNA-seq count data
jointly in all conditions. We call this new approach “Poisson mash” because it is based on a
Poisson model of the data, and, like mash, it improves accuracy in the effect estimates by flexibly
modeling the sharing of effects across conditions. Since the gains in accuracy will be greater as more
conditions with shared effects are included in the analysis, in this paper we focus on scRNA-seq
experiments in which gene expression is measured in many (e.g., dozens) conditions. Although its
development has been motivated by our interest in analyzing multi-condition scRNA-seq data sets,
the Poisson mash model can also be viewed as a general model of multivariate count data, so the
ideas contained in this paper may be useful in other settings where multivariate count data occur.
The structure of the paper is as follows. First, in Section 2, we define “multi-condition differ-
2