Deep Learning in Single-Cell Analysis Dylan Molho1 Jiayuan Ding2 Zhaoheng Li4 Hongzhi Wen2 Wenzhuo Tang3 Yixin Wang5 Julian Venegas1 Wei Jin2 Renming Liu1 Runze Su13

2025-05-06 0 0 2.77MB 77 页 10玖币
侵权投诉
Deep Learning in Single-Cell Analysis
Dylan Molho1, Jiayuan Ding2, Zhaoheng Li4, Hongzhi Wen2, Wenzhuo
Tang3, Yixin Wang5, Julian Venegas1, Wei Jin2, Renming Liu1, Runze Su1,3,
Patrick Danaher8, Robert Yang9, Yu Leo Lei6,7, Yuying Xie1,3, and Jiliang
Tang2
1Department of Computational Mathematics, Science and Engineering, Michigan State
University, East Lansing, USA
2Department of Computer Science and Engineering, Michigan State University, East Lansing,
USA
3Department of Statistics and Probability, Michigan State University, East Lansing, USA
4Department of Biostatistics, University of Washington, Seattle, USA
5Department of Bioengineering, Stanford University, Palo Alto, USA
6Department of Periodontics and Oral Medicine, University of Michigan School of Dentistry, Ann
Arbor, USA
7University of Michigan Rogel Cancer Center, Ann Arbor, USA
8NanoString Technologies, Seattle, USA
9Johnson & Johnson, Boston, USA
November 8, 2022
Abstract
Single-cell technologies are revolutionizing the entire field of biology. The large
volumes of data generated by single-cell technologies are high-dimensional, sparse,
heterogeneous, and have complicated dependency structures, making analyses using
conventional machine learning approaches challenging and impractical. In tackling
these challenges, deep learning often demonstrates superior performance compared to
traditional machine learning methods. In this work, we give a comprehensive survey
on deep learning in single-cell analysis. We first introduce background on single-cell
technologies and their development, as well as fundamental concepts of deep learning
including the most popular deep architectures. We present an overview of the single-
cell analytic pipeline pursued in research applications while noting divergences due to
data sources or specific applications. We then review seven popular tasks spanning
through different stages of the single-cell analysis pipeline, including multimodal inte-
gration, imputation, clustering, spatial domain identification, cell-type deconvolution,
cell segmentation, and cell-type annotation. Under each task, we describe the most re-
cent developments in classical and deep learning methods and discuss their advantages
and disadvantages. Deep learning tools and benchmark datasets are also summarized
for each task. Finally, we discuss the future directions and the most recent challenges.
This survey will serve as a reference for biologists and computer scientists, encouraging
collaborations.
Indicates equal contributions.
molhodyl@msu.edu
dingjia5@msu.edu
1
arXiv:2210.12385v2 [q-bio.QM] 5 Nov 2022
1 Introduction
As the basic building block of life, cells assume dynamic and complex functional states to
inform higher-order structures [1, 2]. Towards that end, the advance of single-cell sequencing
and imaging technologies has revolutionized the investigation of the gene-expression behav-
iors of cells. The advent of single-cell sequencing technology occurred in the early 1990s for
complementary DNA (cDNA) [3, 4]. However, it was not until 2009, with the creation of the
first single-cell RNA sequencing (scRNA-seq) method [5] that marked a true paradigm shift
in the field. Since then, steady progress in the creation of new next-generation sequencing
platforms has led to over one hundred currently existing techniques for single-cell sequenc-
ing [6, 7, 8]. These technologies measure a diverse collection of cell features including DNA
sequences and epigenetic features, RNA expression, and profiles of surface proteins. Recent
technological advances have also enabled the augmentation of these features with additional
data, e.g., multimodal sequencing platforms and spatial transcriptomic technology.
This paradigm shift comes from the quantity of available data using high throughput
methods [9, 10]. For example, one bulk tissue RNAseq data [11] can only quantify the
average gene expressions of a group of cells ignoring the cellular heterogeneity and hence
can serve as one sample for downstream analyses. In contrast, single-cell sequencing tech-
nologies generate tens of thousands to millions of samples/cells in a given experiment. Deep
learning methods, which have consistently shown cutting-edge performance in various big
data applications [12, 13], have fertile new ground for research that pushes the frontiers of
biological science. Studies in single-cell data continue to expand exponentially and obtain
new insights into immunology, oncology, developmental biology, pharmacology, and many
other disciplines, just to name a few areas of applications [14, 15, 16].
Despite the success of single-cell data in numerous applications, difficulties arise due to
the complexity of the data which requires advanced analysis pipelines with a number of
steps. Single-cell data preprocessing includes many stages of data pruning, normalization,
and often challenging machine learning tasks like batch effect correction, data imputation,
or dimensionality reduction. Moreover, specialized types of single-cell data require fur-
ther processing such as multimodal data integration and cell-type deconvolution for spatial
transcriptomics. These steps are crucial to facilitate downstream tasks ranging from cluster-
ing and cell annotation, disease prediction, identifying gene coexpression networks, to the
identification of developmental trajectories of cells transitioning between states [17]. For
tasks with clear evaluation metrics, deep learning often achieves top performance against
other classical machine learning techniques [18]. Deep learning can uniquely leverage its di-
verse architectures to capture networks of interdependencies between genes that alter other
genes’ expression levels [19], and cells that communicate with other cells through mech-
anisms like ligand-receptor pairs [20]. Due to the richness of deep learning architectures
and the customization of hyper-parameters and loss functions, deep learning models can be
more readily tailored to particular tasks in single-cell analysis compared to other machine
learning methods. Deep learning has already rapidly proliferated throughout the field, but
due to the multidisciplinary nature of the work, many remain unaware of this burgeoning
area of research. We write this survey as a bridge between two large research communities
in single-cell biology and computer science. We provide background on deep learning to
those in biology and less familiar with machine learning modeling, and also provide some
history and summary of single-cell data to computer scientists who are looking for novel
applications for their methods.
In this survey, we review methods in the emerging use of deep learning for single-cell
2
biology applications. In Section (2), we discuss the history and major technologies for single-
cell sequencing. In Section (3), we give an overview of deep learning concepts and popular
deep architectures. Due to the categorization of the tasks involved in single-cell analysis,
we group our review of deep learning methods by tasks. We first give an overview of the
pipeline in Section (4). Then, we describe the individual task objectives and highlight the
alternative machine learning methods used for the task before detailing the deep learning
methods.
2 Single-Cell Technologies
Figure 1: Timeline of major developments in single-cell technologies
Figure 2: An illustration of data matrices produced by single-cell technologies
The goal of mapping genotypes to phenotypes presents a multitude of challenges to
biologists performing transcriptome analysis [21]. The cells of an organism have nearly
the same genotype, but the transcriptome is the result of gene regulatory networks in cells
expressing only a subset of the total genes at any given time. With the advent of single-cell
technologies, researchers have access to not just transcriptomic data at the cellular level,
but also genomics and epigenomics data as shown in Figure 1. Compared to bulk sampling
technologies which measure the average transcriptome profiles of a group of cells, single-
cell technologies provide a higher resolution of cell differences and can attribute biological
behaviors to individual cells [22, 23, 24, 25]. We briefly discuss the history of single-cell
technologies and the main technologies that are used in the applications we review. We
summarize a timeline of their development in Figure 1.
3
2.1 Single Modality Profiling
Sequencing technology was first developed by James Eberwine et al. [3] and Iscove et al.[4],
by expanding the complementary DNAs (cDNAs). However, it wasn’t until the creation of
single-cell RNA sequencing (scRNA-seq) in 2009 [5] that single-cell methods truly gained
major traction. Since then, a few major branches in single-cell technologies have emerged,
targeting different aspects of cells, such as RNA in 2009 [5]), DNA methylation in 2013 [26],
protein in 2015, DNA accessibility in 2015, and histone modifications in 2021 [27]. Single-
cell data is often given in the form of a matrix, with features (e.g. genes, proteins, or DNA
interval) corresponding to the columns and each cell as a row, as shown in Figure 2.
Since its creation, scRNA-seq has had remarkable success in a number of different ap-
plications, such as cell developmental studies, classifying cell types, and gene regulation.
For scRNA-seq, isolation of the cell is the first step for obtaining transcriptome informa-
tion. Many technologies are differentiated according to their means of cell isolation before
sequencing occurs. Earlier methods using serial dilution or robotic micromanipulation have
low efficiency and throughput [28] when compared to more recent methods using microflu-
idic technologies [29]. One promising microfluidic technique for single-cell isolation is using
microdroplets [30], which creates the uniform dispersion of water droplets in a medium of
oil, allowing the separation of cells into individual droplets. While commercial microfluidic
platforms like Fluidigm C1, ICELL8, and Chromium can benefit from high throughput, they
face the challenge of high cost and often the requirement of uniform cell size in the sample.
Once a cell is separated and lysed, messenger RNAs in this cell are reverse transcripted
into more stable cDNAs with a unique cell ’barcode’. The cDNAs are then amplified via
Polymerase Chain Reaction (PCR) for better data capture before sequencing, which tends
to introduce bias due to the uneven amplification efficiency. Therefore, besides the unique
barcodes, the cDNA molecules in a cell are also given a Unique Molecular Identifier (UMI)
to correct the amplification bias by collapsing the reads with the same UMI into one read.
After debiasing, sequence reads are mapped to the genome and are grouped into genes for
the creation of a count matrix [6].
Beyond recording RNA expression levels in a cell, technology may also capture infor-
mation about the chromatin accessibility of a cell’s chromosome. Eukaryotic genomes are
hierarchically packaged into chromatin [31], and this packaging plays a central role in gene
regulation [32]. Buenrostro et al. created a means for sampling the epigenome at the
single-cell level through the Assay for Transposase Accessible Chromatin using sequencing
(ATAC-seq) [33] in 2013 . ATAC-seq allows the identification of accessible DNA, i.e. the
nucleosome-free regions of the genome [34]. DNA accessibility within the genome can be
used to identify regulatory elements in different cell types which cause the activation or
repression of gene expression [35]. scATAC-seq produces a count matrix with a number
of reads per open chromatin regions, which lead to very large matrices with hundreds of
thousands of regions. Furthermore the data is known to be very sparse, where it is common
to have the non-zero entries make up less than 3% of the data [36].
Gene expression can also be affected by a number of additional factors that are investi-
gated under the umbrella of epigenetics, which studies mechanisms like DNA methylation
and histone modification which do not change the DNA sequence, but can change gene
activity and expression [37]. DNA methylation occurs when methyl groups are bonded to
the DNA molecule, which can repress gene transcription, and is associated with a num-
ber of key biological processes [38]. In mammals, DNA methylation occurs most often in
particular portions of the base pair sequence, namely CG (denoted CpG) portions where
4
a cytosine is followed by a guanine [39]. New technologies developed in the past decade
for the profiling of DNA methylation use bisulfite sequencing (scBS-seq) [40, 41] or reduced
representation busilfite sequencing (scRRBS-seq) [42, 43] at a single-cell resolution. The
output data for these are binary, indicating regions that are methylated by a 1, while 0
indicates no methylation.
2.2 Multi-Modality Profiling
In addition to the cell transcriptome and epigenome, cell proteome is another focus of
single-cell technologies, which consist of the proteins that are encoded by the mRNA of
the cell. Comprehensive measurements of a cell’s proteome are integral to understanding
how the genes respond to environmental changes, as well as for predicting cellular behavior
since proteins are the functional units responsible for most of the cellular processes. While
single-cell sequencing techniques for transcriptome measurements have widely proliferated,
single-cell proteomics methods have made slower progress. Unlike most of the sequencing
technologies which have a standard process, proteomic measurements are often bespoke and
designed for specific applications [44]. However, some technologies developed have made
significant strides in not only capturing protein information of cells but combining this
with mRNA measurements. Specifically, Cellular Indexing of Transcriptomes and Epitopes
by Sequencing(CITE-seq), a new technology introduced in 2017, simultaneously sequences
mRNA and measures the surface proteins on a cell [45]. The method can sample over 1,000
genes and 80 proteins per cell, but like many other sequencing techniques, suffers from high
noise. In addition, CITE-seq is incapable to detect intracellular proteins [46].
The repertoire of multi-modal single-cell technologies bridges RNA expression not only
to protein but also to DNA methylation, chromatin accessibility, and histone modifications.
One of the first methods to simultaneously sequence RNA and chromatin accessibility is
a droplet-based method named SNARE-seq. Published in 2019, it uses Tn5 transposase
to capture accessible chromatin and creates shared barcodes between RNA and accessible
regions [47]. The same year, paired-seq raised the throughput by two orders by combining a
ligation-based combinatorial indexing strategy and an amplify-and-split library dedicating
method [48]. SHARE-seq [49] further increased the throughput and resolution by adapting
Paired-seq and SPLiT-seq [50], a scRNA-seq technology. In 2020, 10X also released 10X
Multiome, a commercialized product for the joint profiling of RNA and chromatin accessibil-
ity. Beyond the co-profiling of RNA and chromatin accessibility, scM&T-seq [51], based on
G&T-seq [52], allows for parallel analysis of single-cell RNA and DNA methylation. Another
emerging field is the joint profiling of RNA and histone modifications. Example technologies
in this category include Paired-Tag [53] and CoTECH [54], both became available in 2021.
2.3 Single-Cell Spatial Transcriptomics
Single-cell technologies that capture transcriptomic, proteomic, or epigenetic information
do so with great precision but with the loss of spatial information of the cells within the
tissues. However, the cells’ relative locations within tissue is critical to understanding normal
development and disease pathology. With spatial transcriptomic technologies, researchers
are able to measure transcriptomics and leverage the spatial information or relative locations
of cells in a tissue for better performing downstream tasks [55, 56, 57, 58, 59, 60, 61]. For
example, motivated by the fact that a pair of ligand and receptor with closer distance
5
摘要:

DeepLearninginSingle-CellAnalysisDylanMolho*„1,JiayuanDing…2,ZhaohengLi4,HongzhiWen2,WenzhuoTang3,YixinWang5,JulianVenegas1,WeiJin2,RenmingLiu1,RunzeSu1,3,PatrickDanaher8,RobertYang9,YuLeoLei6,7,YuyingXie1,3,andJiliangTang21DepartmentofComputationalMathematics,ScienceandEngineering,MichiganStateUni...

展开>> 收起<<
Deep Learning in Single-Cell Analysis Dylan Molho1 Jiayuan Ding2 Zhaoheng Li4 Hongzhi Wen2 Wenzhuo Tang3 Yixin Wang5 Julian Venegas1 Wei Jin2 Renming Liu1 Runze Su13.pdf

共77页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:77 页 大小:2.77MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 77
客服
关注