Deep Learning in Single-Cell Analysis Dylan Molho1 Jiayuan Ding2 Zhaoheng Li4 Hongzhi Wen2 Wenzhuo Tang3 Yixin Wang5 Julian Venegas1 Wei Jin2 Renming Liu1 Runze Su13

2025-05-06 0 0 2.77MB 77 页 10玖币

侵权投诉

Deep Learning in Single-Cell Analysis

Dylan Molho∗†1, Jiayuan Ding∗‡2, Zhaoheng Li4, Hongzhi Wen2, Wenzhuo

Tang3, Yixin Wang5, Julian Venegas1, Wei Jin2, Renming Liu1, Runze Su1,3,

Patrick Danaher8, Robert Yang9, Yu Leo Lei6,7, Yuying Xie1,3, and Jiliang

Tang2

1Department of Computational Mathematics, Science and Engineering, Michigan State

University, East Lansing, USA

2Department of Computer Science and Engineering, Michigan State University, East Lansing,

USA

3Department of Statistics and Probability, Michigan State University, East Lansing, USA

4Department of Biostatistics, University of Washington, Seattle, USA

5Department of Bioengineering, Stanford University, Palo Alto, USA

6Department of Periodontics and Oral Medicine, University of Michigan School of Dentistry, Ann

Arbor, USA

7University of Michigan Rogel Cancer Center, Ann Arbor, USA

8NanoString Technologies, Seattle, USA

9Johnson & Johnson, Boston, USA

November 8, 2022

Abstract

Single-cell technologies are revolutionizing the entire ﬁeld of biology. The large

volumes of data generated by single-cell technologies are high-dimensional, sparse,

heterogeneous, and have complicated dependency structures, making analyses using

conventional machine learning approaches challenging and impractical. In tackling

these challenges, deep learning often demonstrates superior performance compared to

traditional machine learning methods. In this work, we give a comprehensive survey

on deep learning in single-cell analysis. We ﬁrst introduce background on single-cell

technologies and their development, as well as fundamental concepts of deep learning

including the most popular deep architectures. We present an overview of the single-

cell analytic pipeline pursued in research applications while noting divergences due to

data sources or speciﬁc applications. We then review seven popular tasks spanning

through diﬀerent stages of the single-cell analysis pipeline, including multimodal inte-

gration, imputation, clustering, spatial domain identiﬁcation, cell-type deconvolution,

cell segmentation, and cell-type annotation. Under each task, we describe the most re-

cent developments in classical and deep learning methods and discuss their advantages

and disadvantages. Deep learning tools and benchmark datasets are also summarized

for each task. Finally, we discuss the future directions and the most recent challenges.

This survey will serve as a reference for biologists and computer scientists, encouraging

collaborations.

∗Indicates equal contributions.

†molhodyl@msu.edu

‡dingjia5@msu.edu

arXiv:2210.12385v2 [q-bio.QM] 5 Nov 2022

1 Introduction

As the basic building block of life, cells assume dynamic and complex functional states to

inform higher-order structures [1, 2]. Towards that end, the advance of single-cell sequencing

and imaging technologies has revolutionized the investigation of the gene-expression behav-

iors of cells. The advent of single-cell sequencing technology occurred in the early 1990s for

complementary DNA (cDNA) [3, 4]. However, it was not until 2009, with the creation of the

ﬁrst single-cell RNA sequencing (scRNA-seq) method [5] that marked a true paradigm shift

in the ﬁeld. Since then, steady progress in the creation of new next-generation sequencing

platforms has led to over one hundred currently existing techniques for single-cell sequenc-

ing [6, 7, 8]. These technologies measure a diverse collection of cell features including DNA

sequences and epigenetic features, RNA expression, and proﬁles of surface proteins. Recent

technological advances have also enabled the augmentation of these features with additional

data, e.g., multimodal sequencing platforms and spatial transcriptomic technology.

This paradigm shift comes from the quantity of available data using high throughput

methods [9, 10]. For example, one bulk tissue RNAseq data [11] can only quantify the

average gene expressions of a group of cells ignoring the cellular heterogeneity and hence

can serve as one sample for downstream analyses. In contrast, single-cell sequencing tech-

nologies generate tens of thousands to millions of samples/cells in a given experiment. Deep

learning methods, which have consistently shown cutting-edge performance in various big

data applications [12, 13], have fertile new ground for research that pushes the frontiers of

biological science. Studies in single-cell data continue to expand exponentially and obtain

new insights into immunology, oncology, developmental biology, pharmacology, and many

other disciplines, just to name a few areas of applications [14, 15, 16].

Despite the success of single-cell data in numerous applications, diﬃculties arise due to

the complexity of the data which requires advanced analysis pipelines with a number of

steps. Single-cell data preprocessing includes many stages of data pruning, normalization,

and often challenging machine learning tasks like batch eﬀect correction, data imputation,

or dimensionality reduction. Moreover, specialized types of single-cell data require fur-

ther processing such as multimodal data integration and cell-type deconvolution for spatial

transcriptomics. These steps are crucial to facilitate downstream tasks ranging from cluster-

ing and cell annotation, disease prediction, identifying gene coexpression networks, to the

identiﬁcation of developmental trajectories of cells transitioning between states [17]. For

tasks with clear evaluation metrics, deep learning often achieves top performance against

other classical machine learning techniques [18]. Deep learning can uniquely leverage its di-

verse architectures to capture networks of interdependencies between genes that alter other

genes’ expression levels [19], and cells that communicate with other cells through mech-

anisms like ligand-receptor pairs [20]. Due to the richness of deep learning architectures

and the customization of hyper-parameters and loss functions, deep learning models can be

more readily tailored to particular tasks in single-cell analysis compared to other machine

learning methods. Deep learning has already rapidly proliferated throughout the ﬁeld, but

due to the multidisciplinary nature of the work, many remain unaware of this burgeoning

area of research. We write this survey as a bridge between two large research communities

in single-cell biology and computer science. We provide background on deep learning to

those in biology and less familiar with machine learning modeling, and also provide some

history and summary of single-cell data to computer scientists who are looking for novel

applications for their methods.

In this survey, we review methods in the emerging use of deep learning for single-cell

biology applications. In Section (2), we discuss the history and major technologies for single-

cell sequencing. In Section (3), we give an overview of deep learning concepts and popular

deep architectures. Due to the categorization of the tasks involved in single-cell analysis,

we group our review of deep learning methods by tasks. We ﬁrst give an overview of the

pipeline in Section (4). Then, we describe the individual task objectives and highlight the

alternative machine learning methods used for the task before detailing the deep learning

methods.

2 Single-Cell Technologies

Figure 1: Timeline of major developments in single-cell technologies

Figure 2: An illustration of data matrices produced by single-cell technologies

The goal of mapping genotypes to phenotypes presents a multitude of challenges to

biologists performing transcriptome analysis [21]. The cells of an organism have nearly

the same genotype, but the transcriptome is the result of gene regulatory networks in cells

expressing only a subset of the total genes at any given time. With the advent of single-cell

technologies, researchers have access to not just transcriptomic data at the cellular level,

but also genomics and epigenomics data as shown in Figure 1. Compared to bulk sampling

technologies which measure the average transcriptome proﬁles of a group of cells, single-

cell technologies provide a higher resolution of cell diﬀerences and can attribute biological

behaviors to individual cells [22, 23, 24, 25]. We brieﬂy discuss the history of single-cell

technologies and the main technologies that are used in the applications we review. We

summarize a timeline of their development in Figure 1.

2.1 Single Modality Proﬁling

Sequencing technology was ﬁrst developed by James Eberwine et al. [3] and Iscove et al.[4],

by expanding the complementary DNAs (cDNAs). However, it wasn’t until the creation of

single-cell RNA sequencing (scRNA-seq) in 2009 [5] that single-cell methods truly gained

major traction. Since then, a few major branches in single-cell technologies have emerged,

targeting diﬀerent aspects of cells, such as RNA in 2009 [5]), DNA methylation in 2013 [26],

protein in 2015, DNA accessibility in 2015, and histone modiﬁcations in 2021 [27]. Single-

cell data is often given in the form of a matrix, with features (e.g. genes, proteins, or DNA

interval) corresponding to the columns and each cell as a row, as shown in Figure 2.

Since its creation, scRNA-seq has had remarkable success in a number of diﬀerent ap-

plications, such as cell developmental studies, classifying cell types, and gene regulation.

For scRNA-seq, isolation of the cell is the ﬁrst step for obtaining transcriptome informa-

tion. Many technologies are diﬀerentiated according to their means of cell isolation before

sequencing occurs. Earlier methods using serial dilution or robotic micromanipulation have

low eﬃciency and throughput [28] when compared to more recent methods using microﬂu-

idic technologies [29]. One promising microﬂuidic technique for single-cell isolation is using

microdroplets [30], which creates the uniform dispersion of water droplets in a medium of

oil, allowing the separation of cells into individual droplets. While commercial microﬂuidic

platforms like Fluidigm C1, ICELL8, and Chromium can beneﬁt from high throughput, they

face the challenge of high cost and often the requirement of uniform cell size in the sample.

Once a cell is separated and lysed, messenger RNAs in this cell are reverse transcripted

into more stable cDNAs with a unique cell ’barcode’. The cDNAs are then ampliﬁed via

Polymerase Chain Reaction (PCR) for better data capture before sequencing, which tends

to introduce bias due to the uneven ampliﬁcation eﬃciency. Therefore, besides the unique

barcodes, the cDNA molecules in a cell are also given a Unique Molecular Identiﬁer (UMI)

to correct the ampliﬁcation bias by collapsing the reads with the same UMI into one read.

After debiasing, sequence reads are mapped to the genome and are grouped into genes for

the creation of a count matrix [6].

Beyond recording RNA expression levels in a cell, technology may also capture infor-

mation about the chromatin accessibility of a cell’s chromosome. Eukaryotic genomes are

hierarchically packaged into chromatin [31], and this packaging plays a central role in gene

regulation [32]. Buenrostro et al. created a means for sampling the epigenome at the

single-cell level through the Assay for Transposase Accessible Chromatin using sequencing

(ATAC-seq) [33] in 2013 . ATAC-seq allows the identiﬁcation of accessible DNA, i.e. the

nucleosome-free regions of the genome [34]. DNA accessibility within the genome can be

used to identify regulatory elements in diﬀerent cell types which cause the activation or

repression of gene expression [35]. scATAC-seq produces a count matrix with a number

of reads per open chromatin regions, which lead to very large matrices with hundreds of

thousands of regions. Furthermore the data is known to be very sparse, where it is common

to have the non-zero entries make up less than 3% of the data [36].

Gene expression can also be aﬀected by a number of additional factors that are investi-

gated under the umbrella of epigenetics, which studies mechanisms like DNA methylation

and histone modiﬁcation which do not change the DNA sequence, but can change gene

activity and expression [37]. DNA methylation occurs when methyl groups are bonded to

the DNA molecule, which can repress gene transcription, and is associated with a num-

ber of key biological processes [38]. In mammals, DNA methylation occurs most often in

particular portions of the base pair sequence, namely CG (denoted CpG) portions where

a cytosine is followed by a guanine [39]. New technologies developed in the past decade

for the proﬁling of DNA methylation use bisulﬁte sequencing (scBS-seq) [40, 41] or reduced

representation busilﬁte sequencing (scRRBS-seq) [42, 43] at a single-cell resolution. The

output data for these are binary, indicating regions that are methylated by a 1, while 0

indicates no methylation.

2.2 Multi-Modality Proﬁling

In addition to the cell transcriptome and epigenome, cell proteome is another focus of

single-cell technologies, which consist of the proteins that are encoded by the mRNA of

the cell. Comprehensive measurements of a cell’s proteome are integral to understanding

how the genes respond to environmental changes, as well as for predicting cellular behavior

since proteins are the functional units responsible for most of the cellular processes. While

single-cell sequencing techniques for transcriptome measurements have widely proliferated,

single-cell proteomics methods have made slower progress. Unlike most of the sequencing

technologies which have a standard process, proteomic measurements are often bespoke and

designed for speciﬁc applications [44]. However, some technologies developed have made

signiﬁcant strides in not only capturing protein information of cells but combining this

with mRNA measurements. Speciﬁcally, Cellular Indexing of Transcriptomes and Epitopes

by Sequencing(CITE-seq), a new technology introduced in 2017, simultaneously sequences

mRNA and measures the surface proteins on a cell [45]. The method can sample over 1,000

genes and 80 proteins per cell, but like many other sequencing techniques, suﬀers from high

noise. In addition, CITE-seq is incapable to detect intracellular proteins [46].

The repertoire of multi-modal single-cell technologies bridges RNA expression not only

to protein but also to DNA methylation, chromatin accessibility, and histone modiﬁcations.

One of the ﬁrst methods to simultaneously sequence RNA and chromatin accessibility is

a droplet-based method named SNARE-seq. Published in 2019, it uses Tn5 transposase

to capture accessible chromatin and creates shared barcodes between RNA and accessible

regions [47]. The same year, paired-seq raised the throughput by two orders by combining a

ligation-based combinatorial indexing strategy and an amplify-and-split library dedicating

method [48]. SHARE-seq [49] further increased the throughput and resolution by adapting

Paired-seq and SPLiT-seq [50], a scRNA-seq technology. In 2020, 10X also released 10X

Multiome, a commercialized product for the joint proﬁling of RNA and chromatin accessibil-

ity. Beyond the co-proﬁling of RNA and chromatin accessibility, scM&T-seq [51], based on

G&T-seq [52], allows for parallel analysis of single-cell RNA and DNA methylation. Another

emerging ﬁeld is the joint proﬁling of RNA and histone modiﬁcations. Example technologies

in this category include Paired-Tag [53] and CoTECH [54], both became available in 2021.

2.3 Single-Cell Spatial Transcriptomics

Single-cell technologies that capture transcriptomic, proteomic, or epigenetic information

do so with great precision but with the loss of spatial information of the cells within the

tissues. However, the cells’ relative locations within tissue is critical to understanding normal

development and disease pathology. With spatial transcriptomic technologies, researchers

are able to measure transcriptomics and leverage the spatial information or relative locations

of cells in a tissue for better performing downstream tasks [55, 56, 57, 58, 59, 60, 61]. For

example, motivated by the fact that a pair of ligand and receptor with closer distance

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DeepLearninginSingle-CellAnalysisDylanMolho*1,JiayuanDing2,ZhaohengLi4,HongzhiWen2,WenzhuoTang3,YixinWang5,JulianVenegas1,WeiJin2,RenmingLiu1,RunzeSu1,3,PatrickDanaher8,RobertYang9,YuLeoLei6,7,YuyingXie1,3,andJiliangTang21DepartmentofComputationalMathematics,ScienceandEngineering,MichiganStateUni...

展开>> 收起<<

Deep Learning in Single-Cell Analysis Dylan Molho1 Jiayuan Ding2 Zhaoheng Li4 Hongzhi Wen2 Wenzhuo Tang3 Yixin Wang5 Julian Venegas1 Wei Jin2 Renming Liu1 Runze Su13.pdf

共77页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deep Learning in Single-Cell Analysis Dylan Molho1 Jiayuan Ding2 Zhaoheng Li4 Hongzhi Wen2 Wenzhuo Tang3 Yixin Wang5 Julian Venegas1 Wei Jin2 Renming Liu1 Runze Su13

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: