A single-cell gene expression language model William Connell Department of Pharmaceutical Chemistry

2025-04-30 0 0 3.96MB 10 页 10玖币

侵权投诉

A single-cell gene expression language model

William Connell

Department of Pharmaceutical Chemistry

Institute for Neurodegenerative Diseases

University of California, San Francisco

San Francisco, CA 94143

connell@keiserlab.org

Umair Khan

Department of Pharmaceutical Chemistry

Institute for Neurodegenerative Diseases

University of California, San Francisco

San Francisco, CA 94143

ukhan@keiserlab.org

Michael J. Keiser

Department of Pharmaceutical Chemistry

Institute for Neurodegenerative Diseases

University of California, San Francisco

San Francisco, CA 94143

keiser@keiserlab.org

Abstract

Gene regulation is a dynamic process that connects genotype and phenotype. Given

the difﬁculty of physically mapping mammalian gene circuitry, we require new

computational methods to learn regulatory rules. Natural language is a valuable

analogy to the communication of regulatory control. Machine learning systems

model natural language by explicitly learning context dependencies between words.

We propose a similar system applied to single-cell RNA expression proﬁles to learn

context dependencies between genes. Our model, Exceiver, is trained across a

diversity of cell types using a self-supervised task formulated for discrete count data,

accounting for feature sparsity. We found agreement between the similarity proﬁles

of latent sample representations and learned gene embeddings with respect to

biological annotations. We evaluated Exceiver on a new dataset and a downstream

prediction task and found that pretraining supports transfer learning. Our work

provides a framework to model gene regulation on a single-cell level and transfer

knowledge to downstream tasks.

1 Introduction

Many biological processes regulate the relationship between genotype and phenotype. On one

hand, classical genetics deﬁnes simple hereditary rules. On the other hand, complex regulatory

networks mediate response to the environment. Eventually, we may hope to model molecular circuitry

comprehensively to accurately predict phenotypes. In this direction, learned generalizations of

biological processes may support medical interventions such as individual disease risk prediction and

patient therapy selection.

Large-scale cellular assays capture snapshots of complex and dynamic biological processes such as

gene regulation, with RNA abundances being one measurable outcome. Single-cell RNA sequencing

(scRNA-seq) observations can relate cellular states and mRNA expression relationships, revealing

gene programs corresponding to disease processes, genetic perturbations, and therapeutic interven-

tions [

]. Given the difﬁculty of physically mapping regulatory circuitry explicitly, we hypothesized

a model trained on a large volume of transcriptomic proﬁles would instead implicitly learn RNA

expression dependencies that reﬂect regulatory logic.

Accepted at Learning Meaningful Representations of Life Workshop, 36th Conference on Neural Information

Processing Systems (NeurIPS 2022).

arXiv:2210.14330v1 [q-bio.QM] 25 Oct 2022

Figure 1:

Exceiver learns cell embeddings reﬂecting tissue and compartment.

(a) Architectural

overview and pretraining strategy. UMAP of (b) original data PCA embeddings and (c) Exceiver

sample embeddings colored by tissue type and compartment.

Pretrained models in natural language processing, computer vision, and protein modeling motivate a

similar approach in systems biology [

]. Pretrained models that transfer to downstream tasks

share three components leveraging domain-speciﬁc inductive biases. First, sufﬁcient unlabeled data

volumes provide enough information for highly parameterized models to learn complex relationships

between features. Second, models learn from unlabeled data in a self-supervised manner, often by

feature masking, in which unmasked features are used to predict a fraction of values that are masked.

Third, an attention mechanism learns the dependencies between features. Traditionally, a transformer

applies self-attention to learn context-dependent feature representations. Given the success of this

recipe across various domains, we propose to model gene regulation similarly.

Building on sequence modeling, Exceiver (Expression-Perceiver) is a single-cell gene expression

language model pretrained on an atlas of transcriptomic data. We leveraged the Perceiver IO

framework to train a long-context sequence model on all protein-coding genes in a self-supervised

manner [

]. We evaluated latent sample representations with respect to metadata labels including

cell compartment, tissue, and cell ontology. We analyzed the similarity of learned gene embeddings

relative to known molecular interactions. Finally, we assessed pretrained Exceiver models on new

datasets and a downstream task. Exceiver provides a framework to learn gene regulatory logic from

unlabeled single-cell transcriptomes and transfer knowledge to new domains.

2 Exceiver accounts for discrete features and technical dropout in

scRNA-seq self-supervised pretraining

Exceiver builds on Perceiver IO to encode single-cell transcriptomic proﬁles. Perceiver IO scales

linearly with the size of inputs and outputs, allowing tractable transformer-based encoding and

decoding of long-context sequences. Exceiver retains the core Perceiver IO architectural components:

a cross-attention encoder, a self-attention latent process module, and a cross-attention decoder (Figure

1a; Methods).

Exceiver extends this general architecture to accommodate various biological and experimental

priors. To account for discrete RNA abundances, Exceiver represents individual genes as learnable

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Asingle-cellgeneexpressionlanguagemodelWilliamConnellDepartmentofPharmaceuticalChemistryInstituteforNeurodegenerativeDiseasesUniversityofCalifornia,SanFranciscoSanFrancisco,CA94143connell@keiserlab.orgUmairKhanDepartmentofPharmaceuticalChemistryInstituteforNeurodegenerativeDiseasesUniversityofCalifo...

展开>> 收起<<

A single-cell gene expression language model William Connell Department of Pharmaceutical Chemistry.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A single-cell gene expression language model William Connell Department of Pharmaceutical Chemistry

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: