A single-cell gene expression language model William Connell Department of Pharmaceutical Chemistry

2025-04-30 0 0 3.96MB 10 页 10玖币
侵权投诉
A single-cell gene expression language model
William Connell
Department of Pharmaceutical Chemistry
Institute for Neurodegenerative Diseases
University of California, San Francisco
San Francisco, CA 94143
connell@keiserlab.org
Umair Khan
Department of Pharmaceutical Chemistry
Institute for Neurodegenerative Diseases
University of California, San Francisco
San Francisco, CA 94143
ukhan@keiserlab.org
Michael J. Keiser
Department of Pharmaceutical Chemistry
Institute for Neurodegenerative Diseases
University of California, San Francisco
San Francisco, CA 94143
keiser@keiserlab.org
Abstract
Gene regulation is a dynamic process that connects genotype and phenotype. Given
the difficulty of physically mapping mammalian gene circuitry, we require new
computational methods to learn regulatory rules. Natural language is a valuable
analogy to the communication of regulatory control. Machine learning systems
model natural language by explicitly learning context dependencies between words.
We propose a similar system applied to single-cell RNA expression profiles to learn
context dependencies between genes. Our model, Exceiver, is trained across a
diversity of cell types using a self-supervised task formulated for discrete count data,
accounting for feature sparsity. We found agreement between the similarity profiles
of latent sample representations and learned gene embeddings with respect to
biological annotations. We evaluated Exceiver on a new dataset and a downstream
prediction task and found that pretraining supports transfer learning. Our work
provides a framework to model gene regulation on a single-cell level and transfer
knowledge to downstream tasks.
1 Introduction
Many biological processes regulate the relationship between genotype and phenotype. On one
hand, classical genetics defines simple hereditary rules. On the other hand, complex regulatory
networks mediate response to the environment. Eventually, we may hope to model molecular circuitry
comprehensively to accurately predict phenotypes. In this direction, learned generalizations of
biological processes may support medical interventions such as individual disease risk prediction and
patient therapy selection.
Large-scale cellular assays capture snapshots of complex and dynamic biological processes such as
gene regulation, with RNA abundances being one measurable outcome. Single-cell RNA sequencing
(scRNA-seq) observations can relate cellular states and mRNA expression relationships, revealing
gene programs corresponding to disease processes, genetic perturbations, and therapeutic interven-
tions [
4
]. Given the difficulty of physically mapping regulatory circuitry explicitly, we hypothesized
a model trained on a large volume of transcriptomic profiles would instead implicitly learn RNA
expression dependencies that reflect regulatory logic.
Accepted at Learning Meaningful Representations of Life Workshop, 36th Conference on Neural Information
Processing Systems (NeurIPS 2022).
arXiv:2210.14330v1 [q-bio.QM] 25 Oct 2022
Figure 1:
Exceiver learns cell embeddings reflecting tissue and compartment.
(a) Architectural
overview and pretraining strategy. UMAP of (b) original data PCA embeddings and (c) Exceiver
sample embeddings colored by tissue type and compartment.
Pretrained models in natural language processing, computer vision, and protein modeling motivate a
similar approach in systems biology [
2
,
3
,
8
]. Pretrained models that transfer to downstream tasks
share three components leveraging domain-specific inductive biases. First, sufficient unlabeled data
volumes provide enough information for highly parameterized models to learn complex relationships
between features. Second, models learn from unlabeled data in a self-supervised manner, often by
feature masking, in which unmasked features are used to predict a fraction of values that are masked.
Third, an attention mechanism learns the dependencies between features. Traditionally, a transformer
applies self-attention to learn context-dependent feature representations. Given the success of this
recipe across various domains, we propose to model gene regulation similarly.
Building on sequence modeling, Exceiver (Expression-Perceiver) is a single-cell gene expression
language model pretrained on an atlas of transcriptomic data. We leveraged the Perceiver IO
framework to train a long-context sequence model on all protein-coding genes in a self-supervised
manner [
5
,
6
]. We evaluated latent sample representations with respect to metadata labels including
cell compartment, tissue, and cell ontology. We analyzed the similarity of learned gene embeddings
relative to known molecular interactions. Finally, we assessed pretrained Exceiver models on new
datasets and a downstream task. Exceiver provides a framework to learn gene regulatory logic from
unlabeled single-cell transcriptomes and transfer knowledge to new domains.
2 Exceiver accounts for discrete features and technical dropout in
scRNA-seq self-supervised pretraining
Exceiver builds on Perceiver IO to encode single-cell transcriptomic profiles. Perceiver IO scales
linearly with the size of inputs and outputs, allowing tractable transformer-based encoding and
decoding of long-context sequences. Exceiver retains the core Perceiver IO architectural components:
a cross-attention encoder, a self-attention latent process module, and a cross-attention decoder (Figure
1a; Methods).
Exceiver extends this general architecture to accommodate various biological and experimental
priors. To account for discrete RNA abundances, Exceiver represents individual genes as learnable
2
摘要:

Asingle-cellgeneexpressionlanguagemodelWilliamConnellDepartmentofPharmaceuticalChemistryInstituteforNeurodegenerativeDiseasesUniversityofCalifornia,SanFranciscoSanFrancisco,CA94143connell@keiserlab.orgUmairKhanDepartmentofPharmaceuticalChemistryInstituteforNeurodegenerativeDiseasesUniversityofCalifo...

展开>> 收起<<
A single-cell gene expression language model William Connell Department of Pharmaceutical Chemistry.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:3.96MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注