The current state of single-cell proteomics data analysis Christophe Vanderaa1and Laurent Gatto1 1Computational Biology and Bioinformatics Unit CBIO de Duve Institute

2025-05-02 0 0 1.13MB 31 页 10玖币
侵权投诉
The current state of single-cell proteomics data analysis
Christophe Vanderaa1and Laurent Gatto1
1Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute,
UCLouvain, Belgium
Email: laurent.gatto@uclouvain.be
Abstract
Sound data analysis is essential to retrieve meaningful biological information from single-
cell proteomics experiments. This analysis is carried out by computational methods that are
assembled into workflows, and their implementations influence the conclusions that can be drawn
from the data. In this work, we explore and compare the computational workflows that have been
used over the last four years and identify a profound lack of consensus on how to analyze single-
cell proteomics data. We highlight the need for benchmarking of computational workflows,
standardization of computational tools and data, as well as carefully designed experiments.
Finally, we cover the current standardization efforts that aim to fill the gap and list the remaining
missing pieces, and conclude with lessons learned from the replication of published single-cell
proteomics analyses.
Keywords: mass spectrometry, proteomics, single-cell, data analysis, reproducible research.
1 Introduction
Conducting a principled data analysis is not trivial, especially when technologies and the data they
generate increase in complexity at a fast pace. This is particularly true for mass spectrometry (MS)-
based single-cell proteomics (SCP) data analysis. Several hurdles need to be overcome in order
to extract biologically meaningful information from these complex data [61]. Numerous methods
exist to correct for technical issues, and each method has its respective advantages and drawbacks.
In this review article, we show that the variety of available methods to process proteomics data
1
arXiv:2210.01020v2 [q-bio.QM] 1 Dec 2022
and the current lack of computational standards has lead to a great heterogeneity in SCP data
analysis practices. This computational heterogeneity is a reflection of the technical heterogeneity
since MS-based SCP has undergone many improvements. For instance, two sample preparation
strategies currently co-exist: SCP by label-free quantification (LFQ) and multiplexed SCP [31, 47,
13]. Multiplexing strategies include isobaric labelling, using tandem mass tags (TMT), or non-
isobaric labelling, using mass differential tags for relative and absolute quantification (mTRAQ)
[41, 17]. Several chips have been developed starting with the nanoPOTS chip [70], followed by the
N2chip [65], the proteoCHIP [27], or the microfluidic SciProChip [24]. Efforts have also focused
on automation of the sample processing and reported the successful integration of robot handlers
such as the Mantis [41], the OT-2 [35] or the CellenOne [27, 33, 65] dispensing devices. Several MS
instruments have been used such as orbitraps or time of flight instruments [40, 6, 17] Furthermore,
new acquisition strategies are implemented such as data independent acquisition mode [14, 17, 15,
6, 24], prioritized data acquisition [26], or increased precursor sampling and identification transfer
[62, 66] that all allow for reduced missing values. Finally, several groups reported the acquisition of
post-translational modifications, further increasing the biological resolution of the technology [40?].
This technical heterogeneity is thoroughly justified and benchmarked; each publication demonstrates
the added value of its experimental workflow. As the field demonstrates its potential, efforts are made
to make the technology broadly accessible and standardized through detailed protocols [41, 33, 26, 17]
or by replacing custom-built material with commercially available devices [35, 57]. Several groups
performed a thorough fine-tuning of experimental and instrumental parameters to better understand
their impact on analytical performance [56, 9, 51]. The current state of the field and the opportunities
to push the SCP technology to its full potential are regularly being discussed, sparking the interest
of a growing community [34, 52, 45, 31, 47, 13, 49, 48, 46]. These efforts however mostly focus on
the technical aspects of the technology and overlook the current computational practices.
In this review, we provide a computational perspective to the discussion and examine the cur-
rent approaches and practices for analysing SCP data, specifically focusing on quantitative data
processing. The first section highlights the current heterogeneity in SCP data processing. The next
section covers the existing tools that bring a solution to the current hurdles. Finally, the last section
provides several guidelines on how to improve SCP data analysis practices.
2
2 Quantitative data processing lacks consensus
Proteomics data analysis encompasses three main tasks: spectral data processing, quantitative data
processing and downstream data analysis. Spectral data processing identifies and quantifies the
peptides from the acquired MS spectra. Assigning peptide sequences to MS spectrum was spot-
lighted as an important challenge for SCP data analysis [46] and several groups have contributed to
methodological and software improvements. For instance, Yu et al. extended the match between run
(MBR) algorithm from MaxQuant to TMT data, taking advantage of the quantification data present
in unidentified MS2 spectra [67]. The iceR package also propagates information across runs. The
algorithm dramatically improves peptide identification and outperforms MBR [30]. Unfortunately,
iceR is only applicable to label-free data. Another approach to improve peptide identification is
to increase the confidence of matching by re-scoring. Re-scoring uses the annotations generated
by the search engines such as the deviation between expected and measured elution times or m/z,
the peptide length, or the ion charge [59], to update the score or probability that measured spec-
tra correctly match spectra from a theoretical or empirical spectral library. DART-ID, a Bayesian
framework to update posterior error probabilities based on an accurate estimation of elution times,
has been applied to SCP data and showed a significant increase in the number of identified spectra
[8]. Others have also improved the Percolator re-scoring algorithm for SCP experiments [20, 19],
although the measured improvements were subtle. While these developments considerably improve
the quality of spectrum identification, no dedicated developments in quantitative data processing
have been reported.
Quantitative data processing plays a critical role to overcome many technical artefacts and to
satisfy downstream analysis requirements. It consists of several steps. Quality controls ensure the
analysed data are composed of reliable information and remove features of low quality that could
otherwise compromise the validity of the results. Aggregation combines peptide level data into
protein level data. Log-transformation shapes the data so that the quantitative values follow normal
distributions. Imputation generates estimates for missing values. Finally, normalization and batch
correction aim to remove technical differences between samples and are essential to avoid biased
results. Each of these steps is implemented using different methods. For instance, many methods
exist for missing value imputation: replace by zero, replace with random values sampled from
an estimated background distribution, replace by values estimated from the K-nearest neighbours
(KNN),. . . The imputation methods have different underlying assumptions that have been extensively
reviewed in the bulk proteomics field [5], but further research is required to assess whether these
3
assumptions remain valid or not for SCP data. Besides choosing the right method, finding a correct
sequence of steps is another challenge. For instance, batch effects influence missing data and vice
versa [61]. It has been suggested to correct for batch effects before imputation [16], but batch
correction methods such as ComBat [29] break with highly missing data as in SCP data.
As of today, developing computational workflows for SCP quantitative data processing requires
expert knowledge. We refer to “computational workflow” or “computational pipeline” as the se-
quence of steps and methods that process quantification data for downstream statistical testing or
visualization. Computational workflows are built from scratch and their development often lacks an
explicit rationale. Since we lack systematic comparisons, benchmarks or guidelines, the processing
approaches become fundamentally different between publications. To illustrate our claim, we re-
view the computational approaches from several studies that shaped the SCP landscape since 2018
(Table 1). These studies present significant contributions to the field and showcase applications on
actual single cells (as opposed to bulk lysate dilutions). Five studies supplemented their publication
with material allowing to repeat, at least partially, their computational analysis. Three studies from
the Slavov Lab provide the R code and the data required to fully repeat their results [53, 33, 17].
The code is however poorly documented and difficult to re-use by other labs. Schoof et al. also offer
the data used to repeat their study and distribute their computational workflow as a documented
python library, sceptre [44]. Their library heavily relies on scanpy, a popular python library for
scRNA-Seq analysis [64]. Finally, Brunner et al. provide a python script that also relies on scanpy,
but it lacks an explicit link with the input data [6]. Based on the available material (scripts for
[53, 44, 6, 33, 17] or the methods section for the others), we constructed Figure 1. We divide the
workflow steps in 7 general categories and further group the different steps depending on whether
they are applied at the precursor/peptide to spectrum match (PSM) level, peptide level, protein
level or implicitly embedded in an MS data preprocessing software.
Several conclusions can be drawn from Figure 1. First, one publication corresponds to one
workflow. This variability cannot be explained solely by different experimental protocols. The
computational pipelines by Schoof et al. and Specht et al. differ substantially, while their TMT-based
acquisition protocols are closely related [53, 44], and the computational pipeline by Liang et al. for
processing LFQ data [35] is more similar to the TMT processing workflow of Williams et al. than
its LFQ alternative. Moreover, some publications provide a minimalistic computational workflow,
with only 3 steps, while others perform extensive processing, with 20 steps. These observations
highlight the lack of consensus and the need to identify critical steps in computational pipelines.
4
Table 1: Overview of influential SCP studies. These studies were published between 2018 and 2022. MaxQuant, FragPipe,
Proteome Discoverer (PD), and DIA-NN are software tools to conduct peptide identification and quantification. The peptide identi-
fication is performed by underlying search engines such as Andromeda, MS-GF+, MSFragger or SEQUEST. Multiplexing relies on
TMT or mTRAQ labelling while no labelling implies an LFQ approach. Some publication link to associated computational scripts to
reproduce the analysis that were written either in python or R. The throughput is expressed in number of cells retained after sample
quality control, if any (Figure 1A).
Study Publication date Raw data analysis Labeling Script Throughput Reference
Zhu et al. 2018 Sep 2018 MaxQuant/Andromeda 6 [69]
Budnik et al. 2018 Oct 2018 MaxQuant/Andromeda TMT-10 190 [7]
Dou et al. 2019 Oct 2019 MS-GF+,MASIC TMT-10 72 [18]
Zhu et al. 2019 Nov 2019 MaxQuant/Andromeda 28 [71]
Cong et al. 2020 Jan 2020 MaxQuant/Andromeda 4 [10]
Tsai et al. 2020 May 2020 MaxQuant/Andromeda TMT-11 104 [56]
Williams et al. 2020, LFQ Aug 2020 MaxQuant/Andromeda 17 [63]
Williams et al. 2020, TMT Aug 2020 MaxQuant/Andromeda TMT-11 152 [63]
Liang et al. 2020 Dec 2020 FragPipe/MSFragger 3 [35]
Specht et al. 2021 Jan 2021 MaxQuant/Andromeda TMT-11,TMT-16 R 1,490 [53]
Cong et al. 2021 Feb 2021 PD/SEQUEST 6 [11]
Schoof et al. 2021 Jun 2021 PD/SEQUEST TMT-16 Pyhon 2,025 [44]
Woo et al. 2021 Oct 2021 MaxQuant/Andromeda TMT-16 108 [65]
Brunner et al. 2022 Feb 2022 DIA-NN Python 231 [6]
Leduc et al. 2022 Mar 2022 MaxQuant/Andromeda TMT-18 R 1,556 [33]
Woo et al. 2022 Mar 2022 MaxQuant/Andromeda 155 [66]
Webber et al. 2022 Apr 2022 PD/SEQUEST 28 [62]
Derks et al. 2022 Jul 2022 DIA-NN mTRAQ-3 R 155 [17]
5
摘要:

Thecurrentstateofsingle-cellproteomicsdataanalysisChristopheVanderaa1andLaurentGatto11ComputationalBiologyandBioinformaticsUnit(CBIO),deDuveInstitute,UCLouvain,BelgiumEmail:laurent.gatto@uclouvain.beAbstractSounddataanalysisisessentialtoretrievemeaningfulbiologicalinformationfromsingle-cellproteomic...

展开>> 收起<<
The current state of single-cell proteomics data analysis Christophe Vanderaa1and Laurent Gatto1 1Computational Biology and Bioinformatics Unit CBIO de Duve Institute.pdf

共31页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:31 页 大小:1.13MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 31
客服
关注