The current state of single-cell proteomics data analysis Christophe Vanderaa1and Laurent Gatto1 1Computational Biology and Bioinformatics Unit CBIO de Duve Institute

2025-05-02 0 0 1.13MB 31 页 10玖币

侵权投诉

The current state of single-cell proteomics data analysis

Christophe Vanderaa1and Laurent Gatto1

1Computational Biology and Bioinformatics Unit (CBIO), de Duve Institute,

UCLouvain, Belgium

Email: laurent.gatto@uclouvain.be

Abstract

Sound data analysis is essential to retrieve meaningful biological information from single-

cell proteomics experiments. This analysis is carried out by computational methods that are

assembled into workﬂows, and their implementations inﬂuence the conclusions that can be drawn

from the data. In this work, we explore and compare the computational workﬂows that have been

used over the last four years and identify a profound lack of consensus on how to analyze single-

cell proteomics data. We highlight the need for benchmarking of computational workﬂows,

standardization of computational tools and data, as well as carefully designed experiments.

Finally, we cover the current standardization eﬀorts that aim to ﬁll the gap and list the remaining

missing pieces, and conclude with lessons learned from the replication of published single-cell

proteomics analyses.

Keywords: mass spectrometry, proteomics, single-cell, data analysis, reproducible research.

1 Introduction

Conducting a principled data analysis is not trivial, especially when technologies and the data they

generate increase in complexity at a fast pace. This is particularly true for mass spectrometry (MS)-

based single-cell proteomics (SCP) data analysis. Several hurdles need to be overcome in order

to extract biologically meaningful information from these complex data [61]. Numerous methods

exist to correct for technical issues, and each method has its respective advantages and drawbacks.

In this review article, we show that the variety of available methods to process proteomics data

arXiv:2210.01020v2 [q-bio.QM] 1 Dec 2022

and the current lack of computational standards has lead to a great heterogeneity in SCP data

analysis practices. This computational heterogeneity is a reﬂection of the technical heterogeneity

since MS-based SCP has undergone many improvements. For instance, two sample preparation

strategies currently co-exist: SCP by label-free quantiﬁcation (LFQ) and multiplexed SCP [31, 47,

13]. Multiplexing strategies include isobaric labelling, using tandem mass tags (TMT), or non-

isobaric labelling, using mass diﬀerential tags for relative and absolute quantiﬁcation (mTRAQ)

[41, 17]. Several chips have been developed starting with the nanoPOTS chip [70], followed by the

N2chip [65], the proteoCHIP [27], or the microﬂuidic SciProChip [24]. Eﬀorts have also focused

on automation of the sample processing and reported the successful integration of robot handlers

such as the Mantis [41], the OT-2 [35] or the CellenOne [27, 33, 65] dispensing devices. Several MS

instruments have been used such as orbitraps or time of ﬂight instruments [40, 6, 17] Furthermore,

new acquisition strategies are implemented such as data independent acquisition mode [14, 17, 15,

6, 24], prioritized data acquisition [26], or increased precursor sampling and identiﬁcation transfer

[62, 66] that all allow for reduced missing values. Finally, several groups reported the acquisition of

post-translational modiﬁcations, further increasing the biological resolution of the technology [40?].

This technical heterogeneity is thoroughly justiﬁed and benchmarked; each publication demonstrates

the added value of its experimental workﬂow. As the ﬁeld demonstrates its potential, eﬀorts are made

to make the technology broadly accessible and standardized through detailed protocols [41, 33, 26, 17]

or by replacing custom-built material with commercially available devices [35, 57]. Several groups

performed a thorough ﬁne-tuning of experimental and instrumental parameters to better understand

their impact on analytical performance [56, 9, 51]. The current state of the ﬁeld and the opportunities

to push the SCP technology to its full potential are regularly being discussed, sparking the interest

of a growing community [34, 52, 45, 31, 47, 13, 49, 48, 46]. These eﬀorts however mostly focus on

the technical aspects of the technology and overlook the current computational practices.

In this review, we provide a computational perspective to the discussion and examine the cur-

rent approaches and practices for analysing SCP data, speciﬁcally focusing on quantitative data

processing. The ﬁrst section highlights the current heterogeneity in SCP data processing. The next

section covers the existing tools that bring a solution to the current hurdles. Finally, the last section

provides several guidelines on how to improve SCP data analysis practices.

2 Quantitative data processing lacks consensus

Proteomics data analysis encompasses three main tasks: spectral data processing, quantitative data

processing and downstream data analysis. Spectral data processing identiﬁes and quantiﬁes the

peptides from the acquired MS spectra. Assigning peptide sequences to MS spectrum was spot-

lighted as an important challenge for SCP data analysis [46] and several groups have contributed to

methodological and software improvements. For instance, Yu et al. extended the match between run

(MBR) algorithm from MaxQuant to TMT data, taking advantage of the quantiﬁcation data present

in unidentiﬁed MS2 spectra [67]. The iceR package also propagates information across runs. The

algorithm dramatically improves peptide identiﬁcation and outperforms MBR [30]. Unfortunately,

iceR is only applicable to label-free data. Another approach to improve peptide identiﬁcation is

to increase the conﬁdence of matching by re-scoring. Re-scoring uses the annotations generated

by the search engines such as the deviation between expected and measured elution times or m/z,

the peptide length, or the ion charge [59], to update the score or probability that measured spec-

tra correctly match spectra from a theoretical or empirical spectral library. DART-ID, a Bayesian

framework to update posterior error probabilities based on an accurate estimation of elution times,

has been applied to SCP data and showed a signiﬁcant increase in the number of identiﬁed spectra

[8]. Others have also improved the Percolator re-scoring algorithm for SCP experiments [20, 19],

although the measured improvements were subtle. While these developments considerably improve

the quality of spectrum identiﬁcation, no dedicated developments in quantitative data processing

have been reported.

Quantitative data processing plays a critical role to overcome many technical artefacts and to

satisfy downstream analysis requirements. It consists of several steps. Quality controls ensure the

analysed data are composed of reliable information and remove features of low quality that could

otherwise compromise the validity of the results. Aggregation combines peptide level data into

protein level data. Log-transformation shapes the data so that the quantitative values follow normal

distributions. Imputation generates estimates for missing values. Finally, normalization and batch

correction aim to remove technical diﬀerences between samples and are essential to avoid biased

results. Each of these steps is implemented using diﬀerent methods. For instance, many methods

exist for missing value imputation: replace by zero, replace with random values sampled from

an estimated background distribution, replace by values estimated from the K-nearest neighbours

(KNN),. . . The imputation methods have diﬀerent underlying assumptions that have been extensively

reviewed in the bulk proteomics ﬁeld [5], but further research is required to assess whether these

assumptions remain valid or not for SCP data. Besides choosing the right method, ﬁnding a correct

sequence of steps is another challenge. For instance, batch eﬀects inﬂuence missing data and vice

versa [61]. It has been suggested to correct for batch eﬀects before imputation [16], but batch

correction methods such as ComBat [29] break with highly missing data as in SCP data.

As of today, developing computational workﬂows for SCP quantitative data processing requires

expert knowledge. We refer to “computational workﬂow” or “computational pipeline” as the se-

quence of steps and methods that process quantiﬁcation data for downstream statistical testing or

visualization. Computational workﬂows are built from scratch and their development often lacks an

explicit rationale. Since we lack systematic comparisons, benchmarks or guidelines, the processing

approaches become fundamentally diﬀerent between publications. To illustrate our claim, we re-

view the computational approaches from several studies that shaped the SCP landscape since 2018

(Table 1). These studies present signiﬁcant contributions to the ﬁeld and showcase applications on

actual single cells (as opposed to bulk lysate dilutions). Five studies supplemented their publication

with material allowing to repeat, at least partially, their computational analysis. Three studies from

the Slavov Lab provide the R code and the data required to fully repeat their results [53, 33, 17].

The code is however poorly documented and diﬃcult to re-use by other labs. Schoof et al. also oﬀer

the data used to repeat their study and distribute their computational workﬂow as a documented

python library, sceptre [44]. Their library heavily relies on scanpy, a popular python library for

scRNA-Seq analysis [64]. Finally, Brunner et al. provide a python script that also relies on scanpy,

but it lacks an explicit link with the input data [6]. Based on the available material (scripts for

[53, 44, 6, 33, 17] or the methods section for the others), we constructed Figure 1. We divide the

workﬂow steps in 7 general categories and further group the diﬀerent steps depending on whether

they are applied at the precursor/peptide to spectrum match (PSM) level, peptide level, protein

level or implicitly embedded in an MS data preprocessing software.

Several conclusions can be drawn from Figure 1. First, one publication corresponds to one

workﬂow. This variability cannot be explained solely by diﬀerent experimental protocols. The

computational pipelines by Schoof et al. and Specht et al. diﬀer substantially, while their TMT-based

acquisition protocols are closely related [53, 44], and the computational pipeline by Liang et al. for

processing LFQ data [35] is more similar to the TMT processing workﬂow of Williams et al. than

its LFQ alternative. Moreover, some publications provide a minimalistic computational workﬂow,

with only 3 steps, while others perform extensive processing, with 20 steps. These observations

highlight the lack of consensus and the need to identify critical steps in computational pipelines.

Table 1: Overview of inﬂuential SCP studies. These studies were published between 2018 and 2022. MaxQuant, FragPipe,

Proteome Discoverer (PD), and DIA-NN are software tools to conduct peptide identiﬁcation and quantiﬁcation. The peptide identi-

ﬁcation is performed by underlying search engines such as Andromeda, MS-GF+, MSFragger or SEQUEST. Multiplexing relies on

TMT or mTRAQ labelling while no labelling implies an LFQ approach. Some publication link to associated computational scripts to

reproduce the analysis that were written either in python or R. The throughput is expressed in number of cells retained after sample

quality control, if any (Figure 1A).

Study Publication date Raw data analysis Labeling Script Throughput Reference

Zhu et al. 2018 Sep 2018 MaxQuant/Andromeda — — 6 [69]

Budnik et al. 2018 Oct 2018 MaxQuant/Andromeda TMT-10 — 190 [7]

Dou et al. 2019 Oct 2019 MS-GF+,MASIC TMT-10 — 72 [18]

Zhu et al. 2019 Nov 2019 MaxQuant/Andromeda — — 28 [71]

Cong et al. 2020 Jan 2020 MaxQuant/Andromeda — — 4 [10]

Tsai et al. 2020 May 2020 MaxQuant/Andromeda TMT-11 — 104 [56]

Williams et al. 2020, LFQ Aug 2020 MaxQuant/Andromeda — — 17 [63]

Williams et al. 2020, TMT Aug 2020 MaxQuant/Andromeda TMT-11 — 152 [63]

Liang et al. 2020 Dec 2020 FragPipe/MSFragger — — 3 [35]

Specht et al. 2021 Jan 2021 MaxQuant/Andromeda TMT-11,TMT-16 R 1,490 [53]

Cong et al. 2021 Feb 2021 PD/SEQUEST — — 6 [11]

Schoof et al. 2021 Jun 2021 PD/SEQUEST TMT-16 Pyhon 2,025 [44]

Woo et al. 2021 Oct 2021 MaxQuant/Andromeda TMT-16 — 108 [65]

Brunner et al. 2022 Feb 2022 DIA-NN — Python 231 [6]

Leduc et al. 2022 Mar 2022 MaxQuant/Andromeda TMT-18 R 1,556 [33]

Woo et al. 2022 Mar 2022 MaxQuant/Andromeda — — 155 [66]

Webber et al. 2022 Apr 2022 PD/SEQUEST — — 28 [62]

Derks et al. 2022 Jul 2022 DIA-NN mTRAQ-3 R 155 [17]

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Thecurrentstateofsingle-cellproteomicsdataanalysisChristopheVanderaa1andLaurentGatto11ComputationalBiologyandBioinformaticsUnit(CBIO),deDuveInstitute,UCLouvain,BelgiumEmail:laurent.gatto@uclouvain.beAbstractSounddataanalysisisessentialtoretrievemeaningfulbiologicalinformationfromsingle-cellproteomic...

展开>> 收起<<

The current state of single-cell proteomics data analysis Christophe Vanderaa1and Laurent Gatto1 1Computational Biology and Bioinformatics Unit CBIO de Duve Institute.pdf

共31页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The current state of single-cell proteomics data analysis Christophe Vanderaa1and Laurent Gatto1 1Computational Biology and Bioinformatics Unit CBIO de Duve Institute

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: