1 Challenging targets or describing mismatches A comment on Common Decoy Distribution by Madej et al.

2025-04-30 0 0 875.16KB 10 页 10玖币

侵权投诉

Challenging targets or describing mismatches?

A comment on Common Decoy Distribution by Madej et al.

Lucas Etourneau and Thomas Burger*

Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI FR2048, Grenoble, France

* thomas.burger@cea.fr

Abstract: In their recent article, Madej et al.1 proposed an original way to solve the recurrent issue of

controlling for the false discovery rate (FDR) in peptide-spectrum-match (PSM) validation. Briefly,

they proposed to derive a single precise distribution of decoy matches termed the Common Decoy

Distribution (CDD) and to use it to control for FDR during a target-only search. Conceptually, this

approach is appealing as it takes the best of two worlds, i.e., decoy-based approaches (which

leverage a large-scale collection of empirical mismatches) and decoy-free approaches (which are not

subject to the randomness of decoy generation while sparing an additional database search).

Interestingly, CDD also corresponds to a middle-of-the-road approach in statistics with respect to the

two main families of FDR control procedures: Although historically based on estimating the false-

positive distribution, FDR control has recently been demonstrated to be possible thanks to

competition between the original variables (in proteomics, target sequences) and their fictional

counterparts (in proteomics, decoys). Discriminating between these two theoretical trends is of

prime importance for computational proteomics. In addition to highlighting why proteomics was a

source of inspiration for theoretical biostatistics, it provides practical insights into the improvements

that can be made to FDR control methods used in proteomics, including CDD.

1. A short history of FDR in biostatistics and proteomics

A False Discovery Rate (FDR) is a statistical estimate of the expected proportion of features that pass

a significance threshold by chance (a.k.a., false discoveries). With the advent of high-throughput

analyses, the number of measurable features have sky-rocketed. To avoid producing a proportional

increase in false discoveries, it has become essential to control for the FDR (i.e., to conservatively

select features based on the FDR). Although the starting point of FDR theory is unquestionably dated

to 1995 with the publication of the seminal article by Benjamini and Hochberg (BH)2, few later

publications acknowledge the importance of pre-existing work.3,4 After few technical

improvements5,6 between 1995 and 2000, the subject really gained momentum with the publication

of the human genome7, which revealed how high-throughput biology could dramatically take

advantage of these hitherto purely theoretical advances. The early 2000s thus saw the emergence of

several innovations. On the theoretical side, a group of researchers from Stanford reformulated the

BH framework to better fit applications in biostatistics8–12. This notably led to the now well-

established concepts of q-value (or adjusted p-value) and empirical null estimation.

Meanwhile, in the proteomics community, questions akin to FDR estimation showed up under

several names (e.g., "false identification error rates"13 in 2002 and "false-positive identification

rate"14 in 2003). It also coincided with the moment when Elias and Gigy15 formulated their intuitions

about false positive simulation through decoy permutations, preceding what is now known as Target-

Decoy Competition16 (TDC). It should be noted that this was a complete conceptual breakthrough at

the time, as there was no statistical theory to support the idea that fictional variables (i.e., decoy

sequences) created from the variables ones (i.e., target sequences) could be used to control for FDR.

This is also why decoy databases were soon proposed for use in ways that were more compliant with

the pre-existing theory of FDR control. Notably, in 2007-2008, two groups independently proposed

that target and decoy searches be performed on separate databases, i.e., without organizing a

competition between them (hereafter referred to as TDwoC, to emphasize the absence of

competition).17,18 The first, shorter and more conceptual, article17 became the benchmark (despite

the fact that the estimator proposed was far from optimal).19 This article notably established the

theoretical exactness of TDwoC by linking the approach to empirical null estimation,9,12 a concept to

which one of the co-authors had also contributed. In addition, they raised concerns about TDC in the

conclusions of the article, as in their opinion, the additional competition procedure made it difficult

to derive the distribution of target mismatch scores (a.k.a. target null PSMs). However, despite this

warning as well as rare voices pointing out the apparent inaccuracy of TDC20, the TDC approach

progressively became the reference method over the course of the following decade.

This gap between practical approaches to FDR in proteomics and theoretical background in

biostatistics was tentatively filled by He et al. (in works that remained largely unpublished21,22 until

recently23). Briefly, these authors demonstrated that FDR could be controlled (at the peptide-only

level, as opposed to the more classically-considered PSM level) using decoy sequences. They

connected their demonstration to simultaneously emerging studies from the Candès group24.

Although Barber and Candès’ seminal work unleashed an important and on-going renewal of FDR

theory in the statistics community25–28, it may seem old to proteomics researchers, as its core idea is

to fabricate uninteresting putative biomarkers in silico (i.e., fictional variables referred to as

“knockoffs”) and to use them to challenge each real putative biomarker through pairwise

competition.

2. Two distinct approaches to FDR

Today, the FDR can be controlled in two ways, both in theoretical statistics and in proteomics: based

on a description of how false-positives distribute; or based on competition, by challenging the

variables of interest with fictional ones (a.k.a. knockoffs or decoys). We hereafter summarize the two

trends, along with their specificities.

2.1. “Describing decoys” or the null-based approach

The oldest approach is based on a simple rationale: The scores of observations we are not interested

in (spectrum/peptide mismatches) form the so-called null distribution in statistics. If enough is

known about the null distribution, then it is possible to “subtract” it from the distribution observed.

We will be left with observations that lie beyond the null distribution, which can therefore be

considered of significant interest (discoveries); in sum, to be correct PSMs. Despite a complex

mathematical vehicle, necessary for statistical guarantees, the original BH procedure is the first and

simplest implementation of this approach. However, this procedure relies on a strong assumption:

that the scores distributed are p-values, as the null distribution of such values only is known to be

uniform29, at least in theory30,31. As such, the BH procedure is the natural tool to control for the FDR

when analyzing differential expression, where statistical tests are applied to all putative biomarkers.

However, it can also be applied for peptide identification, provided PSM scores can be converted into

p-values32,33.

If no p-value can be determined from the PSM scores, the approach remains valid, but an additional

preliminary step is necessary. The purpose of this step is to estimate how PSM scores distribute

under the null hypothesis (to keep the subsequent subtraction from the observed distribution

feasible). This extension of the BH framework is naturally referred to as “empirical null estimation”

(or “Empirical Bayes estimation” when the alternative hypothesis is also accounted for). Related

approaches have been used in proteomics for two decades13,34, and are still under investigation35.

TDwoC is their quintessence, as it provides a universal, conceptually simple, and easy-to-implement

means to derive the distribution of random matches.

To summarize, when decoy sequences are used for empirical null modelling, they must be considered

as a whole, essentially as a means to describe the data under the null hypothesis. As this distribution

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1Challengingtargetsordescribingmismatches?AcommentonCommonDecoyDistributionbyMadejetal.LucasEtourneauandThomasBurger*Univ.GrenobleAlpes,CNRS,CEA,Inserm,ProFIFR2048,Grenoble,France*thomas.burger@cea.frAbstract:Intheirrecentarticle,Madejetal.1proposedanoriginalwaytosolvetherecurrentissueofcontrollingf...

展开>> 收起<<

1 Challenging targets or describing mismatches A comment on Common Decoy Distribution by Madej et al..pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Challenging targets or describing mismatches A comment on Common Decoy Distribution by Madej et al.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: