1 Challenging targets or describing mismatches A comment on Common Decoy Distribution by Madej et al.

2025-04-30 0 0 875.16KB 10 页 10玖币
侵权投诉
1
Challenging targets or describing mismatches?
A comment on Common Decoy Distribution by Madej et al.
Lucas Etourneau and Thomas Burger*
Univ. Grenoble Alpes, CNRS, CEA, Inserm, ProFI FR2048, Grenoble, France
* thomas.burger@cea.fr
Abstract: In their recent article, Madej et al.1 proposed an original way to solve the recurrent issue of
controlling for the false discovery rate (FDR) in peptide-spectrum-match (PSM) validation. Briefly,
they proposed to derive a single precise distribution of decoy matches termed the Common Decoy
Distribution (CDD) and to use it to control for FDR during a target-only search. Conceptually, this
approach is appealing as it takes the best of two worlds, i.e., decoy-based approaches (which
leverage a large-scale collection of empirical mismatches) and decoy-free approaches (which are not
subject to the randomness of decoy generation while sparing an additional database search).
Interestingly, CDD also corresponds to a middle-of-the-road approach in statistics with respect to the
two main families of FDR control procedures: Although historically based on estimating the false-
positive distribution, FDR control has recently been demonstrated to be possible thanks to
competition between the original variables (in proteomics, target sequences) and their fictional
counterparts (in proteomics, decoys). Discriminating between these two theoretical trends is of
prime importance for computational proteomics. In addition to highlighting why proteomics was a
source of inspiration for theoretical biostatistics, it provides practical insights into the improvements
that can be made to FDR control methods used in proteomics, including CDD.
1. A short history of FDR in biostatistics and proteomics
A False Discovery Rate (FDR) is a statistical estimate of the expected proportion of features that pass
a significance threshold by chance (a.k.a., false discoveries). With the advent of high-throughput
analyses, the number of measurable features have sky-rocketed. To avoid producing a proportional
increase in false discoveries, it has become essential to control for the FDR (i.e., to conservatively
select features based on the FDR). Although the starting point of FDR theory is unquestionably dated
to 1995 with the publication of the seminal article by Benjamini and Hochberg (BH)2, few later
publications acknowledge the importance of pre-existing work.3,4 After few technical
improvements5,6 between 1995 and 2000, the subject really gained momentum with the publication
of the human genome7, which revealed how high-throughput biology could dramatically take
advantage of these hitherto purely theoretical advances. The early 2000s thus saw the emergence of
several innovations. On the theoretical side, a group of researchers from Stanford reformulated the
BH framework to better fit applications in biostatistics812. This notably led to the now well-
established concepts of q-value (or adjusted p-value) and empirical null estimation.
Meanwhile, in the proteomics community, questions akin to FDR estimation showed up under
several names (e.g., "false identification error rates"13 in 2002 and "false-positive identification
rate"14 in 2003). It also coincided with the moment when Elias and Gigy15 formulated their intuitions
about false positive simulation through decoy permutations, preceding what is now known as Target-
Decoy Competition16 (TDC). It should be noted that this was a complete conceptual breakthrough at
the time, as there was no statistical theory to support the idea that fictional variables (i.e., decoy
sequences) created from the variables ones (i.e., target sequences) could be used to control for FDR.
This is also why decoy databases were soon proposed for use in ways that were more compliant with
the pre-existing theory of FDR control. Notably, in 2007-2008, two groups independently proposed
that target and decoy searches be performed on separate databases, i.e., without organizing a
competition between them (hereafter referred to as TDwoC, to emphasize the absence of
2
competition).17,18 The first, shorter and more conceptual, article17 became the benchmark (despite
the fact that the estimator proposed was far from optimal).19 This article notably established the
theoretical exactness of TDwoC by linking the approach to empirical null estimation,9,12 a concept to
which one of the co-authors had also contributed. In addition, they raised concerns about TDC in the
conclusions of the article, as in their opinion, the additional competition procedure made it difficult
to derive the distribution of target mismatch scores (a.k.a. target null PSMs). However, despite this
warning as well as rare voices pointing out the apparent inaccuracy of TDC20, the TDC approach
progressively became the reference method over the course of the following decade.
This gap between practical approaches to FDR in proteomics and theoretical background in
biostatistics was tentatively filled by He et al. (in works that remained largely unpublished21,22 until
recently23). Briefly, these authors demonstrated that FDR could be controlled (at the peptide-only
level, as opposed to the more classically-considered PSM level) using decoy sequences. They
connected their demonstration to simultaneously emerging studies from the Candès group24.
Although Barber and Candès’ seminal work unleashed an important and on-going renewal of FDR
theory in the statistics community2528, it may seem old to proteomics researchers, as its core idea is
to fabricate uninteresting putative biomarkers in silico (i.e., fictional variables referred to as
knockoffs”) and to use them to challenge each real putative biomarker through pairwise
competition.
2. Two distinct approaches to FDR
Today, the FDR can be controlled in two ways, both in theoretical statistics and in proteomics: based
on a description of how false-positives distribute; or based on competition, by challenging the
variables of interest with fictional ones (a.k.a. knockoffs or decoys). We hereafter summarize the two
trends, along with their specificities.
2.1. Describing decoys” or the null-based approach
The oldest approach is based on a simple rationale: The scores of observations we are not interested
in (spectrum/peptide mismatches) form the so-called null distribution in statistics. If enough is
known about the null distribution, then it is possible to subtractit from the distribution observed.
We will be left with observations that lie beyond the null distribution, which can therefore be
considered of significant interest (discoveries); in sum, to be correct PSMs. Despite a complex
mathematical vehicle, necessary for statistical guarantees, the original BH procedure is the first and
simplest implementation of this approach. However, this procedure relies on a strong assumption:
that the scores distributed are p-values, as the null distribution of such values only is known to be
uniform29, at least in theory30,31. As such, the BH procedure is the natural tool to control for the FDR
when analyzing differential expression, where statistical tests are applied to all putative biomarkers.
However, it can also be applied for peptide identification, provided PSM scores can be converted into
p-values32,33.
If no p-value can be determined from the PSM scores, the approach remains valid, but an additional
preliminary step is necessary. The purpose of this step is to estimate how PSM scores distribute
under the null hypothesis (to keep the subsequent subtraction from the observed distribution
feasible). This extension of the BH framework is naturally referred to as “empirical null estimation”
(or “Empirical Bayes estimation” when the alternative hypothesis is also accounted for). Related
approaches have been used in proteomics for two decades13,34, and are still under investigation35.
TDwoC is their quintessence, as it provides a universal, conceptually simple, and easy-to-implement
means to derive the distribution of random matches.
To summarize, when decoy sequences are used for empirical null modelling, they must be considered
as a whole, essentially as a means to describe the data under the null hypothesis. As this distribution
摘要:

1Challengingtargetsordescribingmismatches?AcommentonCommonDecoyDistributionbyMadejetal.LucasEtourneauandThomasBurger*Univ.GrenobleAlpes,CNRS,CEA,Inserm,ProFIFR2048,Grenoble,France*thomas.burger@cea.frAbstract:Intheirrecentarticle,Madejetal.1proposedanoriginalwaytosolvetherecurrentissueofcontrollingf...

展开>> 收起<<
1 Challenging targets or describing mismatches A comment on Common Decoy Distribution by Madej et al..pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:875.16KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注