2
competition).17,18 The first, shorter and more conceptual, article17 became the benchmark (despite
the fact that the estimator proposed was far from optimal).19 This article notably established the
theoretical exactness of TDwoC by linking the approach to empirical null estimation,9,12 a concept to
which one of the co-authors had also contributed. In addition, they raised concerns about TDC in the
conclusions of the article, as in their opinion, the additional competition procedure made it difficult
to derive the distribution of target mismatch scores (a.k.a. target null PSMs). However, despite this
warning as well as rare voices pointing out the apparent inaccuracy of TDC20, the TDC approach
progressively became the reference method over the course of the following decade.
This gap between practical approaches to FDR in proteomics and theoretical background in
biostatistics was tentatively filled by He et al. (in works that remained largely unpublished21,22 until
recently23). Briefly, these authors demonstrated that FDR could be controlled (at the peptide-only
level, as opposed to the more classically-considered PSM level) using decoy sequences. They
connected their demonstration to simultaneously emerging studies from the Candès group24.
Although Barber and Candès’ seminal work unleashed an important and on-going renewal of FDR
theory in the statistics community25–28, it may seem old to proteomics researchers, as its core idea is
to fabricate uninteresting putative biomarkers in silico (i.e., fictional variables referred to as
“knockoffs”) and to use them to challenge each real putative biomarker through pairwise
competition.
2. Two distinct approaches to FDR
Today, the FDR can be controlled in two ways, both in theoretical statistics and in proteomics: based
on a description of how false-positives distribute; or based on competition, by challenging the
variables of interest with fictional ones (a.k.a. knockoffs or decoys). We hereafter summarize the two
trends, along with their specificities.
2.1. “Describing decoys” or the null-based approach
The oldest approach is based on a simple rationale: The scores of observations we are not interested
in (spectrum/peptide mismatches) form the so-called null distribution in statistics. If enough is
known about the null distribution, then it is possible to “subtract” it from the distribution observed.
We will be left with observations that lie beyond the null distribution, which can therefore be
considered of significant interest (discoveries); in sum, to be correct PSMs. Despite a complex
mathematical vehicle, necessary for statistical guarantees, the original BH procedure is the first and
simplest implementation of this approach. However, this procedure relies on a strong assumption:
that the scores distributed are p-values, as the null distribution of such values only is known to be
uniform29, at least in theory30,31. As such, the BH procedure is the natural tool to control for the FDR
when analyzing differential expression, where statistical tests are applied to all putative biomarkers.
However, it can also be applied for peptide identification, provided PSM scores can be converted into
p-values32,33.
If no p-value can be determined from the PSM scores, the approach remains valid, but an additional
preliminary step is necessary. The purpose of this step is to estimate how PSM scores distribute
under the null hypothesis (to keep the subsequent subtraction from the observed distribution
feasible). This extension of the BH framework is naturally referred to as “empirical null estimation”
(or “Empirical Bayes estimation” when the alternative hypothesis is also accounted for). Related
approaches have been used in proteomics for two decades13,34, and are still under investigation35.
TDwoC is their quintessence, as it provides a universal, conceptually simple, and easy-to-implement
means to derive the distribution of random matches.
To summarize, when decoy sequences are used for empirical null modelling, they must be considered
as a whole, essentially as a means to describe the data under the null hypothesis. As this distribution