a docking model that estimates the bind-
ing affinity of candidate small molecules
to a receptor of interest by performing
a constrained minimization of the model’s
scoring function[1]. Many scoring func-
tions have been proposed[1][2][3][4][5][6][7],
ranging from force field calculations to
knowledge-based methods and even machine
learning methods.
The quality of a docking model is mea-
sured by how well the model can distin-
guish ligands from non-ligands[1][19]. This
is done by examining how the model scores
a dataset of small molecules consisting of
(1) known ligands (called actives) and (2)
molecules known or expected not to bind to
the receptor (known as decoys). The prac-
tice of using a dataset of actives and decoys
to measure how well a docking model can
distinguish between ligands and non-ligands
is known as retrospective docking[8]. This
is the principal model validation technique
available for docking models. Once prop-
erly validated, a docking model may be used
to perform prospective docking, which means
scoring molecules of unknown activity[8]. A
prospective docking screen usually involves
the scoring of many millions or even billions
of molecules[19].
Decoys can be generated for a given
receptor in a number of ways[19][11],
but there are also several established
datasets of actives and decoys available as
benchmarks[8][9][10]. Decoy sets are often
designed to be particularly difficult to dis-
criminate from actives for certain targets,
such that if a given docking model can suc-
cessfully discriminate between them in the
setting of retrospective docking then this
warrants a stronger belief in the model’s use-
fulness as a tool for enrichment in the setting
of prospective docking[19].
An important goal in the field of com-
putational drug discovery is to find a
way to quantitatively measure the ca-
pacity of a given docking model to en-
rich a set of molecules by reliably pre-
dicting mostly favorable interactions for
actives and mostly unfavorable interac-
tions for decoys[14]. There are a num-
ber of enrichment metrics that have been
developed[14][16][18]. For example, the met-
ric known as enrichment factor (EF) cap-
tures the idea of enrichment by equating it
to the proportion of actives present in some
top fraction of best scoring molecules[15].
Quantitative metrics like EF also allow re-
searchers to compare the performance of dif-
ferent docking models on the same dataset
of actives and decoys[16].
The metric known as LogAUC [13] is
one of the most popular metrics for eval-
uating the quality of molecular docking
models[9][10][19][21]. However, it comes
with a significant drawback: it depends on a
cutoff parameter[13][17]. This cutoff param-
eter controls what the minimum value of the
log-scaled x-axis is, which is a mandatory
consequence of the choice to use the loga-
rithmic scale. Unless this parameter is cho-
sen carefully depending on the dataset, one
of the two following situations occurs: either
(1) some fraction of the first inter-decoy in-
tervals of the ROC curve are simply thrown
away and do not contribute to the metric
at all, or (2) the very first inter-decoy inter-
val contributes too much to the metric (even
compared to the contribution of the second
inter-decoy interval), and this comes at the
expense of all inter-decoy intervals following
the first interval.
We fix this problem with LogAUC by
showing a simple way to choose the cutoff
parameter based on the number of decoys
which forces the first inter-decoy interval to
always have a stable, sensible contribution
to the total value. Moreover, we introduce a
normalized version of LogAUC known as en-
richment score, which (1) enforces stability
by selecting the cutoff parameter in the man-
2