Candidate screening. Companies are turning to machine learning to support recruitment decisions [Faliagka
et al.,2012,Shehu and Saeed,2016]. Predictions using automatic resume screening [Amdouni and abdessalem
Karaa,2010,Faliagka et al.,2014], semantic matching [Mochol et al.,2007] or AI-assisted interviews are used
to screen and select candidates from a large pool. Related tasks include talent sourcing, e.g., finding people
who are likely to search for new opportunities, and candidate screening, i.e., selecting qualified applicants
before further human evaluation [Heaslip,2022]. One may be interested in controlling the FDR (1) for
resource efficiency: each shortlisted candidate incurs similar costs such as communication for talent sourcing
and interviews before the hiring decisions. In candidate screening, controlling FDR ensures most of the costs
are devoted to evaluating and ranking qualified candidates. Job recruitment also has fairness concerns; an
alternative goal is to ensure that qualified candidates do not get screened out before human evaluation. To
this end, one can flip the sign of the outcomes, and (1) represents the proportion of qualified candidates in
the filtered out ones.
Drug discovery. Machine learning is playing a similar role in accelerating the drug discovery pipeline.
Early stages of drug discovery aim at finding molecules or compounds—from a diverse library [Szyma´nski
et al.,2011] developed by institutions across the world [Kim et al.,2021]—with strong effects such as high
binding affinity to a specific target. The activity of drug candidates can be evaluated by high-throughput
screening (HTS) [Macarron et al.,2011]. However, the capacity of this approach is quite limited in practice,
and it is generally infeasible to screen the whole library of readily synthesized compounds. Instead, virtual
screening [Huang,2007] by machine learning models has enabled the automatic search of promising drugs.
Often, a representative (ideally diverse) subset of the whole library is evaluated by HTS; machine learning
models are then trained on these data to predict other candidates’ activity based on their physical, geo-
metric and chemical features [Carracedo-Reboredo et al.,2021,Koutsoukas et al.,2017,Vamathevan et al.,
2019,Dara et al.,2021] and select promising ones for further HTS and/or clinical trials. Given the cost of
subsequent investigation, false positives in this process are a major concern [Sink et al.,2010]. Ensuring that
a sufficiently large proportion of resources is devoted to promising drugs is thus important for the efficiency
of the whole pipeline.
In these two examples, the FDR quantifies a trade-off between the resources devoted to shortlisted
candidates (the selection set) and the benefits from finding interesting candidates (the true positives). This
interpretation is similar to the justification of FDR in multiple testing [Benjamini and Hochberg,1995,
1997,Benjamini and Yekutieli,2001]: when evaluating a large number of hypotheses, the FDR measures
the proportion of “false leads” for follow-up confirmatory studies. However, in our prediction problem, the
affinity of a new drug is inferred not from the observations, but from other similar compounds, i.e., other
drugs in the training data. This perspective also blurs the distinction between statistical inference and
prediction; we will draw more connections between these sub-fields later.
The FDR may not necessarily be interpreted as a resource-efficiency measure. In the next two examples,
controlling the FDR, which limits the error in inferring the direction of outcomes, is relevant to monitoring
risk in healthcare and counterfactual inference.
Healthcare. With increasingly available patient data, machine learning is widely adopted to assist human
decisions in healthcare. For example, many works use machine learning prediction for large-scale early disease
diagnosis [Shen et al.,2019,Richens et al.,2020] and patient risk prediction [Rahimi et al.,2014,Corey et al.,
2018,Jalali et al.,2020]. Calibrating black-box models is important in such high-stakes problems. When it is
more desired to limit false negatives than false positives, machine learning prediction might be used to filter
out low-risk cases, leaving other cases for careful human evaluation. It is sensible to control the proportion
of high-risk cases among all filtered out samples.
Counterfactual inference. In randomized experiments that run over a period of time, inferring whether
the patients have benefited from the treatment option compared to an alternative might inform decisions
such as early stopping of the trial for some patients. More generally, inferring the benefit of certain patients
also provides evidence on treatment effect heterogeneity. This is a counterfactual inference problem [Lei
and Cand`es,2021,Jin et al.,2023] in which one could predict the counterfactuals, i.e., what would happen
should one takes an alternative option, by learning from the outcomes of patients under that option, and
2