Selection by Prediction with Conformal p-values Ying Jin1and Emmanuel J. Cand es12 1Department of Statistics Stanford University

2025-05-03 0 0 1.28MB 33 页 10玖币
侵权投诉
Selection by Prediction with Conformal p-values
Ying Jin1and Emmanuel J. Cand`es1,2
1Department of Statistics, Stanford University
2Department of Mathematics, Stanford University
Abstract
Decision making or scientific discovery pipelines such as job hiring and drug discovery often involve
multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions
from a machine learning model to shortlist a few candidates from a large pool. We study screening proce-
dures that aim to select candidates whose unobserved outcomes exceed user-specified values. We develop
a method that wraps around any prediction model to produce a subset of candidates while controlling
the proportion of falsely selected units. Building upon the conformal inference framework, our method
first constructs p-values that quantify the statistical evidence for large outcomes; it then determines the
shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. In many
cases, the procedure selects candidates whose predictions are above a data-dependent threshold. Our
theoretical guarantee holds under mild exchangeability conditions on the samples, generalizing existing
results on multiple conformal p-values. We demonstrate the empirical performance of our method via
simulations, and apply it to job hiring and drug discovery datasets.
1 Introduction
Decision making and scientific discovery are resource intensive tasks: human evaluation is needed before
high-stakes decisions such as job hiring [Shen et al.,2019] and disease diagnosis [Etzioni et al.,2003]; several
rounds of expensive clinical trials are required before a drug can receive FDA approval [FDA,2018]. Early
on, we often hope to identify viable candidates from a very large pool—consider hundreds of applicants to a
position or hundreds of thousands of potential compounds that may bind to the target. In such problems,
machine learning prediction is useful for an initial screening step to shortlist a few candidates; in later, more
costly stages, only these shortlisted candidates are carefully investigated to confirm the interesting cases.
This paper concerns scenarios where outcomes taking on higher values are of interest. Formally, suppose
we have access to a set of training data {(Xi, Yi)}n
i=1 and a set of test samples {Xn+j}m
j=1 whose outcomes
{Yn+j}m
j=1 are unobserved, all (X, Y ) X × Y pairs being i.i.d. from some arbitrary and unknown distribu-
tion.1Given some thresholds {cj}m
j=1, our goal is to find as many test units with Yn+j> cjas possible, while
ensuring the false discovery rate (FDR), the expected proportion of errors (Yn+jcj) among all shortlisted
candidates, is controlled. To be specific, letting R ⊆ {1, . . . , m}be the selection set, we define FDR as the
expectation of the false discovery proportion (FDP), so that
FDR = E[FDP],FDP = Pm
j=1 1{j∈ R, Yn+jcj}
1∨ |R| ,(1)
where we denote ab= max{a, b}for any a, b R, and the expectation is over the randomness of all training
data and all test samples. The FDR is a natural measure of type-I error for binary classification [Hastie
et al.,2009]. For regression problems with a continuous response, counting the error is reasonable if each
selected candidate incurs a similar cost. We discuss below potential applications with binary or quantitative
outcomes.
1Later on, we will relax the i.i.d. assumption to exchangeablity conditions.
1
arXiv:2210.01408v3 [stat.ME] 27 May 2023
Candidate screening. Companies are turning to machine learning to support recruitment decisions [Faliagka
et al.,2012,Shehu and Saeed,2016]. Predictions using automatic resume screening [Amdouni and abdessalem
Karaa,2010,Faliagka et al.,2014], semantic matching [Mochol et al.,2007] or AI-assisted interviews are used
to screen and select candidates from a large pool. Related tasks include talent sourcing, e.g., finding people
who are likely to search for new opportunities, and candidate screening, i.e., selecting qualified applicants
before further human evaluation [Heaslip,2022]. One may be interested in controlling the FDR (1) for
resource efficiency: each shortlisted candidate incurs similar costs such as communication for talent sourcing
and interviews before the hiring decisions. In candidate screening, controlling FDR ensures most of the costs
are devoted to evaluating and ranking qualified candidates. Job recruitment also has fairness concerns; an
alternative goal is to ensure that qualified candidates do not get screened out before human evaluation. To
this end, one can flip the sign of the outcomes, and (1) represents the proportion of qualified candidates in
the filtered out ones.
Drug discovery. Machine learning is playing a similar role in accelerating the drug discovery pipeline.
Early stages of drug discovery aim at finding molecules or compounds—from a diverse library [Szyma´nski
et al.,2011] developed by institutions across the world [Kim et al.,2021]—with strong effects such as high
binding affinity to a specific target. The activity of drug candidates can be evaluated by high-throughput
screening (HTS) [Macarron et al.,2011]. However, the capacity of this approach is quite limited in practice,
and it is generally infeasible to screen the whole library of readily synthesized compounds. Instead, virtual
screening [Huang,2007] by machine learning models has enabled the automatic search of promising drugs.
Often, a representative (ideally diverse) subset of the whole library is evaluated by HTS; machine learning
models are then trained on these data to predict other candidates’ activity based on their physical, geo-
metric and chemical features [Carracedo-Reboredo et al.,2021,Koutsoukas et al.,2017,Vamathevan et al.,
2019,Dara et al.,2021] and select promising ones for further HTS and/or clinical trials. Given the cost of
subsequent investigation, false positives in this process are a major concern [Sink et al.,2010]. Ensuring that
a sufficiently large proportion of resources is devoted to promising drugs is thus important for the efficiency
of the whole pipeline.
In these two examples, the FDR quantifies a trade-off between the resources devoted to shortlisted
candidates (the selection set) and the benefits from finding interesting candidates (the true positives). This
interpretation is similar to the justification of FDR in multiple testing [Benjamini and Hochberg,1995,
1997,Benjamini and Yekutieli,2001]: when evaluating a large number of hypotheses, the FDR measures
the proportion of “false leads” for follow-up confirmatory studies. However, in our prediction problem, the
affinity of a new drug is inferred not from the observations, but from other similar compounds, i.e., other
drugs in the training data. This perspective also blurs the distinction between statistical inference and
prediction; we will draw more connections between these sub-fields later.
The FDR may not necessarily be interpreted as a resource-efficiency measure. In the next two examples,
controlling the FDR, which limits the error in inferring the direction of outcomes, is relevant to monitoring
risk in healthcare and counterfactual inference.
Healthcare. With increasingly available patient data, machine learning is widely adopted to assist human
decisions in healthcare. For example, many works use machine learning prediction for large-scale early disease
diagnosis [Shen et al.,2019,Richens et al.,2020] and patient risk prediction [Rahimi et al.,2014,Corey et al.,
2018,Jalali et al.,2020]. Calibrating black-box models is important in such high-stakes problems. When it is
more desired to limit false negatives than false positives, machine learning prediction might be used to filter
out low-risk cases, leaving other cases for careful human evaluation. It is sensible to control the proportion
of high-risk cases among all filtered out samples.
Counterfactual inference. In randomized experiments that run over a period of time, inferring whether
the patients have benefited from the treatment option compared to an alternative might inform decisions
such as early stopping of the trial for some patients. More generally, inferring the benefit of certain patients
also provides evidence on treatment effect heterogeneity. This is a counterfactual inference problem [Lei
and Cand`es,2021,Jin et al.,2023] in which one could predict the counterfactuals, i.e., what would happen
should one takes an alternative option, by learning from the outcomes of patients under that option, and
2
then compare the prediction to the realized outcomes. In this case, the set of those declared as having
benefited from the treatment is informative if the FDR is controlled.
The generic task underlying these applications is to find a subset of candidates whose not-yet-observed
outcomes are of interest (e.g., qualification or high binding affinity to the target) from a potentially enormous
pool of test samples. This is often achieved by thresholding their test scores—the model prediction on the
test samples—from models built on a set of training data that are assumed to be from the same distribution.
However, controlling the error in the selected set is a nontrivial task.
1.1 Why calibrated predictive inference is insufficient
We consider a binary example to fix ideas, so that Y={0,1}. Our goal is to find test samples with Yn+j= 1.
A natural starting point is to train a machine learning model that predicts (classifies) Ygiven X, with the
hope that test samples with higher predicted values are more promising. To achieve valid prediction, one
could calibrate the model [Vovk et al.,2005] to output a prediction set b
C1α(X) taking the form ,{0},
{1}or {0,1}with the prescription that b
C1α(X) must cover the outcome Ywith probability at least 1 α
for some user-specified α(0,1). The probability is averaged over the randomness in the test sample and
the training process.
However, a prediction set with marginal coverage guarantees is insufficient for selection. For instance, one
might consider selecting all test samples jwith b
C1α(Xn+j) = {1}. The FDR of the selected set would then
be below αif b
C1α(Xn+j) covers 1 with probability at least 1 αfor the selected units, that is, conditional
on selection. However, this is clearly a false statement because predictive inference only ensures (1 α)
coverage averaged over all test samples. In fact, no matter how large we set the coverage (1 α), such a
naive approach might still return a selection set that contains too many uninteresting candidates.
It might be best to preview our results on a real-world drug discovery dataset properly introduced and
studied later in the paper (Section 4.2.1). In short, the goal is to find promising drug candidates, among
thousands of molecules, that are active (Y= 1) for the HIV target. This dataset is highly imbalanced in the
sense that only 3% of the drugs are active, as is often the case in studies on drug discovery. Our main purpose
is here to rapidly demonstrate that a straightforward application of conformal prediction methods, selecting
those leads with b
C1α(Xn+j) = {1}, results in over-confident predictions in a sense described below.
We use a deep learning model (this is introduced in Section 4.2.1) to construct conformity scores, and
ultimately, conformal prediction sets that are one-sided in the sense that they only take on three possible
values: ,{1}and {0,1}; (see Appendix C.1 for details). The left panel of Figure 1shows the FDR of the
naive approach as a function of the confidence level 1 α∈ {0.99,0.98,...,0.70}, along with the marginal
miscoverage of conformal prediction sets and the proportion of cases in the test set for which b
C1α(X) = {1}.
While conformal prediction always achieves nearly exact marginal validity (brown), it is overconfident for
seemingly promising candidates, as the error rate among the selected (FDR)— those with b
C1α(X) = {1}
is very high (orange). When 1 α= 0.90, we witness an error rate of about 80%, meaning that 4 out of 5
‘discoveries’ are false. Even in the extremely conservative case where 1 α= 0.99, the FDR exceeds 35%.
Note that this phenomenon is independent of the target FDR level. We can thus see that the selection issue
would be especially pressing if, say, we aim for a small FDR level. In fact, conformal prediction outputs
a large proportion of uninformative sets: as seen from the light bars, about 1 αof the prediction sets
are b
C1α(X) = {0,1}(we observe no empty prediction sets for this data). Thus, it ensures valid marginal
coverage even though those b
C1α(X) = {1}seldom cover the true label.
To make sure the FDR falls below a user specified tolerance q∈ {0,1}, one might want to employ a
Bonferroni correction. To do this we would pick test cases for which b
C1q/m(Xn+j) = {1}, where mis the
number of test samples. That is, we apply a Bonferroni correction to the marginal coverage, and this ensures
that the probability of making a single false selection—which upper bounds the FDR—is below q. In the
right panel of Figure 1, we compare the FDR and power of our approach and Bonferroni’s method applied
to a range of nominal FDR levels q∈ {0.11,0.15,...,0.3}.2Our approach yields almost exact FDR control
and much higher power than Bonferroni’s.
2Here we take a subset of m= 1000, as otherwise q/m exceeds the resolution of conformal prediction.
3
0.00
0.25
0.50
0.75
0.7 0.8 0.9 1.0
Marginal confidence level
Realized miscoverage
Proportion of {1}
type marginal set−conditional (FDR)
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
0.1 0.2 0.3 0.4 0.5
FDR target
Realized FDR
Power
method Bonferroni ours
Figure 1: Left: FDR (set-conditional miscoverage) of the naive approach and marginal miscoverage as a
function of the parameter α; the light blue bars are the proportion of cases among all test samples for which
b
C1α(X) = {1}. Right: FDR (curve) and power (bar) of our selective inference approach and of Bonferroni’s
method as a function of the nominal FDR target q. The FDR (resp. power) is computed by averaging the
FDP (resp. proportion of true positives) in N= 100 independent splits of training, calibration, and test
data.
To ensure calibration on the selected, we will bridge conformal inference and selective inference and devise
cfBH, an algorithm that turns any prediction model into a screening mechanism. In a nutshell, instead of
calibrating to a fixed confidence level α, we will use tools from conformal inference to quantify the model
confidence in outcomes with larger values, and then employ multiple testing ideas to construct a shortlist of
candidates with statistical guarantees.
Returning to the drug discovery application, we acknowledge a substantial literature using conformal
inference for uncertainty quantification in compound activity prediction, see Lampa et al. [2018], Eklund
et al. [2015], Svensson et al. [2018,2017], Lindh et al. [2017], and Cort´es-Ciriano and Bender [2019] for a
recent review. Whether explicitly stated or not, the goal is eventually to select or prioritize compounds
that progress to later stages of drug discovery [Ahlberg et al.,2017a,b] after constructing valid prediction
intervals. That said, current tools for selection are all heuristic, e.g., picking cases with a high predicted
value and a relatively short prediction interval. As already mentioned, a marginally valid prediction set does
not necessarily imply reliable selection. The method from this paper fills this gap, and can wrap around the
predictions from the literature to produce reliable selection rules for drug discovery.
1.2 Hypothesis testing and conformal p-values
One may view our problem as testing the random hypotheses
Hj:Yn+jcj, j = 1, . . . , m. (2)
From now on, we denote H0={j:Yn+jcj}as the set of null hypotheses. That is, we define a hypothesis
Hjfor each test sample j, and we say Hjis non-null if Yn+jexceeds the threshold cj. This is perhaps non-
classical since the hypothesis Hjis random: it concerns a random variable rather than a model parameter.
However, we show that we can still use p-values and rely on multiple hypothesis testing ideas to construct
the “rejection” set R.
We start by introducing the tool we use to quantify model confidence: conformal p-values; as its name
suggests, these p-values build upon the conformal inference framework [Vovk et al.,2005,1999]. Suppose
we are given any prediction model from a training process that is independent of the calibration and test
samples. We condition on the training process and view the prediction model as given. We first define a
nonconformity score V:X × Y Rbased on the prediction model. Intuitively, V(x, y) measures how well a
value yconforms to the prediction of the model at x. For example, given a prediction bµ:X R, one could use
V(x, y) = |ybµ(x)|; other popular choices in the literature include ideas based on quantile regression [Romano
4
et al.,2019] and conditional density estimation [Chernozhukov et al.,2021]. Should Yn+jbe observed, one
could compute the nonconformity scores Vi=V(Xi, Yi) for i= 1, . . . , n and Vn+j=V(Xn+j, Yn+j). The
corresponding conformal p-value [Vovk et al.,2005,1999,Bates et al.,2021] is defined as
p
j=Pn
i=1 1{Vi< Vn+j}+Uj·(1 + Pn
i=1 1{Vi=Vn+j})
n+ 1 ,(3)
where UjUnif[0,1] are i.i.d. random variables to break ties. If the test sample (Xn+j, Yn+j) follows the
same distribution as the training data, then p
jUnif [0,1]. However, the mutual dependence among {p
j}
is complicated as they all depend on the same calibration data. A recent paper [Bates et al.,2021] used
conformal p-values for outlier detection; in their setting, observations {(Xn+j, Yn+j)}m
j=1 are available, and
the null set {j:Hjis true}is deterministic since the null hypothesis Hjposits that (Xn+j, Yn+j) follows
the same distribution as the training samples. In our setting, the response Yn+jis not observed. This leads
us to introduce a different set of conformal p-values. Our analysis also generalizes Bates et al. [2021] to
exchangeable data.
1.3 Related work
This work concerns calibrating prediction models to obtain correct directional conclusions on the outcomes.
In situations where one cares more about the mistakes on the selected subset, our error notion, the FDR,
might be more relevant than average prediction errors. That said, several works have studied FDR control in
prediction problems, especially in binary classification. Among them, Dickhaus [2014] connects classification
to multiple testing, showing that controlling type-I error (FDR) at certain levels by thresholding an oracle
classifier asymptotically achieves the optimal (Bayes) classification risk; Scott et al. [2009] provides high-
probability bounds for estimating the FDR achieved by classification rules, rather than adaptively controlling
it at a specific level.
Our problem setup is close to several recent works on calibrated screening or thresholding [Wang et al.,
2022,Sahoo et al.,2021] in classification or regression problems. These works however focus on different
targets; Wang et al. [2022] focuses on selecting a subset with a prescribed expected number of qualified
candidates; Sahoo et al. [2021] focuses on the calibration of the predicted score itself to achieve a similar
notion of error control as ours, but at varying levels for all thresholds. The difference is that our method
rigorously controls FDR in finite samples, while it might be difficult to obtain such guarantees for the targets
in Wang et al. [2022], Sahoo et al. [2021].
Our methods build upon the conformal inference framework [Vovk et al.,2005,1999]. Although conformal-
inference-based methods have been developed for reliable uncertainty quantification in various problems [Lei
and Cand`es,2021,Cand`es et al.,2021,Jin et al.,2023,Tibshirani et al.,2019], the theoretical guarantee
usually concerns a single test point. However, in many applications, one might be interested in a batch of
individuals and desire uncertainty quantification for multiple test samples simultaneously; in such situations,
these methods are insufficient due to the complex dependence structure of test scores and p-values as well
as multiplicity issues.
This work is closely related to Bates et al. [2021], in which the authors use conformal p-values (3) to test
for multiple outliers. Our conformal p-values differ from theirs as the outcomes are not observed. A few
works [Mary and Roquain,2021,Roquain and Verzelen,2022] are parallel to Bates et al. [2021], studying
multiple testing in a setting where all null hypotheses specify an identical null distribution; they are further
generalized by Rava et al. [2021] to achieve subgroup FDR control in classification. Our method is similar
to this line of work in constructing a threshold for certain “scores” and selecting candidates with scores
above that threshold. However, we work with random hypotheses and propose distinct procedures, whereas
in their works, the hypotheses are deterministic (or conditioned on). We will discuss these distinctions in
more detail as we present our results.
Our perspective on the problem is also generally related to the multiple hypothesis testing literature
where the FDR is a popular notion of type-I error. Since we pay more attention to one particular direction
(e.g., we are interested in finding those Yn+j> cj), our work is related to testing the signs of statistical
parameters [Bohrer,1979,Bohrer and Schervish,1980,Hochberg,1986,Guo et al.,2010,Weinstein and
Ramdas,2020]. Our framework differs from the existing directional testing literature in important ways.
Firstly, we test for the direction of a random outcome instead of a model parameter. This leads to random
5
摘要:

SelectionbyPredictionwithConformalp-valuesYingJin1andEmmanuelJ.Cand`es1,21DepartmentofStatistics,StanfordUniversity2DepartmentofMathematics,StanfordUniversityAbstractDecisionmakingorscientificdiscoverypipelinessuchasjobhiringanddrugdiscoveryofteninvolvemultiplestages:beforeanyresource-intensivestep,...

展开>> 收起<<
Selection by Prediction with Conformal p-values Ying Jin1and Emmanuel J. Cand es12 1Department of Statistics Stanford University.pdf

共33页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:33 页 大小:1.28MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 33
客服
关注