Selection by Prediction with Conformal p-values Ying Jin1and Emmanuel J. Cand es12 1Department of Statistics Stanford University

2025-05-03 0 0 1.28MB 33 页 10玖币

侵权投诉

Selection by Prediction with Conformal p-values

Ying Jin1and Emmanuel J. Cand`es1,2

1Department of Statistics, Stanford University

2Department of Mathematics, Stanford University

Abstract

Decision making or scientiﬁc discovery pipelines such as job hiring and drug discovery often involve

multiple stages: before any resource-intensive step, there is often an initial screening that uses predictions

from a machine learning model to shortlist a few candidates from a large pool. We study screening proce-

dures that aim to select candidates whose unobserved outcomes exceed user-speciﬁed values. We develop

a method that wraps around any prediction model to produce a subset of candidates while controlling

the proportion of falsely selected units. Building upon the conformal inference framework, our method

ﬁrst constructs p-values that quantify the statistical evidence for large outcomes; it then determines the

shortlist by comparing the p-values to a threshold introduced in the multiple testing literature. In many

cases, the procedure selects candidates whose predictions are above a data-dependent threshold. Our

theoretical guarantee holds under mild exchangeability conditions on the samples, generalizing existing

results on multiple conformal p-values. We demonstrate the empirical performance of our method via

simulations, and apply it to job hiring and drug discovery datasets.

1 Introduction

Decision making and scientiﬁc discovery are resource intensive tasks: human evaluation is needed before

high-stakes decisions such as job hiring [Shen et al.,2019] and disease diagnosis [Etzioni et al.,2003]; several

rounds of expensive clinical trials are required before a drug can receive FDA approval [FDA,2018]. Early

on, we often hope to identify viable candidates from a very large pool—consider hundreds of applicants to a

position or hundreds of thousands of potential compounds that may bind to the target. In such problems,

machine learning prediction is useful for an initial screening step to shortlist a few candidates; in later, more

costly stages, only these shortlisted candidates are carefully investigated to conﬁrm the interesting cases.

This paper concerns scenarios where outcomes taking on higher values are of interest. Formally, suppose

we have access to a set of training data {(Xi, Yi)}n

i=1 and a set of test samples {Xn+j}m

j=1 whose outcomes

{Yn+j}m

j=1 are unobserved, all (X, Y )∈ X × Y pairs being i.i.d. from some arbitrary and unknown distribu-

tion.1Given some thresholds {cj}m

j=1, our goal is to ﬁnd as many test units with Yn+j> cjas possible, while

ensuring the false discovery rate (FDR), the expected proportion of errors (Yn+j≤cj) among all shortlisted

candidates, is controlled. To be speciﬁc, letting R ⊆ {1, . . . , m}be the selection set, we deﬁne FDR as the

expectation of the false discovery proportion (FDP), so that

FDR = E[FDP],FDP = Pm

j=1 1{j∈ R, Yn+j≤cj}

1∨ |R| ,(1)

where we denote a∨b= max{a, b}for any a, b ∈R, and the expectation is over the randomness of all training

data and all test samples. The FDR is a natural measure of type-I error for binary classiﬁcation [Hastie

et al.,2009]. For regression problems with a continuous response, counting the error is reasonable if each

selected candidate incurs a similar cost. We discuss below potential applications with binary or quantitative

outcomes.

1Later on, we will relax the i.i.d. assumption to exchangeablity conditions.

arXiv:2210.01408v3 [stat.ME] 27 May 2023

Candidate screening. Companies are turning to machine learning to support recruitment decisions [Faliagka

et al.,2012,Shehu and Saeed,2016]. Predictions using automatic resume screening [Amdouni and abdessalem

Karaa,2010,Faliagka et al.,2014], semantic matching [Mochol et al.,2007] or AI-assisted interviews are used

to screen and select candidates from a large pool. Related tasks include talent sourcing, e.g., ﬁnding people

who are likely to search for new opportunities, and candidate screening, i.e., selecting qualiﬁed applicants

before further human evaluation [Heaslip,2022]. One may be interested in controlling the FDR (1) for

resource eﬃciency: each shortlisted candidate incurs similar costs such as communication for talent sourcing

and interviews before the hiring decisions. In candidate screening, controlling FDR ensures most of the costs

are devoted to evaluating and ranking qualiﬁed candidates. Job recruitment also has fairness concerns; an

alternative goal is to ensure that qualiﬁed candidates do not get screened out before human evaluation. To

this end, one can ﬂip the sign of the outcomes, and (1) represents the proportion of qualiﬁed candidates in

the ﬁltered out ones.

Drug discovery. Machine learning is playing a similar role in accelerating the drug discovery pipeline.

Early stages of drug discovery aim at ﬁnding molecules or compounds—from a diverse library [Szyma´nski

et al.,2011] developed by institutions across the world [Kim et al.,2021]—with strong eﬀects such as high

binding aﬃnity to a speciﬁc target. The activity of drug candidates can be evaluated by high-throughput

screening (HTS) [Macarron et al.,2011]. However, the capacity of this approach is quite limited in practice,

and it is generally infeasible to screen the whole library of readily synthesized compounds. Instead, virtual

screening [Huang,2007] by machine learning models has enabled the automatic search of promising drugs.

Often, a representative (ideally diverse) subset of the whole library is evaluated by HTS; machine learning

models are then trained on these data to predict other candidates’ activity based on their physical, geo-

metric and chemical features [Carracedo-Reboredo et al.,2021,Koutsoukas et al.,2017,Vamathevan et al.,

2019,Dara et al.,2021] and select promising ones for further HTS and/or clinical trials. Given the cost of

subsequent investigation, false positives in this process are a major concern [Sink et al.,2010]. Ensuring that

a suﬃciently large proportion of resources is devoted to promising drugs is thus important for the eﬃciency

of the whole pipeline.

In these two examples, the FDR quantiﬁes a trade-oﬀ between the resources devoted to shortlisted

candidates (the selection set) and the beneﬁts from ﬁnding interesting candidates (the true positives). This

interpretation is similar to the justiﬁcation of FDR in multiple testing [Benjamini and Hochberg,1995,

1997,Benjamini and Yekutieli,2001]: when evaluating a large number of hypotheses, the FDR measures

the proportion of “false leads” for follow-up conﬁrmatory studies. However, in our prediction problem, the

aﬃnity of a new drug is inferred not from the observations, but from other similar compounds, i.e., other

drugs in the training data. This perspective also blurs the distinction between statistical inference and

prediction; we will draw more connections between these sub-ﬁelds later.

The FDR may not necessarily be interpreted as a resource-eﬃciency measure. In the next two examples,

controlling the FDR, which limits the error in inferring the direction of outcomes, is relevant to monitoring

risk in healthcare and counterfactual inference.

Healthcare. With increasingly available patient data, machine learning is widely adopted to assist human

decisions in healthcare. For example, many works use machine learning prediction for large-scale early disease

diagnosis [Shen et al.,2019,Richens et al.,2020] and patient risk prediction [Rahimi et al.,2014,Corey et al.,

2018,Jalali et al.,2020]. Calibrating black-box models is important in such high-stakes problems. When it is

more desired to limit false negatives than false positives, machine learning prediction might be used to ﬁlter

out low-risk cases, leaving other cases for careful human evaluation. It is sensible to control the proportion

of high-risk cases among all ﬁltered out samples.

Counterfactual inference. In randomized experiments that run over a period of time, inferring whether

the patients have beneﬁted from the treatment option compared to an alternative might inform decisions

such as early stopping of the trial for some patients. More generally, inferring the beneﬁt of certain patients

also provides evidence on treatment eﬀect heterogeneity. This is a counterfactual inference problem [Lei

and Cand`es,2021,Jin et al.,2023] in which one could predict the counterfactuals, i.e., what would happen

should one takes an alternative option, by learning from the outcomes of patients under that option, and

then compare the prediction to the realized outcomes. In this case, the set of those declared as having

beneﬁted from the treatment is informative if the FDR is controlled.

The generic task underlying these applications is to ﬁnd a subset of candidates whose not-yet-observed

outcomes are of interest (e.g., qualiﬁcation or high binding aﬃnity to the target) from a potentially enormous

pool of test samples. This is often achieved by thresholding their test scores—the model prediction on the

test samples—from models built on a set of training data that are assumed to be from the same distribution.

However, controlling the error in the selected set is a nontrivial task.

1.1 Why calibrated predictive inference is insuﬃcient

We consider a binary example to ﬁx ideas, so that Y={0,1}. Our goal is to ﬁnd test samples with Yn+j= 1.

A natural starting point is to train a machine learning model that predicts (classiﬁes) Ygiven X, with the

hope that test samples with higher predicted values are more promising. To achieve valid prediction, one

could calibrate the model [Vovk et al.,2005] to output a prediction set b

C1−α(X) taking the form ∅,{0},

{1}or {0,1}with the prescription that b

C1−α(X) must cover the outcome Ywith probability at least 1 −α

for some user-speciﬁed α∈(0,1). The probability is averaged over the randomness in the test sample and

the training process.

However, a prediction set with marginal coverage guarantees is insuﬃcient for selection. For instance, one

might consider selecting all test samples jwith b

C1−α(Xn+j) = {1}. The FDR of the selected set would then

be below αif b

C1−α(Xn+j) covers 1 with probability at least 1 −αfor the selected units, that is, conditional

on selection. However, this is clearly a false statement because predictive inference only ensures (1 −α)

coverage averaged over all test samples. In fact, no matter how large we set the coverage (1 −α), such a

naive approach might still return a selection set that contains too many uninteresting candidates.

It might be best to preview our results on a real-world drug discovery dataset properly introduced and

studied later in the paper (Section 4.2.1). In short, the goal is to ﬁnd promising drug candidates, among

thousands of molecules, that are active (Y= 1) for the HIV target. This dataset is highly imbalanced in the

sense that only 3% of the drugs are active, as is often the case in studies on drug discovery. Our main purpose

is here to rapidly demonstrate that a straightforward application of conformal prediction methods, selecting

those leads with b

C1−α(Xn+j) = {1}, results in over-conﬁdent predictions in a sense described below.

We use a deep learning model (this is introduced in Section 4.2.1) to construct conformity scores, and

ultimately, conformal prediction sets that are one-sided in the sense that they only take on three possible

values: ∅,{1}and {0,1}; (see Appendix C.1 for details). The left panel of Figure 1shows the FDR of the

naive approach as a function of the conﬁdence level 1 −α∈ {0.99,0.98,...,0.70}, along with the marginal

miscoverage of conformal prediction sets and the proportion of cases in the test set for which b

C1−α(X) = {1}.

While conformal prediction always achieves nearly exact marginal validity (brown), it is overconﬁdent for

seemingly promising candidates, as the error rate among the selected (FDR)— those with b

C1−α(X) = {1}—

is very high (orange). When 1 −α= 0.90, we witness an error rate of about 80%, meaning that 4 out of 5

‘discoveries’ are false. Even in the extremely conservative case where 1 −α= 0.99, the FDR exceeds 35%.

Note that this phenomenon is independent of the target FDR level. We can thus see that the selection issue

would be especially pressing if, say, we aim for a small FDR level. In fact, conformal prediction outputs

a large proportion of uninformative sets: as seen from the light bars, about 1 −αof the prediction sets

are b

C1−α(X) = {0,1}(we observe no empty prediction sets for this data). Thus, it ensures valid marginal

coverage even though those b

C1−α(X) = {1}seldom cover the true label.

To make sure the FDR falls below a user speciﬁed tolerance q∈ {0,1}, one might want to employ a

Bonferroni correction. To do this we would pick test cases for which b

C1−q/m(Xn+j) = {1}, where mis the

number of test samples. That is, we apply a Bonferroni correction to the marginal coverage, and this ensures

that the probability of making a single false selection—which upper bounds the FDR—is below q. In the

right panel of Figure 1, we compare the FDR and power of our approach and Bonferroni’s method applied

to a range of nominal FDR levels q∈ {0.11,0.15,...,0.3}.2Our approach yields almost exact FDR control

and much higher power than Bonferroni’s.

2Here we take a subset of m= 1000, as otherwise q/m exceeds the resolution of conformal prediction.

0.00

0.25

0.50

0.75

0.7 0.8 0.9 1.0

Marginal confidence level

Realized miscoverage

Proportion of {1}

type marginal set−conditional (FDR)

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.1

0.2

0.3

0.4

0.5

0.1 0.2 0.3 0.4 0.5

FDR target

Realized FDR

Power

method Bonferroni ours

Figure 1: Left: FDR (set-conditional miscoverage) of the naive approach and marginal miscoverage as a

function of the parameter α; the light blue bars are the proportion of cases among all test samples for which

C1−α(X) = {1}. Right: FDR (curve) and power (bar) of our selective inference approach and of Bonferroni’s

method as a function of the nominal FDR target q. The FDR (resp. power) is computed by averaging the

FDP (resp. proportion of true positives) in N= 100 independent splits of training, calibration, and test

data.

To ensure calibration on the selected, we will bridge conformal inference and selective inference and devise

cfBH, an algorithm that turns any prediction model into a screening mechanism. In a nutshell, instead of

calibrating to a ﬁxed conﬁdence level α, we will use tools from conformal inference to quantify the model

conﬁdence in outcomes with larger values, and then employ multiple testing ideas to construct a shortlist of

candidates with statistical guarantees.

Returning to the drug discovery application, we acknowledge a substantial literature using conformal

inference for uncertainty quantiﬁcation in compound activity prediction, see Lampa et al. [2018], Eklund

et al. [2015], Svensson et al. [2018,2017], Lindh et al. [2017], and Cort´es-Ciriano and Bender [2019] for a

recent review. Whether explicitly stated or not, the goal is eventually to select or prioritize compounds

that progress to later stages of drug discovery [Ahlberg et al.,2017a,b] after constructing valid prediction

intervals. That said, current tools for selection are all heuristic, e.g., picking cases with a high predicted

value and a relatively short prediction interval. As already mentioned, a marginally valid prediction set does

not necessarily imply reliable selection. The method from this paper ﬁlls this gap, and can wrap around the

predictions from the literature to produce reliable selection rules for drug discovery.

1.2 Hypothesis testing and conformal p-values

One may view our problem as testing the random hypotheses

Hj:Yn+j≤cj, j = 1, . . . , m. (2)

From now on, we denote H0={j:Yn+j≤cj}as the set of null hypotheses. That is, we deﬁne a hypothesis

Hjfor each test sample j, and we say Hjis non-null if Yn+jexceeds the threshold cj. This is perhaps non-

classical since the hypothesis Hjis random: it concerns a random variable rather than a model parameter.

However, we show that we can still use p-values and rely on multiple hypothesis testing ideas to construct

the “rejection” set R.

We start by introducing the tool we use to quantify model conﬁdence: conformal p-values; as its name

suggests, these p-values build upon the conformal inference framework [Vovk et al.,2005,1999]. Suppose

we are given any prediction model from a training process that is independent of the calibration and test

samples. We condition on the training process and view the prediction model as given. We ﬁrst deﬁne a

nonconformity score V:X × Y → Rbased on the prediction model. Intuitively, V(x, y) measures how well a

value yconforms to the prediction of the model at x. For example, given a prediction bµ:X → R, one could use

V(x, y) = |y−bµ(x)|; other popular choices in the literature include ideas based on quantile regression [Romano

et al.,2019] and conditional density estimation [Chernozhukov et al.,2021]. Should Yn+jbe observed, one

could compute the nonconformity scores Vi=V(Xi, Yi) for i= 1, . . . , n and Vn+j=V(Xn+j, Yn+j). The

corresponding conformal p-value [Vovk et al.,2005,1999,Bates et al.,2021] is deﬁned as

p∗

j=Pn

i=1 1{Vi< Vn+j}+Uj·(1 + Pn

i=1 1{Vi=Vn+j})

n+ 1 ,(3)

where Uj∼Unif[0,1] are i.i.d. random variables to break ties. If the test sample (Xn+j, Yn+j) follows the

same distribution as the training data, then p∗

j∼Unif [0,1]. However, the mutual dependence among {p∗

is complicated as they all depend on the same calibration data. A recent paper [Bates et al.,2021] used

conformal p-values for outlier detection; in their setting, observations {(Xn+j, Yn+j)}m

j=1 are available, and

the null set {j:Hjis true}is deterministic since the null hypothesis Hjposits that (Xn+j, Yn+j) follows

the same distribution as the training samples. In our setting, the response Yn+jis not observed. This leads

us to introduce a diﬀerent set of conformal p-values. Our analysis also generalizes Bates et al. [2021] to

exchangeable data.

1.3 Related work

This work concerns calibrating prediction models to obtain correct directional conclusions on the outcomes.

In situations where one cares more about the mistakes on the selected subset, our error notion, the FDR,

might be more relevant than average prediction errors. That said, several works have studied FDR control in

prediction problems, especially in binary classiﬁcation. Among them, Dickhaus [2014] connects classiﬁcation

to multiple testing, showing that controlling type-I error (FDR) at certain levels by thresholding an oracle

classiﬁer asymptotically achieves the optimal (Bayes) classiﬁcation risk; Scott et al. [2009] provides high-

probability bounds for estimating the FDR achieved by classiﬁcation rules, rather than adaptively controlling

it at a speciﬁc level.

Our problem setup is close to several recent works on calibrated screening or thresholding [Wang et al.,

2022,Sahoo et al.,2021] in classiﬁcation or regression problems. These works however focus on diﬀerent

targets; Wang et al. [2022] focuses on selecting a subset with a prescribed expected number of qualiﬁed

candidates; Sahoo et al. [2021] focuses on the calibration of the predicted score itself to achieve a similar

notion of error control as ours, but at varying levels for all thresholds. The diﬀerence is that our method

rigorously controls FDR in ﬁnite samples, while it might be diﬃcult to obtain such guarantees for the targets

in Wang et al. [2022], Sahoo et al. [2021].

Our methods build upon the conformal inference framework [Vovk et al.,2005,1999]. Although conformal-

inference-based methods have been developed for reliable uncertainty quantiﬁcation in various problems [Lei

and Cand`es,2021,Cand`es et al.,2021,Jin et al.,2023,Tibshirani et al.,2019], the theoretical guarantee

usually concerns a single test point. However, in many applications, one might be interested in a batch of

individuals and desire uncertainty quantiﬁcation for multiple test samples simultaneously; in such situations,

these methods are insuﬃcient due to the complex dependence structure of test scores and p-values as well

as multiplicity issues.

This work is closely related to Bates et al. [2021], in which the authors use conformal p-values (3) to test

for multiple outliers. Our conformal p-values diﬀer from theirs as the outcomes are not observed. A few

works [Mary and Roquain,2021,Roquain and Verzelen,2022] are parallel to Bates et al. [2021], studying

multiple testing in a setting where all null hypotheses specify an identical null distribution; they are further

generalized by Rava et al. [2021] to achieve subgroup FDR control in classiﬁcation. Our method is similar

to this line of work in constructing a threshold for certain “scores” and selecting candidates with scores

above that threshold. However, we work with random hypotheses and propose distinct procedures, whereas

in their works, the hypotheses are deterministic (or conditioned on). We will discuss these distinctions in

more detail as we present our results.

Our perspective on the problem is also generally related to the multiple hypothesis testing literature

where the FDR is a popular notion of type-I error. Since we pay more attention to one particular direction

(e.g., we are interested in ﬁnding those Yn+j> cj), our work is related to testing the signs of statistical

parameters [Bohrer,1979,Bohrer and Schervish,1980,Hochberg,1986,Guo et al.,2010,Weinstein and

Ramdas,2020]. Our framework diﬀers from the existing directional testing literature in important ways.

Firstly, we test for the direction of a random outcome instead of a model parameter. This leads to random

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SelectionbyPredictionwithConformalp-valuesYingJin1andEmmanuelJ.Cand`es1,21DepartmentofStatistics,StanfordUniversity2DepartmentofMathematics,StanfordUniversityAbstractDecisionmakingorscientificdiscoverypipelinessuchasjobhiringanddrugdiscoveryofteninvolvemultiplestages:beforeanyresource-intensivestep,...

展开>> 收起<<

Selection by Prediction with Conformal p-values Ying Jin1and Emmanuel J. Cand es12 1Department of Statistics Stanford University.pdf

共33页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Selection by Prediction with Conformal p-values Ying Jin1and Emmanuel J. Cand es12 1Department of Statistics Stanford University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: