POST-SELECTION CONFIDENCE BOUNDS FOR PREDICTION PERFORMANCE Pascal Rink

2025-05-02 0 0 1.79MB 17 页 10玖币

侵权投诉

POST-SELECTION CONFIDENCE BOUNDS FOR

PREDICTION PERFORMANCE

Pascal Rink∗

Institute for Statistics and

Competence Center for

Clinical Trials Bremen

University of Bremen

Bremen, Germany

Werner Brannath

Institute for Statistics and

Competence Center for

Clinical Trials Bremen

University of Bremen

Bremen, Germany

February 6, 2023

ABSTRACT

In machine learning, the selection of a promising model from a potentially large number of compet-

ing models and the assessment of its generalization performance are critical tasks that need careful

consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting

the sample at hand into a training, validation, and evaluation set, and only compute a single con-

ﬁdence interval for the prediction performance of the ﬁnal selected model. We however propose

an algorithm how to compute valid lower conﬁdence bounds for multiple models that have been

selected based on their prediction performances in the evaluation set by interpreting the selection

problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplic-

ity correction. The approach is universally applicable for any combination of prediction models,

any model selection strategy, and any prediction performance measure that accepts weights. We

conducted various simulation experiments which show that our proposed approach yields lower

conﬁdence bounds that are at least comparably good as bounds from standard approaches, and that

reliably reach the nominal coverage probability. In addition, especially when sample size is small,

our proposed approach yields better performing prediction models than the default selection of only

one model for evaluation does.

Keywords: bootstrap tilting, machine learning, multiple testing, performance evaluation, post-selection inference

1 Introduction

Many machine learning applications involve both model selection and the assessment of that model’s prediction per-

formance on future observations. This is particularly challenging when only little data is available to perform both

tasks. By allocating a greater fraction of the data towards model selection the goodness assessment gets less reliable,

and allocation of a greater fraction towards goodness assessment poses the risk of selecting a sub-par prediction model.

In such situations it is desirable to have a procedure at hand that resolves this problem reliably.

Recent work by Westphal and Brannath [2020] showed that it is beneﬁcial in terms of ﬁnal model performance and

statistical power to select multiple models for goodness assessment in spite of the need to correct for multiplicity

then. While Westphal and Brannath [2020] proposed a multiple test in such cases, we here propose a way how to

compute valid lower conﬁdence bounds for the conditional prediction performance of the ﬁnal selected model. Note

that reporting a conﬁdence interval here perfectly makes sense since a point estimate for the performance does not

incorporate the uncertainty of estimation from an evaluation set at all.

∗Correspondence to: Pascal Rink, p.rink@uni-bremen.de

arXiv:2210.13206v3 [stat.ML] 3 Feb 2023

RINK AND BRANNATH (2022)

We follow the idea of Berk et al. [2013] and interpret this post-selection inference problem as a simultaneous inference

problem, controlling for the family-wise error rate

Pθ(θi< θi,L for any i∈ {1, . . . , m})≤α, (1)

where θidenotes the performance of prediction model i,θi,L denotes the corresponding lower conﬁdence bound at

signiﬁcance level α > 0, and θis the true predictive performance. With this type 1 error control, in practice, we are

therefore able to answer the question whether there is a model among the competition that has prediction performance

θiat least as large as a reference performance θ0with high conﬁdence, no matter how and which subset of the initial

competition has been selected for evaluation. In particular, since

α≥Pθ(∪m

i=1{θi< θi,L})≥Pθ(θs< θs,L),(2)

for some s∈ {1, . . . , m}, this coverage guarantee carries over to a potentially ﬁnal selected model s.

This might be an overly eager requirement to meet in certain cases and might lead to somewhat conservative decisions.

Therefore, in order to increase power, we do not evaluate all of the candidate models, but only a promising selection

of them. Or to put it the other way around: we exclude models from evaluation that are likely not going to be the

best performing ones. However, in principle, it is also possible to report lower conﬁdence bounds for the prediction

performance of all the competing models that remain universally valid in the sense that they work with any measure of

prediction performance (as long as it accepts weights), with any combination of prediction models even from different

model classes, any model selection strategy, and are computationally undemanding as no additional model training

is involved. While Berk et al. [2013] proposed universally valid post-selection conﬁdence bounds for regression

coefﬁcients, we are interested in a post-selection lower conﬁdence bound for the conditional prediction performance

of a model selected based on its evaluation performance.

1.1 Conditional vs Unconditional Performance

We are particularly interested in the conditional prediction performance, that is the generalization performance of the

model trained on the present sample. For that, in a model selection and performance estimation regime, the prevailing

recommendation in the literature is to split the sample at hand into three parts, a training, validation, and evaluation set

[Goodfellow et al., 2016, Hastie et al., 2009, Japkowicz and Shah, 2011, Murphy, 2012, Raschka, 2018], see Figure 1.

Depending on the speciﬁc selection rule, the training and validation set can sometimes be combined to form a learning

set. For instance, this is true when cross-validation is used to identify promising models from a competition of models,

based on their cross-validated prediction performance. Cross-validation using the entire sample at hand however is

not a solution to our problem since it actually estimates the unconditional prediction performance, that is the average

prediction performance of a model ﬁt on other training data from the same distribution as the original data [Bates et al.,

2021, Hastie et al., 2009]. There are examples in the literature how to correct for this unwanted behavior and report

an estimate of the conditional performance [Bates et al., 2021, Tsamardinos et al., 2018]. Yet, we choose an approach

that directly and inherently estimates the conditional performance.

1.2 Bootstrap Tilting Conﬁdence Intervals

Our proposed conﬁdence bounds are obtained using bootstrap resampling. In particular, we use bootstrap tilting (BT),

introduced by Efron [1981], which is a general approach to estimate conﬁdence intervals for some statistic θ=θ(F)

using an i. i. d. sample (y1, y2, . . . , yn)from an unknown distribution F. This statistic θwill later be our performance

estimate of choice. Unlike many other bootstrap conﬁdence intervals, BT estimates the distribution of ˆ

θ−θ0for some

test value θ0. The lower conﬁdence bound is then formed consisting of those values of θ0that could not be rejected

in a test of the null hypothesis H0:θ≤θ0. This way the distribution to resample from is consistent with the null

distribution. In particular, this is achieved by reducing the problem to a one-parameteric family (Fτ)τof distributions,

where τis called the tilting parameter and Fτhas support on the observed data {y1, y2, . . . , yn}. A speciﬁc value

of τinduces nonnegative sampling weights pτ= (p1(τ), p2(τ), . . . , pn(τ)) such that Pn

i=1 pi(τ)=1. The tilting

parameter τis monotonically related to θsuch that a speciﬁc value of τcorresponds to a speciﬁc value of θ0.

In order to ﬁnd a lower conﬁdence bound θL, we ﬁnd the largest value of τ < 0such that the corresponding level

αtest still rejects H0; this means θLis the largest value of θ0such that, if the sample came from a distribution with

parameter θ0, the probability of observing ˆ

θor an ever larger value is α,

PFτ(θ≥ˆ

θ) = α. (3)

Conceptually, for any given value of τ, we need to sample from Fτand check whether equation (3) holds true. This is

both expensive and exposed to the randomness of repeated sampling. What we actually do is to employ an importance

RINK AND BRANNATH (2022)

Figure 1: Default evaluation pipeline, as predominantly recommended in the literature. Only a single model ˆ

βsis

selected for evaluation based on its validation performance ˆηs.ˆ

θs,L is the lower conﬁdence bound for that model’s

evaluation performance

sampling reweighting approach as proposed by Efron [1981]. This allows us to ﬁnd the lower bound θLusing only

bootstrap resamples from the observed empirical distribution ˆ

F. We reweight each resample b= 1, . . . , B with the

relative likelihood Wb(τ) = Qn

i=1 pi(τ)/Qn

i=1 n−1of the resample under pτ-weighted sampling relative to ordinary

sampling with equal weights n−1, and calibrate the tilting parameter τ < 0such that the estimated probability of

observing at least ˆ

θunder the tilted distribution Fτis α,

α=PFτ(θ(ˆ

F∗)≥ˆ

θ) = B−1

b=1

Wb(τ)I{ˆ

θ∗

b≥ˆ

θ},

where ˆ

F∗is the resampling empirical distribution. Then the value of the statistic θthat corresponds to that calibrated

value of τand the respective sampling weights pτis the desired lower conﬁdence bound θL=θ(pτ). Figure 2

illustrates this idea.

The tilting approach does not work if the data y1=. . . =ynto resample from is constant because then

p1(τ) = . . . =pn(τ)for any τ, (4)

and the empirical distribution cannot be tilted. This can for instance be an issue in binary classiﬁcation, when the model

perfectly predicts the true class labels. One option to deal with this issue is to switch to another (conservative) interval

estimation method. In the aforementioned example this could for instance be a Clopper-Pearson lower conﬁdence

bound.

BT is known to be second-order correct and to work well for a single model, when no model selection is involved

[DiCiccio and Romano, 1990, Hesterberg, 1999]. However, in our proposed pipeline, multiple models are being

evaluated, see Figure 3. Thus, we modify the BT routine and incorporate a maxT-type multiplicity control, which is

a well-known standard approach in simultaneous inference [Dickhaus, 2014]. To the best of our knowledge, this is

the ﬁrst time that BT is extended to simultaneous inference and applied in a machine learning evaluation setup. Our

RINK AND BRANNATH (2022)

θL

Figure 2: BT conﬁdence bound estimation. The solid-line distribution on the right represents ˆ

F, while the dashed-

lined distribution on the left represents ˆ

Fτ. BT ﬁnds a value for τsuch that the probability under ˆ

Fτto observe at least

θis α; this means the mass of the dashed-lined distribution that is to the right of ˆ

θis equal to α. The associated value

θLof θunder ˆ

Fτis the desired lower conﬁdence bound

approach enables us to simultaneously evaluate the conditional performances of multiple models and provide valid

conﬁdence bounds for them and in particular one for the ﬁnal selected model.

In the following, we consider a binary classiﬁcation problem where a potentially large number rof candidate models

have already been trained and a number of promising models s1, . . . , smhave already been selected for evaluation,

based on their validation performances ˆηs1, . . . ˆηsmand following some selection rule. We call this multitude of models

selected for evaluation to be the set of preselected models. In addition, we suppose that retraining of the preselected

models on the entire learning data has already been performed, yielding models ˆ

βs1,..., ˆ

βsm. Also, suppose that

the associated performance estimates ˆ

θs1,...,ˆ

θsmhave been obtained based on the predictions from the hold-out

evaluation set, and a ﬁnal model s∈ {s1, . . . , sm}has been selected due to its evaluation performance ˆ

θsfollowing

some (other) selection rule.

Section 2 has all the details to our proposed method. In Section 3 we show a selection of results from our simulation

experiments and a complete presentation can be found in the Supplementary Information. We apply our proposed

approach on a real data set in Section 4. Our presentation ends with a discussion in Section 5.

2 Method

For brevity, let j= 1, . . . , m denote the preselected models instead of s1, . . . , smand let s∈ {1, . . . , m}denote

the ﬁnal selected model, that is the model with the most promising evaluation performance ˆ

θs, which is a function

of that model’s evaluation predictions ˆy1s,ˆy2s,...,ˆyns, where nis the size of the evaluation set at hand. Note that

this estimate ˆ

θsof generalizing prediction performance is subject to selection bias and therefore overly optimistic.

To compute our proposed multiplicity-adjusted bootstrap tilting (MABT) lower conﬁdence bound we only need these

predictions ˆyij from all of the competing preselected models j= 1, . . . , m in the evaluation set and the associated

true class labels yi,i= 1, . . . , n. For instance, in case the performance measure of interest is prediction accuracy, the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

POST-SELECTIONCONFIDENCEBOUNDSFORPREDICTIONPERFORMANCEPascalRinkInstituteforStatisticsandCompetenceCenterforClinicalTrialsBremenUniversityofBremenBremen,GermanyWernerBrannathInstituteforStatisticsandCompetenceCenterforClinicalTrialsBremenUniversityofBremenBremen,GermanyFebruary6,2023ABSTRACTInmachi...

展开>> 收起<<

POST-SELECTION CONFIDENCE BOUNDS FOR PREDICTION PERFORMANCE Pascal Rink.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

POST-SELECTION CONFIDENCE BOUNDS FOR PREDICTION PERFORMANCE Pascal Rink

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: