POST-SELECTION CONFIDENCE BOUNDS FOR PREDICTION PERFORMANCE Pascal Rink

2025-05-02 0 0 1.79MB 17 页 10玖币
侵权投诉
POST-SELECTION CONFIDENCE BOUNDS FOR
PREDICTION PERFORMANCE
Pascal Rink
Institute for Statistics and
Competence Center for
Clinical Trials Bremen
University of Bremen
Bremen, Germany
Werner Brannath
Institute for Statistics and
Competence Center for
Clinical Trials Bremen
University of Bremen
Bremen, Germany
February 6, 2023
ABSTRACT
In machine learning, the selection of a promising model from a potentially large number of compet-
ing models and the assessment of its generalization performance are critical tasks that need careful
consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting
the sample at hand into a training, validation, and evaluation set, and only compute a single con-
fidence interval for the prediction performance of the final selected model. We however propose
an algorithm how to compute valid lower confidence bounds for multiple models that have been
selected based on their prediction performances in the evaluation set by interpreting the selection
problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplic-
ity correction. The approach is universally applicable for any combination of prediction models,
any model selection strategy, and any prediction performance measure that accepts weights. We
conducted various simulation experiments which show that our proposed approach yields lower
confidence bounds that are at least comparably good as bounds from standard approaches, and that
reliably reach the nominal coverage probability. In addition, especially when sample size is small,
our proposed approach yields better performing prediction models than the default selection of only
one model for evaluation does.
Keywords: bootstrap tilting, machine learning, multiple testing, performance evaluation, post-selection inference
1 Introduction
Many machine learning applications involve both model selection and the assessment of that model’s prediction per-
formance on future observations. This is particularly challenging when only little data is available to perform both
tasks. By allocating a greater fraction of the data towards model selection the goodness assessment gets less reliable,
and allocation of a greater fraction towards goodness assessment poses the risk of selecting a sub-par prediction model.
In such situations it is desirable to have a procedure at hand that resolves this problem reliably.
Recent work by Westphal and Brannath [2020] showed that it is beneficial in terms of final model performance and
statistical power to select multiple models for goodness assessment in spite of the need to correct for multiplicity
then. While Westphal and Brannath [2020] proposed a multiple test in such cases, we here propose a way how to
compute valid lower confidence bounds for the conditional prediction performance of the final selected model. Note
that reporting a confidence interval here perfectly makes sense since a point estimate for the performance does not
incorporate the uncertainty of estimation from an evaluation set at all.
Correspondence to: Pascal Rink, p.rink@uni-bremen.de
arXiv:2210.13206v3 [stat.ML] 3 Feb 2023
RINK AND BRANNATH (2022)
We follow the idea of Berk et al. [2013] and interpret this post-selection inference problem as a simultaneous inference
problem, controlling for the family-wise error rate
Pθ(θi< θi,L for any i∈ {1, . . . , m})α, (1)
where θidenotes the performance of prediction model i,θi,L denotes the corresponding lower confidence bound at
significance level α > 0, and θis the true predictive performance. With this type 1 error control, in practice, we are
therefore able to answer the question whether there is a model among the competition that has prediction performance
θiat least as large as a reference performance θ0with high confidence, no matter how and which subset of the initial
competition has been selected for evaluation. In particular, since
αPθ(m
i=1{θi< θi,L})Pθ(θs< θs,L),(2)
for some s∈ {1, . . . , m}, this coverage guarantee carries over to a potentially final selected model s.
This might be an overly eager requirement to meet in certain cases and might lead to somewhat conservative decisions.
Therefore, in order to increase power, we do not evaluate all of the candidate models, but only a promising selection
of them. Or to put it the other way around: we exclude models from evaluation that are likely not going to be the
best performing ones. However, in principle, it is also possible to report lower confidence bounds for the prediction
performance of all the competing models that remain universally valid in the sense that they work with any measure of
prediction performance (as long as it accepts weights), with any combination of prediction models even from different
model classes, any model selection strategy, and are computationally undemanding as no additional model training
is involved. While Berk et al. [2013] proposed universally valid post-selection confidence bounds for regression
coefficients, we are interested in a post-selection lower confidence bound for the conditional prediction performance
of a model selected based on its evaluation performance.
1.1 Conditional vs Unconditional Performance
We are particularly interested in the conditional prediction performance, that is the generalization performance of the
model trained on the present sample. For that, in a model selection and performance estimation regime, the prevailing
recommendation in the literature is to split the sample at hand into three parts, a training, validation, and evaluation set
[Goodfellow et al., 2016, Hastie et al., 2009, Japkowicz and Shah, 2011, Murphy, 2012, Raschka, 2018], see Figure 1.
Depending on the specific selection rule, the training and validation set can sometimes be combined to form a learning
set. For instance, this is true when cross-validation is used to identify promising models from a competition of models,
based on their cross-validated prediction performance. Cross-validation using the entire sample at hand however is
not a solution to our problem since it actually estimates the unconditional prediction performance, that is the average
prediction performance of a model fit on other training data from the same distribution as the original data [Bates et al.,
2021, Hastie et al., 2009]. There are examples in the literature how to correct for this unwanted behavior and report
an estimate of the conditional performance [Bates et al., 2021, Tsamardinos et al., 2018]. Yet, we choose an approach
that directly and inherently estimates the conditional performance.
1.2 Bootstrap Tilting Confidence Intervals
Our proposed confidence bounds are obtained using bootstrap resampling. In particular, we use bootstrap tilting (BT),
introduced by Efron [1981], which is a general approach to estimate confidence intervals for some statistic θ=θ(F)
using an i. i. d. sample (y1, y2, . . . , yn)from an unknown distribution F. This statistic θwill later be our performance
estimate of choice. Unlike many other bootstrap confidence intervals, BT estimates the distribution of ˆ
θθ0for some
test value θ0. The lower confidence bound is then formed consisting of those values of θ0that could not be rejected
in a test of the null hypothesis H0:θθ0. This way the distribution to resample from is consistent with the null
distribution. In particular, this is achieved by reducing the problem to a one-parameteric family (Fτ)τof distributions,
where τis called the tilting parameter and Fτhas support on the observed data {y1, y2, . . . , yn}. A specific value
of τinduces nonnegative sampling weights pτ= (p1(τ), p2(τ), . . . , pn(τ)) such that Pn
i=1 pi(τ)=1. The tilting
parameter τis monotonically related to θsuch that a specific value of τcorresponds to a specific value of θ0.
In order to find a lower confidence bound θL, we find the largest value of τ < 0such that the corresponding level
αtest still rejects H0; this means θLis the largest value of θ0such that, if the sample came from a distribution with
parameter θ0, the probability of observing ˆ
θor an ever larger value is α,
PFτ(θˆ
θ) = α. (3)
Conceptually, for any given value of τ, we need to sample from Fτand check whether equation (3) holds true. This is
both expensive and exposed to the randomness of repeated sampling. What we actually do is to employ an importance
2
RINK AND BRANNATH (2022)
Figure 1: Default evaluation pipeline, as predominantly recommended in the literature. Only a single model ˆ
βsis
selected for evaluation based on its validation performance ˆηs.ˆ
θs,L is the lower confidence bound for that model’s
evaluation performance
sampling reweighting approach as proposed by Efron [1981]. This allows us to find the lower bound θLusing only
bootstrap resamples from the observed empirical distribution ˆ
F. We reweight each resample b= 1, . . . , B with the
relative likelihood Wb(τ) = Qn
i=1 pi(τ)/Qn
i=1 n1of the resample under pτ-weighted sampling relative to ordinary
sampling with equal weights n1, and calibrate the tilting parameter τ < 0such that the estimated probability of
observing at least ˆ
θunder the tilted distribution Fτis α,
α=PFτ(θ(ˆ
F)ˆ
θ) = B1
B
X
b=1
Wb(τ)I{ˆ
θ
bˆ
θ},
where ˆ
Fis the resampling empirical distribution. Then the value of the statistic θthat corresponds to that calibrated
value of τand the respective sampling weights pτis the desired lower confidence bound θL=θ(pτ). Figure 2
illustrates this idea.
The tilting approach does not work if the data y1=. . . =ynto resample from is constant because then
p1(τ) = . . . =pn(τ)for any τ, (4)
and the empirical distribution cannot be tilted. This can for instance be an issue in binary classification, when the model
perfectly predicts the true class labels. One option to deal with this issue is to switch to another (conservative) interval
estimation method. In the aforementioned example this could for instance be a Clopper-Pearson lower confidence
bound.
BT is known to be second-order correct and to work well for a single model, when no model selection is involved
[DiCiccio and Romano, 1990, Hesterberg, 1999]. However, in our proposed pipeline, multiple models are being
evaluated, see Figure 3. Thus, we modify the BT routine and incorporate a maxT-type multiplicity control, which is
a well-known standard approach in simultaneous inference [Dickhaus, 2014]. To the best of our knowledge, this is
the first time that BT is extended to simultaneous inference and applied in a machine learning evaluation setup. Our
3
RINK AND BRANNATH (2022)
̂
θ
̂
θL
τ
α
Figure 2: BT confidence bound estimation. The solid-line distribution on the right represents ˆ
F, while the dashed-
lined distribution on the left represents ˆ
Fτ. BT finds a value for τsuch that the probability under ˆ
Fτto observe at least
ˆ
θis α; this means the mass of the dashed-lined distribution that is to the right of ˆ
θis equal to α. The associated value
ˆ
θLof θunder ˆ
Fτis the desired lower confidence bound
approach enables us to simultaneously evaluate the conditional performances of multiple models and provide valid
confidence bounds for them and in particular one for the final selected model.
In the following, we consider a binary classification problem where a potentially large number rof candidate models
have already been trained and a number of promising models s1, . . . , smhave already been selected for evaluation,
based on their validation performances ˆηs1, . . . ˆηsmand following some selection rule. We call this multitude of models
selected for evaluation to be the set of preselected models. In addition, we suppose that retraining of the preselected
models on the entire learning data has already been performed, yielding models ˆ
βs1,..., ˆ
βsm. Also, suppose that
the associated performance estimates ˆ
θs1,...,ˆ
θsmhave been obtained based on the predictions from the hold-out
evaluation set, and a final model s∈ {s1, . . . , sm}has been selected due to its evaluation performance ˆ
θsfollowing
some (other) selection rule.
Section 2 has all the details to our proposed method. In Section 3 we show a selection of results from our simulation
experiments and a complete presentation can be found in the Supplementary Information. We apply our proposed
approach on a real data set in Section 4. Our presentation ends with a discussion in Section 5.
2 Method
For brevity, let j= 1, . . . , m denote the preselected models instead of s1, . . . , smand let s∈ {1, . . . , m}denote
the final selected model, that is the model with the most promising evaluation performance ˆ
θs, which is a function
of that model’s evaluation predictions ˆy1s,ˆy2s,...,ˆyns, where nis the size of the evaluation set at hand. Note that
this estimate ˆ
θsof generalizing prediction performance is subject to selection bias and therefore overly optimistic.
To compute our proposed multiplicity-adjusted bootstrap tilting (MABT) lower confidence bound we only need these
predictions ˆyij from all of the competing preselected models j= 1, . . . , m in the evaluation set and the associated
true class labels yi,i= 1, . . . , n. For instance, in case the performance measure of interest is prediction accuracy, the
4
摘要:

POST-SELECTIONCONFIDENCEBOUNDSFORPREDICTIONPERFORMANCEPascalRinkInstituteforStatisticsandCompetenceCenterforClinicalTrialsBremenUniversityofBremenBremen,GermanyWernerBrannathInstituteforStatisticsandCompetenceCenterforClinicalTrialsBremenUniversityofBremenBremen,GermanyFebruary6,2023ABSTRACTInmachi...

展开>> 收起<<
POST-SELECTION CONFIDENCE BOUNDS FOR PREDICTION PERFORMANCE Pascal Rink.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.79MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注