RINK AND BRANNATH (2022)
We follow the idea of Berk et al. [2013] and interpret this post-selection inference problem as a simultaneous inference
problem, controlling for the family-wise error rate
Pθ(θi< θi,L for any i∈ {1, . . . , m})≤α, (1)
where θidenotes the performance of prediction model i,θi,L denotes the corresponding lower confidence bound at
significance level α > 0, and θis the true predictive performance. With this type 1 error control, in practice, we are
therefore able to answer the question whether there is a model among the competition that has prediction performance
θiat least as large as a reference performance θ0with high confidence, no matter how and which subset of the initial
competition has been selected for evaluation. In particular, since
α≥Pθ(∪m
i=1{θi< θi,L})≥Pθ(θs< θs,L),(2)
for some s∈ {1, . . . , m}, this coverage guarantee carries over to a potentially final selected model s.
This might be an overly eager requirement to meet in certain cases and might lead to somewhat conservative decisions.
Therefore, in order to increase power, we do not evaluate all of the candidate models, but only a promising selection
of them. Or to put it the other way around: we exclude models from evaluation that are likely not going to be the
best performing ones. However, in principle, it is also possible to report lower confidence bounds for the prediction
performance of all the competing models that remain universally valid in the sense that they work with any measure of
prediction performance (as long as it accepts weights), with any combination of prediction models even from different
model classes, any model selection strategy, and are computationally undemanding as no additional model training
is involved. While Berk et al. [2013] proposed universally valid post-selection confidence bounds for regression
coefficients, we are interested in a post-selection lower confidence bound for the conditional prediction performance
of a model selected based on its evaluation performance.
1.1 Conditional vs Unconditional Performance
We are particularly interested in the conditional prediction performance, that is the generalization performance of the
model trained on the present sample. For that, in a model selection and performance estimation regime, the prevailing
recommendation in the literature is to split the sample at hand into three parts, a training, validation, and evaluation set
[Goodfellow et al., 2016, Hastie et al., 2009, Japkowicz and Shah, 2011, Murphy, 2012, Raschka, 2018], see Figure 1.
Depending on the specific selection rule, the training and validation set can sometimes be combined to form a learning
set. For instance, this is true when cross-validation is used to identify promising models from a competition of models,
based on their cross-validated prediction performance. Cross-validation using the entire sample at hand however is
not a solution to our problem since it actually estimates the unconditional prediction performance, that is the average
prediction performance of a model fit on other training data from the same distribution as the original data [Bates et al.,
2021, Hastie et al., 2009]. There are examples in the literature how to correct for this unwanted behavior and report
an estimate of the conditional performance [Bates et al., 2021, Tsamardinos et al., 2018]. Yet, we choose an approach
that directly and inherently estimates the conditional performance.
1.2 Bootstrap Tilting Confidence Intervals
Our proposed confidence bounds are obtained using bootstrap resampling. In particular, we use bootstrap tilting (BT),
introduced by Efron [1981], which is a general approach to estimate confidence intervals for some statistic θ=θ(F)
using an i. i. d. sample (y1, y2, . . . , yn)from an unknown distribution F. This statistic θwill later be our performance
estimate of choice. Unlike many other bootstrap confidence intervals, BT estimates the distribution of ˆ
θ−θ0for some
test value θ0. The lower confidence bound is then formed consisting of those values of θ0that could not be rejected
in a test of the null hypothesis H0:θ≤θ0. This way the distribution to resample from is consistent with the null
distribution. In particular, this is achieved by reducing the problem to a one-parameteric family (Fτ)τof distributions,
where τis called the tilting parameter and Fτhas support on the observed data {y1, y2, . . . , yn}. A specific value
of τinduces nonnegative sampling weights pτ= (p1(τ), p2(τ), . . . , pn(τ)) such that Pn
i=1 pi(τ)=1. The tilting
parameter τis monotonically related to θsuch that a specific value of τcorresponds to a specific value of θ0.
In order to find a lower confidence bound θL, we find the largest value of τ < 0such that the corresponding level
αtest still rejects H0; this means θLis the largest value of θ0such that, if the sample came from a distribution with
parameter θ0, the probability of observing ˆ
θor an ever larger value is α,
PFτ(θ≥ˆ
θ) = α. (3)
Conceptually, for any given value of τ, we need to sample from Fτand check whether equation (3) holds true. This is
both expensive and exposed to the randomness of repeated sampling. What we actually do is to employ an importance
2