Inverse Set Estimation
In terms of statistical inference, the existing approaches require the investigator to set a fixed excursion threshold level,
for example
2◦
C in the climate change data. This threshold depends on the context. Yet, setting a good threshold is
difficult even for domain experts [
4
]. Why is
2◦
C important but not
1.5◦
C? It is natural, and almost unavoidable, for
investigators to try different thresholds and choose those that give most meaningful results. An analysis example using
multiple thresholds is shown in the upper panel of Figure 1or in Figure 2. Therefore, to assure valid inference with
control of type I error rate, the coverage of the CSs should be simultaneous over all thresholds.
1.2 Contributions
This paper proposes an elegant solution to overcome the limitations of the previous methods. The answer, it turns
out, is to construct confidence sets by inverting pre-built simultaneous confidence intervals (SCIs) which are widely
applicable in different data modalities. In this paper, we underscore the broad applicability of our method, primarily
concentrating on the construction of CSs for two prevalent but distinct data modalities: dense functional data and
multiple regression data. The performance of various algorithms in constructing SCIs for dense functional data has been
rigorously evaluated in prior work [
5
]. For multiple regression data, although the non-parametric bootstrap algorithm
has been validated as a method for capturing the asymptotic distribution of multiple linear regression coefficients [
6
],
its efficacy in constructing SCIs within a finite sample setup remains largely unexplored. Consequently, we introduce
a non-parametric bootstrap algorithm, supplemented by R code, for constructing SCIs in multiple regression and
provide a comprehensive evaluation of its performance. Our simulation results reveal that this approach not only
controls the predetermined Type I error rate effectively but also maintains robustness despite finite sample sizes and
does not necessitate the continuity of covariates. Furthermore, our method of inverting pre-built SCIs ensures that the
coverage probability of the confidence sets, for any given threshold
c∈R
, aligns precisely with the SCI coverage rate,
as corroborated by our theorems. Inspired by Goeman [
7
,
8
], this safeguards against “data peeking” in exploratory
data analysis, thereby enabling researchers to construct confidence sets for any threshold
c
without concerns about
compromising the control over the Type I error rate. The algorithm, simulation, and data application code associated
with our study is accessible online at https://github.com/junting-ren/inverse_set_SCI.
1.3 Other existing inverse set estimation methodology
In addition to application to climate change [
1
], inverse set estimation methods are applied in many other different fields,
such as astronomy [
9
], medical imaging [
10
,
11
,
12
], dose-effect finding [
13
], and geoscience [
14
]. Furthermore, there
is a growing trend to quantify the effect size for genomic regions rather than just testing the null hypothesis [
15
,
16
],
where inverse set estimation methods can be utilized to quantify the uncertainty of genomic regions with effects greater
than a certain threshold.
However, just like the aforementioned method of [
1
], existing inverse set estimation methods are only applicable to
specific kinds of data and require strict assumptions. Other methods are specifically designed for scenarios where the
function
µ
is a density function [
17
,
18
,
19
]. Inverse set methods have been also developed for stochastic processes
(random functions), but they require the process itself to be Gaussian and data must be observed on a fixed grid
[
20
,
21
,
22
]. The additional significant issue with all the inverse set estimation methods above is that the estimated
confidence sets are only valid for a single threshold
c
, for example, estimating the set
µ−1[c, +∞)
for a fixed threshold
c.
1.4 Existing simultaneous confidence interval methods
Since the proposed inverse set estimation method is based on SCIs, it is worth reviewing existing SCI methods, to which
our method would be applicable. For dense functional data, researchers constructed SCIs based on functional central
limit theorems in the Banach space using Monte-Carlo simulations with an estimate of the limiting covariance structure
[
23
,
24
,
25
], based on bootstrap [
26
,
27
], and based on the Gaussian Kinematic formula [
5
]. For sparse functional data,
SCIs are built using functional principal component analysis [
28
,
29
]. For high dimensional data such as genomics data
with discrete indexing, valid SCIs are built for high dimensional but a finite number of parameters before selection
[
30
] or after selection [
31
,
32
]. For survival data, SCIs for survival functions are built using Greenwood’s variance
formula under large sample sizes [
33
], as well as SCIs for the difference or ratio of two survival functions [
34
,
35
]. For
regression problems, researchers are often interested in how the response
y
changes with a vector of predictors
x
, or the
magnitude of the regression coefficients. Therefore, SCIs can be constructed for
y
on the range of
x
for simple linear
regression [
36
] and multiple regression on the dense compact subset of continuous covariates in
Rd
[
37
]. However, to
the best of our knowledge, there is no practical bootstrap algorithm nor accessible code online that constructs SCIs for
linear combinations of coefficients of multiple regression that is valid under finite sample size, which is addressed in the
current paper with our algorithm and code.
3