INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS CONFIDENCE INTERVALS Junting Ren

2025-05-05 0 0 3.46MB 21 页 10玖币
侵权投诉
INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS
CONFIDENCE INTERVALS
Junting Ren
Division of Biostatistics
University of California San Diego
j5ren@ucsd.edu
Fabian J.E. Telschow
Institute of Mathematics
Humboldt Universit¨
at zu Berlin
fabian.telschow@hu-berlin.de
Armin Schwartzman
Division of Biostatistics and Halıcıo˘
glu Data Science Institute
University of California San Diego
armins@ucsd.edu
ABSTRACT
Motivated by the questions of risk assessment in climatology (temperature change in North America)
and medicine (impact of statin usage and COVID-19 on hospitalized patients), we address the problem
of estimating the set in the domain of a function whose image equals a predefined subset. Existing
methods that construct confidence sets require strict assumptions. We generalize the estimation of
such sets to dense and non-dense domains with protection against “data peeking” by proving that
confidence sets of multiple levels can be simultaneously constructed with the desired confidence
non-asymptotically through inverting simultaneous confidence bands. A non-parametric bootstrap
algorithm and code are provided.
Keywords Inverse set ·Simultaneous confidence bands ·Bootstrap ·Non-parametric
1 Introduction
1.1 Motivation
One motivating problem for our work comes from the data analysis in [
1
]. The data used here is obtained from the
North American Regional Climate Change Assessment Program (NARCCAP) project [
2
]. This data comprises two
sets of 29 geographically registered arrays of average seasonal temperatures for summer (June-August) and winter
(December-February), during two time frames: the late 20th century (1971–1999) and the mid-21st century (2041-2069).
The aim of the analysis is to identify specific geographical regions where the difference in average summer or winter
temperatures between these two periods exceeds a certain benchmark, with the intention of helping policymakers focus
on regions that are at higher risk for effects of climate change.
Mathematically, the regions with temperature difference exceeding
c
degrees can be defined as
µ1(U) = {s∈ S :
µ(s)U}
: the set in the closed domain
S
such that the function output
µ(s)
(true difference in temperature) is in the
half interval
U= [c, )
. We call
µ1(U)
inverse set because it is the preimage or inverse image of a set
UR
under
a deterministic function
µ:S 7→ R
. Suppose
µ
is unknown but data is available to construct an estimator
ˆµn
where
n
is the sample size. A point estimate of the inverse set
µ1(U)
can be constructed as
ˆµ1
n(U)
, indicated by the inside of
the green contours in the middle lower and upper panels of Figure 1for
U= [2,)
. But how do we assess the spatial
uncertainty of such an estimate?
To assess this uncertainty, [
1
] introduced Coverage Probability Excursion (CoPE) sets, here called inner and outer
confidence sets (CSs), that are sub- and super-sets of the target inverse set, i.e.
CSin(U)µ1(U)CSout(U).
arXiv:2210.03933v3 [stat.ME] 11 Jul 2023
Inverse Set Estimation
Figure 1: Confidence sets for the increase of the mean summer temperature (June–August) in North America between
the 20th and 21st centuries according to the specific climate model analyzed in [
1
]. Heat maps show the estimate of the
mean difference. The first row displays the contours of the outer confidence sets, estimated inverse set, and the inner
confidence sets, for various levels. The three plots in the second row display the confidence sets for the inverse sets,
where the estimated mean difference is greater or equal to the individual level 1.5, 2.0, or 2.5 respectively. In the second
row, the blue line is the contour of the outer confidence set, the green line is the contour of the estimated inverse set and
the red line is the contour of the inner confidence set.
with a certain pre-specified probability, say 95%. We call these CSs due to their analogy to confidence intervals, with the
“lower bound” being
CSin(U)
and the “upper bound” being
CSout(U)
. In the NARCCAP application, for
U= [2,)
,
CSin(U)
and
CSout(U)
are indicated respectively by the inside of the red and blue contours in the middle lower panel
of Figure 1.
In order to have a precise control of
PCSin(U)µ1(U)CSout(U)
, [
1
] assume that the domain
S
is a dense
subset of
Rd
, that both
µ(s)
and
ˆµn(s)
are continuous whenever they have values close to
c= 2
, and the desired
coverage is only guaranteed asymptotically as the sample size
n
goes to infinity. These assumptions lead to a rather
complicated proof and limit its applicability and generality.
Even for the NARCCAP climate change data that is used as main illustration in [
1
], the assumptions are not strictly
satisfied. The data consists of only 29 samples per location, at which the algorithm in [
1
] fails to construct CSs that
achieve the correct coverage in simulations, as we show in Section 3. In addition, the temperature data is only observed
on a finite set of locations, so it is not strictly dense in R2[2].
Due to its required assumptions, the original approach [
1
] is designed for dense functional data and cannot be applied to
other data types such as multiple regression data. For instance, using the data in [
3
], consider the problem of identifying
patient characteristics that lead to a risk of having a severe outcome which is higher than a certain threshold. Figure 2
shows the estimated probability of hospitalized patients having a severe outcome, depending on age, COVID status,
and statins medication status, obtained using multiple logistic regression. The use of statin [adjusted odds ratio (aOR)
0.78, confidence interval (CI) 0.66 to 0.93] is associated with decreased probability of severe outcome. Using CSs, we
can visualize the protective effect of statin for better interpretation as detailed in Section 4. However, since categorical
covariates are discrete, the domain is not a dense subset of
Rd
where
d
is the number of covariates. Therefore, the
original method or other existing methods for constructing CSs are not applicable in this scenario, but the method we
propose here is.
2
Inverse Set Estimation
In terms of statistical inference, the existing approaches require the investigator to set a fixed excursion threshold level,
for example
2
C in the climate change data. This threshold depends on the context. Yet, setting a good threshold is
difficult even for domain experts [
4
]. Why is
2
C important but not
1.5
C? It is natural, and almost unavoidable, for
investigators to try different thresholds and choose those that give most meaningful results. An analysis example using
multiple thresholds is shown in the upper panel of Figure 1or in Figure 2. Therefore, to assure valid inference with
control of type I error rate, the coverage of the CSs should be simultaneous over all thresholds.
1.2 Contributions
This paper proposes an elegant solution to overcome the limitations of the previous methods. The answer, it turns
out, is to construct confidence sets by inverting pre-built simultaneous confidence intervals (SCIs) which are widely
applicable in different data modalities. In this paper, we underscore the broad applicability of our method, primarily
concentrating on the construction of CSs for two prevalent but distinct data modalities: dense functional data and
multiple regression data. The performance of various algorithms in constructing SCIs for dense functional data has been
rigorously evaluated in prior work [
5
]. For multiple regression data, although the non-parametric bootstrap algorithm
has been validated as a method for capturing the asymptotic distribution of multiple linear regression coefficients [
6
],
its efficacy in constructing SCIs within a finite sample setup remains largely unexplored. Consequently, we introduce
a non-parametric bootstrap algorithm, supplemented by R code, for constructing SCIs in multiple regression and
provide a comprehensive evaluation of its performance. Our simulation results reveal that this approach not only
controls the predetermined Type I error rate effectively but also maintains robustness despite finite sample sizes and
does not necessitate the continuity of covariates. Furthermore, our method of inverting pre-built SCIs ensures that the
coverage probability of the confidence sets, for any given threshold
cR
, aligns precisely with the SCI coverage rate,
as corroborated by our theorems. Inspired by Goeman [
7
,
8
], this safeguards against “data peeking” in exploratory
data analysis, thereby enabling researchers to construct confidence sets for any threshold
c
without concerns about
compromising the control over the Type I error rate. The algorithm, simulation, and data application code associated
with our study is accessible online at https://github.com/junting-ren/inverse_set_SCI.
1.3 Other existing inverse set estimation methodology
In addition to application to climate change [
1
], inverse set estimation methods are applied in many other different fields,
such as astronomy [
9
], medical imaging [
10
,
11
,
12
], dose-effect finding [
13
], and geoscience [
14
]. Furthermore, there
is a growing trend to quantify the effect size for genomic regions rather than just testing the null hypothesis [
15
,
16
],
where inverse set estimation methods can be utilized to quantify the uncertainty of genomic regions with effects greater
than a certain threshold.
However, just like the aforementioned method of [
1
], existing inverse set estimation methods are only applicable to
specific kinds of data and require strict assumptions. Other methods are specifically designed for scenarios where the
function
µ
is a density function [
17
,
18
,
19
]. Inverse set methods have been also developed for stochastic processes
(random functions), but they require the process itself to be Gaussian and data must be observed on a fixed grid
[
20
,
21
,
22
]. The additional significant issue with all the inverse set estimation methods above is that the estimated
confidence sets are only valid for a single threshold
c
, for example, estimating the set
µ1[c, +)
for a fixed threshold
c.
1.4 Existing simultaneous confidence interval methods
Since the proposed inverse set estimation method is based on SCIs, it is worth reviewing existing SCI methods, to which
our method would be applicable. For dense functional data, researchers constructed SCIs based on functional central
limit theorems in the Banach space using Monte-Carlo simulations with an estimate of the limiting covariance structure
[
23
,
24
,
25
], based on bootstrap [
26
,
27
], and based on the Gaussian Kinematic formula [
5
]. For sparse functional data,
SCIs are built using functional principal component analysis [
28
,
29
]. For high dimensional data such as genomics data
with discrete indexing, valid SCIs are built for high dimensional but a finite number of parameters before selection
[
30
] or after selection [
31
,
32
]. For survival data, SCIs for survival functions are built using Greenwood’s variance
formula under large sample sizes [
33
], as well as SCIs for the difference or ratio of two survival functions [
34
,
35
]. For
regression problems, researchers are often interested in how the response
y
changes with a vector of predictors
x
, or the
magnitude of the regression coefficients. Therefore, SCIs can be constructed for
y
on the range of
x
for simple linear
regression [
36
] and multiple regression on the dense compact subset of continuous covariates in
Rd
[
37
]. However, to
the best of our knowledge, there is no practical bootstrap algorithm nor accessible code online that constructs SCIs for
linear combinations of coefficients of multiple regression that is valid under finite sample size, which is addressed in the
current paper with our algorithm and code.
3
Inverse Set Estimation
Figure 2: Simultaneous confidence set for the probability of severe outcome. We fixed other variables at ACE = 0, ARB
= 0, sex = Male, CKD = 1, hypertension=1, CVD = 1, diabetes=1, obesity = 1. The gray shaded area is the 95% SCIs,
the solid black line is the estimated probability. The red horizontal line shows the inner confidence sets (where the lower
SCIs are greater than the corresponding level) which are contained in the estimated inverse upper excursion set colored
as the green and red horizontal line (where the estimated means are greater than the corresponding levels); the outer
confidence sets are colored by the blue, green and red line (where the upper SCIs are greater than the corresponding
levels) and contain both the estimated inverse sets and the inner confidence sets.
1.5 Outline
After stating and proving the main theorem and corollaries in Section 2, we present the results of simulation studies that
validate our method for continuous domains using dense functional data and regression mean prediction on a fine grid
of predictors. For discrete domains, confidence sets for regression coefficients are constructed using simulated datasets,
and the results are shown in Section 3. The non-parametric bootstrap algorithm for constructing SCIs for regression
coefficients and linear combination of the coefficients is provided in Section 3. In addition, for different correlation
structures between the estimated means
ˆµ(s)
in the domain
S
, we demonstrate how conservative the method is when
only a finite number of confidence sets are constructed, compared to the SCI nominal coverage rate. We showcase
the advantages of our method over the previous approach [
1
] in both the simulations and the real data application.
Following the simulations, we exhibit two motivating applications in two distinct domain: probability contour for mean
temperature difference map for climate change, and logistic regression for determining whether statin is protective
against the severe outcome of Coronavirus disease 2019 (COVID) patients in Section 4. We conclude with a brief
discussion in Section 5.
2 Theory
2.1 Setup
The goal of inverse set estimation is to estimate the set
µ1(U) = {s∈ S :µ(s)U}
where
µ:S 7→ R
is an
unknown deterministic function,
U
is a fixed subset of
R
, and
S
is a closed indexing set. The ”point estimator” of the
true inverse set is:
ˆµ1(U) = {s∈ S : ˆµ(s)U}.
4
Inverse Set Estimation
Similar to the point estimate of a scalar parameter, we need a “lower bound” and an “upper bound” for the estimated
inverse set. Therefore, we introduce the data-dependent outer confidence set
CSout(U)
and the data-dependent inner
confidence set
CSin(U)
with the goal that the true inverse set
µ1(U) = {s∈ S :µ(s)U}
is “sandwiched” within
them:
CSin(U)µ1(U)CSout(U).
2.2 Estimating inverse upper excursion sets
The central idea of this article is that such confidence sets can be obtained by inverting SCIs. Let
ˆ
Bl(s)
and
ˆ
Bu(s)
denote the estimated lower and upper SCI functions at pre-specified level αsuch that:
Phs∈ S :ˆ
Bl(s)µ(s)ˆ
Bu(s)i= 1 α
Because the function
µ
and the SCIs are generally not one-to-one functions, the inversion can get complicated depending
on the interval
U
. We simplify this issue by setting
U
as half of the real line, and this is often the set that researchers are
interested in. We can define the following inverse upper excursion set at level cas:
µ1[c, +) = {s∈ S | µ(s)c}
In addition, we define the following sets as the inner and outer confidence sets for the inverse upper excursion set
µ1[c, +)for a single level c:
CSin[c, +) := ˆ
B1
l[c, +) = ns∈ S | ˆ
Bl(s)co
CSout[c, +) := ˆ
B1
u[c, +) = ns∈ S | ˆ
Bu(s)co
In Figure 3, the red horizontal lines are the
CSin[c, +)
, whereas the union of red, green and blue horizontal lines are
the
CSout[c, +)
. Henceforth, we distinguish between the inference when
c
is a single level and when the inference is
simultaneous over multiple choices of the level c.
2.2.1 Single level confidence set from SCI
In [
38
,
39
], after constructing a bootstrap percentile SCI, the authors claim that the true mean is greater than
c= 0
in the region where the estimated lower interval is greater than
0
. However, no probability or confidence statement
is given. This is one of the ad-hoc examples of using SCI for inverse set estimation in applications. The following
Proposition 1provides a formal justification for the procedure above, stating that for a single level
c
,
ˆ
B1
l[c, +)
is a
set such that we are at least 95% confident that sˆ
B1
l[c, +), µ(s)c.
Proposition 1. For a fixed level cR, and SCIs with αtype I family-wiser error rate, we have
Pˆ
B1
l[c, +)µ1[c, +)Ps∈ S :ˆ
Bl(s)µ(s)ˆ
Bu(s)= 1 α
Proof. Define the following events
E:= ˆ
B1
l[c, +)µ1[c, +),
and
ESCI := ns∈ S :ˆ
Bl(s)µ(s)ˆ
Bu(s)o.
We want to show:
ESCI =E.
Conditioning on the ESCI event, assume for a fixed s∈ S, we have ˆ
Bl(s)c, then
µ(s)ˆ
Bl(s)c
by
ESCI
. This means that
s∈ S :ˆ
Bl(s)c
, we must also have
µ(s)c
holds as well, which is equivalent to the
statement ˆ
B1
l[c, +)µ1[c, +).
5
摘要:

INVERSESETESTIMATIONANDINVERSIONOFSIMULTANEOUSCONFIDENCEINTERVALSJuntingRenDivisionofBiostatisticsUniversityofCaliforniaSanDiegoj5ren@ucsd.eduFabianJ.E.TelschowInstituteofMathematicsHumboldtUniversit¨atzuBerlinfabian.telschow@hu-berlin.deArminSchwartzmanDivisionofBiostatisticsandHalıcıo˘gluDataScien...

展开>> 收起<<
INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS CONFIDENCE INTERVALS Junting Ren.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:3.46MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注