INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS CONFIDENCE INTERVALS Junting Ren

2025-05-05 0 0 3.46MB 21 页 10玖币

侵权投诉

INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS

CONFIDENCE INTERVALS

Junting Ren

Division of Biostatistics

University of California San Diego

j5ren@ucsd.edu

Fabian J.E. Telschow

Institute of Mathematics

Humboldt Universit¨

at zu Berlin

fabian.telschow@hu-berlin.de

Armin Schwartzman

Division of Biostatistics and Halıcıo˘

glu Data Science Institute

University of California San Diego

armins@ucsd.edu

ABSTRACT

Motivated by the questions of risk assessment in climatology (temperature change in North America)

and medicine (impact of statin usage and COVID-19 on hospitalized patients), we address the problem

of estimating the set in the domain of a function whose image equals a predeﬁned subset. Existing

methods that construct conﬁdence sets require strict assumptions. We generalize the estimation of

such sets to dense and non-dense domains with protection against “data peeking” by proving that

conﬁdence sets of multiple levels can be simultaneously constructed with the desired conﬁdence

non-asymptotically through inverting simultaneous conﬁdence bands. A non-parametric bootstrap

algorithm and code are provided.

Keywords Inverse set ·Simultaneous conﬁdence bands ·Bootstrap ·Non-parametric

1 Introduction

1.1 Motivation

One motivating problem for our work comes from the data analysis in [

]. The data used here is obtained from the

North American Regional Climate Change Assessment Program (NARCCAP) project [

]. This data comprises two

sets of 29 geographically registered arrays of average seasonal temperatures for summer (June-August) and winter

(December-February), during two time frames: the late 20th century (1971–1999) and the mid-21st century (2041-2069).

The aim of the analysis is to identify speciﬁc geographical regions where the difference in average summer or winter

temperatures between these two periods exceeds a certain benchmark, with the intention of helping policymakers focus

on regions that are at higher risk for effects of climate change.

Mathematically, the regions with temperature difference exceeding

degrees can be deﬁned as

µ−1(U) = {s∈ S :

µ(s)∈U}

: the set in the closed domain

such that the function output

µ(s)

(true difference in temperature) is in the

half interval

U= [c, ∞)

. We call

µ−1(U)

inverse set because it is the preimage or inverse image of a set

U⊂R

under

a deterministic function

µ:S 7→ R

. Suppose

is unknown but data is available to construct an estimator

ˆµn

where

is the sample size. A point estimate of the inverse set

µ−1(U)

can be constructed as

ˆµ−1

n(U)

, indicated by the inside of

the green contours in the middle lower and upper panels of Figure 1for

U= [2,∞)

. But how do we assess the spatial

uncertainty of such an estimate?

To assess this uncertainty, [

] introduced Coverage Probability Excursion (CoPE) sets, here called inner and outer

conﬁdence sets (CSs), that are sub- and super-sets of the target inverse set, i.e.

CSin(U)⊆µ−1(U)⊆CSout(U).

arXiv:2210.03933v3 [stat.ME] 11 Jul 2023

Inverse Set Estimation

Figure 1: Conﬁdence sets for the increase of the mean summer temperature (June–August) in North America between

the 20th and 21st centuries according to the speciﬁc climate model analyzed in [

]. Heat maps show the estimate of the

mean difference. The ﬁrst row displays the contours of the outer conﬁdence sets, estimated inverse set, and the inner

conﬁdence sets, for various levels. The three plots in the second row display the conﬁdence sets for the inverse sets,

where the estimated mean difference is greater or equal to the individual level 1.5, 2.0, or 2.5 respectively. In the second

row, the blue line is the contour of the outer conﬁdence set, the green line is the contour of the estimated inverse set and

the red line is the contour of the inner conﬁdence set.

with a certain pre-speciﬁed probability, say 95%. We call these CSs due to their analogy to conﬁdence intervals, with the

“lower bound” being

CSin(U)

and the “upper bound” being

CSout(U)

. In the NARCCAP application, for

U= [2,∞)

CSin(U)

and

CSout(U)

are indicated respectively by the inside of the red and blue contours in the middle lower panel

of Figure 1.

In order to have a precise control of

PCSin(U)⊆µ−1(U)⊆CSout(U)

, [

] assume that the domain

is a dense

subset of

, that both

µ(s)

and

ˆµn(s)

are continuous whenever they have values close to

c= 2

, and the desired

coverage is only guaranteed asymptotically as the sample size

goes to inﬁnity. These assumptions lead to a rather

complicated proof and limit its applicability and generality.

Even for the NARCCAP climate change data that is used as main illustration in [

], the assumptions are not strictly

satisﬁed. The data consists of only 29 samples per location, at which the algorithm in [

] fails to construct CSs that

achieve the correct coverage in simulations, as we show in Section 3. In addition, the temperature data is only observed

on a ﬁnite set of locations, so it is not strictly dense in R2[2].

Due to its required assumptions, the original approach [

] is designed for dense functional data and cannot be applied to

other data types such as multiple regression data. For instance, using the data in [

], consider the problem of identifying

patient characteristics that lead to a risk of having a severe outcome which is higher than a certain threshold. Figure 2

shows the estimated probability of hospitalized patients having a severe outcome, depending on age, COVID status,

and statins medication status, obtained using multiple logistic regression. The use of statin [adjusted odds ratio (aOR)

0.78, conﬁdence interval (CI) 0.66 to 0.93] is associated with decreased probability of severe outcome. Using CSs, we

can visualize the protective effect of statin for better interpretation as detailed in Section 4. However, since categorical

covariates are discrete, the domain is not a dense subset of

where

is the number of covariates. Therefore, the

original method or other existing methods for constructing CSs are not applicable in this scenario, but the method we

propose here is.

Inverse Set Estimation

In terms of statistical inference, the existing approaches require the investigator to set a ﬁxed excursion threshold level,

for example

2◦

C in the climate change data. This threshold depends on the context. Yet, setting a good threshold is

difﬁcult even for domain experts [

]. Why is

2◦

C important but not

1.5◦

C? It is natural, and almost unavoidable, for

investigators to try different thresholds and choose those that give most meaningful results. An analysis example using

multiple thresholds is shown in the upper panel of Figure 1or in Figure 2. Therefore, to assure valid inference with

control of type I error rate, the coverage of the CSs should be simultaneous over all thresholds.

1.2 Contributions

This paper proposes an elegant solution to overcome the limitations of the previous methods. The answer, it turns

out, is to construct conﬁdence sets by inverting pre-built simultaneous conﬁdence intervals (SCIs) which are widely

applicable in different data modalities. In this paper, we underscore the broad applicability of our method, primarily

concentrating on the construction of CSs for two prevalent but distinct data modalities: dense functional data and

multiple regression data. The performance of various algorithms in constructing SCIs for dense functional data has been

rigorously evaluated in prior work [

]. For multiple regression data, although the non-parametric bootstrap algorithm

has been validated as a method for capturing the asymptotic distribution of multiple linear regression coefﬁcients [

its efﬁcacy in constructing SCIs within a ﬁnite sample setup remains largely unexplored. Consequently, we introduce

a non-parametric bootstrap algorithm, supplemented by R code, for constructing SCIs in multiple regression and

provide a comprehensive evaluation of its performance. Our simulation results reveal that this approach not only

controls the predetermined Type I error rate effectively but also maintains robustness despite ﬁnite sample sizes and

does not necessitate the continuity of covariates. Furthermore, our method of inverting pre-built SCIs ensures that the

coverage probability of the conﬁdence sets, for any given threshold

c∈R

, aligns precisely with the SCI coverage rate,

as corroborated by our theorems. Inspired by Goeman [

], this safeguards against “data peeking” in exploratory

data analysis, thereby enabling researchers to construct conﬁdence sets for any threshold

without concerns about

compromising the control over the Type I error rate. The algorithm, simulation, and data application code associated

with our study is accessible online at https://github.com/junting-ren/inverse_set_SCI.

1.3 Other existing inverse set estimation methodology

In addition to application to climate change [

], inverse set estimation methods are applied in many other different ﬁelds,

such as astronomy [

], medical imaging [

], dose-effect ﬁnding [

], and geoscience [

]. Furthermore, there

is a growing trend to quantify the effect size for genomic regions rather than just testing the null hypothesis [

where inverse set estimation methods can be utilized to quantify the uncertainty of genomic regions with effects greater

than a certain threshold.

However, just like the aforementioned method of [

], existing inverse set estimation methods are only applicable to

speciﬁc kinds of data and require strict assumptions. Other methods are speciﬁcally designed for scenarios where the

function

is a density function [

]. Inverse set methods have been also developed for stochastic processes

(random functions), but they require the process itself to be Gaussian and data must be observed on a ﬁxed grid

[

]. The additional signiﬁcant issue with all the inverse set estimation methods above is that the estimated

conﬁdence sets are only valid for a single threshold

, for example, estimating the set

µ−1[c, +∞)

for a ﬁxed threshold

1.4 Existing simultaneous conﬁdence interval methods

Since the proposed inverse set estimation method is based on SCIs, it is worth reviewing existing SCI methods, to which

our method would be applicable. For dense functional data, researchers constructed SCIs based on functional central

limit theorems in the Banach space using Monte-Carlo simulations with an estimate of the limiting covariance structure

[

], based on bootstrap [

], and based on the Gaussian Kinematic formula [

]. For sparse functional data,

SCIs are built using functional principal component analysis [

]. For high dimensional data such as genomics data

with discrete indexing, valid SCIs are built for high dimensional but a ﬁnite number of parameters before selection

[

] or after selection [

]. For survival data, SCIs for survival functions are built using Greenwood’s variance

formula under large sample sizes [

], as well as SCIs for the difference or ratio of two survival functions [

]. For

regression problems, researchers are often interested in how the response

changes with a vector of predictors

, or the

magnitude of the regression coefﬁcients. Therefore, SCIs can be constructed for

on the range of

for simple linear

regression [

] and multiple regression on the dense compact subset of continuous covariates in

[

]. However, to

the best of our knowledge, there is no practical bootstrap algorithm nor accessible code online that constructs SCIs for

linear combinations of coefﬁcients of multiple regression that is valid under ﬁnite sample size, which is addressed in the

current paper with our algorithm and code.

Inverse Set Estimation

Figure 2: Simultaneous conﬁdence set for the probability of severe outcome. We ﬁxed other variables at ACE = 0, ARB

= 0, sex = Male, CKD = 1, hypertension=1, CVD = 1, diabetes=1, obesity = 1. The gray shaded area is the 95% SCIs,

the solid black line is the estimated probability. The red horizontal line shows the inner conﬁdence sets (where the lower

SCIs are greater than the corresponding level) which are contained in the estimated inverse upper excursion set colored

as the green and red horizontal line (where the estimated means are greater than the corresponding levels); the outer

conﬁdence sets are colored by the blue, green and red line (where the upper SCIs are greater than the corresponding

levels) and contain both the estimated inverse sets and the inner conﬁdence sets.

1.5 Outline

After stating and proving the main theorem and corollaries in Section 2, we present the results of simulation studies that

validate our method for continuous domains using dense functional data and regression mean prediction on a ﬁne grid

of predictors. For discrete domains, conﬁdence sets for regression coefﬁcients are constructed using simulated datasets,

and the results are shown in Section 3. The non-parametric bootstrap algorithm for constructing SCIs for regression

coefﬁcients and linear combination of the coefﬁcients is provided in Section 3. In addition, for different correlation

structures between the estimated means

ˆµ(s)

in the domain

, we demonstrate how conservative the method is when

only a ﬁnite number of conﬁdence sets are constructed, compared to the SCI nominal coverage rate. We showcase

the advantages of our method over the previous approach [

] in both the simulations and the real data application.

Following the simulations, we exhibit two motivating applications in two distinct domain: probability contour for mean

temperature difference map for climate change, and logistic regression for determining whether statin is protective

against the severe outcome of Coronavirus disease 2019 (COVID) patients in Section 4. We conclude with a brief

discussion in Section 5.

2 Theory

2.1 Setup

The goal of inverse set estimation is to estimate the set

µ−1(U) = {s∈ S :µ(s)∈U}

where

µ:S 7→ R

is an

unknown deterministic function,

is a ﬁxed subset of

, and

is a closed indexing set. The ”point estimator” of the

true inverse set is:

ˆµ−1(U) = {s∈ S : ˆµ(s)∈U}.

Inverse Set Estimation

Similar to the point estimate of a scalar parameter, we need a “lower bound” and an “upper bound” for the estimated

inverse set. Therefore, we introduce the data-dependent outer conﬁdence set

CSout(U)

and the data-dependent inner

conﬁdence set

CSin(U)

with the goal that the true inverse set

µ−1(U) = {s∈ S :µ(s)∈U}

is “sandwiched” within

them:

CSin(U)⊆µ−1(U)⊆CSout(U).

2.2 Estimating inverse upper excursion sets

The central idea of this article is that such conﬁdence sets can be obtained by inverting SCIs. Let

Bl(s)

and

Bu(s)

denote the estimated lower and upper SCI functions at pre-speciﬁed level αsuch that:

Ph∀s∈ S :ˆ

Bl(s)≤µ(s)≤ˆ

Bu(s)i= 1 −α

Because the function

and the SCIs are generally not one-to-one functions, the inversion can get complicated depending

on the interval

. We simplify this issue by setting

as half of the real line, and this is often the set that researchers are

interested in. We can deﬁne the following inverse upper excursion set at level cas:

µ−1[c, +∞) = {s∈ S | µ(s)≥c}

In addition, we deﬁne the following sets as the inner and outer conﬁdence sets for the inverse upper excursion set

µ−1[c, +∞)for a single level c:

CSin[c, +∞) := ˆ

B−1

l[c, +∞) = ns∈ S | ˆ

Bl(s)≥co

CSout[c, +∞) := ˆ

B−1

u[c, +∞) = ns∈ S | ˆ

Bu(s)≥co

In Figure 3, the red horizontal lines are the

CSin[c, +∞)

, whereas the union of red, green and blue horizontal lines are

the

CSout[c, +∞)

. Henceforth, we distinguish between the inference when

is a single level and when the inference is

simultaneous over multiple choices of the level c.

2.2.1 Single level conﬁdence set from SCI

In [

], after constructing a bootstrap percentile SCI, the authors claim that the true mean is greater than

c= 0

in the region where the estimated lower interval is greater than

. However, no probability or conﬁdence statement

is given. This is one of the ad-hoc examples of using SCI for inverse set estimation in applications. The following

Proposition 1provides a formal justiﬁcation for the procedure above, stating that for a single level

B−1

l[c, +∞)

is a

set such that we are at least 95% conﬁdent that ∀s∈ˆ

B−1

l[c, +∞), µ(s)≥c.

Proposition 1. For a ﬁxed level c∈R, and SCIs with αtype I family-wiser error rate, we have

Pˆ

B−1

l[c, +∞)⊆µ−1[c, +∞)≥P∀s∈ S :ˆ

Bl(s)≤µ(s)≤ˆ

Bu(s)= 1 −α

Proof. Deﬁne the following events

E:= ˆ

B−1

l[c, +∞)⊆µ−1[c, +∞),

and

ESCI := n∀s∈ S :ˆ

Bl(s)≤µ(s)≤ˆ

Bu(s)o.

We want to show:

ESCI =⇒E.

Conditioning on the ESCI event, assume for a ﬁxed s′∈ S, we have ˆ

Bl(s′)≥c, then

µ(s′)≥ˆ

Bl(s′)≥c

ESCI

. This means that

∀s∈ S :ˆ

Bl(s)≥c

, we must also have

µ(s)≥c

holds as well, which is equivalent to the

statement ˆ

B−1

l[c, +∞)⊆µ−1[c, +∞).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

INVERSESETESTIMATIONANDINVERSIONOFSIMULTANEOUSCONFIDENCEINTERVALSJuntingRenDivisionofBiostatisticsUniversityofCaliforniaSanDiegoj5ren@ucsd.eduFabianJ.E.TelschowInstituteofMathematicsHumboldtUniversit¨atzuBerlinfabian.telschow@hu-berlin.deArminSchwartzmanDivisionofBiostatisticsandHalıcıo˘gluDataScien...

展开>> 收起<<

INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS CONFIDENCE INTERVALS Junting Ren.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

INVERSE SET ESTIMATION AND INVERSION OF SIMULTANEOUS CONFIDENCE INTERVALS Junting Ren

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: