ON THE TESTING OF MULTIPLE HYPOTHESIS IN SLICED INVERSE REGRESSION By Zhigen Zhaoand Xin Xing

2025-04-27 1 0 1.52MB 22 页 10玖币

侵权投诉

ON THE TESTING OF MULTIPLE HYPOTHESIS IN

SLICED INVERSE REGRESSION

By Zhigen Zhao∗and Xin Xing†

We consider the multiple testing of the general regression frame-

work aiming at studying the relationship between a univariate re-

sponse and a p-dimensional predictor. To test the hypothesis of the

eﬀect of each predictor, we construct an Angular Balanced Statistic

(ABS) based on the estimator of the sliced inverse regression without

assuming a model of the conditional distribution of the response. Ac-

cording to the developed limiting distribution results in this paper,

we have shown that ABS is asymptotically symmetric with respect to

zero under the null hypothesis. We then propose a Model-free multi-

ple Testing procedure using Angular balanced statistics (MTA) and

show theoretically that the false discovery rate of this method is less

than or equal to a designated level asymptotically. Numerical evi-

dence has shown that the MTA method is much more powerful than

its alternatives, subject to the control of the false discovery rate.

Keywords: model free, FDR, suﬃcient dimension reduction

1. Introduction. In the general framework of the regression analysis,

the goal is to infer the relation between a univariate response variable yand

ap×1 vector x. One would like to know y|x, namely, how the distribution

of ydepends on the value of x. Among the literature of suﬃcient dimension

reduction [25, 8, 9, 28, 29], the fundamental idea is to replace the predictor

by its projection to a subspace without loss of information. In other words,

we seek for a subspace Sy|xof the predictor space such that

(1) y

x|PSy|xx.

Here

indicates independence, and P(·)stands for a projection operator.

The subspace Sy|xis called the central subspace. Let dbe the dimension

of this central subspace. Let B, a p×dmatrix, be a basis of the central

subspace Sy|x. Then the equation (1) is equivalent to

(2) y

x|Bx.

∗Zhigen Zhao is Associate Professor of Department of Statistics, Operations, and Data

Science, Temple University.

†Xin Xing is Assistant Professor of the Department of Statistics, Virginia Tech Univer-

sity.

arXiv:2210.05873v2 [stat.ME] 16 Jun 2023

2ZHAO AND XING

To further reduce the dimensionality especially when the number of pre-

dictors pdiverges with respect to n, it is commonly assumed that ydepends

on xthrough a subset of x, known as the Markov blanket and denoted as

MB(y, x) [33, 36, 5], such that

x|MB(y, x).

For each predictor, one would like to know whether xj∈ MB(y, x), which

could be formulated as a multiple testing problem. The null hypothesis stat-

ing that xj/∈ MB(y, x) is equivalent to

(3) Hj:Pspan(xj)(Sy|x) = Op,

where Opis the origin point in the p-dimensional space [10]. In other words,

it is equivalent to saying that the j-th row of the matrix Bconsists of all

zeros.

There are many attempts targeting estimating the central subspace in the

existing literature on suﬃcient dimension reduction. The most widely used

method is the sliced inverse regression (SIR) which was ﬁrst introduced in

[25]. Later, there are many attempts to extend SIR, including, but not lim-

ited to, sliced average variance estimation [14, 8], directional regression [24],

and constructive estimation [41]. Nevertheless, most of the existing methods

and theories in suﬃcient dimension reduction focus on the estimation of the

central subspace Sy|x. The result of statistical inference is very limited when

pdiverges, not to mention the procedure of controlling the false discovery

rate (FDR) when testing these hypotheses simultaneously.

The challenge arises from two perspectives. First, among the literature

on suﬃcient dimension reduction, the result on the limiting distribution of

the estimator of the central subspace is very limited when pdiverges. When

pis ﬁxed, [20, 46, 30, 23] have derived the asymptotic distribution of the

sliced inverse regression. To the best of the authors’ knowledge, there are

no results on the limiting distribution when pdiverges unless assuming the

signal is strong and the total number of false hypotheses is ﬁxed [40].

Second, after the test statistic is determined for each hypothesis, how to

combine these test statistics to derive a method that controls the false dis-

covery rate is challenging. Many existing procedures, such as [2, 3, 42, 35],

work on the (generalized) linear regression models. In [5], the authors con-

sidered an arbitrary joint distribution of yand xand proposed the model-X

knockoﬀ to control FDR. However, this method requires that the distribu-

tion of the design matrix is known, which is not feasible in many practice.

The study conducted by [21] explored variable selection for the linear

regression model, assuming the condition of weak and rare signals. It is

noted in this paper that the selection consistency is not possible and allow-

ing for false positives is necessary. While several existing penalization-based

methods on suﬃcient dimension reduction require imposing uniform signal

strength condition to achieve consistent results ([27, 29, 38, 44]), this work

tackles a more challenging scenario. Speciﬁcally, we develop the central limit

theorem of SIR, utilizing recent theories on Gaussian approximation [6, 7]

without relying on the uniform signal strength conditions. This theoretical

result is the ﬁrst of its kind in the literature on suﬃcient dimension reduc-

tion and is a necessity for simultaneous inference when the eﬀects of some

relevant predictors are either moderate or weak.

We proceed by constructing a statistic for each hypothesis based on sliced

inverse regression. Applying Gaussian approximation theory, we demon-

strate that this statistic is asymptotically symmetric about zero when the

null hypothesis holds. We refer to this as a Angular Balanced Statistic

(ABS). We then develop a single-step procedure that rejects a hypothesis

when its ABS exceeds a certain threshold. Additionally, we provide an esti-

mator for the false discovery proportion. For a designated FDR level q, we

adaptively select a threshold such that the estimated false discovery pro-

portion is no greater than q. This method is referred to as the Model-free

multiple Testing procedure using Angular balanced statistics (MTA). The-

oretical analysis conﬁrms that MTA asymptotically controls the FDR at

the q-level under regularity conditions. Simulation results and data analysis

demonstrate that MTA signiﬁcantly outperforms its competitors in terms of

power while controlling the FDR.

The paper is organized as follows. In Section 2, we derive the central limit

theorem of SIR when both the dimension pand the number of important

predictors sdiverge with nusing the recently developed Gaussian approx-

imation theory. In Section 3, we construct ABS based on the estimator of

SIR and propose the MTA method. It is shown that the FDR of the MTA

method is less than or equal to a designated level asymptotically. In Sections

4 and 5, we provide numerical evidence including extensive simulations and

a real data analysis to demonstrate advantages of MTA. We conclude the

paper in Section 6, and include all technical details in the appendix.

Notation. We adopt the following notations throughout this paper. For

a matrix A, we call the space generated by its column vectors the column

space and denote it by col(A). The element at the i-th row and j-th column

of a matrix Ais denoted as Aij or aij. The i-th row and j-th column of the

matrix are denoted by Ai·and A·j, respectively. The minimum and max-

imum eigenvalues of Aare denoted as λmin(A) and λmax(A) respectively.

4ZHAO AND XING

For two positive numbers a, b, we use a∨band a∧bto denote max{a, b}

and min{a, b}respectively. We use c,C,C′,C1and C2to denote generic

absolute constants, though the actual value may vary from case to case.

2. Gauss Approximaton of SIR. Recall that yis the response and x

is a p-dimensional vector. In the literature of suﬃcient dimension reduction,

people aim to ﬁnd the central subspace Sy|xdeﬁned in (1). The sliced inverse

regression (SIR) introduced in [25] is the ﬁrst and the most popular method

among many existing ones.

Assume that the covariance matrix of xis Σ. Let Ω=Σ−1be the pre-

cision matrix of x. Let sj= #{k:Ωjk ̸= 0}and s= maxjsj. When the

distribution of xis elliptically symmetric, it is shown in [25] that

ΣSy|x=col(Λ),(4)

where Λ=var(E[x|y]) and col(Λ) is the column space spanned by Λ.

Given n i.i.d. samples (yi,xi), i= 1,··· , n. To estimate Λ, divide the

data into Hslices according to the order statistics y(i),i= 1, . . . , n. Let x(i)

be the concomitant associated with y(i). Note that slicing the data naturally

forms a partition of the support of the response variable, denoted as H. Let

Phbe the h-th slice in the partition H. Here we let P1= (−∞, y(⌈n/H⌉)] and

PH= (y(⌈n/H⌉∗(H−1)+1),+∞). Let ¯

xbe the mean of all the x’s and ¯

xh,·be

the sample mean of the vectors x(j)’s such that its concomitant y(j)∈ Ph

and estimate Λ≜var(E[x|y]) by

(5) b

ΛH=1

h=1

(¯

xh,·−¯

x)(¯

xh,·−¯

x)τ.

The b

ΛHwas shown to be a consistent estimator of Λunder some technical

conditions [18, 20, 46, 25, 28].

Alternatively, we could view SIR through a sequence of ordinary least

squares regressions. Let fh(y), h = 1,2,··· , H be a sequence of transforma-

tions of Y. Following the proof of [43, 40], one knows that under the linearity

condition [25],

Ef(y)ϕ(y)∈ Sy|x,

where ϕ(y) = Σ−1E(x|y). Let βh∈Rp×1be deﬁned as

βh=argminβhE(fh(y)−xβh)2.

Assuming the coverage condition [32, 13, 40], then

Span(B) = Sy|x,

where B= (β1,··· ,βH)∈Rp×H.

Note that diﬀerent choices of fh(y) lead to diﬀerent methods [40, 17]. To

name a few here, [43] suggested fh(y) = yhwhere h≤H. After slicing the

data into Hslices according to the value of the response variable y, [13]

suggested fh(y) = yif yis in the h-th slice and 0 otherwise. If we choose

fh(y) = (y∈ Ph), this will lead to SIR, which is the main focus in this

paper [43, 40].

After obtaining data (xi, yi) based on a sample of nsubjects, let

fh(y) = (y∈ Ph) = ( (y1∈ Ph),(y2∈ Ph),··· ,(yn∈ Ph))T.

Let ˆ

βhbe deﬁned as

βh=argminβh||fh(y)−xTβh||2= (xxT)−1xfh(y),

or a general form

(6) ˆ

βh=argminβh||fh(y)−xTβh||2=1

nˆ

Ωxfh(y),

where ˆ

Ωis a suitable approximation of the inverse of the Gram matrix

Σ=xTx/n. Let

(7) ˆ

B= ( ˆ

β1,··· ,ˆ

βH).

There are many methods to estimate the precision matrix. As an example,

we could consider the one given by the Lasso for the node-wide regression

on the design matrix x[31]. Next, we will derive the central limit theorem

of SIR when pconverges to the inﬁnity.

Our derivation is built upon the Gaussian approximation (GAR) theory

recently developed in [6, 7]. Let PH=∪hPhbe a partition of the sample

space of yand ph=P(y∈ Ph). Deﬁne

(8) ˜

βh=1

nΩxfh(y).

Let ˜

B= ( ˜

β1,··· ,˜

βH). For i= 1,2,··· , n, j = 1,2,··· , p, h = 1,2,··· , H,

let zijh’s be normal random variables such that

•Ezijh =phΩT

·jE(x|y∈ Ph);

•V(zijh) = phΩT

·jE(xxT|y∈ Ph)Ω·j−p2

h(ΩT

·jE(x|y∈ Ph))2;

•Cov(zijh, zikh) = phΩT

·jE(xxT|y∈ Ph)Ω·k−p2

hΩT

·jE(x|y∈ Ph)ΩT

·k

E(x|y∈ Ph);

•Cov(zijh1, zikh2) = −ph1ph2ΩT

·jE(x|y∈ Ph1)ΩT

·kE(x|y∈ Ph2).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ONTHETESTINGOFMULTIPLEHYPOTHESISINSLICEDINVERSEREGRESSIONByZhigenZhao∗andXinXing†Weconsiderthemultipletestingofthegeneralregressionframe-workaimingatstudyingtherelationshipbetweenaunivariatere-sponseandap-dimensionalpredictor.Totestthehypothesisoftheeffectofeachpredictor,weconstructanAngularBalanced...

展开>> 收起<<

ON THE TESTING OF MULTIPLE HYPOTHESIS IN SLICED INVERSE REGRESSION By Zhigen Zhaoand Xin Xing.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ON THE TESTING OF MULTIPLE HYPOTHESIS IN SLICED INVERSE REGRESSION By Zhigen Zhaoand Xin Xing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: