Model-free controlled variable selection via data splitting Yixin Han1 Xu Guo2 Changliang Zou1

2025-05-06 0 0 753.59KB 55 页 10玖币

侵权投诉

Model-free controlled variable selection via

data splitting

Yixin Han1, Xu Guo2& Changliang Zou1

1School of Statistics and Data Science, LPMC &KLMDASR, Nankai University

2Department of Mathematical Statistics, Beijing Normal University

Abstract

Addressing the simultaneous identiﬁcation of contributory variables while control-

ling the false discovery rate (FDR) in high-dimensional data is a crucial statistical

challenge. In this paper, we propose a novel model-free variable selection procedure in

suﬃcient dimension reduction framework via a data splitting technique. The variable

selection problem is ﬁrst converted to a least squares procedure with several response

transformations. We construct a series of statistics with global symmetry property and

leverage the symmetry to derive a data-driven threshold aimed at error rate control.

Our approach demonstrates the capability for achieving ﬁnite-sample and asymptotic

FDR control under mild theoretical conditions. Numerical experiments conﬁrm that

our procedure has satisfactory FDR control and higher power compared with existing

methods.

Keywords: Data splitting; False discovery rate; Model-free; Suﬃcient dimension re-

duction; Symmetry

arXiv:2210.12382v3 [stat.ME] 22 Apr 2024

1 Introduction

Suﬃcient dimension reduction (SDR) is a powerful technique to extract relevant infor-

mation from high-dimensional data (Li,1991;Cook and Weisberg,1991;Xia et al.,2002;Li

and Wang,2007). We use Ywith support ΩYto denote the univariate response, and let

X= (X1, . . . , Xp)⊤∈Rpbe the p-dimensional vector of all covariates. The basic idea of

SDR is to replace the predictor vector with its projection onto a subspace of the predictor

space without loss of information on the conditional distribution of Ygiven X. In practice,

a large number of features in high-dimensional data are typically collected, but only a small

portion of them are truly associated with the response variable. However, while grasping im-

portant features or patterns in the data, the reduction subspace from SDR usually includes

all original variables which makes it diﬃcult to interpret. Therefore, in this paper, we aim

at developing a model-free variable selection procedure to screen out truly non-contributing

variables with certain error rate control, thus making the subsequent model building feasible

or simpliﬁed and helping reduce the computational cost caused by high-dimensional data.

Let F(Y|X) denote the conditional distribution function of Ygiven X. The index sets

of the active and inactive variables are deﬁned respectively as

A={j:F(Y|X) functionally depends on Xj, j = 1, . . . , p},

Ac={j:F(Y|X) does not functionally depend on Xj, j = 1, . . . , p}.

Many prevalent variable selection procedures have been developed under the paradigm of

linear models or generalized linear models, such as LASSO (Tibshirani,1996), SCAD (Fan

and Li,2001), or adaptive LASSO (Zou,2006). See the review of Fan and Lv (2010) and the

book of Fan et al. (2020) for a fuller list of references. In contrast, model-free variable selection

can be achieved by SDR since it does not require complete knowledge of the underlying model,

thus researchers can avoid disposing of model misspeciﬁcation.

SDR methods with variable selection aim to ﬁnd the active set Asuch that

Y⊥⊥XAc|XA,(1)

where “⊥⊥” stands for independence, XA={Xj:j∈ A} denotes the vector containing all

active variables and Xc

Ais the complementary set of XA. Condition (1) implies that XA

contains all the relevant information in terms of predicting Y.Li et al. (2005) proposed to

combine suﬃcient dimension reduction and variable selection. Chen et al. (2010) proposed a

coordinate-independent sparse estimation that can simultaneously achieve sparse SDR and

screen out irrelevant variables eﬃciently. Wu and Li (2011) focused on the model-free variable

selection with a diverging number of predictors. A marginal coordinate hypothesis is proposed

by Cook (2004) for model-free variable selection under low-dimensional settings, and then

is promoted by Shao et al. (2007) and Yu and Dong (2016). Yu et al. (2016a) constructed

marginal coordinate tests for sliced inverse regression (SIR) and Yu et al. (2016b) suggested

a trace-pursuit-based utility for ultrahigh-dimensional feature selection. See Li et al. (2020)

and Zhu (2020) for a comprehensive review.

However, those existing approaches do not account for uncertainty quantiﬁcation of the

variable selection, i.e., the global error rate control in the selected subset of important covari-

ates in high-dimensional situations. In general high-dimensional nonlinear settings, Candes

et al. (2018) developed a Model-X Knockoﬀ framework for controlling false discovery rate

(FDR, Benjamini and Hochberg,1995), which was motivated by the pioneering Knockoﬀ

ﬁlter (Barber and Cand`es,2015). Their statistics constructed via “Knockoﬀ copies” would

satisfy (or roughly) joint exchangeability and thus can yield ﬁnite-sample FDR control. How-

ever, the Model-X Knockoﬀ requires knowing the joint distribution of the covariates, which

is typically diﬃcult in high-dimensional settings. Recently, Guo et al. (2024) improved the

line of marginal tests (Cook,2004;Yu and Dong,2016) by using decorrelated score type

statistics to make inferences for a speciﬁc predictor which is of interest in advance. They

further leveraged the standard Benjamini and Hochberg (1995) on p-values to control FDR,

but the intensive computation of the decorrelated process may limit its application to high-

dimensional situations. In a diﬀerent direction, Du et al. (2023) proposed a data splitting

strategy, named symmetrized data aggregation (SDA), to construct a series of statistics with

global symmetry property and then utilize the symmetry to derive a data-driven threshold

for error rate control. Speciﬁcally, Du et al. (2023) aggregated the dependence structure

into a linear model with a pseudo response and a ﬁxed covariate, making the dependence

structure become a blessing for power improvement. Similar to the Knockoﬀ method, the

SDA is also free of p-values and its construction does not rely on contingent assumptions,

which motivates us to employ it in suﬃcient dimension reduction problems.

In this paper, we propose a model-free variable selection procedure that could achieve

an eﬀective FDR control. We ﬁrst recast the problem of conducting variable selection in

suﬃcient dimension reduction into making inferences on regression coeﬃcients in a set of

linear regressions with several response transformations. A variable selection procedure is

subsequently developed via error rate control for low-dimensional and high-dimensional set-

tings, respectively. Our main contributions include: (1) This novel data-driven selection

procedure can control the FDR while being combined with diﬀerent existing SDR methods

for model-free variable selection by choosing diﬀerent response transformation functions. (2)

Our method does not need to estimate any nuisance parameters such as the structural di-

mension in SDR. (3) Notably, the proposed procedure is computationally eﬃcient and easy

to implement since it only involves a one-time split of the data and the calculation of the

product of two dimension reduction matrices obtained from two splits. (4) Furthermore, this

method can achieve ﬁnite-sample and asymptotic FDR control under some mild conditions.

(5) Numerical experiments indicate that our procedure exhibits satisfactory FDR control and

higher power compared with existing methods.

The rest of this paper is organized as follows. In section 2, we present the problem and

model formulation. In section 3, we propose a low-dimensional variable selection procedure

with error rate control and then discuss its extension in high-dimensional situations. The

ﬁnite-sample and asymptotic theories for controlling the FDR are developed in Section 4.

Simulation studies and a real-data investigation are conducted in Section 5to demonstrate

the superior performance of the proposed method. Section 6concludes the paper with several

further topics. The main theoretical proofs are given in Appendix. More detailed proofs and

additional numerical results are delineated in the Supplementary Material.

Notations. Let λmin(B) and λmax(B) denote the smallest and largest eigenvalues of square

matrix B= (bij). Write ∥B∥2= (PiPjb2

ij)1/2and ∥B∥∞= maxiPj|bij|. Denote ∥µ∥1=

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Model-freecontrolledvariableselectionviadatasplittingYixinHan1,XuGuo2&ChangliangZou11SchoolofStatisticsandDataScience,LPMC&KLMDASR,NankaiUniversity2DepartmentofMathematicalStatistics,BeijingNormalUniversityAbstractAddressingthesimultaneousidentificationofcontributoryvariableswhilecontrol-lingthefals...

展开>> 收起<<

Model-free controlled variable selection via data splitting Yixin Han1 Xu Guo2 Changliang Zou1.pdf

共55页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model-free controlled variable selection via data splitting Yixin Han1 Xu Guo2 Changliang Zou1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: