Model-free controlled variable selection via data splitting Yixin Han1 Xu Guo2 Changliang Zou1

2025-05-06 0 0 753.59KB 55 页 10玖币
侵权投诉
Model-free controlled variable selection via
data splitting
Yixin Han1, Xu Guo2& Changliang Zou1
1School of Statistics and Data Science, LPMC &KLMDASR, Nankai University
2Department of Mathematical Statistics, Beijing Normal University
Abstract
Addressing the simultaneous identification of contributory variables while control-
ling the false discovery rate (FDR) in high-dimensional data is a crucial statistical
challenge. In this paper, we propose a novel model-free variable selection procedure in
sufficient dimension reduction framework via a data splitting technique. The variable
selection problem is first converted to a least squares procedure with several response
transformations. We construct a series of statistics with global symmetry property and
leverage the symmetry to derive a data-driven threshold aimed at error rate control.
Our approach demonstrates the capability for achieving finite-sample and asymptotic
FDR control under mild theoretical conditions. Numerical experiments confirm that
our procedure has satisfactory FDR control and higher power compared with existing
methods.
Keywords: Data splitting; False discovery rate; Model-free; Sufficient dimension re-
duction; Symmetry
1
arXiv:2210.12382v3 [stat.ME] 22 Apr 2024
1 Introduction
Sufficient dimension reduction (SDR) is a powerful technique to extract relevant infor-
mation from high-dimensional data (Li,1991;Cook and Weisberg,1991;Xia et al.,2002;Li
and Wang,2007). We use Ywith support ΩYto denote the univariate response, and let
X= (X1, . . . , Xp)Rpbe the p-dimensional vector of all covariates. The basic idea of
SDR is to replace the predictor vector with its projection onto a subspace of the predictor
space without loss of information on the conditional distribution of Ygiven X. In practice,
a large number of features in high-dimensional data are typically collected, but only a small
portion of them are truly associated with the response variable. However, while grasping im-
portant features or patterns in the data, the reduction subspace from SDR usually includes
all original variables which makes it difficult to interpret. Therefore, in this paper, we aim
at developing a model-free variable selection procedure to screen out truly non-contributing
variables with certain error rate control, thus making the subsequent model building feasible
or simplified and helping reduce the computational cost caused by high-dimensional data.
Let F(Y|X) denote the conditional distribution function of Ygiven X. The index sets
of the active and inactive variables are defined respectively as
A={j:F(Y|X) functionally depends on Xj, j = 1, . . . , p},
Ac={j:F(Y|X) does not functionally depend on Xj, j = 1, . . . , p}.
Many prevalent variable selection procedures have been developed under the paradigm of
linear models or generalized linear models, such as LASSO (Tibshirani,1996), SCAD (Fan
2
and Li,2001), or adaptive LASSO (Zou,2006). See the review of Fan and Lv (2010) and the
book of Fan et al. (2020) for a fuller list of references. In contrast, model-free variable selection
can be achieved by SDR since it does not require complete knowledge of the underlying model,
thus researchers can avoid disposing of model misspecification.
SDR methods with variable selection aim to find the active set Asuch that
YXAc|XA,(1)
where “” stands for independence, XA={Xj:j∈ A} denotes the vector containing all
active variables and Xc
Ais the complementary set of XA. Condition (1) implies that XA
contains all the relevant information in terms of predicting Y.Li et al. (2005) proposed to
combine sufficient dimension reduction and variable selection. Chen et al. (2010) proposed a
coordinate-independent sparse estimation that can simultaneously achieve sparse SDR and
screen out irrelevant variables efficiently. Wu and Li (2011) focused on the model-free variable
selection with a diverging number of predictors. A marginal coordinate hypothesis is proposed
by Cook (2004) for model-free variable selection under low-dimensional settings, and then
is promoted by Shao et al. (2007) and Yu and Dong (2016). Yu et al. (2016a) constructed
marginal coordinate tests for sliced inverse regression (SIR) and Yu et al. (2016b) suggested
a trace-pursuit-based utility for ultrahigh-dimensional feature selection. See Li et al. (2020)
and Zhu (2020) for a comprehensive review.
However, those existing approaches do not account for uncertainty quantification of the
variable selection, i.e., the global error rate control in the selected subset of important covari-
3
ates in high-dimensional situations. In general high-dimensional nonlinear settings, Candes
et al. (2018) developed a Model-X Knockoff framework for controlling false discovery rate
(FDR, Benjamini and Hochberg,1995), which was motivated by the pioneering Knockoff
filter (Barber and Cand`es,2015). Their statistics constructed via “Knockoff copies” would
satisfy (or roughly) joint exchangeability and thus can yield finite-sample FDR control. How-
ever, the Model-X Knockoff requires knowing the joint distribution of the covariates, which
is typically difficult in high-dimensional settings. Recently, Guo et al. (2024) improved the
line of marginal tests (Cook,2004;Yu and Dong,2016) by using decorrelated score type
statistics to make inferences for a specific predictor which is of interest in advance. They
further leveraged the standard Benjamini and Hochberg (1995) on p-values to control FDR,
but the intensive computation of the decorrelated process may limit its application to high-
dimensional situations. In a different direction, Du et al. (2023) proposed a data splitting
strategy, named symmetrized data aggregation (SDA), to construct a series of statistics with
global symmetry property and then utilize the symmetry to derive a data-driven threshold
for error rate control. Specifically, Du et al. (2023) aggregated the dependence structure
into a linear model with a pseudo response and a fixed covariate, making the dependence
structure become a blessing for power improvement. Similar to the Knockoff method, the
SDA is also free of p-values and its construction does not rely on contingent assumptions,
which motivates us to employ it in sufficient dimension reduction problems.
In this paper, we propose a model-free variable selection procedure that could achieve
an effective FDR control. We first recast the problem of conducting variable selection in
sufficient dimension reduction into making inferences on regression coefficients in a set of
4
linear regressions with several response transformations. A variable selection procedure is
subsequently developed via error rate control for low-dimensional and high-dimensional set-
tings, respectively. Our main contributions include: (1) This novel data-driven selection
procedure can control the FDR while being combined with different existing SDR methods
for model-free variable selection by choosing different response transformation functions. (2)
Our method does not need to estimate any nuisance parameters such as the structural di-
mension in SDR. (3) Notably, the proposed procedure is computationally efficient and easy
to implement since it only involves a one-time split of the data and the calculation of the
product of two dimension reduction matrices obtained from two splits. (4) Furthermore, this
method can achieve finite-sample and asymptotic FDR control under some mild conditions.
(5) Numerical experiments indicate that our procedure exhibits satisfactory FDR control and
higher power compared with existing methods.
The rest of this paper is organized as follows. In section 2, we present the problem and
model formulation. In section 3, we propose a low-dimensional variable selection procedure
with error rate control and then discuss its extension in high-dimensional situations. The
finite-sample and asymptotic theories for controlling the FDR are developed in Section 4.
Simulation studies and a real-data investigation are conducted in Section 5to demonstrate
the superior performance of the proposed method. Section 6concludes the paper with several
further topics. The main theoretical proofs are given in Appendix. More detailed proofs and
additional numerical results are delineated in the Supplementary Material.
Notations. Let λmin(B) and λmax(B) denote the smallest and largest eigenvalues of square
matrix B= (bij). Write B2= (PiPjb2
ij)1/2and B= maxiPj|bij|. Denote µ1=
5
摘要:

Model-freecontrolledvariableselectionviadatasplittingYixinHan1,XuGuo2&ChangliangZou11SchoolofStatisticsandDataScience,LPMC&KLMDASR,NankaiUniversity2DepartmentofMathematicalStatistics,BeijingNormalUniversityAbstractAddressingthesimultaneousidentificationofcontributoryvariableswhilecontrol-lingthefals...

展开>> 收起<<
Model-free controlled variable selection via data splitting Yixin Han1 Xu Guo2 Changliang Zou1.pdf

共55页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:55 页 大小:753.59KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 55
客服
关注