Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation Yong-Shiuan Lee

2025-04-30 0 0 1.04MB 34 页 10玖币
侵权投诉
Adaptive greedy forward variable selection for linear regression models
with incomplete data using multiple imputation
Yong-Shiuan Lee
Abstract
Variable selection is crucial for sparse modeling in this age of big data. Missing values are common in data,
and make variable selection more complicated. The approach of multiple imputation (MI) results in multiply im-
puted datasets for missing values, and has been widely applied in various variable selection procedures. However,
directly performing variable selection on the whole MI data or bootstrapped MI data may not be worthy in terms
of computation cost. To fast identify the active variables in the linear regression model, we propose the adaptive
grafting procedure with three pooling rules on MI data. The proposed methods proceed iteratively, which starts
from finding the active variables based on the complete case subset and then expand the working data matrix
with both the number of active variables and available observations. A comprehensive simulation study shows the
selection accuracy in different aspects and computational efficiency of the proposed methods. Two real-life examples
illustrate the strength of the proposed methods.
Keywords: Fraction of missing information; gradient feature testing; greedy forward selection; missing data;
missing mechanism; multiple imputation; variable selection.
1 Introduction
Variable selection is an important issue for sparse modeling, a rapidly developing area including sparse regression and
classification, sparse signal recovery, along with others. Statistical sparsity can be expressed as a limiting property,
which covers all sparsity models in signal processing area (McCullagh and Polson 2018). Sparsity corresponds to per-
forming variable selection in the model from high-dimensional datasets, and provides interpretability of the model. In
traditional regression analysis, variable selection can be done by subset selection, shrinkage methods, and so on (Hastie
et al. 2009). Subset selection includes best-subset selection, forward selection, backward selection, stepwise selection
while well-known shrinkage methods include ridge regression (Hoerl and Kennard 1970), least absolute shrinkage and
selection operator (lasso) (Tibshirani 1996) and some variants of lasso (Hastie, Tibshirani, and Wainwright 2015).
Some methods such as the best-subset selection and lasso have the sparse property, that is, they have sparse solutions
to the corresponding problems.
Greedy algorithms are widely used approaches in sparse modeling. The greedy algorithm always makes a locally
optimal choice and never revisits former decisions (Cormen et al. 2009). The forward selection is an example of
greedy algorithm. Perkins et al. (2003) propose a greedy forward selection approach, called the grafting (gradient
feature testing) algorithm, for a variety of predictor models. Designed to be fast for high dimensional problems, the
grafting operates in an incremental iterative fashion and selects one variable at a time into the active set. Therefore,
it diminishes the computations greatly. Zhang (2009) proposes the greedy least squares regression algorithm, and
presents its consistency property of the greedy feature selection under the irrepresentable condition of Zhao and Yu
(2006), which is the same corresponding condition for lasso.
Missing values are common in data, and make statical analysis complicated. As a matter of fact, statistical methods
of dealing with missing data have a long history (Little and Rubin 2002). These methods can be divided into three
categories, i.e. deletion, single imputation, and multiple imputation. The deletion method includes listwise deletion
(also known as complete case analysis) and pairwise deletion. However, deletion may lead to loss of precision and
bias if the missing mechanism depends on the data values, especially the missing values themselves. In the most
case, deletion may also fail any analysis when no enough complete cases left. Single imputation is prevalent due to
its simplicity and applicability, such as mean/median/mode imputation, nearest neighbor imputation, and regression
imputation. However, single imputation tends to underestimate the standard errors. Multiple imputation (MI) is a
more flexible method proposed by Rubin (1978). MI allows and quantifies variation across multiply imputed datasets,
which reflects the uncertainty of the imputed values. With the validity of its assumptions, MI provides accurate
estimates for the parameters of interest and related standard errors, as well as improving statistical power (Enders
2010).
Three variable selection strategies for MI are distinguished by van Buuren (2018). The first is Majority, which
selects variables into the final model if they appear in at least half of the models from MI data. The second, Stack, is
to apply a weighted regression method on a single dataset stacked by all multiply imputed datasets to select variables.
The third called Wald is the stepwise selection performed by the Wald statistic based on Rubin’s rule (Rubin 1987).
Graduate Institute of Applied Physics, National Chengchi University, Taipei 11605, Taiwan. E-mail: dianalee@nccu.edu.tw
1
arXiv:2210.10967v1 [stat.ME] 20 Oct 2022
Brand (1999) first proposes a procedure for variable selection on missing data, which selects a set of candidate variables
on each of the MI data and determines the final model by selecting the variables with inclusion frequency larger than
a pre-specified threshold. Wood et al. (2008) and Vergouwe et al. (2010) investigate the variable selection strategies
with missing predictor values.
Apart from performing variable selection on MI data, other approaches for variable selection on incomplete data
are available. Yang et al. (2005) propose two Bayesian variable selection methods, “impute, then select” (ITS) and
“simultaneously impute and select” (SIAS), to impute and select variables for linear regression models. SIAS, which
embeds imputation and the stochastic search variable selection (SSVS) method of George and McCulloch (1993) in a
Gibbs sampling process, slightly outperforms ITS in providing smaller Monte Carlo standard errors. Their simulation
studies show that both ITS and SIAS outperform the stepwise selection using only complete cases. Zhao and Long
(2017) review approaches of variable selection on missing data.
Among these aforementioned approaches, MI is prevalent as a consequence of some user-friendly software packages,
and all the existing variable selection methods can be readily applied to each multiply imputed dataset. However, if
relatively high proportion of the variables are not active for the response variable, imputing all the missing entries
may not be necessary for variable selection. Our goal in this study is to identify active variables more efficiently for
data with missing values. To efficiently perform variable selection on MI data, we extend the greedy forward selection,
the grafting for least squares, to MI data to achieve our goal. We adapt the greedy least squares regression algorithm
(Zhang 2009) for variable selection in linear regression models with missing data. The most important characteristic
of MI is that it takes into account the uncertainty of the estimates due to the variation among imputed datasets. This
accordingly yields diverse results as selecting variables is separate from multiple imputation. Hence, we propose three
pooling rules for the adaptive grafting in the subsequent study.
The proposed procedure starts with the complete cases, and the initial set of active variables is obtained from the
MI data, which only includes variables selected by lasso regression in the imputation model. Our proposed algorithm
quickly expands the working data in two ways. The adaptive grafting adds one active variable with corresponding
available observations into the data matrix. Moreover, the MI data expand by including one active variable from
the adaptive grafting into the imputation model, which results in updated imputed values. This approach is more
efficient than existing variable selection methods concerned with MI since the adaptive grafting fast identifies the
active features and helps conducting MI on merely a subset of variables instead of the whole dataset.
The proposed algorithm initializes from but not confines the variable selection to the complete cases since the
listwise deletion may cause bias and hence incorrectly identify the set of active variables. Applying MI on incomplete
data with valid assumptions should improve the performance of any variable selection method. However, conducting
MI on data with noise variables is computationally intensive. Our proposed procedure incrementally selects variables
into the active set, and expands the MI data accordingly. Therefore, the procedure benefits from MI in a more
computationally efficient way.
One of the challenges further emerges from the various feature selecting approaches produced by MI is their
assessment, particularly in real data analysis. While some strategies concern missing proportion when applying the
existing approaches for variable selection in missing data, Madley-Dowd et al. (2019) indicates that adding auxiliary
variables in the imputation model does not always ensure the efficiency gains from MI. We then employ the fraction of
missing information (fmi) to evaluate the available methods and proposed procedure. fmi is a useful concept in missing
data analysis to quantify the uncertainty about imputed values accounting for the amount of information retained by
other variables within a data set (e.g. Savalei and Rhemtulla 2012; Madley-Dowd et al. 2019). This quantity depends
on the type of missing data mechanism, the model parameterization, and the degree of interrelationship among the
variables. We refit the data to assess the information retained from the selected variables with incomplete data,
which intends to compare the efficiency gains for the estimates obtained by some examined methods and the proposed
procedure.
The remainder of this paper organizes as follows. Section 2 briefly reviews the related works for variable selection
and multiple imputation, together with some selecting variable approaches using MI. Section 3 presents the proposed
methods in detail, which extends the gradient feature testing for multiply imputed datasets and their pooling strategies.
Section 4 conducts the simulation study to evaluate the proposed methods and compare with some common approaches
in the literature. The evaluation includes criteria reflecting the selection performance, the parameter estimation, the
prediction. The simulation study shows that the proposed methods attain preferable accuracy of variable selection
within a significantly short execution time comparing with some common methods. Section 5 illustrates the proposed
methods with two real-world datasets. We assess the information retained by all the examined methods for refitting
the corresponding dataset with the selecting active variables. Section 6 concludes the paper with some discussions.
2 Variable selection
We consider the linear regression model
y=Xβ +,(1)
where yis a n×1vector of response variable, Xis an n×pmatrix of explanatory variables, and βis the parameter
vector. We denote the data matrix as Z= (y,X). Assuming N(0, σ2In)for the error term, the log likelihood
2
function for model (1) is
l(β, σ2|Z) = n
2log(2π)n
2log(σ2)
1
2σ2(yXβ)T(yXβ).
(2)
For the data consisting of redundant predictors in (1), both model selection and parameter estimation could be a
heavy labor. We briefly review some related methods of variable selection for model (1) in this section.
2.1 Lasso
Tibshirani (1996) proposes lasso or `1-regularized regression for estimation in (1) to produce interpretable models.
The lasso estimator is the solution to the following optimization problem for model (1):
ˆ
β(lasso)= argmin
β kyXβk2
2+λ
p
X
v=1
|βv|!,(3)
where λ0is the regularization parameter, which controls the amount of shrinkage. The tuning parameter λis
chosen by cross validation (cv). The 5-fold or 10-fold cv is recommended to choose the optimal λ(Breiman and
Spector, 1992; Tibshirani, 1996).
The lasso is a modification with absolute value constraint in regression, in which the least squares estimates are
scaled by nonnegative constants as discussed in Breiman (1995). More generalizations of the lasso include adaptive
lasso (Zou 2006), group lasso (Yuan and Lin 2006), elastic net (Zou and Hastie 2005), etc.
2.2 The gradient feature testing
Fitting model (1) with a large number of redundant variables needs intensive computations. The greedy forward
selection reduces the computational cost and is suggested by Efron and Hastie (2016). Perkins et al. (2003) propose
grafting for feature selection in a regularized learning framework. Grafting is applicable to a variety of models.
Although the application to classification problem is only considered in Perkins et al. (2003), grafting is easily
extended to the regression models.
The grafting fast obtains the optimal or approximately optimal solution, which minimizes the general criterion, a
combination of empirical risk and a regularization term,
C(β) = 1
n
n
X
i=1
L(f(xi), yi)+Ω2+ Ω1+ Ω0.
where xiis the ith observation, fis the predictor function, Lis a loss function, and q=λqPp
v=1 αq|βv|qis the `q
regularizer. The best performance of the grafting in the experiments on classification problems obtained in Perkins
et al. (2003) is achieved by using both a `1and a `0regularizer. The choice between various loss functions and the
modified `0regularizer for more complex models can be found in Perkins et al. (2003).
The extension of grafting to regression problems is straightforward by using a linear predictor model of the form
f(x) = β0+Pp
v=1 βjxvand the squared error loss function. For the regression model (1), the grafting minimizes
Creg (β) = 1
n
n
X
i=1
(yiβx0
i)2+λ2
p
X
v=1
α2|βv|2
+λ1
p
X
v=1
α1|βv|+λ0
p
X
v=1
α0β0
v,
(4)
where the `0regularizer corresponds to a constraint on the maximum number of features in fwith a fixed penalty α0
for all parameters if 00is defined as 0 as in Perkins et al. (2003).
The grafting is a stagewise optimization procedure. At each stage, the gradient-based heuristic, which measures
the magnitude of Creg(β)/∂βvis employed to decide which variable is most likely to improve the current model,
and the model is incrementally optimized using gradient descent. To evaluate the variable effect on reducing the
optimization criterion (4), the grafting can be simplified as the greedy least squares regression algorithm described in
Zhang (2009). At each stage, measuring the magnitude of Creg(β)/∂βvis equivalent to calculating the gradient of
the log-likelihood function (2) for the variable βvnot selected into the current model.
Suppose that the index set of variables, denoted as Vr={v(1),· · · , v(r)}, includes already selected rvariables in
the current model. We evaluate the remaining variables in the set VC
r, where Cindicates the complementary set, and
select the one with maximum magnitude of gradient into Vrand update it to Vr+1. That is, the newly chosen variable
will be the one with the largest gradient from the log-likelihood as in (2) based on the data matrix, ZVr, corresponding
3
rvariables in the current model, denoted as
xv(r+1) =argmax
v0VC
r
l(β, σ2|ZVr)
βv0
=argmax
v0VC
r
n
X
i=1
xi,v0
yiβ0
r
X
j=1
βv(j)xi,v(j)
.
(5)
The grafting procedure (5) only involves the inner product of those variables in VC
rand the residuals of the current
model, and hence reduce the computation complexity. Moreover, since the approach adds one variable into the current
model sequentially, it typically produces a model which consists of a small number of variables.
The grafting is guaranteed to find the global optimum in some cases with a convex loss function and just the `2
and/or `1regularizers (Perkins et al. 2003). Furthermore, Zhang (2009) shows the greedy least squares regression
algorithm identifies the correct set of variables with the selection of stopping criterion which is equivalent to selecting
an appropriate regularization parameter λin the lasso regression in (3), i.e. the grafting using only the `1regularizer.
Nevertheless, the approach only applies on data without missing values. The optimality of grafting on multiply
imputed data is more complicated and beyond the scope of this study, which can be further investigated in the future.
2.3 Variable selection with missing values
There are three main approaches for variable selection in the presence of missing data (Zhao and Long 2017). The first
approach is the likelihood-based methods. The joint distribution of the observed and missing data is specified, and by
integrating out the missing data the marginal distribution of the observed data is obtained and variable selection is
performed with accommodations. Claeskens and Consentino (2008) propose some variations of the Akaike’s information
criterion (AIC) criteria to perform variable selection with missing covariates based on the expectation-maximization
(EM) algorithm. Ibrahim et al. (2008) also propose a class of model selection criteria for missing-data problems using
the EM algorithm. Mitra and Dunson (2010) propose a two-level stochastic search variable selection (SSVS) algorithm,
which addresses a broader class of models involving mixed categorical and continuous variables. Liang et al. (2018)
propose two algorithms, the imputation–regularized optimization (IRO) algorithm and the imputation–conditional
regularized optimization (ICRO) algorithm, which incorporate imputation and regularization for high dimensional
variable selection with missing values. Zambom and Matthews (2019) propose two methods, maximum likelihood sure
independence screening (ML-SIS) and two-stage screening (TS-SIS), for ultrahigh dimensional problems with the sure
screening property.
The second approach is through inverse probability weighting (IPW) and its extensions. Johnson et al. (2008)
propose a general strategy based on penalized weighted estimating equations for variable selection. Wolfson (2011)
develops the EEBoost algorithm for variable selection based on estimating equations (EEs). Brown et al. (2017)
extend EEBoost and propose ThrEEBoost (Thresholded EEBoost) algorithm. However, this approach is difficult to
extend to general missing data patterns that are not monotone (Long and Johnson 2015; Liu et al. 2016).
The third approach is performing variable selection on multiply imputed datasets. Brand (1999) selects the
variables into the final model if the inclusion frequency is higher than the pre-specified threshold. Chen and Wang
(2013) propose to ensure the consistency of selection across multiply imputed data by using group lasso regularization.
Long and Johnson (2015) generate bootstrapped samples to conduct imputation, and determine the final selection
through stability selection. Similar to the method proposed in Long and Johnson (2015), Liu et al. (2016) propose
multiple imputation random lasso, which also involves bootstrapping and stability selection, to perform variable
selection depending on the importance measure of each variable. Thao and Geskus (2019) categorize the variable
selection using MI into three classes, and the further discussions will be presented in Sect. 3.3.
An alternative of MI is to impute missing values through matrix completion, the structured matrix completion
(SMC) (Cai et al. 2016) and the softImpute-ALS procedure (Hastie et al. 2015) for example. Gao and Lee (2017)
apply the softImpute-ALS and adaptive lasso to missing data, then the final model selection is made with the lowest
value of the modified Rank Selection Criterion (RSC). Nevertheless, Xue and Qu (2020) argue that the approach of
Gao and Lee (2017) does not guarantee the estimation consistency.
3 Missing values in data
Missing values complicate the data analysis. It is important to understand the reasons, patterns, and mechanisms
of missing values when dealing with their analysis. The methods for missing values are developed depending on the
missing mechanisms (Little and Rubin 2002; Rubin 1978). The missing-data mechanisms can be categorized into three
types. The first one is missing completely at random (MCAR), which assumes the missingness does not depend on
the data values. The second assumption is missing at random (MAR) if the missingness depends only on the observed
values in the data. The third kind of assumption is the missing not at random (MNAR) if there is a relationship
between the propensity of missingness and the unobserved data.
4
3.1 Multiple imputation
Among the techniques that handle missing values, multiple imputation is more flexible by allowing imputation variation
and is applied widely owing to the development of computational ability (van Buuren 2018). Rubin (1978) proposes
this Bayesian approach to relate the observed-data posterior distribution to the complete data posterior distribution
which would have been obtained if the missing data had been observed under the assumption of MCAR or MAR. MI
allows variation of imputed values and provides a quantification of the uncertainty.
Considering the regression model (1), if the response yis completely observed and there are missing values in the
predictors, then an n×pmissing-data indicator matrix, = (δiv), is defined as:
δiv =1,if xiv is missing,
0,if xiv is observed,
where xiv is the (i, v)th element in X. We also denote the imputed data matrix as ˜
Z= (y,˜
X)such that
˜xiv =x(imp)
iv ,if δiv = 1,
xiv,if δiv = 0,
where x(imp)
iv is a filled-in value replaced by any imputation approach when xiv is missing.
MI involves a three-stage process (Rubin 1978). First, Mimputed datasets are created, in each of which the
missing entries may have different imputed values, while all other non-missing entries are identical. The differences
among the imputed values reflect the uncertainty of the imputation. The second step is to apply the same statistical
analysis to each imputed data set and obtain Mestimates of interest. Finally, one can pool all the Mestimates as
well as the variance of the pooling estimate by the Rubin’s rule (Rubin 1987).
Once the Mimputed data sets have been created, we exemplify model (1) to present the Rubin’s rule. MI obtains
the OLS estimate,
ˆ
β(m)= ( ˜
X(m)T˜
X(m))1˜
X(m)Ty
and
ˆσ2
(m)=1
np1(y˜
X(m)ˆ
β(m))T(y˜
X(m)ˆ
β(m)),
based on each imputed datasets, ˜
Z(m)= (y,˜
X(m)), where m= 1,· · · , M. The pooled estimate for βis the mean
value of the estimates across the Mdatasets,
¯
β=1
M
M
X
m=1
ˆ
β(m).(6)
The variance of ¯
βis
V ar(¯
β) = T=U+M+ 1
MB,(7)
where
U=1
M
M
X
m=1
ˆσ2
(m)˜
X(m)T˜
X(m)1
(8)
is the within-imputation variance and
B=1
M1
M
X
m=1 ˆ
β(m)¯
βˆ
β(m)¯
βT
(9)
is the between-imputation variance.
Formulating a set of conditional distributions relating each variable to a set of the other variables instead of joint
modelling is more practical for real multivariate data sets. It is easier to take one conditional distribution at a time to
obtain imputed values of a variable by regressing it on the other variables and then impute another variable conditioned
on the observed and currently imputed values and so on. Cycling the procedure for every variable with missing values
and repeating it for a number of iterations produces the imputed values. van Buuren and Groothuis-Oudshoorn (2011)
employ this concept to develop an R package mice (multivariate imputation by chained equations).
3.2 Fraction of missing information
The fraction of missing information (fmi) quantifies the uncertainty about imputed values accounting for the amount
of information retained by other variables within a data set (Wagner 2010; White et al. 2011; Madley-Dowd et al.
2019). The fmi measures the proportion of a parameter’s sampling error that is due to missing data, thus it reveals
5
摘要:

AdaptivegreedyforwardvariableselectionforlinearregressionmodelswithincompletedatausingmultipleimputationYong-ShiuanLee*AbstractVariableselectioniscrucialforsparsemodelinginthisageofbigdata.Missingvaluesarecommonindata,andmakevariableselectionmorecomplicated.Theapproachofmultipleimputation(MI)results...

展开>> 收起<<
Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation Yong-Shiuan Lee.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:1.04MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注