Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation Yong-Shiuan Lee

2025-04-30 0 0 1.04MB 34 页 10玖币

侵权投诉

Adaptive greedy forward variable selection for linear regression models

with incomplete data using multiple imputation

Yong-Shiuan Lee∗

Abstract

Variable selection is crucial for sparse modeling in this age of big data. Missing values are common in data,

and make variable selection more complicated. The approach of multiple imputation (MI) results in multiply im-

puted datasets for missing values, and has been widely applied in various variable selection procedures. However,

directly performing variable selection on the whole MI data or bootstrapped MI data may not be worthy in terms

of computation cost. To fast identify the active variables in the linear regression model, we propose the adaptive

grafting procedure with three pooling rules on MI data. The proposed methods proceed iteratively, which starts

from ﬁnding the active variables based on the complete case subset and then expand the working data matrix

with both the number of active variables and available observations. A comprehensive simulation study shows the

selection accuracy in diﬀerent aspects and computational eﬃciency of the proposed methods. Two real-life examples

illustrate the strength of the proposed methods.

Keywords: Fraction of missing information; gradient feature testing; greedy forward selection; missing data;

missing mechanism; multiple imputation; variable selection.

1 Introduction

Variable selection is an important issue for sparse modeling, a rapidly developing area including sparse regression and

classiﬁcation, sparse signal recovery, along with others. Statistical sparsity can be expressed as a limiting property,

which covers all sparsity models in signal processing area (McCullagh and Polson 2018). Sparsity corresponds to per-

forming variable selection in the model from high-dimensional datasets, and provides interpretability of the model. In

traditional regression analysis, variable selection can be done by subset selection, shrinkage methods, and so on (Hastie

et al. 2009). Subset selection includes best-subset selection, forward selection, backward selection, stepwise selection

while well-known shrinkage methods include ridge regression (Hoerl and Kennard 1970), least absolute shrinkage and

selection operator (lasso) (Tibshirani 1996) and some variants of lasso (Hastie, Tibshirani, and Wainwright 2015).

Some methods such as the best-subset selection and lasso have the sparse property, that is, they have sparse solutions

to the corresponding problems.

Greedy algorithms are widely used approaches in sparse modeling. The greedy algorithm always makes a locally

optimal choice and never revisits former decisions (Cormen et al. 2009). The forward selection is an example of

greedy algorithm. Perkins et al. (2003) propose a greedy forward selection approach, called the grafting (gradient

feature testing) algorithm, for a variety of predictor models. Designed to be fast for high dimensional problems, the

grafting operates in an incremental iterative fashion and selects one variable at a time into the active set. Therefore,

it diminishes the computations greatly. Zhang (2009) proposes the greedy least squares regression algorithm, and

presents its consistency property of the greedy feature selection under the irrepresentable condition of Zhao and Yu

(2006), which is the same corresponding condition for lasso.

Missing values are common in data, and make statical analysis complicated. As a matter of fact, statistical methods

of dealing with missing data have a long history (Little and Rubin 2002). These methods can be divided into three

categories, i.e. deletion, single imputation, and multiple imputation. The deletion method includes listwise deletion

(also known as complete case analysis) and pairwise deletion. However, deletion may lead to loss of precision and

bias if the missing mechanism depends on the data values, especially the missing values themselves. In the most

case, deletion may also fail any analysis when no enough complete cases left. Single imputation is prevalent due to

its simplicity and applicability, such as mean/median/mode imputation, nearest neighbor imputation, and regression

imputation. However, single imputation tends to underestimate the standard errors. Multiple imputation (MI) is a

more ﬂexible method proposed by Rubin (1978). MI allows and quantiﬁes variation across multiply imputed datasets,

which reﬂects the uncertainty of the imputed values. With the validity of its assumptions, MI provides accurate

estimates for the parameters of interest and related standard errors, as well as improving statistical power (Enders

2010).

Three variable selection strategies for MI are distinguished by van Buuren (2018). The ﬁrst is Majority, which

selects variables into the ﬁnal model if they appear in at least half of the models from MI data. The second, Stack, is

to apply a weighted regression method on a single dataset stacked by all multiply imputed datasets to select variables.

The third called Wald is the stepwise selection performed by the Wald statistic based on Rubin’s rule (Rubin 1987).

∗Graduate Institute of Applied Physics, National Chengchi University, Taipei 11605, Taiwan. E-mail: dianalee@nccu.edu.tw

arXiv:2210.10967v1 [stat.ME] 20 Oct 2022

Brand (1999) ﬁrst proposes a procedure for variable selection on missing data, which selects a set of candidate variables

on each of the MI data and determines the ﬁnal model by selecting the variables with inclusion frequency larger than

a pre-speciﬁed threshold. Wood et al. (2008) and Vergouwe et al. (2010) investigate the variable selection strategies

with missing predictor values.

Apart from performing variable selection on MI data, other approaches for variable selection on incomplete data

are available. Yang et al. (2005) propose two Bayesian variable selection methods, “impute, then select” (ITS) and

“simultaneously impute and select” (SIAS), to impute and select variables for linear regression models. SIAS, which

embeds imputation and the stochastic search variable selection (SSVS) method of George and McCulloch (1993) in a

Gibbs sampling process, slightly outperforms ITS in providing smaller Monte Carlo standard errors. Their simulation

studies show that both ITS and SIAS outperform the stepwise selection using only complete cases. Zhao and Long

(2017) review approaches of variable selection on missing data.

Among these aforementioned approaches, MI is prevalent as a consequence of some user-friendly software packages,

and all the existing variable selection methods can be readily applied to each multiply imputed dataset. However, if

relatively high proportion of the variables are not active for the response variable, imputing all the missing entries

may not be necessary for variable selection. Our goal in this study is to identify active variables more eﬃciently for

data with missing values. To eﬃciently perform variable selection on MI data, we extend the greedy forward selection,

the grafting for least squares, to MI data to achieve our goal. We adapt the greedy least squares regression algorithm

(Zhang 2009) for variable selection in linear regression models with missing data. The most important characteristic

of MI is that it takes into account the uncertainty of the estimates due to the variation among imputed datasets. This

accordingly yields diverse results as selecting variables is separate from multiple imputation. Hence, we propose three

pooling rules for the adaptive grafting in the subsequent study.

The proposed procedure starts with the complete cases, and the initial set of active variables is obtained from the

MI data, which only includes variables selected by lasso regression in the imputation model. Our proposed algorithm

quickly expands the working data in two ways. The adaptive grafting adds one active variable with corresponding

available observations into the data matrix. Moreover, the MI data expand by including one active variable from

the adaptive grafting into the imputation model, which results in updated imputed values. This approach is more

eﬃcient than existing variable selection methods concerned with MI since the adaptive grafting fast identiﬁes the

active features and helps conducting MI on merely a subset of variables instead of the whole dataset.

The proposed algorithm initializes from but not conﬁnes the variable selection to the complete cases since the

listwise deletion may cause bias and hence incorrectly identify the set of active variables. Applying MI on incomplete

data with valid assumptions should improve the performance of any variable selection method. However, conducting

MI on data with noise variables is computationally intensive. Our proposed procedure incrementally selects variables

into the active set, and expands the MI data accordingly. Therefore, the procedure beneﬁts from MI in a more

computationally eﬃcient way.

One of the challenges further emerges from the various feature selecting approaches produced by MI is their

assessment, particularly in real data analysis. While some strategies concern missing proportion when applying the

existing approaches for variable selection in missing data, Madley-Dowd et al. (2019) indicates that adding auxiliary

variables in the imputation model does not always ensure the eﬃciency gains from MI. We then employ the fraction of

missing information (fmi) to evaluate the available methods and proposed procedure. fmi is a useful concept in missing

data analysis to quantify the uncertainty about imputed values accounting for the amount of information retained by

other variables within a data set (e.g. Savalei and Rhemtulla 2012; Madley-Dowd et al. 2019). This quantity depends

on the type of missing data mechanism, the model parameterization, and the degree of interrelationship among the

variables. We reﬁt the data to assess the information retained from the selected variables with incomplete data,

which intends to compare the eﬃciency gains for the estimates obtained by some examined methods and the proposed

procedure.

The remainder of this paper organizes as follows. Section 2 brieﬂy reviews the related works for variable selection

and multiple imputation, together with some selecting variable approaches using MI. Section 3 presents the proposed

methods in detail, which extends the gradient feature testing for multiply imputed datasets and their pooling strategies.

Section 4 conducts the simulation study to evaluate the proposed methods and compare with some common approaches

in the literature. The evaluation includes criteria reﬂecting the selection performance, the parameter estimation, the

prediction. The simulation study shows that the proposed methods attain preferable accuracy of variable selection

within a signiﬁcantly short execution time comparing with some common methods. Section 5 illustrates the proposed

methods with two real-world datasets. We assess the information retained by all the examined methods for reﬁtting

the corresponding dataset with the selecting active variables. Section 6 concludes the paper with some discussions.

2 Variable selection

We consider the linear regression model

y=Xβ +,(1)

where yis a n×1vector of response variable, Xis an n×pmatrix of explanatory variables, and βis the parameter

vector. We denote the data matrix as Z= (y,X). Assuming ∼N(0, σ2In)for the error term, the log likelihood

function for model (1) is

l(β, σ2|Z) = −n

2log(2π)−n

2log(σ2)

−1

2σ2(y−Xβ)T(y−Xβ).

(2)

For the data consisting of redundant predictors in (1), both model selection and parameter estimation could be a

heavy labor. We brieﬂy review some related methods of variable selection for model (1) in this section.

2.1 Lasso

Tibshirani (1996) proposes lasso or `1-regularized regression for estimation in (1) to produce interpretable models.

The lasso estimator is the solution to the following optimization problem for model (1):

β(lasso)= argmin

β ky−Xβk2

2+λ

v=1

|βv|!,(3)

where λ≥0is the regularization parameter, which controls the amount of shrinkage. The tuning parameter λis

chosen by cross validation (cv). The 5-fold or 10-fold cv is recommended to choose the optimal λ(Breiman and

Spector, 1992; Tibshirani, 1996).

The lasso is a modiﬁcation with absolute value constraint in regression, in which the least squares estimates are

scaled by nonnegative constants as discussed in Breiman (1995). More generalizations of the lasso include adaptive

lasso (Zou 2006), group lasso (Yuan and Lin 2006), elastic net (Zou and Hastie 2005), etc.

2.2 The gradient feature testing

Fitting model (1) with a large number of redundant variables needs intensive computations. The greedy forward

selection reduces the computational cost and is suggested by Efron and Hastie (2016). Perkins et al. (2003) propose

grafting for feature selection in a regularized learning framework. Grafting is applicable to a variety of models.

Although the application to classiﬁcation problem is only considered in Perkins et al. (2003), grafting is easily

extended to the regression models.

The grafting fast obtains the optimal or approximately optimal solution, which minimizes the general criterion, a

combination of empirical risk and a regularization term,

C(β) = 1

i=1

L(f(xi), yi)+Ω2+ Ω1+ Ω0.

where xiis the ith observation, fis the predictor function, Lis a loss function, and Ωq=λqPp

v=1 αq|βv|qis the `q

regularizer. The best performance of the grafting in the experiments on classiﬁcation problems obtained in Perkins

et al. (2003) is achieved by using both a `1and a `0regularizer. The choice between various loss functions and the

modiﬁed `0regularizer for more complex models can be found in Perkins et al. (2003).

The extension of grafting to regression problems is straightforward by using a linear predictor model of the form

f(x) = β0+Pp

v=1 βjxvand the squared error loss function. For the regression model (1), the grafting minimizes

Creg (β) = 1

i=1

(yi−βx0

i)2+λ2

v=1

α2|βv|2

+λ1

v=1

α1|βv|+λ0

v=1

α0β0

(4)

where the `0regularizer corresponds to a constraint on the maximum number of features in fwith a ﬁxed penalty α0

for all parameters if 00is deﬁned as 0 as in Perkins et al. (2003).

The grafting is a stagewise optimization procedure. At each stage, the gradient-based heuristic, which measures

the magnitude of ∂Creg(β)/∂βvis employed to decide which variable is most likely to improve the current model,

and the model is incrementally optimized using gradient descent. To evaluate the variable eﬀect on reducing the

optimization criterion (4), the grafting can be simpliﬁed as the greedy least squares regression algorithm described in

Zhang (2009). At each stage, measuring the magnitude of ∂Creg(β)/∂βvis equivalent to calculating the gradient of

the log-likelihood function (2) for the variable βvnot selected into the current model.

Suppose that the index set of variables, denoted as Vr={v(1),· · · , v(r)}, includes already selected rvariables in

the current model. We evaluate the remaining variables in the set VC

r, where Cindicates the complementary set, and

select the one with maximum magnitude of gradient into Vrand update it to Vr+1. That is, the newly chosen variable

will be the one with the largest gradient from the log-likelihood as in (2) based on the data matrix, ZVr, corresponding

rvariables in the current model, denoted as

xv(r+1) =argmax

v0∈VC

r

∂l(β, σ2|ZVr)

∂βv0

=argmax

v0∈VC

r

i=1

xi,v0

yi−β0−

j=1

βv(j)xi,v(j)



(5)

The grafting procedure (5) only involves the inner product of those variables in VC

rand the residuals of the current

model, and hence reduce the computation complexity. Moreover, since the approach adds one variable into the current

model sequentially, it typically produces a model which consists of a small number of variables.

The grafting is guaranteed to ﬁnd the global optimum in some cases with a convex loss function and just the `2

and/or `1regularizers (Perkins et al. 2003). Furthermore, Zhang (2009) shows the greedy least squares regression

algorithm identiﬁes the correct set of variables with the selection of stopping criterion which is equivalent to selecting

an appropriate regularization parameter λin the lasso regression in (3), i.e. the grafting using only the `1regularizer.

Nevertheless, the approach only applies on data without missing values. The optimality of grafting on multiply

imputed data is more complicated and beyond the scope of this study, which can be further investigated in the future.

2.3 Variable selection with missing values

There are three main approaches for variable selection in the presence of missing data (Zhao and Long 2017). The ﬁrst

approach is the likelihood-based methods. The joint distribution of the observed and missing data is speciﬁed, and by

integrating out the missing data the marginal distribution of the observed data is obtained and variable selection is

performed with accommodations. Claeskens and Consentino (2008) propose some variations of the Akaike’s information

criterion (AIC) criteria to perform variable selection with missing covariates based on the expectation-maximization

(EM) algorithm. Ibrahim et al. (2008) also propose a class of model selection criteria for missing-data problems using

the EM algorithm. Mitra and Dunson (2010) propose a two-level stochastic search variable selection (SSVS) algorithm,

which addresses a broader class of models involving mixed categorical and continuous variables. Liang et al. (2018)

propose two algorithms, the imputation–regularized optimization (IRO) algorithm and the imputation–conditional

regularized optimization (ICRO) algorithm, which incorporate imputation and regularization for high dimensional

variable selection with missing values. Zambom and Matthews (2019) propose two methods, maximum likelihood sure

independence screening (ML-SIS) and two-stage screening (TS-SIS), for ultrahigh dimensional problems with the sure

screening property.

The second approach is through inverse probability weighting (IPW) and its extensions. Johnson et al. (2008)

propose a general strategy based on penalized weighted estimating equations for variable selection. Wolfson (2011)

develops the EEBoost algorithm for variable selection based on estimating equations (EEs). Brown et al. (2017)

extend EEBoost and propose ThrEEBoost (Thresholded EEBoost) algorithm. However, this approach is diﬃcult to

extend to general missing data patterns that are not monotone (Long and Johnson 2015; Liu et al. 2016).

The third approach is performing variable selection on multiply imputed datasets. Brand (1999) selects the

variables into the ﬁnal model if the inclusion frequency is higher than the pre-speciﬁed threshold. Chen and Wang

(2013) propose to ensure the consistency of selection across multiply imputed data by using group lasso regularization.

Long and Johnson (2015) generate bootstrapped samples to conduct imputation, and determine the ﬁnal selection

through stability selection. Similar to the method proposed in Long and Johnson (2015), Liu et al. (2016) propose

multiple imputation random lasso, which also involves bootstrapping and stability selection, to perform variable

selection depending on the importance measure of each variable. Thao and Geskus (2019) categorize the variable

selection using MI into three classes, and the further discussions will be presented in Sect. 3.3.

An alternative of MI is to impute missing values through matrix completion, the structured matrix completion

(SMC) (Cai et al. 2016) and the softImpute-ALS procedure (Hastie et al. 2015) for example. Gao and Lee (2017)

apply the softImpute-ALS and adaptive lasso to missing data, then the ﬁnal model selection is made with the lowest

value of the modiﬁed Rank Selection Criterion (RSC). Nevertheless, Xue and Qu (2020) argue that the approach of

Gao and Lee (2017) does not guarantee the estimation consistency.

3 Missing values in data

Missing values complicate the data analysis. It is important to understand the reasons, patterns, and mechanisms

of missing values when dealing with their analysis. The methods for missing values are developed depending on the

missing mechanisms (Little and Rubin 2002; Rubin 1978). The missing-data mechanisms can be categorized into three

types. The ﬁrst one is missing completely at random (MCAR), which assumes the missingness does not depend on

the data values. The second assumption is missing at random (MAR) if the missingness depends only on the observed

values in the data. The third kind of assumption is the missing not at random (MNAR) if there is a relationship

between the propensity of missingness and the unobserved data.

3.1 Multiple imputation

Among the techniques that handle missing values, multiple imputation is more ﬂexible by allowing imputation variation

and is applied widely owing to the development of computational ability (van Buuren 2018). Rubin (1978) proposes

this Bayesian approach to relate the observed-data posterior distribution to the complete data posterior distribution

which would have been obtained if the missing data had been observed under the assumption of MCAR or MAR. MI

allows variation of imputed values and provides a quantiﬁcation of the uncertainty.

Considering the regression model (1), if the response yis completely observed and there are missing values in the

predictors, then an n×pmissing-data indicator matrix, ∆= (δiv), is deﬁned as:

δiv =1,if xiv is missing,

0,if xiv is observed,

where xiv is the (i, v)th element in X. We also denote the imputed data matrix as ˜

Z= (y,˜

X)such that

˜xiv =x(imp)

iv ,if δiv = 1,

xiv,if δiv = 0,

where x(imp)

iv is a ﬁlled-in value replaced by any imputation approach when xiv is missing.

MI involves a three-stage process (Rubin 1978). First, Mimputed datasets are created, in each of which the

missing entries may have diﬀerent imputed values, while all other non-missing entries are identical. The diﬀerences

among the imputed values reﬂect the uncertainty of the imputation. The second step is to apply the same statistical

analysis to each imputed data set and obtain Mestimates of interest. Finally, one can pool all the Mestimates as

well as the variance of the pooling estimate by the Rubin’s rule (Rubin 1987).

Once the Mimputed data sets have been created, we exemplify model (1) to present the Rubin’s rule. MI obtains

the OLS estimate,

β(m)= ( ˜

X(m)T˜

X(m))−1˜

X(m)Ty

and

ˆσ2

(m)=1

n−p−1(y−˜

X(m)ˆ

β(m))T(y−˜

X(m)ˆ

β(m)),

based on each imputed datasets, ˜

Z(m)= (y,˜

X(m)), where m= 1,· · · , M. The pooled estimate for βis the mean

value of the estimates across the Mdatasets,

β=1

m=1

β(m).(6)

The variance of ¯

βis

V ar(¯

β) = T=U+M+ 1

MB,(7)

where

U=1

m=1

ˆσ2

(m)˜

X(m)T˜

X(m)−1

(8)

is the within-imputation variance and

B=1

M−1

m=1 ˆ

β(m)−¯

βˆ

β(m)−¯

βT

(9)

is the between-imputation variance.

Formulating a set of conditional distributions relating each variable to a set of the other variables instead of joint

modelling is more practical for real multivariate data sets. It is easier to take one conditional distribution at a time to

obtain imputed values of a variable by regressing it on the other variables and then impute another variable conditioned

on the observed and currently imputed values and so on. Cycling the procedure for every variable with missing values

and repeating it for a number of iterations produces the imputed values. van Buuren and Groothuis-Oudshoorn (2011)

employ this concept to develop an R package mice (multivariate imputation by chained equations).

3.2 Fraction of missing information

The fraction of missing information (fmi) quantiﬁes the uncertainty about imputed values accounting for the amount

of information retained by other variables within a data set (Wagner 2010; White et al. 2011; Madley-Dowd et al.

2019). The fmi measures the proportion of a parameter’s sampling error that is due to missing data, thus it reveals

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptivegreedyforwardvariableselectionforlinearregressionmodelswithincompletedatausingmultipleimputationYong-ShiuanLee*AbstractVariableselectioniscrucialforsparsemodelinginthisageofbigdata.Missingvaluesarecommonindata,andmakevariableselectionmorecomplicated.Theapproachofmultipleimputation(MI)results...

展开>> 收起<<

Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation Yong-Shiuan Lee.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive greedy forward variable selection for linear regression models with incomplete data using multiple imputation Yong-Shiuan Lee

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: