Brand (1999) first proposes a procedure for variable selection on missing data, which selects a set of candidate variables
on each of the MI data and determines the final model by selecting the variables with inclusion frequency larger than
a pre-specified threshold. Wood et al. (2008) and Vergouwe et al. (2010) investigate the variable selection strategies
with missing predictor values.
Apart from performing variable selection on MI data, other approaches for variable selection on incomplete data
are available. Yang et al. (2005) propose two Bayesian variable selection methods, “impute, then select” (ITS) and
“simultaneously impute and select” (SIAS), to impute and select variables for linear regression models. SIAS, which
embeds imputation and the stochastic search variable selection (SSVS) method of George and McCulloch (1993) in a
Gibbs sampling process, slightly outperforms ITS in providing smaller Monte Carlo standard errors. Their simulation
studies show that both ITS and SIAS outperform the stepwise selection using only complete cases. Zhao and Long
(2017) review approaches of variable selection on missing data.
Among these aforementioned approaches, MI is prevalent as a consequence of some user-friendly software packages,
and all the existing variable selection methods can be readily applied to each multiply imputed dataset. However, if
relatively high proportion of the variables are not active for the response variable, imputing all the missing entries
may not be necessary for variable selection. Our goal in this study is to identify active variables more efficiently for
data with missing values. To efficiently perform variable selection on MI data, we extend the greedy forward selection,
the grafting for least squares, to MI data to achieve our goal. We adapt the greedy least squares regression algorithm
(Zhang 2009) for variable selection in linear regression models with missing data. The most important characteristic
of MI is that it takes into account the uncertainty of the estimates due to the variation among imputed datasets. This
accordingly yields diverse results as selecting variables is separate from multiple imputation. Hence, we propose three
pooling rules for the adaptive grafting in the subsequent study.
The proposed procedure starts with the complete cases, and the initial set of active variables is obtained from the
MI data, which only includes variables selected by lasso regression in the imputation model. Our proposed algorithm
quickly expands the working data in two ways. The adaptive grafting adds one active variable with corresponding
available observations into the data matrix. Moreover, the MI data expand by including one active variable from
the adaptive grafting into the imputation model, which results in updated imputed values. This approach is more
efficient than existing variable selection methods concerned with MI since the adaptive grafting fast identifies the
active features and helps conducting MI on merely a subset of variables instead of the whole dataset.
The proposed algorithm initializes from but not confines the variable selection to the complete cases since the
listwise deletion may cause bias and hence incorrectly identify the set of active variables. Applying MI on incomplete
data with valid assumptions should improve the performance of any variable selection method. However, conducting
MI on data with noise variables is computationally intensive. Our proposed procedure incrementally selects variables
into the active set, and expands the MI data accordingly. Therefore, the procedure benefits from MI in a more
computationally efficient way.
One of the challenges further emerges from the various feature selecting approaches produced by MI is their
assessment, particularly in real data analysis. While some strategies concern missing proportion when applying the
existing approaches for variable selection in missing data, Madley-Dowd et al. (2019) indicates that adding auxiliary
variables in the imputation model does not always ensure the efficiency gains from MI. We then employ the fraction of
missing information (fmi) to evaluate the available methods and proposed procedure. fmi is a useful concept in missing
data analysis to quantify the uncertainty about imputed values accounting for the amount of information retained by
other variables within a data set (e.g. Savalei and Rhemtulla 2012; Madley-Dowd et al. 2019). This quantity depends
on the type of missing data mechanism, the model parameterization, and the degree of interrelationship among the
variables. We refit the data to assess the information retained from the selected variables with incomplete data,
which intends to compare the efficiency gains for the estimates obtained by some examined methods and the proposed
procedure.
The remainder of this paper organizes as follows. Section 2 briefly reviews the related works for variable selection
and multiple imputation, together with some selecting variable approaches using MI. Section 3 presents the proposed
methods in detail, which extends the gradient feature testing for multiply imputed datasets and their pooling strategies.
Section 4 conducts the simulation study to evaluate the proposed methods and compare with some common approaches
in the literature. The evaluation includes criteria reflecting the selection performance, the parameter estimation, the
prediction. The simulation study shows that the proposed methods attain preferable accuracy of variable selection
within a significantly short execution time comparing with some common methods. Section 5 illustrates the proposed
methods with two real-world datasets. We assess the information retained by all the examined methods for refitting
the corresponding dataset with the selecting active variables. Section 6 concludes the paper with some discussions.
2 Variable selection
We consider the linear regression model
y=Xβ +,(1)
where yis a n×1vector of response variable, Xis an n×pmatrix of explanatory variables, and βis the parameter
vector. We denote the data matrix as Z= (y,X). Assuming ∼N(0, σ2In)for the error term, the log likelihood
2