Ensembling improves stability and power of feature selection for deep learning models Prashnna K Gyawali Xiaoxia Liu James ZouZihuai He

2025-04-29 0 0 907.87KB 13 页 10玖币
侵权投诉
Ensembling improves stability and power of feature
selection for deep learning models
Prashnna K Gyawali Xiaoxia Liu James ZouZihuai He
Stanford University
Abstract
With the growing adoption of deep learning models in different real-world domains,
including computational biology, it is often necessary to understand which data
features are essential for the model’s decision. Despite extensive recent efforts to
define different feature importance metrics for deep learning models, we identi-
fied that inherent stochasticity in the design and training of deep learning models
makes commonly used feature importance scores unstable. This results in varied
explanations or selections of different features across different runs of the model.
We demonstrate how the signal strength of features and correlation among features
directly contribute to this instability. To address this instability, we explore the
ensembling of feature importance scores of models across different epochs and
find that this simple approach can substantially address this issue. For example, we
consider knockoff inference as they allow feature selection with statistical guaran-
tees. We discover considerable variability in selected features in different epochs
of deep learning training, and the best selection of features doesn’t necessarily
occur at the lowest validation loss, the conventional approach to determine the best
model. As such, we present a framework to combine the feature importance of
trained models across different hyperparameter settings and epochs, and instead
of selecting features from one best model, we perform an ensemble of feature
importance scores from numerous good models. Across the range of experiments
in simulated and various real-world datasets, we demonstrate that the proposed
framework consistently improves the power of feature selection.
1 Introduction
Machine learning (ML) algorithms, especially deep learning models, are being used extensively and
making important decisions in real-world domains, including medicine and the biological domain.
These algorithms are often required to be interpretable as this allows us to understand which input
features are used to make the decision. For some applications, knowing which features are selected
enhances trust in the model, while for other applications, selected features may help in scientific
discovery, e.g., drug discovery problems. Overall, interpretability and explainability are critical with
the growing usage of ML algorithms. However, although recent efforts have given rise to various
interpretability methods Sundararajan et al. [2017], Shrikumar et al. [2017], Binder et al. [2016],
Kim et al. [2018] and have explained why an algorithm made certain decisions, these interpretations
remain fragile Ghorbani et al. [2019], leading to different explanations for the same data instance
and, eventually, contributing toward mistrust in the models.
The primary reason for such varied explanations can be attributed to stochasticity in the design and
training of deep neural networks (DNN). Some examples could be model initialization, dropouts,
stochastic gradient descent, etc. For instance, the unstable behavior of DNN from varying only the
initialization of the network weights has been reported before Mehrer et al. [2020]. Further, with
* co-corresponding authors.
arXiv:2210.00604v1 [cs.LG] 2 Oct 2022
the objective function of neural networks being highly non-convex, the resulting model may end up
in different local minima. Although such local minima result in models with similar generalization
performance (as measured with validation loss), these result in varied explanations or feature im-
portance scores for the same data. This instability in feature attribution is aggravated by the dataset
having low signal amplitude features or high correlation between features which is quite common
in real-world data analysis. More importantly, this instability in feature importance score directly
impacts the stability of selected features.
In this work, we first demonstrate this issue of instability in feature importance and feature selection
for standard benchmarking datasets and interpretability metrics. We also provide evidence of how
data properties like signal strength and correlation aggravate instability. We then demonstrate how
simple averaging of feature importance scores from models at different training epochs helps address
this instability. Motivated by the abilities of such averaging, we propose a framework for stabilizing
the feature importance and feature selection from the deep neural network. Our proposed framework
first perform hyperparameter optimization of deep learning models. Then, instead of the conventional
practice of selecting a single best model, we find out numerous good models and create an ensemble
of their feature importance score, which, as we show later, will help select robust features. For
determining good models, we consider two strategies. First, we propose using top-performing models
as determined by cross-validation (CV) loss. In the second strategy, we propose statistical leveraging
to find the influential models for feature importance. In this work, we consider a knockoff framework
for feature selection as they choose features with statistical guarantees. Across a range of experiments
in simulation settings and real-world feature selection problems, we find that the existing approach of
selecting features from the best model across different hyperparameter settings and epochs doesn’t
necessarily result in stable or improved feature selection. Instead, we achieve stable and improved
feature selection with the presented framework. Overall, our contributions are as follows:
We demonstrate the instability in DNN interpretations for widely used interpretability metrics
(Grad, DeepLift and Lime) across two benchmarking datasets (MNIST and CIFAR-10).
We propose a framework to create an ensemble of feature importance scores obtained from
the training paths of the deep neural network to stabilize the feature importance score from
deep neural networks.
We demonstrate the applicability of such an ensemble in the task of feature selection with
knockoff inference.
Across the simulation studies and three real-data applications for feature selection, we
demonstrate the efficacy of the proposed framework to improve both stability and the power
of feature selection for deep learning models.
2 Related works
In recent times, there have been works that carefully studied the fragility in neural networks interpreta-
tion Ghorbani et al. [2019], Slack et al. [2020]. These works demonstrate that explanation approaches
are fragile to adversarial perturbations where perceptively indistinguishable inputs can have very
different interpretations, despite assigning the same predicted label. Although our work is also about
the instability in interpretations of the neural network, unlike them, we study this problem without
relying on adversarial inputs. Further, we focus on the impact of this instability on the downstream
application of feature selection, which hasn’t been considered before.
The primary strategy in our framework, i.e., ensemble of feature importance score from models in
different training stages has some similarities with the recent works in deep learning generalization
Li et al. [2022], Izmailov et al. [2018]. These works have studied the model’s weight averaging as an
alternative to improve generalization. However, unlike these works, we propose to form an ensemble
of feature importance scores obtained from individual model weights at different stages of the deep
learning training, and most importantly, unlike all previous works, we used such an ensemble to
improve the stability and power of feature selection.
Feature selection (or variable selection) has been extensively studied in machine learning and statistics
Saeys et al. [2007], Mares et al. [2016]. The selection of features while controlling false discovery is
an attractive property, and there exist different feature selection methods that provide such statistical
guarantees Meinshausen and Bühlmann [2010], Barber and Candès [2015]. Although our framework
2
Figure 1: (Left) Box plot of correlation between feature importance scores between models trained
separately with five random initialization. This analysis considers two datasets (MNIST and CIFAR-
10) and three different feature importance metrics (Grad, DeepLift, and Lime). The Best feature
importance score (obtained from the model with the lowest validation loss) is compared with the Avg
feature importance score obtained from the ensemble of feature importance scores from all training
epochs. (Middle) Instability of individual features across different randomly initialized models
in relation to their signal strength and signal correlation with all other features. (Right) Average
instability across all the features for three feature importance measures considered in this study.
applies to any such feature selection method, we consider knockoff inference for feature selection
Barber and Candès [2015].
Since we consider knockoff inference to demonstrate how our proposal helps to stabilize and improve
feature selection in deep learning, our work is also related to research on the intersection of deep
learning and knockoff inference. Despite being model-free and selecting features with statistical
guarantees, the usage of knockoff inference with deep learning is quite limited Lu et al. [2018], Zhu
et al. [2021]. Our work is complementary to these works as we demonstrate the instability issues in
feature selection with the help of these works, and our proposed framework improves the power of
feature selection compared to such works. Furthermore, within the knockoff framework, the idea
of constructing multiple knockoffs has been studied in detail to improve feature selection He et al.
[2021], Gimenez and Zou [2019]. Our work utilizes single and multiple knockoffs depending on the
dataset’s complexity and demonstrates the utility of the presented framework for both scenarios.
3 Instability in models’ interpretability
In this section, we explore instability in the interpretations of deep neural networks. Toward this
we consider two benchmarking datasets: MNIST LeCun et al. [1998] and CIFAR-10 Krizhevsky
et al. [2009]. We train the models for the standard image classification task and record their feature
importance (FI) score (details in Appendix A). In particular, for a given dataset, we train a deep
neural network five times with random initialization and record the FI for the best model (as defined
by the lowest validation loss in the training epochs) in each run. We randomly sample (stratified
sampling to cover all the classes in the dataset) images from the test dataset to calculate the FI and
average their absolute values to obtain a single FI score
Z
. For FI, we consider three separate and
widely considered feature importance metrics: Grad, gradients with respect to inputs, DeepLift, a
back-propagation-based approach that attributes a change to inputs based on the differences between
the inputs and corresponding references (or baselines) for non-linear activations Shrikumar et al.
[2017], and Lime, an interpretability method that trains an interpretable surrogate model by sampling
data points around a specified input example and using model evaluations at these points to train a
simpler interpretable ‘surrogate’ model Ribeiro et al. [2016]. Ideally, to provide robust interpretability,
we expect the correlation between different
Z
from these randomly initialized models to be close to 1.
However, as we see in Fig. 1 (left), for both datasets and across all the FI metrics, the best FI score
between the same models trained with different random initialization are not correlated. Instead,
when we create an ensemble from
Z
obtained across
E
epochs, i.e.,
Zavg =1
EPE
i=1 Zi
, we found
that such an ensemble helps in stabilizing the interpretations with increased correlation between these
random runs. As we can see in Fig. 1 (left), for MNIST, the ensemble (represented by Avg) increases
the correlation significantly, and for CIFAR-10, although the ensemble wasn’t able to increase the
correlation by a considerable margin, it was still doing better compared to the best model.
3
摘要:

EnsemblingimprovesstabilityandpoweroffeatureselectionfordeeplearningmodelsPrashnnaKGyawaliXiaoxiaLiuJamesZouZihuaiHeStanfordUniversityAbstractWiththegrowingadoptionofdeeplearningmodelsindifferentreal-worlddomains,includingcomputationalbiology,itisoftennecessarytounderstandwhichdatafeaturesareessen...

展开>> 收起<<
Ensembling improves stability and power of feature selection for deep learning models Prashnna K Gyawali Xiaoxia Liu James ZouZihuai He.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:907.87KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注