Ensembling improves stability and power of feature selection for deep learning models Prashnna K Gyawali Xiaoxia Liu James ZouZihuai He

2025-04-29 0 0 907.87KB 13 页 10玖币

侵权投诉

Ensembling improves stability and power of feature

selection for deep learning models

Prashnna K Gyawali Xiaoxia Liu James Zou∗Zihuai He∗

Stanford University

Abstract

With the growing adoption of deep learning models in different real-world domains,

including computational biology, it is often necessary to understand which data

features are essential for the model’s decision. Despite extensive recent efforts to

deﬁne different feature importance metrics for deep learning models, we identi-

ﬁed that inherent stochasticity in the design and training of deep learning models

makes commonly used feature importance scores unstable. This results in varied

explanations or selections of different features across different runs of the model.

We demonstrate how the signal strength of features and correlation among features

directly contribute to this instability. To address this instability, we explore the

ensembling of feature importance scores of models across different epochs and

ﬁnd that this simple approach can substantially address this issue. For example, we

consider knockoff inference as they allow feature selection with statistical guaran-

tees. We discover considerable variability in selected features in different epochs

of deep learning training, and the best selection of features doesn’t necessarily

occur at the lowest validation loss, the conventional approach to determine the best

model. As such, we present a framework to combine the feature importance of

trained models across different hyperparameter settings and epochs, and instead

of selecting features from one best model, we perform an ensemble of feature

importance scores from numerous good models. Across the range of experiments

in simulated and various real-world datasets, we demonstrate that the proposed

framework consistently improves the power of feature selection.

1 Introduction

Machine learning (ML) algorithms, especially deep learning models, are being used extensively and

making important decisions in real-world domains, including medicine and the biological domain.

These algorithms are often required to be interpretable as this allows us to understand which input

features are used to make the decision. For some applications, knowing which features are selected

enhances trust in the model, while for other applications, selected features may help in scientiﬁc

discovery, e.g., drug discovery problems. Overall, interpretability and explainability are critical with

the growing usage of ML algorithms. However, although recent efforts have given rise to various

interpretability methods Sundararajan et al. [2017], Shrikumar et al. [2017], Binder et al. [2016],

Kim et al. [2018] and have explained why an algorithm made certain decisions, these interpretations

remain fragile Ghorbani et al. [2019], leading to different explanations for the same data instance

and, eventually, contributing toward mistrust in the models.

The primary reason for such varied explanations can be attributed to stochasticity in the design and

training of deep neural networks (DNN). Some examples could be model initialization, dropouts,

stochastic gradient descent, etc. For instance, the unstable behavior of DNN from varying only the

initialization of the network weights has been reported before Mehrer et al. [2020]. Further, with

* co-corresponding authors.

arXiv:2210.00604v1 [cs.LG] 2 Oct 2022

the objective function of neural networks being highly non-convex, the resulting model may end up

in different local minima. Although such local minima result in models with similar generalization

performance (as measured with validation loss), these result in varied explanations or feature im-

portance scores for the same data. This instability in feature attribution is aggravated by the dataset

having low signal amplitude features or high correlation between features which is quite common

in real-world data analysis. More importantly, this instability in feature importance score directly

impacts the stability of selected features.

In this work, we ﬁrst demonstrate this issue of instability in feature importance and feature selection

for standard benchmarking datasets and interpretability metrics. We also provide evidence of how

data properties like signal strength and correlation aggravate instability. We then demonstrate how

simple averaging of feature importance scores from models at different training epochs helps address

this instability. Motivated by the abilities of such averaging, we propose a framework for stabilizing

the feature importance and feature selection from the deep neural network. Our proposed framework

ﬁrst perform hyperparameter optimization of deep learning models. Then, instead of the conventional

practice of selecting a single best model, we ﬁnd out numerous good models and create an ensemble

of their feature importance score, which, as we show later, will help select robust features. For

determining good models, we consider two strategies. First, we propose using top-performing models

as determined by cross-validation (CV) loss. In the second strategy, we propose statistical leveraging

to ﬁnd the inﬂuential models for feature importance. In this work, we consider a knockoff framework

for feature selection as they choose features with statistical guarantees. Across a range of experiments

in simulation settings and real-world feature selection problems, we ﬁnd that the existing approach of

selecting features from the best model across different hyperparameter settings and epochs doesn’t

necessarily result in stable or improved feature selection. Instead, we achieve stable and improved

feature selection with the presented framework. Overall, our contributions are as follows:

•

We demonstrate the instability in DNN interpretations for widely used interpretability metrics

(Grad, DeepLift and Lime) across two benchmarking datasets (MNIST and CIFAR-10).

•

We propose a framework to create an ensemble of feature importance scores obtained from

the training paths of the deep neural network to stabilize the feature importance score from

deep neural networks.

•

We demonstrate the applicability of such an ensemble in the task of feature selection with

knockoff inference.

•

Across the simulation studies and three real-data applications for feature selection, we

demonstrate the efﬁcacy of the proposed framework to improve both stability and the power

of feature selection for deep learning models.

2 Related works

In recent times, there have been works that carefully studied the fragility in neural networks interpreta-

tion Ghorbani et al. [2019], Slack et al. [2020]. These works demonstrate that explanation approaches

are fragile to adversarial perturbations where perceptively indistinguishable inputs can have very

different interpretations, despite assigning the same predicted label. Although our work is also about

the instability in interpretations of the neural network, unlike them, we study this problem without

relying on adversarial inputs. Further, we focus on the impact of this instability on the downstream

application of feature selection, which hasn’t been considered before.

The primary strategy in our framework, i.e., ensemble of feature importance score from models in

different training stages has some similarities with the recent works in deep learning generalization

Li et al. [2022], Izmailov et al. [2018]. These works have studied the model’s weight averaging as an

alternative to improve generalization. However, unlike these works, we propose to form an ensemble

of feature importance scores obtained from individual model weights at different stages of the deep

learning training, and most importantly, unlike all previous works, we used such an ensemble to

improve the stability and power of feature selection.

Feature selection (or variable selection) has been extensively studied in machine learning and statistics

Saeys et al. [2007], Mares et al. [2016]. The selection of features while controlling false discovery is

an attractive property, and there exist different feature selection methods that provide such statistical

guarantees Meinshausen and Bühlmann [2010], Barber and Candès [2015]. Although our framework

Figure 1: (Left) Box plot of correlation between feature importance scores between models trained

separately with ﬁve random initialization. This analysis considers two datasets (MNIST and CIFAR-

10) and three different feature importance metrics (Grad, DeepLift, and Lime). The Best feature

importance score (obtained from the model with the lowest validation loss) is compared with the Avg

feature importance score obtained from the ensemble of feature importance scores from all training

epochs. (Middle) Instability of individual features across different randomly initialized models

in relation to their signal strength and signal correlation with all other features. (Right) Average

instability across all the features for three feature importance measures considered in this study.

applies to any such feature selection method, we consider knockoff inference for feature selection

Barber and Candès [2015].

Since we consider knockoff inference to demonstrate how our proposal helps to stabilize and improve

feature selection in deep learning, our work is also related to research on the intersection of deep

learning and knockoff inference. Despite being model-free and selecting features with statistical

guarantees, the usage of knockoff inference with deep learning is quite limited Lu et al. [2018], Zhu

et al. [2021]. Our work is complementary to these works as we demonstrate the instability issues in

feature selection with the help of these works, and our proposed framework improves the power of

feature selection compared to such works. Furthermore, within the knockoff framework, the idea

of constructing multiple knockoffs has been studied in detail to improve feature selection He et al.

[2021], Gimenez and Zou [2019]. Our work utilizes single and multiple knockoffs depending on the

dataset’s complexity and demonstrates the utility of the presented framework for both scenarios.

3 Instability in models’ interpretability

In this section, we explore instability in the interpretations of deep neural networks. Toward this

we consider two benchmarking datasets: MNIST LeCun et al. [1998] and CIFAR-10 Krizhevsky

et al. [2009]. We train the models for the standard image classiﬁcation task and record their feature

importance (FI) score (details in Appendix A). In particular, for a given dataset, we train a deep

neural network ﬁve times with random initialization and record the FI for the best model (as deﬁned

by the lowest validation loss in the training epochs) in each run. We randomly sample (stratiﬁed

sampling to cover all the classes in the dataset) images from the test dataset to calculate the FI and

average their absolute values to obtain a single FI score

. For FI, we consider three separate and

widely considered feature importance metrics: Grad, gradients with respect to inputs, DeepLift, a

back-propagation-based approach that attributes a change to inputs based on the differences between

the inputs and corresponding references (or baselines) for non-linear activations Shrikumar et al.

[2017], and Lime, an interpretability method that trains an interpretable surrogate model by sampling

data points around a speciﬁed input example and using model evaluations at these points to train a

simpler interpretable ‘surrogate’ model Ribeiro et al. [2016]. Ideally, to provide robust interpretability,

we expect the correlation between different

from these randomly initialized models to be close to 1.

However, as we see in Fig. 1 (left), for both datasets and across all the FI metrics, the best FI score

between the same models trained with different random initialization are not correlated. Instead,

when we create an ensemble from

obtained across

epochs, i.e.,

Zavg =1

EPE

i=1 Zi

, we found

that such an ensemble helps in stabilizing the interpretations with increased correlation between these

random runs. As we can see in Fig. 1 (left), for MNIST, the ensemble (represented by Avg) increases

the correlation signiﬁcantly, and for CIFAR-10, although the ensemble wasn’t able to increase the

correlation by a considerable margin, it was still doing better compared to the best model.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnsemblingimprovesstabilityandpoweroffeatureselectionfordeeplearningmodelsPrashnnaKGyawaliXiaoxiaLiuJamesZouZihuaiHeStanfordUniversityAbstractWiththegrowingadoptionofdeeplearningmodelsindifferentreal-worlddomains,includingcomputationalbiology,itisoftennecessarytounderstandwhichdatafeaturesareessen...

展开>> 收起<<

Ensembling improves stability and power of feature selection for deep learning models Prashnna K Gyawali Xiaoxia Liu James ZouZihuai He.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Ensembling improves stability and power of feature selection for deep learning models Prashnna K Gyawali Xiaoxia Liu James ZouZihuai He

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: