All correspondence to Amir.moslemiryerson.ca Email address arash.ahmadianmail.utoronto.ca Subspace Learning for Feature Selection via Rank Revealing_2

2025-04-30 0 0 1.96MB 34 页 10玖币
侵权投诉
* All correspondence to Amir.moslemi@ryerson.ca
** Email address: arash.ahmadian@mail.utoronto.ca
Subspace Learning for Feature Selection via Rank Revealing
QR Factorization: Unsupervised and Hybrid Approaches
with Non-negative Matrix Factorization and Evolutionary
Algorithm
Amir Moslemi a,b,*, Arash Ahmadian c,**
a Toronto Metropolitan University, Department of Physics Toronto, Ontario, Canada
b University of Toronto, Department of Medical Biophysics, Toronto, Ontario, Canada
c University of Toronto, Edward S. Rogers Sr. Department of Electrical and Computer Engineering, Toronto,
Ontario, Canada
Abstract
The selection of most informative and discriminative features from high-dimensional data has been
noticed as an important topic in machine learning and data engineering. Using matrix factorization-
based techniques such as nonnegative matrix factorization for feature selection has emerged as a
hot topic in feature selection. The main goal of feature selection using matrix factorization is to
extract a subspace which approximates the original space but in a lower dimension. In this study,
rank revealing QR (RRQR) factorization, which is computationally cheaper than singular value
decomposition (SVD), is leveraged in obtaining the most informative features as a novel
unsupervised feature selection technique. This technique uses the permutation matrix of QR for
feature selection which is a unique property to this factorization method. Moreover, QR
factorization is embedded into non-negative matrix factorization (NMF) objective function as a
new unsupervised feature selection method. Lastly, a hybrid feature selection algorithm is
proposed by coupling RRQR, as a filter-based technique, and a Genetic algorithm as a wrapper-
based technique. In this method, redundant features are removed using RRQR factorization and
the most discriminative subset of features are selected using the Genetic algorithm. The proposed
algorithm shows to be dependable and robust when compared against state-of-the-art feature
selection algorithms in supervised, unsupervised, and semi-supervised settings. All methods are
tested on seven available microarray datasets using KNN, SVM and C4.5 classifiers. In terms of
evaluation metrics, the experimental results shows that the proposed method is comparable with
the state-of-the-art feature selection.
Key words: Feature selection, Rank Revealing QR factorization, Genetic Algorithm
1. Introduction
Challenges surrounding high-dimensional data has been strikingly noticed since the emergence of
the internet and rapid development of information technology. Trace of curse of dimensionality is
become bold in machine learning, pattern recognition, data mining, computer vision and
recommender systems. Recently, dimensionality reduction techniques have been developed
rapidly. Feature selection and feature extraction are the two main categories of dimensionality
reduction. Feature extraction aims to map data to a lower dimension representation, whereas
Feature selection is aimed to obtain the most discriminative features of high-dimensional data,
which can be used to interpret medical data, industrial data, finance data, etc.
Feature selection in medical data can be important since the selected features can be used in disease
diagnosis or optimizing treatments. To this end, a new line of research is devoted to bioinformatic
and pattern recognition to deal with DNA microarray high dimensional datasets. These data types
have significantly more features than sample points and thus need to be treated as an undetermined
system [1,2]. Although thousands of genes from different samples are stored in a DNA micro array,
only a small portion of the encoded information are related to disease [3]. A Gene selection
problem in bioinformatic is equivalent to a feature selection task in machine learning.
In the context of disease diagnosis and progression analysis using feature selection, demographic,
spirometry test and Computed Tomography (CT) features were collected and combined to predict
chronic obstructive pulmonary disease (COPD) progression, exacerbation, and hospitalization.
Then, extracting more prominent and discriminative features improved the accuracy of
hospitalization prediction and discrimination between COPD and asthma [4-6].
Features without effect on output considered as irrelevant features, and redundant features are a
mix of other features which cannot add more information. Including these irrelevant and redundant
features increases the complexity of system which affects the performance of the learning
algorithm and the computation time [7].
Feature selection is utilized for dimensionality reduction and to obtain the most informative subset
of features by minimizing redundancy and maximizing the relevancy with the aim of increasing
accuracy and decreasing computation time. Feature selection decreases task complexity and thus
the probability of overfitting is decreased [8]. Peaking phenomenon states the error of a classifier
for fixed data is decreased by adding features though error might be increased [9]. There are three
strategies for feature selection; A) Filter, B) Wrapper, and C) Embedded, based techniques. In
filter-based, the features are sorted and assessed based on some criteria by statistical and intrinsic
characteristics of the dataset. In the filter method, there is no connection between classifier and
dataset. Laplacian score [10], mutual information and MRMR are amongst the most well-known
filter-based techniques [11-15]. In wrapper-based, there is a connection between features and the
learning algorithm and thus the best subset of features can be achieved based on the output of the
machine. Forward feature selection, sequential backward feature selection and floating search [16]
and recursive SVM [17] are conventional feature selection techniques in machine learning.
Feature selection based on label information can be categorized to three types including supervised
methods [18], semi-supervised methods [19], and unsupervised methods [20]. Information about
relevancy of features and target can be obtained in a supervised setting making the goal of
interaction between features and classes to find the most discriminative features. Under semi-
supervised conditions part of data are labeled and information of both labeled and unlabeled data
are extracted. Most of the semi-supervised techniques leverage a similarity matrix to select features
by Laplacian graph (graph structure) construction [20]. There is no label information for the
unsupervised case. Therefore, feature selection can be applied based on relevancy between
features. This type of feature selection is most of the time considered as subspace learning with
the aim to find a subspace of data can span original data [21]. The notations and list of
abbreviations, which are frequently used in this study, are shown in Table 1 and Table 2.
Table 1. Notation used in the study.
Notation
Representation
m
Number of instances
n
Number of features
Singular value of matrix
I
Unit matrix
u
Lower case shows the vector
Aij
The (i, j)-th entry of matrix
AT
The transpose of matrix A
Diag(A)
The vector of all diagonal elements of matrix A
Tr(A)
The trace of square matrix 


Determinant of matrix

Frobenius norm of A. 
Matrix 
Vector norm (.


Mixed matrix Norm (. 
 
Recently, feature selection using matrix decomposition and subspace learning has gotten
significant attention. Dealing with noisy information and redundant features are the main
challenges for subspace learning. To circumvent this challenge, different regularization
frameworks have been proposed to decrease the effect of noise and redundant information. Non-
negative Matrix Factorization (NMF) combined with additional regularization terms is a well-
known technique in this domain. Sparse regularization terms such as ,  (
and  were added to increase the sparsity of solution [28,30,54,55]. However, information
can be degraded or lost after projecting to subspace. To ameliorate this challenge, local structure
learning regularization functions, which are based on graph Laplacian concept, were introduced.
These functions were added to preserve geometrical information of data in subspace with this
assumption that if two samples were close, they would be close in the subspace. [31, 32, 53, 54].
Since feature selection is categorized as a combinatorial NP-hard problem, evolutionary
computational techniques can be utilized to tackle this challenge. To this end, Genetic algorithms
[38,46], particle swarm optimization (PSO) [39, 47], ant colony optimization [40, 48] and grey
wolf optimization [41] have been applied on feature selection task to obtain the most informative
features (genes). Cost function definition is one of the most important issues which requires
significant consideration in feature selection using evolutionary techniques. The vast majority of
feature selection cost function are constructed based on two well-known objectives: the number of
features and classification error. Furthermore, many-objective feature selection methods have been
proposed to enhance the classification accuracy [56]. This cost function encompassed the four
objectives including the number of features, classification error, correlation between features and
labels, and computational complexity of features.
Table 2. List of abbreviations, which are utilized throughout of this article.
Complete Form
Abbreviation
Non-negative matrix factorization
NMF
Maximum projection and minimum redundancy feature
MPMR
Rescaled Linear Square Regression
RLSR
Dual regularized unsupervised feature selection based on
matrix factorization and minimum redundancy
DR-FS-MFMR
ConCaveConvex Procedure
CCCP
Dual-graph sparse non-negative matrix factorization
DSNMF
Hesitant fuzzy set
HFS
Reduced row Echelon
RRE
Singular value decomposition
SVD
Particle swarm optimization
PSO
Genetic Algorithm
GA
Rank revealing QR
RRQR
Rank revealing QR- Genetic Algorithm
QR-GA
Principal component analysis
PCA
Non-negative matrix factorization QR
NMF-QR
In all feature selection techniques, including NMF based and combinatorial methods,
computational complexity is a considered heavily. For instance, the complexity of SVD and NMF
are  and , respectively ( is the number of selected features). To this end, a
comparison between SVD-based principal component analysis (PCA) and QR-based PCA was
done, and results showed that QR-based PCA is more efficient than SVD-based PCA in terms of
computational complexity [57]. Additionally, these feature selection techniques have many hyper-
parameters which must be tuned to achieve the best performance. Consequently, a matrix-based
technique with less computational complexity and minimum hyper-parameters is the main
motivation behind this work and to the best of our knowledge, it's the first to utilize QR
factorization in feature selection. Additionally, followed by QR two hybrid techniques are
proposed including; A) unsupervised NMF-QR feature selection, which is embedding QR in NMF,
and B) A genetic algorithm is used, as an evolutionary algorithm, to combine with rank revealing
QR factorization to construct a supervised hybrid feature selection by combination of filter policy
and wrapper policy.
The contributions of this paper can be summarized as follows.
For the first time rank revealing QR matrix factorization is proposed to find the most
informative features of a high-dimensional data.
Combination of NMF and QR as unsupervised feature selection.
A Genetic algorithm is combined with rank revealing QR matrix factorization to obtain the
optimum number of features and tune the hyperparameters of the classifier simultaneously
with feature selection. Moreover, to prevent premature convergence to local optima, the
quality of search space for the genetic algorithm is considerably enhanced by using Rank
revealing QR matrix factorization as a filter-phase for hybrid feature selection.
The remainder of this paper is arranged as follows:
Several feature selections are surveyed in Section 2. Section 3 briefly reviews matrix
decomposition. Section 3.1.2-3.1.3 belongs to RRQR and NMF-QR and 3.1.4 covers hybrid
feature selection QR-GA. Results of the proposed method and its comparison with state-of-art
methods are shown in Section 5. Finally, Section 7 and 8 draws the discussion and conclusion.
2. Related work
Matrix-based techniques, which are deeply rooted in linear algebra, have played an important role
in feature selection task. Subspace learning plays a significant role in obtaining a low-dimensional
representation of original data while preserving maximum information. To this end, non-negative
matrix factorization (NMF) was formulized for feature selection by minimizing distance between
the subspace and the original space of matrix [22]. In theory of NMF a matrix can be
estimated by two non-negative matrices  and  such that . Therefore,
NMF feature selection was defined as follows,




 (1)
is a positive penalty coefficient and this cost function was solved for fixed and in an
iterative algorithm. is an indicator matrix (feature weight matrix) and the importance of each
feature can be obtained using the Euclidean norm of rows. is representation matrix of data.
A new term, which is to guarantee minimum redundancy, was added to NMF feature selection to
摘要:

*AllcorrespondencetoAmir.moslemi@ryerson.ca**Emailaddress:arash.ahmadian@mail.utoronto.caSubspaceLearningforFeatureSelectionviaRankRevealingQRFactorization:UnsupervisedandHybridApproacheswithNon-negativeMatrixFactorizationandEvolutionaryAlgorithmAmirMoslemia,b,*,ArashAhmadianc,**aTorontoMetropolitan...

展开>> 收起<<
All correspondence to Amir.moslemiryerson.ca Email address arash.ahmadianmail.utoronto.ca Subspace Learning for Feature Selection via Rank Revealing_2.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:1.96MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注