Spectrally-Corrected and Regularized LDA for Spiked Model

2025-05-03 0 0 686.91KB 27 页 10玖币
侵权投诉
Spectrally-Corrected and Regularized LDA for Spiked
Model
Hua Lia,1, Wenya Luob,, Zhidong Baic, Huanchao Zhouc, Zhangni Puc
aSchool of Science, Chang Chun University, China
bSchool of Data Sciences, Zhejiang University of Finance and Economics, China
cKLAS MOE and School of Mathematics and Statistics, Northeast Normal University, China
Abstract
This paper proposes an improved linear discriminant analysis called spectrally-corrected
and regularized LDA (SRLDA). This method integrates the design ideas of the sample
spectrally-corrected covariance matrix and the regularized discriminant analysis. With
the support of a large-dimensional random matrix analysis framework, it is proved that
SRLDA has a linear classification optimal global optimal solution under the spiked
model assumption. According to simulation data analysis, the SRLDA classifier per-
forms better than RLDA and ILDA and is closer to the theoretical classifier. Experi-
ments on dierent data sets show that the SRLDA algorithm performs better in classi-
fication and dimensionality reduction than currently used tools.
Keywords: Spectrally-corrected method, Regularized technology
1. Introduction
Linear Discriminant Analysis (LDA) is a very common technique for dimensional-
ity reduction problems as a preprocessing step for machine learning and pattern classifi-
cation applications. As data collection and storage capacity improves, there has been an
increasing prevalence of high-dimensional data sets, including microarray gene expres-
sion data [47], gene expression pattern images [23], text documents [43], face images
[55], etc. Learning in high-dimensional spaces is challenging because data points are
far from each other in such spaces, and the similarities between data points are dicult
Email address: luowy042@zufe.edu.cn (Wenya Luo)
Preprint submitted to Journal of L
A
T
E
X Templates March 11, 2024
arXiv:2210.03859v3 [stat.ML] 8 Mar 2024
to compare and analyze [13]. This phenomenon is traditionally known as the curse of
dimensionality [60], which states that an enormous number of samples is required to
perform accurate predictions on problems with high dimensionality.
On the other hand, large-dimensional random matrix theory (LRMT) analysis tools
have gradually developed and matured, and their applications in fields such as statistics,
economics, computers, signal processing, and so on, have gained increasing recogni-
tion [22, 48, 4]. In LRMT, researchers study the asymptotic properties of random
matrices in the high-dimensional range, where the dimensions of the random matri-
ces are extremely large or even infinite. The extreme results obtained in the infinite-
dimensional case can well approximate the more realistic finite-dimensional scenario,
and this has been verified by many empirical results. LRMT has proven to be a valu-
able tool for analyzing high-dimensional systems in a variety of applications, including
signal processing, data analysis, and machine learning. In this paper, we provide an im-
proved linear discriminant analysis for spiked models with LRMT.
The statistics problem treated here is assigning a p-dimensional observation x=
x1,x2,...,xpinto one of two classes or groups Πi(i=0,1). The classes are assumed
to have Gaussian distributions with the same covariance matrix. Since R.A. Fisher
[19] originally proposed Linear Discriminant Analysis (LDA) based classifiers in 1936,
Fisher’s LDA has been one of the most classical techniques for classification tasks
and is still used routinely in applications. We employ separate (stratified) sampling:
n=n0+n1sample points are collected to constitute the sample Cin Rp, where, given
n,n0and n1are determined (not random) and where C0={x1,x2, ..., xn0}and C1=
{xn0+1,xn0+2, ..., xn}are randomly selected from populations Π0and Π1, respectively.
Under the support of the labeled sample sets, Fisher’s discriminant rule [28] with two
multivariate normal populations directs us to allocate xinto Π0if
WLDA(x0,x1,S,x)= xx0+x1
2!T
S1x0x1log π1
π0
>0,(1.1)
and into Π1otherwise, where πiis the prior probability corresponding to Πi(i=0,1).
Here x0and x1are the sample mean vectors of two classes, and S=n0
nS0+n1
nS1in
which Siis the sample covariance matrix of Ci. LDA has a long and successful history.
From the first use of taxonomic classification by [19], LDA-Fisher-based classification
2
and recognition systems have been used in many applications, including detection [58],
speech recognition [57], cancer genomics [32], and face recognition [55].
In the classic asymptotic approach, WLDA(x0,x1,S,x) is the consistent estimation
of Fisher’s LDA function. However, it is not generally helpful in situations where the
dimensionality of the observations is of the same order of magnitude as the sample
size (n→ ∞,p→ ∞,and p/nJ>0). In this asymptotic scenario, the Fisher
discriminant rule performs poorly due to the sample covariance matrix diverging from
the population’s one severely. Papers developing dierent approaches to handling the
high-dimensionality issue in estimating the covariance matrix can be divided into two
schools. The first suggests building on the additional knowledge in the estimation
process, such as sparseness, graph model, or factor model [8, 9, 16, 31, 49, 51, 52].
The second recommends correcting the spectrum of the sample covariance [37], such
as the shrinkage estimator in [14, 35, 36], and regularized technologic in [6, 15, 54, 62].
The Spectrally-Corrected and Regularized LDA (SRLDA) given in this paper belong to
the second school, which integrates the design ideas of the sample spectrally-corrected
covariance matrix [37] and the regularized discriminant analysis [21] to improve the
LDA estimation in high-dimensional settings.
This paper proposes a novel approach under the assumption of the spiked covari-
ance model in [29] that all but a finite number of eigenvalues of the population co-
variance matrix are the same. This model could be and has been used in many real
applications, such as detection speech recognition [24, 29], mathematical financial
[33, 34, 38, 44, 46], wireless communication [56], physics of mixture [53], EEG sig-
nals [12, 18] and data analysis and statistical learning [25]. Based on some theoretical
and applied results of the spiked covariance model([3, 5, 7, 30, 45]), we suppose popu-
lation eigenvalues are estimated reasonably. Then consider a class of covariance matrix
estimators that follow the same spiked structure, written as a finite rank perturbation of
a scaled identity matrix. The sample eigenvectors of the extreme sample eigenvalues
provide the directions of the low-rank perturbation, and the corresponding eigenvalues
are corrected to population one and regularized by the common parameter. In this way,
compared with [54], we not only retain the spike structure as much as possible, but also
reduce the number of undetermined parameters. In addition, fewer parameters make it
3
possible for this algorithm to be designed for data dimensionality reduction.
This paper selects design parameters that minimize the limit of misclassification
rate and can be obtained in the corresponding closed form without using standard
cross-validation methods, thus providing lower complexity and higher classification
performance. Furthermore, the computational cost is also reduced compared to classic
R-LDA in [15]. Through a comparative analysis of dierent data sets, our proposed
classifier outperforms other popular classification techniques, such as improved LDA
(I-LDA) in [54], support vector machine (SVM), k- Nearest neighbor (KNN), and a
fully connected neural network (CNN).
The remainder of this article is organized as follows: Section 2 introduces the bi-
nary classification problem and the basic form of the SRLDA classifier. Section 3
presents a consistent estimation of the true error and a parameter optimization method.
The SRLDA classifier with optimal intercept is improved in Section 4. Section 5 ex-
tends the spectral correction regularization classification method to multi-classification
problems. Section 6 compares the algorithm eects on simulated data and real data
respectively. The final section analyzes the conclusion.
2. Binary Classification and SRLDA Classifier
On the basis of the analysis framework in [62], some modeling assumptions are
restated and made throughout the paper. First, we have individually sampled binary
classification sample sets C0and C1given in (1.1). Separate sampling is very common
in biomedical applications, where data from two classes are collected without reference
to the other, for instance, to discriminate two types of tumors or to distinguish a normal
from a pathological phenotype.
A second assumption is that the classes possess a common covariance matrix. Πi
follows a multivariate Gaussian distribution N(µi,Σ), for i=0,1, where Σis nonsin-
gular. Although it does not fully correspond to reality, the LDA still often performs
better than quadratic discriminant analysis (QDA) in most cases, even with dierent
covariances, because of the advantage of its estimation method [50].
In this paper, we improve the LDA classifier in the particular scenarios, as the third
4
assumption, wherein Σtakes the following particular form [3]:
Σ=σ2
Ip+X
jI
λjvjvT
j
,(2.1)
where I=:I1I2,I1:={1,...,r1},I2:={−r2,...,1},σ2>0, r=r1+r2,λ1≥ ··· ≥
λr1>0> λr2≥ ··· ≥ λ1>1 (λj=λp+j+1for jI2), Ipis a p×pidentity matrix
and vi(iI) are orthogonal. This is the spiked model defined in this article.
The spiked model is often used to detect the number of signals embedded in noise
[40, 41, 61, 18], or the number of factors in financial econometrics [2, 42], where the
number of signals (or factors) corresponds to the number of large spiked eigenvalues,
while small spiked eigenvalues are often ignored. It should be pointed out that there
are currently few research papers discussing the situation of small spiked eigenval-
ues. We believe that this is mainly due to two reasons. First, small spiked eigenvalues
have no actual physical meaning in signal detection problems, so they are often ig-
nored; second, there are not many pioneering contributions in the study of small spiked
eigenvalues theory, which is not enough to arouse widespread interest among schol-
ars. However, in real data analysis, we found that small spiked eigenvalues can play
a unique role in algorithm improvement. This is because small spiked eigenvalues are
more hidden in the noise. By introducing the assumption of small spiked eigenvalues
in (2.1), the modeling eect can be greatly improved.
Under the high-dimensional random matrix theory framework, population param-
eters such as the number of spiked eigenvalues and the estimators of spiked eigenval-
ues, the main parameters of this model, have been extensively and in-depth studied
[3, 17, 30, 27, 18]. These works have greatly improved the ability to estimate the
population covariance matrix and adapt to dierent data backgrounds. For the sake of
simplicity, we assume that σ2,r1,r2, and λi(iI) are perfectly known. In our nu-
merical simulations, we have used the method of [3, 27, 30] to estimate the unknown
parameters in (2.1).
In this paper, the spectrally-corrected method is to correct the spectral elements of
the sample covariance matrix to those of Σ. Start the eigen decomposition of the pooled
5
摘要:

Spectrally-CorrectedandRegularizedLDAforSpikedModelHuaLia,1,WenyaLuob,,ZhidongBaic,HuanchaoZhouc,ZhangniPucaSchoolofScience,ChangChunUniversity,ChinabSchoolofDataSciences,ZhejiangUniversityofFinanceandEconomics,ChinacKLASMOEandSchoolofMathematicsandStatistics,NortheastNormalUniversity,ChinaAbstractT...

展开>> 收起<<
Spectrally-Corrected and Regularized LDA for Spiked Model.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:686.91KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注