Spectrally-Corrected and Regularized LDA for Spiked Model

2025-05-03 0 0 686.91KB 27 页 10玖币

侵权投诉

Spectrally-Corrected and Regularized LDA for Spiked

Model

Hua Lia,1, Wenya Luob,, Zhidong Baic, Huanchao Zhouc, Zhangni Puc

aSchool of Science, Chang Chun University, China

bSchool of Data Sciences, Zhejiang University of Finance and Economics, China

cKLAS MOE and School of Mathematics and Statistics, Northeast Normal University, China

Abstract

This paper proposes an improved linear discriminant analysis called spectrally-corrected

and regularized LDA (SRLDA). This method integrates the design ideas of the sample

spectrally-corrected covariance matrix and the regularized discriminant analysis. With

the support of a large-dimensional random matrix analysis framework, it is proved that

SRLDA has a linear classiﬁcation optimal global optimal solution under the spiked

model assumption. According to simulation data analysis, the SRLDA classiﬁer per-

forms better than RLDA and ILDA and is closer to the theoretical classiﬁer. Experi-

ments on diﬀerent data sets show that the SRLDA algorithm performs better in classi-

ﬁcation and dimensionality reduction than currently used tools.

Keywords: Spectrally-corrected method, Regularized technology

1. Introduction

Linear Discriminant Analysis (LDA) is a very common technique for dimensional-

ity reduction problems as a preprocessing step for machine learning and pattern classiﬁ-

cation applications. As data collection and storage capacity improves, there has been an

increasing prevalence of high-dimensional data sets, including microarray gene expres-

sion data [47], gene expression pattern images [23], text documents [43], face images

[55], etc. Learning in high-dimensional spaces is challenging because data points are

far from each other in such spaces, and the similarities between data points are diﬃcult

Email address: luowy042@zufe.edu.cn (Wenya Luo)

Preprint submitted to Journal of L

X Templates March 11, 2024

arXiv:2210.03859v3 [stat.ML] 8 Mar 2024

to compare and analyze [13]. This phenomenon is traditionally known as the curse of

dimensionality [60], which states that an enormous number of samples is required to

perform accurate predictions on problems with high dimensionality.

On the other hand, large-dimensional random matrix theory (LRMT) analysis tools

have gradually developed and matured, and their applications in ﬁelds such as statistics,

economics, computers, signal processing, and so on, have gained increasing recogni-

tion [22, 48, 4]. In LRMT, researchers study the asymptotic properties of random

matrices in the high-dimensional range, where the dimensions of the random matri-

ces are extremely large or even inﬁnite. The extreme results obtained in the inﬁnite-

dimensional case can well approximate the more realistic ﬁnite-dimensional scenario,

and this has been veriﬁed by many empirical results. LRMT has proven to be a valu-

able tool for analyzing high-dimensional systems in a variety of applications, including

signal processing, data analysis, and machine learning. In this paper, we provide an im-

proved linear discriminant analysis for spiked models with LRMT.

The statistics problem treated here is assigning a p-dimensional observation x=

x1,x2,...,xp′into one of two classes or groups Πi(i=0,1). The classes are assumed

to have Gaussian distributions with the same covariance matrix. Since R.A. Fisher

[19] originally proposed Linear Discriminant Analysis (LDA) based classiﬁers in 1936,

Fisher’s LDA has been one of the most classical techniques for classiﬁcation tasks

and is still used routinely in applications. We employ separate (stratiﬁed) sampling:

n=n0+n1sample points are collected to constitute the sample Cin Rp, where, given

n,n0and n1are determined (not random) and where C0={x1,x2, ..., xn0}and C1=

{xn0+1,xn0+2, ..., xn}are randomly selected from populations Π0and Π1, respectively.

Under the support of the labeled sample sets, Fisher’s discriminant rule [28] with two

multivariate normal populations directs us to allocate xinto Π0if

WLDA(x0,x1,S,x)= x−x0+x1

2!T

S−1x0−x1−log π1

π0

>0,(1.1)

and into Π1otherwise, where πiis the prior probability corresponding to Πi(i=0,1).

Here x0and x1are the sample mean vectors of two classes, and S=n0

nS0+n1

nS1in

which Siis the sample covariance matrix of Ci. LDA has a long and successful history.

From the ﬁrst use of taxonomic classiﬁcation by [19], LDA-Fisher-based classiﬁcation

and recognition systems have been used in many applications, including detection [58],

speech recognition [57], cancer genomics [32], and face recognition [55].

In the classic asymptotic approach, WLDA(x0,x1,S,x) is the consistent estimation

of Fisher’s LDA function. However, it is not generally helpful in situations where the

dimensionality of the observations is of the same order of magnitude as the sample

size (n→ ∞,p→ ∞,and p/n→J>0). In this asymptotic scenario, the Fisher

discriminant rule performs poorly due to the sample covariance matrix diverging from

the population’s one severely. Papers developing diﬀerent approaches to handling the

high-dimensionality issue in estimating the covariance matrix can be divided into two

schools. The ﬁrst suggests building on the additional knowledge in the estimation

process, such as sparseness, graph model, or factor model [8, 9, 16, 31, 49, 51, 52].

The second recommends correcting the spectrum of the sample covariance [37], such

as the shrinkage estimator in [14, 35, 36], and regularized technologic in [6, 15, 54, 62].

The Spectrally-Corrected and Regularized LDA (SRLDA) given in this paper belong to

the second school, which integrates the design ideas of the sample spectrally-corrected

covariance matrix [37] and the regularized discriminant analysis [21] to improve the

LDA estimation in high-dimensional settings.

This paper proposes a novel approach under the assumption of the spiked covari-

ance model in [29] that all but a ﬁnite number of eigenvalues of the population co-

variance matrix are the same. This model could be and has been used in many real

applications, such as detection speech recognition [24, 29], mathematical ﬁnancial

[33, 34, 38, 44, 46], wireless communication [56], physics of mixture [53], EEG sig-

nals [12, 18] and data analysis and statistical learning [25]. Based on some theoretical

and applied results of the spiked covariance model([3, 5, 7, 30, 45]), we suppose popu-

lation eigenvalues are estimated reasonably. Then consider a class of covariance matrix

estimators that follow the same spiked structure, written as a ﬁnite rank perturbation of

a scaled identity matrix. The sample eigenvectors of the extreme sample eigenvalues

provide the directions of the low-rank perturbation, and the corresponding eigenvalues

are corrected to population one and regularized by the common parameter. In this way,

compared with [54], we not only retain the spike structure as much as possible, but also

reduce the number of undetermined parameters. In addition, fewer parameters make it

possible for this algorithm to be designed for data dimensionality reduction.

This paper selects design parameters that minimize the limit of misclassiﬁcation

rate and can be obtained in the corresponding closed form without using standard

cross-validation methods, thus providing lower complexity and higher classiﬁcation

performance. Furthermore, the computational cost is also reduced compared to classic

R-LDA in [15]. Through a comparative analysis of diﬀerent data sets, our proposed

classiﬁer outperforms other popular classiﬁcation techniques, such as improved LDA

(I-LDA) in [54], support vector machine (SVM), k- Nearest neighbor (KNN), and a

fully connected neural network (CNN).

The remainder of this article is organized as follows: Section 2 introduces the bi-

nary classiﬁcation problem and the basic form of the SRLDA classiﬁer. Section 3

presents a consistent estimation of the true error and a parameter optimization method.

The SRLDA classiﬁer with optimal intercept is improved in Section 4. Section 5 ex-

tends the spectral correction regularization classiﬁcation method to multi-classiﬁcation

problems. Section 6 compares the algorithm eﬀects on simulated data and real data

respectively. The ﬁnal section analyzes the conclusion.

2. Binary Classiﬁcation and SRLDA Classiﬁer

On the basis of the analysis framework in [62], some modeling assumptions are

restated and made throughout the paper. First, we have individually sampled binary

classiﬁcation sample sets C0and C1given in (1.1). Separate sampling is very common

in biomedical applications, where data from two classes are collected without reference

to the other, for instance, to discriminate two types of tumors or to distinguish a normal

from a pathological phenotype.

A second assumption is that the classes possess a common covariance matrix. Πi

follows a multivariate Gaussian distribution N(µi,Σ), for i=0,1, where Σis nonsin-

gular. Although it does not fully correspond to reality, the LDA still often performs

better than quadratic discriminant analysis (QDA) in most cases, even with diﬀerent

covariances, because of the advantage of its estimation method [50].

In this paper, we improve the LDA classiﬁer in the particular scenarios, as the third

assumption, wherein Σtakes the following particular form [3]:

Σ=σ2



Ip+X

j∈I

λjvjvT

j



,(2.1)

where I=:I1∪I2,I1:={1,...,r1},I2:={−r2,...,−1},σ2>0, r=r1+r2,λ1≥ ··· ≥

λr1>0> λ−r2≥ ··· ≥ λ−1>−1 (λj=λp+j+1for j∈I2), Ipis a p×pidentity matrix

and vi(i∈I) are orthogonal. This is the spiked model deﬁned in this article.

The spiked model is often used to detect the number of signals embedded in noise

[40, 41, 61, 18], or the number of factors in ﬁnancial econometrics [2, 42], where the

number of signals (or factors) corresponds to the number of large spiked eigenvalues,

while small spiked eigenvalues are often ignored. It should be pointed out that there

are currently few research papers discussing the situation of small spiked eigenval-

ues. We believe that this is mainly due to two reasons. First, small spiked eigenvalues

have no actual physical meaning in signal detection problems, so they are often ig-

nored; second, there are not many pioneering contributions in the study of small spiked

eigenvalues theory, which is not enough to arouse widespread interest among schol-

ars. However, in real data analysis, we found that small spiked eigenvalues can play

a unique role in algorithm improvement. This is because small spiked eigenvalues are

more hidden in the noise. By introducing the assumption of small spiked eigenvalues

in (2.1), the modeling eﬀect can be greatly improved.

Under the high-dimensional random matrix theory framework, population param-

eters such as the number of spiked eigenvalues and the estimators of spiked eigenval-

ues, the main parameters of this model, have been extensively and in-depth studied

[3, 17, 30, 27, 18]. These works have greatly improved the ability to estimate the

population covariance matrix and adapt to diﬀerent data backgrounds. For the sake of

simplicity, we assume that σ2,r1,r2, and λi(i∈I) are perfectly known. In our nu-

merical simulations, we have used the method of [3, 27, 30] to estimate the unknown

parameters in (2.1).

In this paper, the spectrally-corrected method is to correct the spectral elements of

the sample covariance matrix to those of Σ. Start the eigen decomposition of the pooled

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Spectrally-CorrectedandRegularizedLDAforSpikedModelHuaLia,1,WenyaLuob,,ZhidongBaic,HuanchaoZhouc,ZhangniPucaSchoolofScience,ChangChunUniversity,ChinabSchoolofDataSciences,ZhejiangUniversityofFinanceandEconomics,ChinacKLASMOEandSchoolofMathematicsandStatistics,NortheastNormalUniversity,ChinaAbstractT...

展开>> 收起<<

Spectrally-Corrected and Regularized LDA for Spiked Model.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Spectrally-Corrected and Regularized LDA for Spiked Model

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: