Optimal Discriminant Analysis in High-Dimensional Latent Factor Models Xin BingMarten Wegkamp

2025-05-02 0 0 1010.92KB 73 页 10玖币

侵权投诉

Optimal Discriminant Analysis in High-Dimensional Latent

Factor Models

Xin Bing∗Marten Wegkamp†

Abstract

In high-dimensional classiﬁcation problems, a commonly used approach is to ﬁrst project

the high-dimensional features into a lower dimensional space, and base the classiﬁcation on

the resulting lower dimensional projections. In this paper, we formulate a latent-variable

model with a hidden low-dimensional structure to justify this two-step procedure and to

guide which projection to choose. We propose a computationally eﬃcient classiﬁer that

takes certain principal components (PCs) of the observed features as projections, with the

number of retained PCs selected in a data-driven way. A general theory is established for

analyzing such two-step classiﬁers based on any projections. We derive explicit rates of

convergence of the excess risk of the proposed PC-based classiﬁer. The obtained rates are

further shown to be optimal up to logarithmic factors in the minimax sense. Our theory

allows the lower-dimension to grow with the sample size and is also valid even when the

feature dimension (greatly) exceeds the sample size. Extensive simulations corroborate our

theoretical ﬁndings. The proposed method also performs favorably relative to other existing

discriminant methods on three real data examples.

Keywords: High-dimensional classiﬁcation, latent factor model, principal component regression,

dimension reduction, discriminant analysis, optimal rate of convergence.

1 Introduction

In high-dimensional classiﬁcation problems, a widely used technique is to ﬁrst project the high-

dimensional features into a lower dimensional space, and base the classiﬁcation on the resulting

lower dimensional projections (Ghosh,2001;Nguyen and Rocke,2002;Chiaromonte and Mar-

tinelli,2002;Antoniadis et al.,2003;Biau et al.,2003;Boulesteix,2004;Dai et al.,2006;Li,

2016;Hadef and Djebabra,2019;Jin et al.,2021;Ma et al.,2020;Mallary et al.,2022). Despite

having been widely used for years, theoretical understanding of this approach is scarce, and

what kind of low-dimensional projection to choose remains unknown. In this paper we for-

mulate a latent-variable model with a hidden low-dimensional structure to justify the two-step

procedure that takes leading principal components of the observed features as projections.

Concretely, suppose our data consists of independent copies of the pair (X, Y ) with features

X∈Rpaccording to

X=AZ +W(1.1)

and labels Y∈ {0,1}. Here Ais a deterministic, unknown p×Kloading matrix, Z∈RKare

unobserved, latent factors and Wis random noise. We assume that

(i) Wis independent of both Zand Y,

∗Department of Statistical Statistics, University of Toronto. E-mail: xin.bing@utoronto.ca

†Department of Mathematics and Department of Statistics and Data Science, Cornell University, Ithaca, NY.

E-mail: mhw73@cornell.edu.

arXiv:2210.12862v1 [math.ST] 23 Oct 2022

(ii) E[W] = 0p,

(iii) Ahas rank K.

This mathematical framework allows for a substantial dimension reduction in classiﬁcation for

Kp. Indeed, in terms of the Bayes’ misclassiﬁcation errors, we prove in Lemma 1of Section

2.1 the inequality

R∗

x:= inf

P{g(X)6=Y} ≥ R∗

z:= inf

P{h(Z)6=Y},(1.2)

that is, it is easier to classify in the latent space RKthan in the observed feature space Rp. In

this work, we further assume that

(iv) Zis a mixture of two Gaussians

Z|Y=k∼NK(αk,ΣZ|Y),P(Y=k) = πk, k ∈ {0,1}(1.3)

with diﬀerent means α0:= E[Z|Y= 0] and α1:= E[Z|Y= 1], but with the same

covariance matrix

ΣZ|Y:= Cov(Z|Y= 0) = Cov(Z|Y= 1),(1.4)

assumed to be strictly positive deﬁnite.

We emphasize that the distributions of Xgiven Yare not necessarily Gaussian as the distribu-

tion of Wcould be arbitrary.

Within the above modelling framework, parameters related with the moments of Xand Y,

such as πk,E[X|Y] and Cov(X|Y), are identiﬁable, while A, ΣZ|Y,αk, and ΣW:= Cov(W)

are not. For instance, we can always replace Zby Z0=QZ for any invertible K×Kmatrix

Qand write α0

k=Qαk, Σ0

Z|Y=QΣZ|YQ>and A0=AQ−1. Since we focus on classiﬁcation,

there is no need to impose any conditions on the latter group of parameters that render them

identiﬁable. Although our discussion throughout this paper is based on a ﬁxed notation of A,

ΣZ|Y, ΣWand αk, it should be understood that our results are valid for all possible choices of

these parameters such that model (1.1) and (1.3) holds, including sub-models under which such

parameters are (partially) identiﬁable.

Our goal is to construct a classiﬁcation rule bgx:Rp→ {0,1}based on the training data

D:= {X,Y}that consists of independent pairs (X1, Y1),...,(Xn, Yn) from model (1.1) and

(1.3) such that the resulting rule has small missclassiﬁcation error P{bgx(X)6=Y}for a new pair

of (X, Y ) from the same model that is independent of D. In this paper, we are particularly

interested in bgxthat is linear in X, motivated by the fact that the restriction of equal covariance

in (1.4) leads to a Bayes rule that is linear in Zwhen we observe Z(see display (1.6) below).

Linear classiﬁers have been popular for decades, especially in high-dimensional classiﬁcation

problems, due to their interpretability and computational simplicity. One strand of the existing

literature imposes sparsity on the coeﬃcients β∈Rpin linear classiﬁers g(x) = {β>x+β0≥0}

for large p(p≥n), see, for instance, Tibshirani et al. (2002); Fan and Fan (2008); Witten and

Tibshirani (2011); Shao et al. (2011); Cai and Liu (2011); Mai et al. (2012); Cai and Zhang

(2019a) for sparse linear discriminant analysis (LDA) and Tarigan and Van de Geer (2006);

Wegkamp and Yuan (2011) for sparse support vector machines. For instance, in the classical

LDA-setting, when Xitself is a mixture of Gaussians

X|Y=k∼Np(µk,Σ),P(Y=k) = πk, k ∈ {0,1}(1.5)

with Σ strictly positive deﬁnite, the Bayes classiﬁer is linear with p-dimensional vector β=

Σ−1(µ1−µ0). Sparsity of βis then a reasonable assumption when Σ is close to diagonal, so

that sparsity of βgets translated to that of the diﬀerence between the mean vectors µ1−µ0.

However, in the high-dimensional regime, many features are highly correlated and any sparsity

assumption on βis no longer intuitive and becomes in fact questionable. This serves as a main

motivation for this work, in which we study a class of linear classiﬁers that no longer requires

the sparsity assumption on β, for neither construction of the classiﬁer, nor its analysis.

1.1 Contributions

We summarize our contributions below.

1.1.1 Minimax lower bounds of rate of convergence of the excess risk

Our ﬁrst contribution in this paper is to establish minimax lower bounds of rate of convergence of

the excess risk for any classiﬁer under model (1.1) and (1.3). The excess risk is deﬁned relative

to R∗

zin (1.2) which we view as a more natural benchmark than R∗

xbecause our proposed

classiﬁer is designed to adapt to the underlying low-dimensional structure in (1.1). The relation

in (1.2) suggests R∗

zis also a more ambitious benchmark than R∗

Since the gap between R∗

xand R∗

zquantiﬁes the irreducible error for not observing Z, we start

in Lemma 2of Section 2.1 by characterizing how R∗

x−R∗

zdepends on ξ∗=λK(AΣZ|YA>)/λ1(ΣW),

the signal-to-noise ratio for predicting Zfrom X(conditioned on Y), and ∆2= (α1−α0)>Σ−1

Z|Y(α1−

α0), the Mahalanobis distance between random vectors Z|Y= 1 and Z|Y= 0. Interestingly,

it turns out that R∗

x−R∗

zis small when either ξ∗or ∆ is large, a phenomenon that is diﬀerent

from the setting when Yis linear in Z. Indeed, for the latter case, the excess risk of predicting

Yby using the best linear predictor of Xrelative to the risk of predicting Yfrom E[Y|Z] is

small only when ξ∗is large (Bing et al.,2021).

In Theorem 3of Section 2.2, we derive the minimax lower bounds of the excess risk for any

classiﬁer with explicit dependency on the signal-to-noise ratio ξ∗, the separation distance ∆, the

dimensions Kand pand the sample size n. Our results also fully capture the phase transition

of the excess risk as the magnitude of ∆ varies. Speciﬁcally, when ∆ is of constant order, the

established lower bounds are

(ω∗

n)2=K

n+∆2

ξ∗+∆2

ξ∗

ξ∗n.

The ﬁrst term is the optimal rate of the excess risk even when Zwere observable; the second

term corresponds to the irreducible error of not observing Zin R∗

x−R∗

zand the last term reﬂects

the minimal price to pay for estimating the column space of A. When ∆ → ∞ as n→ ∞, the

lower bounds become (ω∗

n)2exp(−∆2/8) and get exponentially faster in ∆2. When ∆ →0 as

n→ ∞, the lower bounds get slower as ω∗

nmin{ω∗

n/∆,1}, implying a more diﬃcult scenario for

classiﬁcation. In Section 5.3, the lower bounds are further shown to be tight in the sense that

the excess risk of the proposed PC-based classiﬁers have a matching upper bound, up to some

logarithmic factors.

To the best of our knowledge, our minimax lower bounds are both new in the literature

of factor models and the classical LDA. In the factor model literature, even in linear factor

regression models, there is no known minimax lower bound of the prediction risk with respect

to the quadratic loss function. In the LDA literature, our results cover the minimax lower bound

of the excess risk in the classical LDA as a special case and are the ﬁrst to fully characterize

the phase transition in ∆ (see Remark 5for details). The analysis of establishing Theorem 3

is highly non-trivial and encounters several challenges. Speciﬁcally, since the excess risk is not

a semi-distance, as required by the standard techniques of proving minimax lower bounds, the

ﬁrst challenge is to develop a reduction scheme based on a surrogate loss function that satisﬁes

a local triangle inequality-type bound. The second challenge of our analysis is to allow a fully

non-diagonal structure of Cov(X|Y) under model (1.1), as opposed to the existing literature on

the classical LDA that assumes Cov(X|Y) to be diagonal or even proportional to the identity

matrix. To characterize the eﬀect of estimating the column space of Aon the excess risk in

deriving the third term of the lower bounds, our proof is based on constructing a suitable subset

of the parameter space via the hypercube construction that is used for proving the optimal rates

of the sparse PCA (Vu and Lei,2013) (see the paragraph after Theorem 3for a full discussion).

Since the statistical distance (such as the KL-divergence) between thus constructed hypotheses

could diverge as p/n → ∞, this leads to the third challenge of providing a meaningful and sharp

lower bound that is valid for both p<nand p>n.

1.1.2 A general two-step classiﬁcation approach and the PC-based classiﬁer

Our second contribution in this paper is to propose a computationally eﬃcient linear classiﬁer

in Section 3.2 that uses leading principal components (PCs) of the high-dimensional feature,

with the number of retained PCs selected in a data-driven way. This PC-based classiﬁer is one

instance of a general two-step classiﬁcation approach proposed in Section 3.1. To be clear, it

diﬀers from naively applying standard LDA, using plug-in estimates of the Bayes rule, on the

leading PCs.

To motivate our approach, suppose that the factors Zwere observable. Then the optimal

Bayes rule is to classify a new point z∈RKas

g∗

z(z) = {z>η+η0≥0}(1.6)

where

η= Σ−1

Z|Y(α1−α0), η0=−1

2(α0+α1)>η+ log π1

π0

.(1.7)

This rule is optimal in the sense that it has the smallest possible misclassiﬁcation error. Our ap-

proach in Section 3.1 utilizes an intimate connection between the linear discriminant analysis and

regression to reformulate the Bayes rule g∗

z(z) as {z>β+β0≥0}with β= Σ−1

ZCov(Z, Y ) (and

β0is given in (3.1) of Section 3). The key diﬀerence is the use of the unconditional covariance

matrix ΣZ, as opposed to the conditional one ΣZ|Yin (1.7). As a result, βcan be interpreted as

the coeﬃcient of regressing Yon Z, suggesting to estimate z>βby z>(Z>ΠnZ)+Z>ΠnYvia

the method of least squares, again, in case Z= (Z1, . . . , Zn)>∈Rn×Kand z∈RKhad been

observed. Here Y= (Y1, . . . , Yn)>∈ {0,1}n, Πn=In−n−11n1>

nis the centering projection

matrix and M+denotes the Moore-Penrose inverse of any matrix Mthroughout of this paper.

Since we only have access to x∈Rp, a realization of X,X= [X1···Xn]>∈Rn×p, and

Y∈ {0,1}n, it is natural to estimate the span of zby B>xand to predict the span of ΠnZby

ΠnXB, for some appropriate matrix B. This motivates us to estimate the inner-product z>β

(B>x)>(B>X>ΠnXB)+B>X>ΠnY:= x>b

θ. (1.8)

By using a plug-in estimator b

β0of β0, the resulting rule bgx(x) = {x>b

θ+b

β0≥0}is a general

two-step, regression-based classiﬁer and the choice of Bis up to the practitioner.

In this paper, we advocate the choice B=Ur∈Rp×rwhere Urcontains the ﬁrst rright-

singular vectors of ΠnX, such that the projections ΠnXBbecome the ﬁrst rprincipal com-

ponents of X. Intuitively, this method has promise as Stock and Watson (2002a) proves that

when ris chosen as K, the projection ΠnXUKaccurately predicts the span of ΠnZunder

model (1.1). Since in practice Kis oftentimes unknown, we further use a data-driven selection

of Kin Section 3.3 to construct our ﬁnal PC-based classiﬁer. The proposed procedure is com-

putationally eﬃcient. Its only computational burden is that of computing the singular value

decomposition (SVD) of X. Guided by our theory, we also discuss a cross-ﬁtting strategy in

Section 3.2 that improves the PC-based classiﬁer by removing the dependence from using the

data twice (one for constructing Urand one for computing b

θin (1.8)) when p>nand the

signal-to-noise ratio ξ∗is weak.

Retaining only a few principal components of the observed features and using them in

subsequent regressions is known as principal component regression (PCR) (Stock and Watson,

2002a). It is a popular method for predicting Y∈Rfrom a high-dimensional feature vector

X∈Rpwhen both Xand Yare generated via a low-dimensional latent factor Z. Most of the

existing literature analyzes the performance of PCR when both Yand Xare linear in Z, for

instance, Stock and Watson (2002a,b); Bair et al. (2006); Bai and Ng (2008); Hahn et al. (2013);

Bing et al. (2021), just to name a few. When Yis not linear in Z, little is known. An exception

is Fan et al. (2017), which studies the model Y=h(ξ1Z, ··· , ξqZ;ε) and X=AZ +Wfor

some unknown general link function h(·). Their focus is only on estimation of ξ1, . . . , ξq, the

suﬃcient predictive indices of Y, rather than analysis of the risk of predicting Y. As E[Y|Z] is

not linear in Zunder our model (1.1) and (1.3), to the best of our knowledge, analysis of the

misclassifcation error under model (1.1) and (1.3) for a general linear classiﬁer has not been

studied elsewhere.

1.1.3 A general strategy of analyzing the excess risk of bgxbased on any matrix B

Our third contribution in this paper is to provide a general theory for analyzing the excess risk

of the type of classiﬁers bgxthat uses a generic matrix Bin (1.8). In Section 4we state our result

in Theorem 5, a general bound for the excess risk of the classiﬁer bgxbased on a generic matrix

B. It depends on how well we estimate z>β+β0and a margin condition on the conditional

distributions Z|Y=k,k∈ {0,1}, nearby the hyperplane {z|z>β+β0= 0}. This bound is

diﬀerent from the usual approach which bounds the excess risk P{bg(X)6=Y}−R∗

zof a classiﬁer

bg:Rp→ {0,1}by 2E[|η(Z)−1/2| {bg(X)6=g∗

z(Z)}], with η(z) = P(Y= 1|Z=z), and involves

analyzing the behavior of η(Z) near 1/2 (see our detailed discussion in Remark 7). The analysis

of Theorem 5is powerful in that it can easily be generalized to any distribution of Z|Y, as

explained in Remark 8. Our second main result in Theorem 7of Section 4provides explicit rates

of convergence of the excess risk of bgxfor a generic Band clearly delineates three key quantities

that need to be controlled as introduced therein. The established rates of convergence reveal

the same phase transition in ∆ from the lower bounds. It is worth mentioning that the analysis

of Theorem 7is more challenging under model (1.1) and (1.3) than the classical LDA setting

(1.5) in which the excess risk of any linear classiﬁer in Xhas a closed-form expression.

1.1.4 Optimal rates of convergence of the PC-based classiﬁer

Our fourth contribution is to apply the general theory in Section 4to analyze the PC-based

classiﬁers. Consistency of our proposed estimator of Kis established in Theorem 8of Section

5.1. In Theorem 9of Section 5.2, we derive explicit rates of convergence of the excess risk of the

PC-based classiﬁer that uses B=UK. The obtained rate of convergence exhibits an interesting

interplay between the sample size nand the dimensions Kand pthrough the quantities K/n,

ξ∗and ∆. Our analysis also covers the low signal setting ∆ = o(1), a regime that has not been

analyzed even in the existing literature of classical LDA. Our theoretical results are valid for

both ﬁxed and growing Kand are also valid even when pis much lager than n. In Theorem

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OptimalDiscriminantAnalysisinHigh-DimensionalLatentFactorModelsXinBing*MartenWegkampAbstractInhigh-dimensionalclassicationproblems,acommonlyusedapproachistorstprojectthehigh-dimensionalfeaturesintoalowerdimensionalspace,andbasetheclassicationontheresultinglowerdimensionalprojections.Inthispaper,...

展开>> 收起<<

Optimal Discriminant Analysis in High-Dimensional Latent Factor Models Xin BingMarten Wegkamp.pdf

共73页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Optimal Discriminant Analysis in High-Dimensional Latent Factor Models Xin BingMarten Wegkamp

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: