Optimal Discriminant Analysis in High-Dimensional Latent Factor Models Xin BingMarten Wegkamp

2025-05-02 0 0 1010.92KB 73 页 10玖币
侵权投诉
Optimal Discriminant Analysis in High-Dimensional Latent
Factor Models
Xin BingMarten Wegkamp
Abstract
In high-dimensional classification problems, a commonly used approach is to first project
the high-dimensional features into a lower dimensional space, and base the classification on
the resulting lower dimensional projections. In this paper, we formulate a latent-variable
model with a hidden low-dimensional structure to justify this two-step procedure and to
guide which projection to choose. We propose a computationally efficient classifier that
takes certain principal components (PCs) of the observed features as projections, with the
number of retained PCs selected in a data-driven way. A general theory is established for
analyzing such two-step classifiers based on any projections. We derive explicit rates of
convergence of the excess risk of the proposed PC-based classifier. The obtained rates are
further shown to be optimal up to logarithmic factors in the minimax sense. Our theory
allows the lower-dimension to grow with the sample size and is also valid even when the
feature dimension (greatly) exceeds the sample size. Extensive simulations corroborate our
theoretical findings. The proposed method also performs favorably relative to other existing
discriminant methods on three real data examples.
Keywords: High-dimensional classification, latent factor model, principal component regression,
dimension reduction, discriminant analysis, optimal rate of convergence.
1 Introduction
In high-dimensional classification problems, a widely used technique is to first project the high-
dimensional features into a lower dimensional space, and base the classification on the resulting
lower dimensional projections (Ghosh,2001;Nguyen and Rocke,2002;Chiaromonte and Mar-
tinelli,2002;Antoniadis et al.,2003;Biau et al.,2003;Boulesteix,2004;Dai et al.,2006;Li,
2016;Hadef and Djebabra,2019;Jin et al.,2021;Ma et al.,2020;Mallary et al.,2022). Despite
having been widely used for years, theoretical understanding of this approach is scarce, and
what kind of low-dimensional projection to choose remains unknown. In this paper we for-
mulate a latent-variable model with a hidden low-dimensional structure to justify the two-step
procedure that takes leading principal components of the observed features as projections.
Concretely, suppose our data consists of independent copies of the pair (X, Y ) with features
XRpaccording to
X=AZ +W(1.1)
and labels Y∈ {0,1}. Here Ais a deterministic, unknown p×Kloading matrix, ZRKare
unobserved, latent factors and Wis random noise. We assume that
(i) Wis independent of both Zand Y,
Department of Statistical Statistics, University of Toronto. E-mail: xin.bing@utoronto.ca
Department of Mathematics and Department of Statistics and Data Science, Cornell University, Ithaca, NY.
E-mail: mhw73@cornell.edu.
1
arXiv:2210.12862v1 [math.ST] 23 Oct 2022
(ii) E[W] = 0p,
(iii) Ahas rank K.
This mathematical framework allows for a substantial dimension reduction in classification for
Kp. Indeed, in terms of the Bayes’ misclassification errors, we prove in Lemma 1of Section
2.1 the inequality
R
x:= inf
g
P{g(X)6=Y} ≥ R
z:= inf
h
P{h(Z)6=Y},(1.2)
that is, it is easier to classify in the latent space RKthan in the observed feature space Rp. In
this work, we further assume that
(iv) Zis a mixture of two Gaussians
Z|Y=kNK(αk,ΣZ|Y),P(Y=k) = πk, k ∈ {0,1}(1.3)
with different means α0:= E[Z|Y= 0] and α1:= E[Z|Y= 1], but with the same
covariance matrix
ΣZ|Y:= Cov(Z|Y= 0) = Cov(Z|Y= 1),(1.4)
assumed to be strictly positive definite.
We emphasize that the distributions of Xgiven Yare not necessarily Gaussian as the distribu-
tion of Wcould be arbitrary.
Within the above modelling framework, parameters related with the moments of Xand Y,
such as πk,E[X|Y] and Cov(X|Y), are identifiable, while A, ΣZ|Y,αk, and ΣW:= Cov(W)
are not. For instance, we can always replace Zby Z0=QZ for any invertible K×Kmatrix
Qand write α0
k=k, Σ0
Z|Y=QΣZ|YQ>and A0=AQ1. Since we focus on classification,
there is no need to impose any conditions on the latter group of parameters that render them
identifiable. Although our discussion throughout this paper is based on a fixed notation of A,
ΣZ|Y, ΣWand αk, it should be understood that our results are valid for all possible choices of
these parameters such that model (1.1) and (1.3) holds, including sub-models under which such
parameters are (partially) identifiable.
Our goal is to construct a classification rule bgx:Rp→ {0,1}based on the training data
D:= {X,Y}that consists of independent pairs (X1, Y1),...,(Xn, Yn) from model (1.1) and
(1.3) such that the resulting rule has small missclassification error P{bgx(X)6=Y}for a new pair
of (X, Y ) from the same model that is independent of D. In this paper, we are particularly
interested in bgxthat is linear in X, motivated by the fact that the restriction of equal covariance
in (1.4) leads to a Bayes rule that is linear in Zwhen we observe Z(see display (1.6) below).
Linear classifiers have been popular for decades, especially in high-dimensional classification
problems, due to their interpretability and computational simplicity. One strand of the existing
literature imposes sparsity on the coefficients βRpin linear classifiers g(x) = {β>x+β00}
for large p(pn), see, for instance, Tibshirani et al. (2002); Fan and Fan (2008); Witten and
Tibshirani (2011); Shao et al. (2011); Cai and Liu (2011); Mai et al. (2012); Cai and Zhang
(2019a) for sparse linear discriminant analysis (LDA) and Tarigan and Van de Geer (2006);
Wegkamp and Yuan (2011) for sparse support vector machines. For instance, in the classical
LDA-setting, when Xitself is a mixture of Gaussians
X|Y=kNp(µk,Σ),P(Y=k) = πk, k ∈ {0,1}(1.5)
2
with Σ strictly positive definite, the Bayes classifier is linear with p-dimensional vector β=
Σ1(µ1µ0). Sparsity of βis then a reasonable assumption when Σ is close to diagonal, so
that sparsity of βgets translated to that of the difference between the mean vectors µ1µ0.
However, in the high-dimensional regime, many features are highly correlated and any sparsity
assumption on βis no longer intuitive and becomes in fact questionable. This serves as a main
motivation for this work, in which we study a class of linear classifiers that no longer requires
the sparsity assumption on β, for neither construction of the classifier, nor its analysis.
1.1 Contributions
We summarize our contributions below.
1.1.1 Minimax lower bounds of rate of convergence of the excess risk
Our first contribution in this paper is to establish minimax lower bounds of rate of convergence of
the excess risk for any classifier under model (1.1) and (1.3). The excess risk is defined relative
to R
zin (1.2) which we view as a more natural benchmark than R
xbecause our proposed
classifier is designed to adapt to the underlying low-dimensional structure in (1.1). The relation
in (1.2) suggests R
zis also a more ambitious benchmark than R
x.
Since the gap between R
xand R
zquantifies the irreducible error for not observing Z, we start
in Lemma 2of Section 2.1 by characterizing how R
xR
zdepends on ξ=λK(AΣZ|YA>)1W),
the signal-to-noise ratio for predicting Zfrom X(conditioned on Y), and ∆2= (α1α0)>Σ1
Z|Y(α1
α0), the Mahalanobis distance between random vectors Z|Y= 1 and Z|Y= 0. Interestingly,
it turns out that R
xR
zis small when either ξor ∆ is large, a phenomenon that is different
from the setting when Yis linear in Z. Indeed, for the latter case, the excess risk of predicting
Yby using the best linear predictor of Xrelative to the risk of predicting Yfrom E[Y|Z] is
small only when ξis large (Bing et al.,2021).
In Theorem 3of Section 2.2, we derive the minimax lower bounds of the excess risk for any
classifier with explicit dependency on the signal-to-noise ratio ξ, the separation distance ∆, the
dimensions Kand pand the sample size n. Our results also fully capture the phase transition
of the excess risk as the magnitude of ∆ varies. Specifically, when ∆ is of constant order, the
established lower bounds are
(ω
n)2=K
n+2
ξ+2
ξ
p
ξn.
The first term is the optimal rate of the excess risk even when Zwere observable; the second
term corresponds to the irreducible error of not observing Zin R
xR
zand the last term reflects
the minimal price to pay for estimating the column space of A. When → ∞ as n→ ∞, the
lower bounds become (ω
n)2exp(2/8) and get exponentially faster in ∆2. When 0 as
n→ ∞, the lower bounds get slower as ω
nmin{ω
n/,1}, implying a more difficult scenario for
classification. In Section 5.3, the lower bounds are further shown to be tight in the sense that
the excess risk of the proposed PC-based classifiers have a matching upper bound, up to some
logarithmic factors.
To the best of our knowledge, our minimax lower bounds are both new in the literature
of factor models and the classical LDA. In the factor model literature, even in linear factor
regression models, there is no known minimax lower bound of the prediction risk with respect
to the quadratic loss function. In the LDA literature, our results cover the minimax lower bound
of the excess risk in the classical LDA as a special case and are the first to fully characterize
the phase transition in ∆ (see Remark 5for details). The analysis of establishing Theorem 3
is highly non-trivial and encounters several challenges. Specifically, since the excess risk is not
3
a semi-distance, as required by the standard techniques of proving minimax lower bounds, the
first challenge is to develop a reduction scheme based on a surrogate loss function that satisfies
a local triangle inequality-type bound. The second challenge of our analysis is to allow a fully
non-diagonal structure of Cov(X|Y) under model (1.1), as opposed to the existing literature on
the classical LDA that assumes Cov(X|Y) to be diagonal or even proportional to the identity
matrix. To characterize the effect of estimating the column space of Aon the excess risk in
deriving the third term of the lower bounds, our proof is based on constructing a suitable subset
of the parameter space via the hypercube construction that is used for proving the optimal rates
of the sparse PCA (Vu and Lei,2013) (see the paragraph after Theorem 3for a full discussion).
Since the statistical distance (such as the KL-divergence) between thus constructed hypotheses
could diverge as p/n → ∞, this leads to the third challenge of providing a meaningful and sharp
lower bound that is valid for both p<nand p>n.
1.1.2 A general two-step classification approach and the PC-based classifier
Our second contribution in this paper is to propose a computationally efficient linear classifier
in Section 3.2 that uses leading principal components (PCs) of the high-dimensional feature,
with the number of retained PCs selected in a data-driven way. This PC-based classifier is one
instance of a general two-step classification approach proposed in Section 3.1. To be clear, it
differs from naively applying standard LDA, using plug-in estimates of the Bayes rule, on the
leading PCs.
To motivate our approach, suppose that the factors Zwere observable. Then the optimal
Bayes rule is to classify a new point zRKas
g
z(z) = {z>η+η00}(1.6)
where
η= Σ1
Z|Y(α1α0), η0=1
2(α0+α1)>η+ log π1
π0
.(1.7)
This rule is optimal in the sense that it has the smallest possible misclassification error. Our ap-
proach in Section 3.1 utilizes an intimate connection between the linear discriminant analysis and
regression to reformulate the Bayes rule g
z(z) as {z>β+β00}with β= Σ1
ZCov(Z, Y ) (and
β0is given in (3.1) of Section 3). The key difference is the use of the unconditional covariance
matrix ΣZ, as opposed to the conditional one ΣZ|Yin (1.7). As a result, βcan be interpreted as
the coefficient of regressing Yon Z, suggesting to estimate z>βby z>(Z>ΠnZ)+Z>ΠnYvia
the method of least squares, again, in case Z= (Z1, . . . , Zn)>Rn×Kand zRKhad been
observed. Here Y= (Y1, . . . , Yn)>∈ {0,1}n, Πn=Inn11n1>
nis the centering projection
matrix and M+denotes the Moore-Penrose inverse of any matrix Mthroughout of this paper.
Since we only have access to xRp, a realization of X,X= [X1···Xn]>Rn×p, and
Y∈ {0,1}n, it is natural to estimate the span of zby B>xand to predict the span of ΠnZby
ΠnXB, for some appropriate matrix B. This motivates us to estimate the inner-product z>β
by
(B>x)>(B>X>ΠnXB)+B>X>ΠnY:= x>b
θ. (1.8)
By using a plug-in estimator b
β0of β0, the resulting rule bgx(x) = {x>b
θ+b
β00}is a general
two-step, regression-based classifier and the choice of Bis up to the practitioner.
In this paper, we advocate the choice B=UrRp×rwhere Urcontains the first rright-
singular vectors of ΠnX, such that the projections ΠnXBbecome the first rprincipal com-
ponents of X. Intuitively, this method has promise as Stock and Watson (2002a) proves that
4
when ris chosen as K, the projection ΠnXUKaccurately predicts the span of ΠnZunder
model (1.1). Since in practice Kis oftentimes unknown, we further use a data-driven selection
of Kin Section 3.3 to construct our final PC-based classifier. The proposed procedure is com-
putationally efficient. Its only computational burden is that of computing the singular value
decomposition (SVD) of X. Guided by our theory, we also discuss a cross-fitting strategy in
Section 3.2 that improves the PC-based classifier by removing the dependence from using the
data twice (one for constructing Urand one for computing b
θin (1.8)) when p>nand the
signal-to-noise ratio ξis weak.
Retaining only a few principal components of the observed features and using them in
subsequent regressions is known as principal component regression (PCR) (Stock and Watson,
2002a). It is a popular method for predicting YRfrom a high-dimensional feature vector
XRpwhen both Xand Yare generated via a low-dimensional latent factor Z. Most of the
existing literature analyzes the performance of PCR when both Yand Xare linear in Z, for
instance, Stock and Watson (2002a,b); Bair et al. (2006); Bai and Ng (2008); Hahn et al. (2013);
Bing et al. (2021), just to name a few. When Yis not linear in Z, little is known. An exception
is Fan et al. (2017), which studies the model Y=h(ξ1Z, ··· , ξqZ;ε) and X=AZ +Wfor
some unknown general link function h(·). Their focus is only on estimation of ξ1, . . . , ξq, the
sufficient predictive indices of Y, rather than analysis of the risk of predicting Y. As E[Y|Z] is
not linear in Zunder our model (1.1) and (1.3), to the best of our knowledge, analysis of the
misclassifcation error under model (1.1) and (1.3) for a general linear classifier has not been
studied elsewhere.
1.1.3 A general strategy of analyzing the excess risk of bgxbased on any matrix B
Our third contribution in this paper is to provide a general theory for analyzing the excess risk
of the type of classifiers bgxthat uses a generic matrix Bin (1.8). In Section 4we state our result
in Theorem 5, a general bound for the excess risk of the classifier bgxbased on a generic matrix
B. It depends on how well we estimate z>β+β0and a margin condition on the conditional
distributions Z|Y=k,k∈ {0,1}, nearby the hyperplane {z|z>β+β0= 0}. This bound is
different from the usual approach which bounds the excess risk P{bg(X)6=Y}R
zof a classifier
bg:Rp→ {0,1}by 2E[|η(Z)1/2| {bg(X)6=g
z(Z)}], with η(z) = P(Y= 1|Z=z), and involves
analyzing the behavior of η(Z) near 1/2 (see our detailed discussion in Remark 7). The analysis
of Theorem 5is powerful in that it can easily be generalized to any distribution of Z|Y, as
explained in Remark 8. Our second main result in Theorem 7of Section 4provides explicit rates
of convergence of the excess risk of bgxfor a generic Band clearly delineates three key quantities
that need to be controlled as introduced therein. The established rates of convergence reveal
the same phase transition in ∆ from the lower bounds. It is worth mentioning that the analysis
of Theorem 7is more challenging under model (1.1) and (1.3) than the classical LDA setting
(1.5) in which the excess risk of any linear classifier in Xhas a closed-form expression.
1.1.4 Optimal rates of convergence of the PC-based classifier
Our fourth contribution is to apply the general theory in Section 4to analyze the PC-based
classifiers. Consistency of our proposed estimator of Kis established in Theorem 8of Section
5.1. In Theorem 9of Section 5.2, we derive explicit rates of convergence of the excess risk of the
PC-based classifier that uses B=UK. The obtained rate of convergence exhibits an interesting
interplay between the sample size nand the dimensions Kand pthrough the quantities K/n,
ξand ∆. Our analysis also covers the low signal setting ∆ = o(1), a regime that has not been
analyzed even in the existing literature of classical LDA. Our theoretical results are valid for
both fixed and growing Kand are also valid even when pis much lager than n. In Theorem
5
摘要:

OptimalDiscriminantAnalysisinHigh-DimensionalLatentFactorModelsXinBing*MartenWegkamp„AbstractInhigh-dimensionalclassi cationproblems,acommonlyusedapproachisto rstprojectthehigh-dimensionalfeaturesintoalowerdimensionalspace,andbasetheclassi cationontheresultinglowerdimensionalprojections.Inthispaper,...

展开>> 收起<<
Optimal Discriminant Analysis in High-Dimensional Latent Factor Models Xin BingMarten Wegkamp.pdf

共73页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:73 页 大小:1010.92KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 73
客服
关注