
when ris chosen as K, the projection ΠnXUKaccurately predicts the span of ΠnZunder
model (1.1). Since in practice Kis oftentimes unknown, we further use a data-driven selection
of Kin Section 3.3 to construct our final PC-based classifier. The proposed procedure is com-
putationally efficient. Its only computational burden is that of computing the singular value
decomposition (SVD) of X. Guided by our theory, we also discuss a cross-fitting strategy in
Section 3.2 that improves the PC-based classifier by removing the dependence from using the
data twice (one for constructing Urand one for computing b
θin (1.8)) when p>nand the
signal-to-noise ratio ξ∗is weak.
Retaining only a few principal components of the observed features and using them in
subsequent regressions is known as principal component regression (PCR) (Stock and Watson,
2002a). It is a popular method for predicting Y∈Rfrom a high-dimensional feature vector
X∈Rpwhen both Xand Yare generated via a low-dimensional latent factor Z. Most of the
existing literature analyzes the performance of PCR when both Yand Xare linear in Z, for
instance, Stock and Watson (2002a,b); Bair et al. (2006); Bai and Ng (2008); Hahn et al. (2013);
Bing et al. (2021), just to name a few. When Yis not linear in Z, little is known. An exception
is Fan et al. (2017), which studies the model Y=h(ξ1Z, ··· , ξqZ;ε) and X=AZ +Wfor
some unknown general link function h(·). Their focus is only on estimation of ξ1, . . . , ξq, the
sufficient predictive indices of Y, rather than analysis of the risk of predicting Y. As E[Y|Z] is
not linear in Zunder our model (1.1) and (1.3), to the best of our knowledge, analysis of the
misclassifcation error under model (1.1) and (1.3) for a general linear classifier has not been
studied elsewhere.
1.1.3 A general strategy of analyzing the excess risk of bgxbased on any matrix B
Our third contribution in this paper is to provide a general theory for analyzing the excess risk
of the type of classifiers bgxthat uses a generic matrix Bin (1.8). In Section 4we state our result
in Theorem 5, a general bound for the excess risk of the classifier bgxbased on a generic matrix
B. It depends on how well we estimate z>β+β0and a margin condition on the conditional
distributions Z|Y=k,k∈ {0,1}, nearby the hyperplane {z|z>β+β0= 0}. This bound is
different from the usual approach which bounds the excess risk P{bg(X)6=Y}−R∗
zof a classifier
bg:Rp→ {0,1}by 2E[|η(Z)−1/2| {bg(X)6=g∗
z(Z)}], with η(z) = P(Y= 1|Z=z), and involves
analyzing the behavior of η(Z) near 1/2 (see our detailed discussion in Remark 7). The analysis
of Theorem 5is powerful in that it can easily be generalized to any distribution of Z|Y, as
explained in Remark 8. Our second main result in Theorem 7of Section 4provides explicit rates
of convergence of the excess risk of bgxfor a generic Band clearly delineates three key quantities
that need to be controlled as introduced therein. The established rates of convergence reveal
the same phase transition in ∆ from the lower bounds. It is worth mentioning that the analysis
of Theorem 7is more challenging under model (1.1) and (1.3) than the classical LDA setting
(1.5) in which the excess risk of any linear classifier in Xhas a closed-form expression.
1.1.4 Optimal rates of convergence of the PC-based classifier
Our fourth contribution is to apply the general theory in Section 4to analyze the PC-based
classifiers. Consistency of our proposed estimator of Kis established in Theorem 8of Section
5.1. In Theorem 9of Section 5.2, we derive explicit rates of convergence of the excess risk of the
PC-based classifier that uses B=UK. The obtained rate of convergence exhibits an interesting
interplay between the sample size nand the dimensions Kand pthrough the quantities K/n,
ξ∗and ∆. Our analysis also covers the low signal setting ∆ = o(1), a regime that has not been
analyzed even in the existing literature of classical LDA. Our theoretical results are valid for
both fixed and growing Kand are also valid even when pis much lager than n. In Theorem
5