Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures Xin BingMarten Wegkamp

2025-05-05 0 0 562.22KB 29 页 10玖币
侵权投诉
Interpolating Discriminant Functions in
High-Dimensional Gaussian Latent Mixtures
Xin BingMarten Wegkamp
Abstract
This paper considers binary classification of high-dimensional features under a
postulated model with a low-dimensional latent Gaussian mixture structure and
non-vanishing noise. A generalized least squares estimator is used to estimate the
direction of the optimal separating hyperplane. The estimated hyperplane is shown
to interpolate on the training data. While the direction vector can be consistently
estimated as could be expected from recent results in linear regression, a naive plug-
in estimate fails to consistently estimate the intercept. A simple correction, that
requires an independent hold-out sample, renders the procedure minimax optimal in
many scenarios. The interpolation property of the latter procedure can be retained,
but surprisingly depends on the way the labels are encoded.
Keywords: High-dimensional classification, latent factor model, generalized least squares,
interpolation, benign overfitting, overparametrization, discriminant analysis, minimax
optimal rate of convergence.
1 Introduction
We consider binary classification of a high-dimensional feature vector. That is, we are
given a n×pdata-matrix Xwith nindependent p-dimensional rows, with pn, and
a response vector y∈ {0,1}nof corresponding labels, and the task is to predict the
labels of new p-dimensional features. Complex models such as kernel support vector
machines (SVM) and deep neural networks have been observed to have surprisingly good
generalization performance despite overfitting the training data. Towards understanding
such benign overfitting phenomenon, one popular example that has gained increasing
attention is the estimator b
θ=X+yin the regression context. Here X+represents the
Moore-Penrose inverse of X. It can be shown that b
θhas the minimal `2-norm among all
least squares estimators, that is,
kb
θk2= min kθk2:kyXθk2= min
uRpkyXuk2.
Department of Statistical Sciences, University of Toronto. E-mail: xin.bing@toronto.ca
Department of Mathematics and Department of Statistics and Data Science, Cornell University.
E-mail: marten.wegkamp@cornell.edu.
1
arXiv:2210.14347v2 [stat.ML] 29 Mar 2023
Since often Xb
θ=yholds, for instance, when Xhas full row-rank n<p, the estimator
b
θis referred to as the minimum-norm interpolator. It serves as the prime example to
illustrate the phenomenon that overfitting in linear regression can be benign, in that b
θ
can still lead to good prediction results. See, for instance, Belkin et al. (2018); Bartlett
et al. (2020); Hastie et al. (2022); Bunea et al. (2022) and the references therein.
More recently, the minimum-norm interpolator further finds its importance in bi-
nary classification problems. Specifically, b
θis shown to coincide with the solution of the
hard margin SVM under the over-parametrized Gaussian mixture models (Muthukumar
et al.,2019;Wang and Thrampoulidis,2021) and beyond (Hsu et al.,2021). In the
over-parametrized logistic regression model, b
θis also closely connected to the solution of
maximizing the log-likelihood, obtained by gradient descent with sufficiently small step
size (Soudry et al.,2018;Cao et al.,2021). In the over-parametrized setting pn, the
hyperplane {x|x>b
θ= 0}separates the training data perfectly, leading to a classifier
that has zero training error. There is a growing literature (Cao et al.,2021;Wang and
Thrampoulidis,2021;Chatterji and Long,2021;Minsker et al.,2021) that shows that in-
terpolating classifiers ¯g(x) = {x>b
θ > 0}can also have vanishing misclassification error
P{¯g(X)6=Y}in (sub-)Gaussian mixture models. We extend these works in this paper
motivated by the following observations.
First, we notice that the separating hyperplane {x|x>b
θ= 0}considered in the above
mentioned literature has no intercept. Under symmetric Gaussian mixture models, that
is, P{Y= 1}=P{Y= 0}and X|Y=kNp((2k1)µ, Σ) for k∈ {0,1}, the
optimal (Bayes) rule is indeed based on a hyperplane through the origin (no intercept).
However, this is no longer true in the asymmetric setting where the class probabilities
differ, P{Y= 1} 6=P{Y= 0}, rendering the usage of ¯g(x) questionable. Although Wang
and Thrampoulidis (2021) shows that in the asymmetric setting P{¯g(X)6=Y}still tends
to zero if the separation between X|Y= 0 and X|Y= 1 diverges, the rate of this
convergence is unfortunately exponentially slower than the optimal rate, in part due to
not using an intercept. This motivates us to propose an improved linear classifier based
on b
θthat includes an intercept, formally introduced in (1.1) below. Finding a meaningful
intercept under the interpolation of b
θrequires extra care, as standard approaches, such
as the empirical risk minimization, can not be used when the hyperplane {x|x>b
θ= 0}
separates the training data perfectly.
Second, the aforementioned works all focus on the misclassification risk P{¯g(X)6=Y}.
In particular, they show that P{¯g(X)6=Y}vanishes only if the separation between the
two mixture distributions diverges. In general, the excess risk - the difference between
the misclassification error and the Bayes error - is a more meaningful criterion, because
the Bayes error infhP{h(X)6=Y}is the smallest possible misclassification error among
all classifiers and generally does not vanish. For this reason, we focus on analyzing the
excess risk of the proposed classifier, and our results are informative even if the separation
between the two mixture distributions does not diverge (and therefore the Bayes risk does
not vanish).
Summarizing, the existing results on interpolating classifiers are not satisfactory as
they only consider stylized examples that do not address the more realistic scenarios when
the mixture probabilities are asymmetric and the Bayes error does not vanish. In fact, we
will argue that these interpolation methods without intercept in the literature actually
fail in the asymmetric setting when the conditional distributions are not asymptotically
2
distinguishable - which is the statistically more challenging case. In this work, we instead
analyze the interpolating classifier with a judiciously chosen intercept under a recently
proposed latent, low-dimensional statistical model (Bing and Wegkamp,2022), and our
results reveal that its excess risk has minimax-optimal rate of convergence in the over-
parametrized setting even when the separation between the two mixture distributions
does not diverge. Together with the interpolation property of the proposed classifier,
we thus provide a concrete instance of the interesting phenomenon that overfitting and
minimax-optimal generalization performance can coexist in a more realistic statistical
setting, against traditional statistical belief.
1.1 Our contributions
Concretely, in this paper we study the linear classifier
eg(x) = {x>b
θ+e
β0>0},for any xRp,(1.1)
based on the minimum-norm interpolator b
θand some estimated intercept e
β0R. Fol-
lowing Bing and Wegkamp (2022), we assume that each feature consists of linear com-
binations of hidden low-dimensional Gaussian components, obscured by independent,
possibly non-Gaussian, noise. The low-dimensional Gaussian component suggests to take
a Linear Discriminant Analysis (LDA) approach, reducing the problem to find (i) the
unknown low-dimensional space, (ii) its dimension K, with Kn, and (iii) the optimal
hyperplane {zRK|z>β+β0= 0}in this latent space. The formal model and expres-
sion of the optimal hyperplane are described in Section 2. Existing literature on LDA
in high-dimensional classification problems often imposes sparsity on the coefficients of
the hyperplane (Tibshirani et al.,2002;Fan and Fan,2008;Witten and Tibshirani,2011;
Shao et al.,2011;Cai and Liu,2011;Mai et al.,2012;Cai and Zhang,2019). In this
work, we take a different route and do not assume that the high-dimensional features are
Gaussian, nor rely on any sparsity assumption.
The recent work Bing and Wegkamp (2022) successfully utilized Principal Component
Regression (PCR) to estimate z>βwith its low-dimension estimated via the method
developed in Bing and Wegkamp (2019). The classifier in (1.1) estimates z>βby x>b
θ
via the minimum-norm interpolator b
θinstead. This estimator can be viewed as a limit
case of PCR, with the number of retained principal components equal to p(not K),
and as such an extension of Bing and Wegkamp (2022). The practical advantage of this
extension is to avoid estimation of the latent dimension K, meanwhile it sheds light
on the robustness of the PCR-based classifiers of Bing and Wegkamp (2022) against
misspecification of K. From a theoretical perspective, it is surprising to see that x>b
θ
adapts to the low-dimensional structure in z>β, as explained below.
Section 4provides theoretical guarantees for our proposed classifier. Theorem 8in
Section 4states that x>b
θconsistently estimates z>βby adapting to the low-dimensional
structure, and the rate of this convergence is often minimax-optimal in over-parametrized
setting. Establishing Theorem 8is our main technical challenge and its proof occupies
a large part of the paper and is delegated to Section 6.1. Although this convergence is
in line with current developments in regression that b
θsurprisingly succeeds, our analysis
is more complicated as yis no longer linearly related with X. In Section 4.1 we also
3
explain that similar arguments explored in Bing and Wegkamp (2022) can not be used
here. A key step in our proof is to recognize and characterize the implicit regularization of
b
θ=X+yin high-dimensional factor models. In our proof, we generalize existing analyses
of factor models (Bai,2003;Bai and Ng,2008;Fan et al.,2013;Stock and Watson,
2002) by relaxing the stringent conditions that require all singular values of the latent
components of Xto grow at the same rate, proportional to the dimension p.
Given the success in estimating z>βvia x>b
θ, a rather difficult p-dimensional problem,
one would expect that consistent estimation of β0is much easier. Surprisingly, Proposition
4in Section 3.2 shows that this is not the case. The natural plug-in estimate of β0based on
b
θand standard non-parametric estimates of the conditional means and label probabilities,
always takes the value 1/2, regardless of the true value of β0. The same is true for an
estimate based on empirical risk minimization. Simulations confirm that this problematic
behavior leads to an inferior classifier. In Section 3.2 we offer a simple rectification and
propose to estimate β0using an independent hold-out sample. In Proposition 11 of
Section 4, we derive the consistency of our proposed estimator of β0. Finally, in Theorem
12 we establish the rate of convergence of the excess-misclassification risk of our proposed
classifier and discuss its minimax-optimal properties in Remark 3.
In view of the optimal guarantees of the proposed classifier in over-parametrized set-
ting, we also find an interesting observation on its interpolation property. Specifically,
Lemma 3in Section 3.1 and our discussion in Section 3.3 reveal that its interpolation
property crucially depends on the way we encode the labels. For instance, interpolation
always happens if we encode Yas {−1,1}, whereas this is not always the case for the
{0,1}encoding scheme unless the majority class is encoded as 0. This suggests that
interpolation is a rather arbitrary property.
The paper is organized as follows. In Section 2we formally introduce the statistical
model. We discuss the interpolation property of the proposed classifier and introduce
the estimator of the intercept in Section 3. Section 4is devoted to study the rate of
convergence of the excess risk of the proposed classifier. Section 5contains simulation
studies. The main proofs are deferred to Section 6while auxiliary lemmas are stated in
Section 7.
1.2 Notation
We use the common notation ϕ(x) = exp(x2/2)/2πfor the standard normal density,
and denote by Φ(x) = Rϕ(t){tx}dtits c.d.f..
For any positive integer d, we write [d] := {1, . . . , d}. For two numbers aand b, we
write ab= min{a, b}and ab= max{a, b}. For any two sequences anand bn, we
write an.bnif there exists some constant Csuch that anCbn. The notation anbn
stands for an.bnand bn.an. We often write anbnif an=o(bn) and anbnfor
bn=o(an).
For any vector v, we use kvkqto denote its `qnorm for 0 q≤ ∞. We also write
kvk2
Q=v>Qv for any commensurate, square matrix Q. For any real-valued matrix M
Rr×q, we use M+to denote the Moore-Penrose inverse of M, and σ1(M)σ2(M)≥ ··· ≥
σmin(r,q)(M) to denote the singular values of Min non-increasing order. We define the
operator norm kMkop =σ1(M). For a symmetric positive semi-definite matrix QRp×p,
4
we use λ1(Q)λ2(Q)≥ ··· ≥ λp(Q) to denote the eigenvalues of Qin non-increasing
order. We use Idto denote the d×didentity matrix and use 1d(0d) to denote the vector
with all ones (zeroes). For d1d2, we use Od1×d2to denote the set of all d1×d2matrices
with orthonormal columns.
2 Background
Suppose our training data consists of independent copies of the pair (X, Y ) with features
XRpaccording to
X=AZ +W(2.1)
and labels Y∈ {0,1}. Here Ais a deterministic, unknown p×Kloading matrix, ZRK
are unobserved, latent factors and Wis random noise. We assume throughout this study
the following set of assumptions:
(i) Wis independent of both Zand Y
(ii) E[Z] = 0K,E[W] = 0p
(iii) Ahas rank K
(iv) Zis a mixture of two Gaussians
Z|Y=kNK(αk,ΣZ|Y),P(Y=k) = πk, k ∈ {0,1}(2.2)
with different means α0:= E[Z|Y= 0] and α1:= E[Z|Y= 1], but with the same,
strictly positive definite covariance matrix
ΣZ|Y:= Cov(Z|Y= 0) = Cov(Z|Y= 1) (2.3)
(v) W= Σ1/2
WV, for some semi-positive definite matrix ΣW, with E[V] = 0p,E[V V >] =
Ipand E[exp(u>V)] exp(σ2/2) for all kuk2= 1, for some 0 < σ < .
(vi) π0=P{Y= 0}and π1=P{Y= 1}are fixed and strictly positive.
We emphasize that the distributions of Xgiven Yare not necessarily Gaussian. This
mathematical framework allows for a substantial dimension reduction in classification for
Kpas is evident from the inequality
R
x:= inf
g
P{g(X)6=Y} ≥ R
z:= inf
h
P{h(Z)6=Y}(2.4)
in terms of the Bayes’ misclassification errors, see Bing and Wegkamp (2022, Lemma
1). As in Bing and Wegkamp (2022), we first change the classification problem into a
regression problem by drawing a connection of the Bayes rule to a quantity that can
be identified via regressing Yonto Z. We denote by ΣZ=E[ZZ>] the unconditional
covariance matrix of Zand we define
β=π0π1Σ1
Z(α1α0),(2.5)
β0=1
2(α0+α1)>β+1(α1α0)>βπ0π1log π1
π0
.(2.6)
5
摘要:

InterpolatingDiscriminantFunctionsinHigh-DimensionalGaussianLatentMixturesXinBing*MartenWegkamp„AbstractThispaperconsidersbinaryclassi cationofhigh-dimensionalfeaturesunderapostulatedmodelwithalow-dimensionallatentGaussianmixturestructureandnon-vanishingnoise.Ageneralizedleastsquaresestimatorisusedt...

展开>> 收起<<
Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures Xin BingMarten Wegkamp.pdf

共29页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:29 页 大小:562.22KB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 29
客服
关注