Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures Xin BingMarten Wegkamp

2025-05-05 0 0 562.22KB 29 页 10玖币

侵权投诉

Interpolating Discriminant Functions in

High-Dimensional Gaussian Latent Mixtures

Xin Bing∗Marten Wegkamp†

Abstract

This paper considers binary classiﬁcation of high-dimensional features under a

postulated model with a low-dimensional latent Gaussian mixture structure and

non-vanishing noise. A generalized least squares estimator is used to estimate the

direction of the optimal separating hyperplane. The estimated hyperplane is shown

to interpolate on the training data. While the direction vector can be consistently

estimated as could be expected from recent results in linear regression, a naive plug-

in estimate fails to consistently estimate the intercept. A simple correction, that

requires an independent hold-out sample, renders the procedure minimax optimal in

many scenarios. The interpolation property of the latter procedure can be retained,

but surprisingly depends on the way the labels are encoded.

Keywords: High-dimensional classiﬁcation, latent factor model, generalized least squares,

interpolation, benign overﬁtting, overparametrization, discriminant analysis, minimax

optimal rate of convergence.

1 Introduction

We consider binary classiﬁcation of a high-dimensional feature vector. That is, we are

given a n×pdata-matrix Xwith nindependent p-dimensional rows, with pn, and

a response vector y∈ {0,1}nof corresponding labels, and the task is to predict the

labels of new p-dimensional features. Complex models such as kernel support vector

machines (SVM) and deep neural networks have been observed to have surprisingly good

generalization performance despite overﬁtting the training data. Towards understanding

such benign overﬁtting phenomenon, one popular example that has gained increasing

attention is the estimator b

θ=X+yin the regression context. Here X+represents the

Moore-Penrose inverse of X. It can be shown that b

θhas the minimal `2-norm among all

least squares estimators, that is,

θk2= min kθk2:ky−Xθk2= min

u∈Rpky−Xuk2.

∗Department of Statistical Sciences, University of Toronto. E-mail: xin.bing@toronto.ca

†Department of Mathematics and Department of Statistics and Data Science, Cornell University.

E-mail: marten.wegkamp@cornell.edu.

arXiv:2210.14347v2 [stat.ML] 29 Mar 2023

Since often Xb

θ=yholds, for instance, when Xhas full row-rank n<p, the estimator

θis referred to as the minimum-norm interpolator. It serves as the prime example to

illustrate the phenomenon that overﬁtting in linear regression can be benign, in that b

can still lead to good prediction results. See, for instance, Belkin et al. (2018); Bartlett

et al. (2020); Hastie et al. (2022); Bunea et al. (2022) and the references therein.

More recently, the minimum-norm interpolator further ﬁnds its importance in bi-

nary classiﬁcation problems. Speciﬁcally, b

θis shown to coincide with the solution of the

hard margin SVM under the over-parametrized Gaussian mixture models (Muthukumar

et al.,2019;Wang and Thrampoulidis,2021) and beyond (Hsu et al.,2021). In the

over-parametrized logistic regression model, b

θis also closely connected to the solution of

maximizing the log-likelihood, obtained by gradient descent with suﬃciently small step

size (Soudry et al.,2018;Cao et al.,2021). In the over-parametrized setting pn, the

hyperplane {x|x>b

θ= 0}separates the training data perfectly, leading to a classiﬁer

that has zero training error. There is a growing literature (Cao et al.,2021;Wang and

Thrampoulidis,2021;Chatterji and Long,2021;Minsker et al.,2021) that shows that in-

terpolating classiﬁers ¯g(x) = {x>b

θ > 0}can also have vanishing misclassiﬁcation error

P{¯g(X)6=Y}in (sub-)Gaussian mixture models. We extend these works in this paper

motivated by the following observations.

First, we notice that the separating hyperplane {x|x>b

θ= 0}considered in the above

mentioned literature has no intercept. Under symmetric Gaussian mixture models, that

is, P{Y= 1}=P{Y= 0}and X|Y=k∼Np((2k−1)µ, Σ) for k∈ {0,1}, the

optimal (Bayes) rule is indeed based on a hyperplane through the origin (no intercept).

However, this is no longer true in the asymmetric setting where the class probabilities

diﬀer, P{Y= 1} 6=P{Y= 0}, rendering the usage of ¯g(x) questionable. Although Wang

and Thrampoulidis (2021) shows that in the asymmetric setting P{¯g(X)6=Y}still tends

to zero if the separation between X|Y= 0 and X|Y= 1 diverges, the rate of this

convergence is unfortunately exponentially slower than the optimal rate, in part due to

not using an intercept. This motivates us to propose an improved linear classiﬁer based

on b

θthat includes an intercept, formally introduced in (1.1) below. Finding a meaningful

intercept under the interpolation of b

θrequires extra care, as standard approaches, such

as the empirical risk minimization, can not be used when the hyperplane {x|x>b

θ= 0}

separates the training data perfectly.

Second, the aforementioned works all focus on the misclassiﬁcation risk P{¯g(X)6=Y}.

In particular, they show that P{¯g(X)6=Y}vanishes only if the separation between the

two mixture distributions diverges. In general, the excess risk - the diﬀerence between

the misclassiﬁcation error and the Bayes error - is a more meaningful criterion, because

the Bayes error infhP{h(X)6=Y}is the smallest possible misclassiﬁcation error among

all classiﬁers and generally does not vanish. For this reason, we focus on analyzing the

excess risk of the proposed classiﬁer, and our results are informative even if the separation

between the two mixture distributions does not diverge (and therefore the Bayes risk does

not vanish).

Summarizing, the existing results on interpolating classiﬁers are not satisfactory as

they only consider stylized examples that do not address the more realistic scenarios when

the mixture probabilities are asymmetric and the Bayes error does not vanish. In fact, we

will argue that these interpolation methods without intercept in the literature actually

fail in the asymmetric setting when the conditional distributions are not asymptotically

distinguishable - which is the statistically more challenging case. In this work, we instead

analyze the interpolating classiﬁer with a judiciously chosen intercept under a recently

proposed latent, low-dimensional statistical model (Bing and Wegkamp,2022), and our

results reveal that its excess risk has minimax-optimal rate of convergence in the over-

parametrized setting even when the separation between the two mixture distributions

does not diverge. Together with the interpolation property of the proposed classiﬁer,

we thus provide a concrete instance of the interesting phenomenon that overﬁtting and

minimax-optimal generalization performance can coexist in a more realistic statistical

setting, against traditional statistical belief.

1.1 Our contributions

Concretely, in this paper we study the linear classiﬁer

eg(x) = {x>b

θ+e

β0>0},for any x∈Rp,(1.1)

based on the minimum-norm interpolator b

θand some estimated intercept e

β0∈R. Fol-

lowing Bing and Wegkamp (2022), we assume that each feature consists of linear com-

binations of hidden low-dimensional Gaussian components, obscured by independent,

possibly non-Gaussian, noise. The low-dimensional Gaussian component suggests to take

a Linear Discriminant Analysis (LDA) approach, reducing the problem to ﬁnd (i) the

unknown low-dimensional space, (ii) its dimension K, with Kn, and (iii) the optimal

hyperplane {z∈RK|z>β+β0= 0}in this latent space. The formal model and expres-

sion of the optimal hyperplane are described in Section 2. Existing literature on LDA

in high-dimensional classiﬁcation problems often imposes sparsity on the coeﬃcients of

the hyperplane (Tibshirani et al.,2002;Fan and Fan,2008;Witten and Tibshirani,2011;

Shao et al.,2011;Cai and Liu,2011;Mai et al.,2012;Cai and Zhang,2019). In this

work, we take a diﬀerent route and do not assume that the high-dimensional features are

Gaussian, nor rely on any sparsity assumption.

The recent work Bing and Wegkamp (2022) successfully utilized Principal Component

Regression (PCR) to estimate z>βwith its low-dimension estimated via the method

developed in Bing and Wegkamp (2019). The classiﬁer in (1.1) estimates z>βby x>b

via the minimum-norm interpolator b

θinstead. This estimator can be viewed as a limit

case of PCR, with the number of retained principal components equal to p(not K),

and as such an extension of Bing and Wegkamp (2022). The practical advantage of this

extension is to avoid estimation of the latent dimension K, meanwhile it sheds light

on the robustness of the PCR-based classiﬁers of Bing and Wegkamp (2022) against

misspeciﬁcation of K. From a theoretical perspective, it is surprising to see that x>b

adapts to the low-dimensional structure in z>β, as explained below.

Section 4provides theoretical guarantees for our proposed classiﬁer. Theorem 8in

Section 4states that x>b

θconsistently estimates z>βby adapting to the low-dimensional

structure, and the rate of this convergence is often minimax-optimal in over-parametrized

setting. Establishing Theorem 8is our main technical challenge and its proof occupies

a large part of the paper and is delegated to Section 6.1. Although this convergence is

in line with current developments in regression that b

θsurprisingly succeeds, our analysis

is more complicated as yis no longer linearly related with X. In Section 4.1 we also

explain that similar arguments explored in Bing and Wegkamp (2022) can not be used

here. A key step in our proof is to recognize and characterize the implicit regularization of

θ=X+yin high-dimensional factor models. In our proof, we generalize existing analyses

of factor models (Bai,2003;Bai and Ng,2008;Fan et al.,2013;Stock and Watson,

2002) by relaxing the stringent conditions that require all singular values of the latent

components of Xto grow at the same rate, proportional to the dimension p.

Given the success in estimating z>βvia x>b

θ, a rather diﬃcult p-dimensional problem,

one would expect that consistent estimation of β0is much easier. Surprisingly, Proposition

4in Section 3.2 shows that this is not the case. The natural plug-in estimate of β0based on

θand standard non-parametric estimates of the conditional means and label probabilities,

always takes the value −1/2, regardless of the true value of β0. The same is true for an

estimate based on empirical risk minimization. Simulations conﬁrm that this problematic

behavior leads to an inferior classiﬁer. In Section 3.2 we oﬀer a simple rectiﬁcation and

propose to estimate β0using an independent hold-out sample. In Proposition 11 of

Section 4, we derive the consistency of our proposed estimator of β0. Finally, in Theorem

12 we establish the rate of convergence of the excess-misclassiﬁcation risk of our proposed

classiﬁer and discuss its minimax-optimal properties in Remark 3.

In view of the optimal guarantees of the proposed classiﬁer in over-parametrized set-

ting, we also ﬁnd an interesting observation on its interpolation property. Speciﬁcally,

Lemma 3in Section 3.1 and our discussion in Section 3.3 reveal that its interpolation

property crucially depends on the way we encode the labels. For instance, interpolation

always happens if we encode Yas {−1,1}, whereas this is not always the case for the

{0,1}encoding scheme unless the majority class is encoded as 0. This suggests that

interpolation is a rather arbitrary property.

The paper is organized as follows. In Section 2we formally introduce the statistical

model. We discuss the interpolation property of the proposed classiﬁer and introduce

the estimator of the intercept in Section 3. Section 4is devoted to study the rate of

convergence of the excess risk of the proposed classiﬁer. Section 5contains simulation

studies. The main proofs are deferred to Section 6while auxiliary lemmas are stated in

Section 7.

1.2 Notation

We use the common notation ϕ(x) = exp(−x2/2)/√2πfor the standard normal density,

and denote by Φ(x) = Rϕ(t){t≤x}dtits c.d.f..

For any positive integer d, we write [d] := {1, . . . , d}. For two numbers aand b, we

write a∧b= min{a, b}and a∨b= max{a, b}. For any two sequences anand bn, we

write an.bnif there exists some constant Csuch that an≤Cbn. The notation anbn

stands for an.bnand bn.an. We often write anbnif an=o(bn) and anbnfor

bn=o(an).

For any vector v, we use kvkqto denote its `qnorm for 0 ≤q≤ ∞. We also write

kvk2

Q=v>Qv for any commensurate, square matrix Q. For any real-valued matrix M∈

Rr×q, we use M+to denote the Moore-Penrose inverse of M, and σ1(M)≥σ2(M)≥ ··· ≥

σmin(r,q)(M) to denote the singular values of Min non-increasing order. We deﬁne the

operator norm kMkop =σ1(M). For a symmetric positive semi-deﬁnite matrix Q∈Rp×p,

we use λ1(Q)≥λ2(Q)≥ ··· ≥ λp(Q) to denote the eigenvalues of Qin non-increasing

order. We use Idto denote the d×didentity matrix and use 1d(0d) to denote the vector

with all ones (zeroes). For d1≥d2, we use Od1×d2to denote the set of all d1×d2matrices

with orthonormal columns.

2 Background

Suppose our training data consists of independent copies of the pair (X, Y ) with features

X∈Rpaccording to

X=AZ +W(2.1)

and labels Y∈ {0,1}. Here Ais a deterministic, unknown p×Kloading matrix, Z∈RK

are unobserved, latent factors and Wis random noise. We assume throughout this study

the following set of assumptions:

(i) Wis independent of both Zand Y

(ii) E[Z] = 0K,E[W] = 0p

(iii) Ahas rank K

(iv) Zis a mixture of two Gaussians

Z|Y=k∼NK(αk,ΣZ|Y),P(Y=k) = πk, k ∈ {0,1}(2.2)

with diﬀerent means α0:= E[Z|Y= 0] and α1:= E[Z|Y= 1], but with the same,

strictly positive deﬁnite covariance matrix

ΣZ|Y:= Cov(Z|Y= 0) = Cov(Z|Y= 1) (2.3)

(v) W= Σ1/2

WV, for some semi-positive deﬁnite matrix ΣW, with E[V] = 0p,E[V V >] =

Ipand E[exp(u>V)] ≤exp(σ2/2) for all kuk2= 1, for some 0 < σ < ∞.

(vi) π0=P{Y= 0}and π1=P{Y= 1}are ﬁxed and strictly positive.

We emphasize that the distributions of Xgiven Yare not necessarily Gaussian. This

mathematical framework allows for a substantial dimension reduction in classiﬁcation for

Kpas is evident from the inequality

R∗

x:= inf

P{g(X)6=Y} ≥ R∗

z:= inf

P{h(Z)6=Y}(2.4)

in terms of the Bayes’ misclassiﬁcation errors, see Bing and Wegkamp (2022, Lemma

1). As in Bing and Wegkamp (2022), we ﬁrst change the classiﬁcation problem into a

regression problem by drawing a connection of the Bayes rule to a quantity that can

be identiﬁed via regressing Yonto Z. We denote by ΣZ=E[ZZ>] the unconditional

covariance matrix of Zand we deﬁne

β=π0π1Σ−1

Z(α1−α0),(2.5)

β0=−1

2(α0+α1)>β+1−(α1−α0)>βπ0π1log π1

π0

.(2.6)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InterpolatingDiscriminantFunctionsinHigh-DimensionalGaussianLatentMixturesXinBing*MartenWegkampAbstractThispaperconsidersbinaryclassicationofhigh-dimensionalfeaturesunderapostulatedmodelwithalow-dimensionallatentGaussianmixturestructureandnon-vanishingnoise.Ageneralizedleastsquaresestimatorisusedt...

展开>> 收起<<

Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures Xin BingMarten Wegkamp.pdf

共29页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures Xin BingMarten Wegkamp

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: