
Since often Xb
θ=yholds, for instance, when Xhas full row-rank n<p, the estimator
b
θis referred to as the minimum-norm interpolator. It serves as the prime example to
illustrate the phenomenon that overfitting in linear regression can be benign, in that b
θ
can still lead to good prediction results. See, for instance, Belkin et al. (2018); Bartlett
et al. (2020); Hastie et al. (2022); Bunea et al. (2022) and the references therein.
More recently, the minimum-norm interpolator further finds its importance in bi-
nary classification problems. Specifically, b
θis shown to coincide with the solution of the
hard margin SVM under the over-parametrized Gaussian mixture models (Muthukumar
et al.,2019;Wang and Thrampoulidis,2021) and beyond (Hsu et al.,2021). In the
over-parametrized logistic regression model, b
θis also closely connected to the solution of
maximizing the log-likelihood, obtained by gradient descent with sufficiently small step
size (Soudry et al.,2018;Cao et al.,2021). In the over-parametrized setting pn, the
hyperplane {x|x>b
θ= 0}separates the training data perfectly, leading to a classifier
that has zero training error. There is a growing literature (Cao et al.,2021;Wang and
Thrampoulidis,2021;Chatterji and Long,2021;Minsker et al.,2021) that shows that in-
terpolating classifiers ¯g(x) = {x>b
θ > 0}can also have vanishing misclassification error
P{¯g(X)6=Y}in (sub-)Gaussian mixture models. We extend these works in this paper
motivated by the following observations.
First, we notice that the separating hyperplane {x|x>b
θ= 0}considered in the above
mentioned literature has no intercept. Under symmetric Gaussian mixture models, that
is, P{Y= 1}=P{Y= 0}and X|Y=k∼Np((2k−1)µ, Σ) for k∈ {0,1}, the
optimal (Bayes) rule is indeed based on a hyperplane through the origin (no intercept).
However, this is no longer true in the asymmetric setting where the class probabilities
differ, P{Y= 1} 6=P{Y= 0}, rendering the usage of ¯g(x) questionable. Although Wang
and Thrampoulidis (2021) shows that in the asymmetric setting P{¯g(X)6=Y}still tends
to zero if the separation between X|Y= 0 and X|Y= 1 diverges, the rate of this
convergence is unfortunately exponentially slower than the optimal rate, in part due to
not using an intercept. This motivates us to propose an improved linear classifier based
on b
θthat includes an intercept, formally introduced in (1.1) below. Finding a meaningful
intercept under the interpolation of b
θrequires extra care, as standard approaches, such
as the empirical risk minimization, can not be used when the hyperplane {x|x>b
θ= 0}
separates the training data perfectly.
Second, the aforementioned works all focus on the misclassification risk P{¯g(X)6=Y}.
In particular, they show that P{¯g(X)6=Y}vanishes only if the separation between the
two mixture distributions diverges. In general, the excess risk - the difference between
the misclassification error and the Bayes error - is a more meaningful criterion, because
the Bayes error infhP{h(X)6=Y}is the smallest possible misclassification error among
all classifiers and generally does not vanish. For this reason, we focus on analyzing the
excess risk of the proposed classifier, and our results are informative even if the separation
between the two mixture distributions does not diverge (and therefore the Bayes risk does
not vanish).
Summarizing, the existing results on interpolating classifiers are not satisfactory as
they only consider stylized examples that do not address the more realistic scenarios when
the mixture probabilities are asymmetric and the Bayes error does not vanish. In fact, we
will argue that these interpolation methods without intercept in the literature actually
fail in the asymmetric setting when the conditional distributions are not asymptotically
2