
arbitrary data covariance and norm, recovering the benign overfitting conditions of Bartlett et al.
(2020) as well as proving novel consistency results for the minimal-
`1
norm interpolator. Based on
this uniform convergence framework, G. Wang et al. (2021) establish tight bounds for the minimal-
`1
norm interpolator under a sparse signal with isotropic data. Earlier work (Ju et al. 2020; Chinot et al.
2020; Li and Wei 2021) also studied the minimal-
`1
norm interpolator, without showing consistency.
Though the minimal-
`1
norm interpolator suffers from an exponentially slow rate, Donhauser et al.
(2022) show the minimal-
`p
norm interpolator can achieve faster rates with
p
close to 1. Zhou
et al. (2021) show a risk-dependent (“localized”) bound that extends the uniform convergence of
interpolators guarantee to predictors with arbitrary training loss, and used it to establish generalization
for regularized estimators such as Ridge and LASSO. Our Moreau envelope theory builds on the
techniques developed in this line of work to apply uniform convergence in the interpolation regime.
In terms of requirements on the data distribution, Bartlett et al. (2020) and Tsigler and Bartlett
(2020) only require the feature vector to be sub-Gaussian, but assume a well-specified linear model
for the conditional distribution of the label. The uniform convergence-based works also assume a
well-specified linear model, but the assumptions are more restrictive in the sense that the marginal
distribution of the feature needs to be exactly Gaussian because their proof techniques rely on the
Gaussian Minimax Theorem (GMT). Our Moreau envelope theory’s application to linear regression
significantly relaxes the assumption on the label generating process, though it is still constrained by
the Gaussian data assumption. Shamir (2022) also studies model misspecification in linear regression,
but allows non-Gaussian features, and shows that benign overfitting does not necessarily occur in the
most general setting, even with a spiked-covariance structure (see his Example 1).
For linear classification, Muthukumar et al. (2021) analyze
`2
max-margin classifier by connecting to
minimal-norm interpolation in regression. Similarly, our analysis in the classification case depends
on the fact the squared hinge loss goes through the same transformation as the square loss under
smoothing by Moreau envelope. Donhauser et al. (2022) prove generalization bounds for
`p
max-
margin classifiers in the isotropic setting and do not consider the spiked-covariance case. Deng
et al. (2021), Montanari et al. (2019), and Liang and Sur (2020) derive exact expressions for the
asymptotic prediction risk of the
`2
and
`p
(with
p∈[1,2)
) max-margin classifiers. Though their
proof techniques also rely on the GMT, our approaches are drastically different. We use GMT in order
to show uniform convergence for a class of predictors and establish a non-asymptotic bound, whereas
their results are asymptotic and assume a proportional scaling limit. This is a key distinction, because
overfitting usually cannot be benign with proportional scaling (e.g. Donhauser et al. 2022, Proposition
1). Similar lower bounds have also been shown in the context of linear regression (Muthukumar et al.
2020; G. Wang et al. 2021; Zhou et al. 2020).
Some concurrent works have obtained consistency results for max-margin classification in the spiked
covariance setting. In particular, the May 2022 version of the work by Shamir (2022) also studies
convergence to the minimum of the squared hinge loss, and obtains consistency under conditions
similar to the benign covariance condition of Bartlett et al. (2020). During preparation of this
manuscript we learned of concurrent work by Montanari et al., not yet publicly available, which
also studies consistency results for classification. Comparing our Corollary 3to Shamir (2022), their
result applies to some non-Gaussian settings, but in the Gaussian setting their result is not as general
as ours. (Combining Assumptions 1 and 2 of Theorem 7 there, they require the norm of the data to be
bounded, whereas our Corollary 3applies even if
o(n)
eigenvalues of
Σ
grow arbitrarily quickly with
n
.) More conceptually, our result follows from a norm-based generalization bound that applies to all
predictors and outside of the “benign overfitting” conditions, generalizing the result of Koehler et al.
(2021) and unlike the analysis of prior work.
3 Problem Formulation
GLM setting.
Given a continuous loss function
f:R×R→R
and i.i.d. sample pairs
(xi, yi)
from some data distribution
D
, we can learn a linear model
( ˆw, ˆ
b)
by minimizing the empirical loss
ˆ
Lfwith the goal of achieving small population loss Lf:
ˆ
Lf(w, b) = 1
n
n
X
i=1
f(hw, xii+b, yi), Lf(w, b) = E
(x,y)∼D f(hw, xi+b, y).(1)
Multi-index model. We assume that the data distribution Dover (x, y)is such that
3