A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models Lijia Zhou

2025-04-30 0 0 3.54MB 53 页 10玖币
侵权投诉
A Non-Asymptotic Moreau Envelope Theory for
High-Dimensional Generalized Linear Models
Lijia Zhou
University of Chicago
zlj@uchicago.edu
Frederic Koehler
Stanford University
fkoehler@stanford.edu
Pragya Sur
Harvard University
pragya@fas.harvard.edu
Danica J. Sutherland
University of British Columbia & Amii
dsuth@cs.ubc.ca
Nathan Srebro
Toyota Technological Institute at Chicago
nati@ttic.edu
Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)
Abstract
We prove a new generalization bound that shows for any class of linear predictors
in Gaussian space, the Rademacher complexity of the class and the training error
under any continuous loss
`
can control the test error under all
Moreau envelopes
of the loss
`
. We use our finite-sample bound to directly recover the “optimistic rate”
of Zhou et al. (2021) for linear regression with the square loss, which is known to be
tight for minimal
`2
-norm interpolation, but we also handle more general settings
where the label is generated by a potentially misspecified multi-index model. The
same argument can analyze noisy interpolation of max-margin classifiers through
the squared hinge loss, and establishes consistency results in spiked-covariance
settings. More generally, when the loss is only assumed to be Lipschitz, our bound
effectively improves Talagrand’s well-known contraction lemma by a factor of
two, and we prove uniform convergence of interpolators (Koehler et al. 2021)
for all smooth, non-negative losses. Finally, we show that application of our
generalization bound using localized Gaussian width will generally be sharp for
empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory
for generalization that applies outside of proportional scaling regimes, handles
model misspecification, and complements existing asymptotic Moreau envelope
theories for M-estimation.
1 Introduction
Modern machine learning models often contain more parameters than the number of training sam-
ples. Despite the capacity to overfit, training these models without explicit regularization has been
empirically shown to achieve good generalization performance (Neyshabur et al. 2015; C. Zhang
et al. 2017; Belkin et al. 2019). On the theoretical side, the study of minimal-norm interpolation has
revealed fascinating phenomena that challenge traditional understandings of machine learning.
We now have a better understanding of how properties of the data distribution and algorithmic bias
can affect generalization in high-dimensional linear regression. For example, data with a spiked
covariance structure can ensure that the test error of ridge regression will be approximately constant
once the regularization strength is small enough for the model to fit the signal (Zhou et al. 2021;
Tsigler and Bartlett 2020), contradicting the classical U-shaped curve expected from arguments about
the bias-variance tradeoff. Surprisingly, even when the signal is sparse, the risk of the minimal-
`1
norm interpolator can be shown to converge much slower to the Bayes risk than the minimal-
`2
These authors contributed equally.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.12082v1 [stat.ML] 21 Oct 2022
norm interpolator in the junk feature setting (Chatterji and Long 2021; Koehler et al. 2021). In
contrast, the minimal-
`2
norm interpolator fails to achieve consistency in the isotropic setting, while
the minimal-
`1
norm interpolator is consistent with sparsity but suffers from an exponentially slow
rate in the number of parameters
d
(G. Wang et al. 2021; Muthukumar et al. 2020). However, we
can still achieve the minimax rate with minimal-
`p
norm interpolators with
p
extremely close to 1
(Donhauser et al. 2022).
In fact, many of the intriguing phenomena from the work above may be understood using the norm
of a predictor; localized notions of uniform convergence have emerged as essential tools for doing
so. Compared to other techniques, uniform convergence analyses can have the benefit of requiring
neither particular proportional scaling regimes nor closed-form expressions for the learned model,
since only an approximate estimate of its complexity is needed. Despite uniform convergence’s
potential for wider applicability, though, work in this area has mostly focused on linear regression
settings with strong assumptions: that the conditional expectation of the label is linear with respect
to the features, and that the residual has constant variance. In contrast, classical agnostic learning
guarantees established by uniform convergence usually need only much weaker assumptions on the
data distribution, and apply to a broader range of losses and function classes. For example, Srebro et al.
(2010) show that bounds with an “optimistic rate” hold generally for any smooth, nonnegative loss,
though the hidden logarithmic factor in their result is too loose for explaining noisy interpolation; this
was recently addressed by Zhou et al. (2021) in the special case of well-specified linear regression.
In this work, we take a step further towards agnostic interpolation learning and consider a high-
dimensional generalized linear model (GLM) setting where the label is generated by a potentially
misspecified model. We show a new generalization bound that allows us to use the Moreau envelopes
of any continuous loss function as an intermediate tool. By optimizing over the smoothing param-
eter to balance the approximation and generalization errors, our general Moreau envelope theory
yields sharp non-asymptotic generalization bounds in a wide variety of settings. Applying to linear
regression with the square loss, we recover the optimistic rate of Zhou et al. (2021) and show that
it can more generally be extended to handle model misspecification, such as nonlinear trends and
heteroskedasticity. The generality of our result comes from the fact that taking the Moreau envelope
of the square loss only scales by a constant; this property alone suffices to obtain a generalization
guarantee in terms of the original square loss. The squared hinge loss enjoys the same property, and
hence a completely analogous argument shows an optimistic rate in that setting. Combined with an
analysis of the margin, we show a novel consistency result for max-margin classifiers.
More generally, we apply the Moreau envelope theory to obtain a generic bound for any Lipschitz
loss and smooth, nonnegative loss with sharp constants. Looking specifically at the test error of an
Empirical Risk Minimizer (ERM), we show our generalization bound with localized Gaussian width
will be asymptotically sharp even when overfitting is not necessarily benign, yielding a version of the
asymptotic Moreau envelope framework for analyzing M-estimators (El Karoui et al. 2013; Bean
et al. 2013; Donoho and Montanari 2016; Thrampoulidis et al. 2018; Sur and Candès 2019) but for
the problem of generalization. Numerical simulations on a variety of feature distributions and label
generating processes confirm the wide applicability of our theory.
2 Related Work
The Moreau envelope has been useful in characterizing asymptotic properties of M-estimators in
linear models (Bean et al. 2013; El Karoui et al. 2013; Donoho and Montanari 2016; El Karoui 2018;
Thrampoulidis et al. 2018) and logistic regression (Sur and Candès 2019; Sur et al. 2019; Candès
and Sur 2020; Salehi et al. 2019; Zhao et al. 2022). This theory focuses on estimation and inference
under proportional asymptotics, rather than generalization, and does not provide any non-asymptotic
results.
For linear regression, Bartlett et al. (2020) identify nearly-matching necessary and sufficient conditions
for the consistency of minimal-
`2
norm interpolation; their subsequent work (Tsigler and Bartlett
2020) shows generalization bounds for overparametrized ridge regression. Following their work,
Negrea et al. (2020) and Zhou et al. (2020) explore the role of uniform convergence, including
showing that uniformly bounding the difference between training error and test error fails to explain
interpolation learning. Zhou et al. (2020) argue, however, that uniform convergence of interpolators
is sufficient to establish consistency in a toy example. Koehler et al. (2021) extend their result to
2
arbitrary data covariance and norm, recovering the benign overfitting conditions of Bartlett et al.
(2020) as well as proving novel consistency results for the minimal-
`1
norm interpolator. Based on
this uniform convergence framework, G. Wang et al. (2021) establish tight bounds for the minimal-
`1
norm interpolator under a sparse signal with isotropic data. Earlier work (Ju et al. 2020; Chinot et al.
2020; Li and Wei 2021) also studied the minimal-
`1
norm interpolator, without showing consistency.
Though the minimal-
`1
norm interpolator suffers from an exponentially slow rate, Donhauser et al.
(2022) show the minimal-
`p
norm interpolator can achieve faster rates with
p
close to 1. Zhou
et al. (2021) show a risk-dependent (“localized”) bound that extends the uniform convergence of
interpolators guarantee to predictors with arbitrary training loss, and used it to establish generalization
for regularized estimators such as Ridge and LASSO. Our Moreau envelope theory builds on the
techniques developed in this line of work to apply uniform convergence in the interpolation regime.
In terms of requirements on the data distribution, Bartlett et al. (2020) and Tsigler and Bartlett
(2020) only require the feature vector to be sub-Gaussian, but assume a well-specified linear model
for the conditional distribution of the label. The uniform convergence-based works also assume a
well-specified linear model, but the assumptions are more restrictive in the sense that the marginal
distribution of the feature needs to be exactly Gaussian because their proof techniques rely on the
Gaussian Minimax Theorem (GMT). Our Moreau envelope theory’s application to linear regression
significantly relaxes the assumption on the label generating process, though it is still constrained by
the Gaussian data assumption. Shamir (2022) also studies model misspecification in linear regression,
but allows non-Gaussian features, and shows that benign overfitting does not necessarily occur in the
most general setting, even with a spiked-covariance structure (see his Example 1).
For linear classification, Muthukumar et al. (2021) analyze
`2
max-margin classifier by connecting to
minimal-norm interpolation in regression. Similarly, our analysis in the classification case depends
on the fact the squared hinge loss goes through the same transformation as the square loss under
smoothing by Moreau envelope. Donhauser et al. (2022) prove generalization bounds for
`p
max-
margin classifiers in the isotropic setting and do not consider the spiked-covariance case. Deng
et al. (2021), Montanari et al. (2019), and Liang and Sur (2020) derive exact expressions for the
asymptotic prediction risk of the
`2
and
`p
(with
p[1,2)
) max-margin classifiers. Though their
proof techniques also rely on the GMT, our approaches are drastically different. We use GMT in order
to show uniform convergence for a class of predictors and establish a non-asymptotic bound, whereas
their results are asymptotic and assume a proportional scaling limit. This is a key distinction, because
overfitting usually cannot be benign with proportional scaling (e.g. Donhauser et al. 2022, Proposition
1). Similar lower bounds have also been shown in the context of linear regression (Muthukumar et al.
2020; G. Wang et al. 2021; Zhou et al. 2020).
Some concurrent works have obtained consistency results for max-margin classification in the spiked
covariance setting. In particular, the May 2022 version of the work by Shamir (2022) also studies
convergence to the minimum of the squared hinge loss, and obtains consistency under conditions
similar to the benign covariance condition of Bartlett et al. (2020). During preparation of this
manuscript we learned of concurrent work by Montanari et al., not yet publicly available, which
also studies consistency results for classification. Comparing our Corollary 3to Shamir (2022), their
result applies to some non-Gaussian settings, but in the Gaussian setting their result is not as general
as ours. (Combining Assumptions 1 and 2 of Theorem 7 there, they require the norm of the data to be
bounded, whereas our Corollary 3applies even if
o(n)
eigenvalues of
Σ
grow arbitrarily quickly with
n
.) More conceptually, our result follows from a norm-based generalization bound that applies to all
predictors and outside of the “benign overfitting” conditions, generalizing the result of Koehler et al.
(2021) and unlike the analysis of prior work.
3 Problem Formulation
GLM setting.
Given a continuous loss function
f:R×RR
and i.i.d. sample pairs
(xi, yi)
from some data distribution
D
, we can learn a linear model
( ˆw, ˆ
b)
by minimizing the empirical loss
ˆ
Lfwith the goal of achieving small population loss Lf:
ˆ
Lf(w, b) = 1
n
n
X
i=1
f(hw, xii+b, yi), Lf(w, b) = E
(x,y)∼D f(hw, xi+b, y).(1)
Multi-index model. We assume that the data distribution Dover (x, y)is such that
3
1. x∼ N(0,Σ) is a centered Gaussian with unknown covariance matrix Σ.
2.
There are unknown weight vectors
w
1, ..., w
kRd
such that the
Σ1/2w
i
are orthonormal,
a function
g:Rk+1 R
, and a random variable
ξ∼ Dξ
independent of
x
(not necessarily
Gaussian) such that
ηi=hw
i, xi, y =g(η1, ..., ηk, ξ).(2)
We can assume that the distribution of
x
is centered without loss of generality since presence of a
mean term simply corresponds to changing the bias term
b
:
hw, xi+b=hw, x µi+ (bhw, µi).
We can also assume that
Σ1/2w
1, ..., Σ1/2w
k
are orthonormal without loss of generality since we
have not imposed any assumption on the link function
g
. The multi-index model includes well-
specified linear regression, by setting
k= 1
and
g(η, ξ) = η+ξ
. It also allows nonlinear trends and
heteroskedasticity (such as the model in Figure 1) by changing the definition of
g
. Since
g
need not
be continuous, the label ycan be binary, as in linear classification.
4 Moreau Envelope Generalization Theory
Our theory vitally depends on the Moreau envelope, defined as follows.
Definition 1.
The Moreau envelope of
f:R×RR
with parameter
λR+
is defined as the
function fλ:R×RRgiven by
fλ(ˆy, y) = inf
uf(u, y) + λ(uˆy)2.(3)
The Moreau envelope can be viewed as a smooth approximation to the original function
f
: in our
parameterization, smaller
λ
corresponds to more smoothing. The map that outputs the minimizer
u
,
known as the proximal operator, plays an important role in convex analysis (Parikh and Boyd 2014;
Bauschke, Combettes, et al. 2011).
Our general theory, as stated in Theorem 1below, essentially upper bounds the generalization gap
between the population Moreau envelope
Lfλ
and the original training loss
ˆ
Lf
by the sum of two
parts: a parametric component that can be controlled by the dimension
k
of the “meaningful” part of
x
, and a non-parametric component that can be controlled by a dimension-free complexity measure
such as the Euclidean norm of the predictor. Typically, the first term is negligible since
k
is small, and
the complexity of fitting all the noise is absorbed into the second term. More precisely, we introduce
the following definitions to formalize separating out a low dimensional component:
Definition 2.
Under the model assumptions
(2)
, define a (possibly oblique) projection matrix
Q
onto
the space orthogonal to w
1, ..., w
kand a mapping φfrom Rdto Rk+1 by
Q=Id
k
X
i=1
w
i(w
i)TΣ, φ(w)=(hw, Σw
1i, ..., hw, Σw
ki,kΣ1/2Qwk2)T.(4)
We let
Σ=QTΣQ
denote the covariance matrix of
QTx
. We also define a low-dimensional
surrogate distribution ˜
Dover Rk+1 ×Rby
˜x∼ N(0, Ik+1),˜
ξ∼ Dξ,and ˜y=g(˜x1, ..., ˜xk,˜
ξ).(5)
This surrogate distribution compresses the “meaningful part” of
x
while maintaining the test loss,
as shown by our main result Theorem 1(proved in Appendix D). Note that as a non-asymptotic
statement, the functions λ,δ and Cδonly need hold for a specific choice of nand D.
Theorem 1.
Suppose
λR+
satisfies that for any
δ(0,1)
, there exists a continuous function
λ,δ :Rk+1 R
such that with probability at least
1δ/4
over independent draws
(˜xi,˜yi)
from
the surrogate distribution ˜
Ddefined in (5), we have uniformly over all ( ˜w, ˜
b)Rk+2 that
1
n
n
X
i=1
fλ(h˜w, ˜xii+˜
b, ˜yi)E
(˜x,˜y)˜
D
[fλ(h˜w, ˜xi+˜
b, ˜y)] λ,δ( ˜w, ˜
b).(6)
Further, assume that for any
δ(0,1)
, there exists a continuous function
Cδ:Rd[0,]
such
that with probability at least 1δ/4over x∼ N (0,Σ), uniformly over all wRd,
hQw, xi ≤ Cδ(w).(7)
4
Then it holds with probability at least 1δthat uniformly over all (w, b)Rd+1, we have
Lfλ(w, b)ˆ
Lf(w, b) + λ,δ(φ(w), b) + λCδ(w)2
n.(8)
If we additionally assume that (6)holds uniformly for all λR+, then (8)does as well.
As we will see, we can generally bound the difference between
Lfλ
and
Lf
when the loss is assumed
to be Lipschitz. If
f
is not Lipschitz but smooth (i.e.
f
is Lipschitz, as for the squared loss), we
can always write it as the Moreau envelope of another function
˜
f
. In the special case of square loss
or squared hinge loss, the Moreau envelope
fλ
is proportional to
f
, meaning that
(8)
becomes a
generalization guarantee in terms of
Lf
. Optimizing over
λ
will establish optimal bounds that recover
the result of Koehler et al. (2021) and Zhou et al. (2021), and lead to other novel results.
Remark 1.
The complexity functional
Cδ(w)
should be thought of as a localized, high-probability
version of Rademacher complexity. This is because the Gaussian width of a convex set
K
,
Esupw∈Khw, xi
, is the same as the Rademacher complexity of the class of linear functions
{x7→ hw, xi:w∈ K}
(Zhou et al. 2021, Proposition 1). A somewhat similar complexity functional
appears in Panchenko (2003). Also, note
(6)
requires only one-sided concentration — see Remark 3.
4.1 VC Theory for Low-dimensional Concentration
To apply our generalization result (Theorem 1), we should check the low-dimensional concentration
assumption
(6)
. The quantitative bounds in the low-dimensional concentration (i.e. the precise form
of error term
λ,δ
) will inevitably depend on the exact setting we consider (see e.g. Vapnik 1982;
Koltchinskii and Mendelson 2015; Lugosi and Mendelson 2019 for discussion).
First, we recall the following result from VC theory.
Theorem 2
(Special case of Assertion 4 of Vapnik (1982), Chapter 7.8; see also Theorem 7.6)
.
Let
K ⊂ Rd
and
B R
. Suppose that a distribution
D
over
(x, y)Rd×R
satisfies that for some
τ > 0, it holds uniformly over all (w, b) K × B that
Ef(hw, xi+b, y)4]1/4
Ef(hw, xi+b, y)τ. (9)
Also suppose the class of functions
{(x, y)7→ {f(hw, xi+b, y)> t}:w∈ K, b ∈ B, t R}
has
VC-dimension at most
h
. Then for any
n>h
, with probability at least
1δ
over the choice of
((x1, y1),...,(xn, yn)) ∼ Dn, it holds uniformly over all w∈ K, b ∈ B that
1
n
n
X
i=1
f(hw, xii+b, yi) 18τrh(log(2n/h) + 1) + log(12)
n!Ef(hw, xi+b, y).
The assumption
(9)
is standard (indeed, this is the setting primarily focused on in Vapnik 1982) and
is sometimes referred to as hypercontractivity or norm equivalence in the literature; a variant of the
result holds with
4
replaced by
1 +
. In many settings of interest, this can be directly checked using
the fact that
x
is Gaussian (for instance, see Theorem 9and Appendix E.3). Of course, our general
result can be applied without this assumption, by using low-dimensional concentration under an
alternative assumption: Vapnik (1982), Panchenko (2002), Panchenko (2003), and Mendelson (2017)
have further discussion and alternative results; in particular, Assertion 3 of Vapnik (1982, Chapter
7.8) gives a bound based on a fourth-moment assumption, and Panchenko (2003, Theorem 3) gives
one based on a version of Rademacher complexity.
Combining Theorems 1and 2yields the following.
Corollary 1.
Under the model assumptions
(2)
, suppose that
Cδ
satisfies condition
(7)
. Also suppose
that for some fixed
λ0
,
K ⊆ Rd
, and
B R
, the surrogate distribution
˜
D
satisfies assumption
(9)
under
fλ
uniformly over
φ(K)×B
, and that the class
{(x, y)7→ {fλ(h˜w, φ(x)i+˜
b, y)> t}: ˜w
φ(K),˜
b∈ B, t R}
has VC-dimension at most
h
. Then with probability at least
1δ
, uniformly
over all (w, b) K × B
18τrh(log(2n/h) + 1) + log(48)
n!Lfλ(w, b)ˆ
Lf(w, b) + λCδ(w)2
n.
5
摘要:

ANon-AsymptoticMoreauEnvelopeTheoryforHigh-DimensionalGeneralizedLinearModelsLijiaZhouUniversityofChicagozlj@uchicago.eduFredericKoehlerStanfordUniversityfkoehler@stanford.eduPragyaSurHarvardUniversitypragya@fas.harvard.eduDanicaJ.SutherlandUniversityofBritishColumbia&Amiidsuth@cs.ubc.caNathanSreb...

展开>> 收起<<
A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models Lijia Zhou.pdf

共53页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:53 页 大小:3.54MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 53
客服
关注