A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models Lijia Zhou

2025-04-30 0 0 3.54MB 53 页 10玖币

侵权投诉

A Non-Asymptotic Moreau Envelope Theory for

High-Dimensional Generalized Linear Models

Lijia Zhou∗

University of Chicago

zlj@uchicago.edu

Frederic Koehler∗

Stanford University

fkoehler@stanford.edu

Pragya Sur

Harvard University

pragya@fas.harvard.edu

Danica J. Sutherland

University of British Columbia & Amii

dsuth@cs.ubc.ca

Nathan Srebro

Toyota Technological Institute at Chicago

nati@ttic.edu

Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)

Abstract

We prove a new generalization bound that shows for any class of linear predictors

in Gaussian space, the Rademacher complexity of the class and the training error

under any continuous loss

can control the test error under all

Moreau envelopes

of the loss

. We use our ﬁnite-sample bound to directly recover the “optimistic rate”

of Zhou et al. (2021) for linear regression with the square loss, which is known to be

tight for minimal

-norm interpolation, but we also handle more general settings

where the label is generated by a potentially misspeciﬁed multi-index model. The

same argument can analyze noisy interpolation of max-margin classiﬁers through

the squared hinge loss, and establishes consistency results in spiked-covariance

settings. More generally, when the loss is only assumed to be Lipschitz, our bound

effectively improves Talagrand’s well-known contraction lemma by a factor of

two, and we prove uniform convergence of interpolators (Koehler et al. 2021)

for all smooth, non-negative losses. Finally, we show that application of our

generalization bound using localized Gaussian width will generally be sharp for

empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory

for generalization that applies outside of proportional scaling regimes, handles

model misspeciﬁcation, and complements existing asymptotic Moreau envelope

theories for M-estimation.

1 Introduction

Modern machine learning models often contain more parameters than the number of training sam-

ples. Despite the capacity to overﬁt, training these models without explicit regularization has been

empirically shown to achieve good generalization performance (Neyshabur et al. 2015; C. Zhang

et al. 2017; Belkin et al. 2019). On the theoretical side, the study of minimal-norm interpolation has

revealed fascinating phenomena that challenge traditional understandings of machine learning.

We now have a better understanding of how properties of the data distribution and algorithmic bias

can affect generalization in high-dimensional linear regression. For example, data with a spiked

covariance structure can ensure that the test error of ridge regression will be approximately constant

once the regularization strength is small enough for the model to ﬁt the signal (Zhou et al. 2021;

Tsigler and Bartlett 2020), contradicting the classical U-shaped curve expected from arguments about

the bias-variance tradeoff. Surprisingly, even when the signal is sparse, the risk of the minimal-

norm interpolator can be shown to converge much slower to the Bayes risk than the minimal-

∗These authors contributed equally.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.12082v1 [stat.ML] 21 Oct 2022

norm interpolator in the junk feature setting (Chatterji and Long 2021; Koehler et al. 2021). In

contrast, the minimal-

norm interpolator fails to achieve consistency in the isotropic setting, while

the minimal-

norm interpolator is consistent with sparsity but suffers from an exponentially slow

rate in the number of parameters

(G. Wang et al. 2021; Muthukumar et al. 2020). However, we

can still achieve the minimax rate with minimal-

norm interpolators with

extremely close to 1

(Donhauser et al. 2022).

In fact, many of the intriguing phenomena from the work above may be understood using the norm

of a predictor; localized notions of uniform convergence have emerged as essential tools for doing

so. Compared to other techniques, uniform convergence analyses can have the beneﬁt of requiring

neither particular proportional scaling regimes nor closed-form expressions for the learned model,

since only an approximate estimate of its complexity is needed. Despite uniform convergence’s

potential for wider applicability, though, work in this area has mostly focused on linear regression

settings with strong assumptions: that the conditional expectation of the label is linear with respect

to the features, and that the residual has constant variance. In contrast, classical agnostic learning

guarantees established by uniform convergence usually need only much weaker assumptions on the

data distribution, and apply to a broader range of losses and function classes. For example, Srebro et al.

(2010) show that bounds with an “optimistic rate” hold generally for any smooth, nonnegative loss,

though the hidden logarithmic factor in their result is too loose for explaining noisy interpolation; this

was recently addressed by Zhou et al. (2021) in the special case of well-speciﬁed linear regression.

In this work, we take a step further towards agnostic interpolation learning and consider a high-

dimensional generalized linear model (GLM) setting where the label is generated by a potentially

misspeciﬁed model. We show a new generalization bound that allows us to use the Moreau envelopes

of any continuous loss function as an intermediate tool. By optimizing over the smoothing param-

eter to balance the approximation and generalization errors, our general Moreau envelope theory

yields sharp non-asymptotic generalization bounds in a wide variety of settings. Applying to linear

regression with the square loss, we recover the optimistic rate of Zhou et al. (2021) and show that

it can more generally be extended to handle model misspeciﬁcation, such as nonlinear trends and

heteroskedasticity. The generality of our result comes from the fact that taking the Moreau envelope

of the square loss only scales by a constant; this property alone sufﬁces to obtain a generalization

guarantee in terms of the original square loss. The squared hinge loss enjoys the same property, and

hence a completely analogous argument shows an optimistic rate in that setting. Combined with an

analysis of the margin, we show a novel consistency result for max-margin classiﬁers.

More generally, we apply the Moreau envelope theory to obtain a generic bound for any Lipschitz

loss and smooth, nonnegative loss with sharp constants. Looking speciﬁcally at the test error of an

Empirical Risk Minimizer (ERM), we show our generalization bound with localized Gaussian width

will be asymptotically sharp even when overﬁtting is not necessarily benign, yielding a version of the

asymptotic Moreau envelope framework for analyzing M-estimators (El Karoui et al. 2013; Bean

et al. 2013; Donoho and Montanari 2016; Thrampoulidis et al. 2018; Sur and Candès 2019) but for

the problem of generalization. Numerical simulations on a variety of feature distributions and label

generating processes conﬁrm the wide applicability of our theory.

2 Related Work

The Moreau envelope has been useful in characterizing asymptotic properties of M-estimators in

linear models (Bean et al. 2013; El Karoui et al. 2013; Donoho and Montanari 2016; El Karoui 2018;

Thrampoulidis et al. 2018) and logistic regression (Sur and Candès 2019; Sur et al. 2019; Candès

and Sur 2020; Salehi et al. 2019; Zhao et al. 2022). This theory focuses on estimation and inference

under proportional asymptotics, rather than generalization, and does not provide any non-asymptotic

results.

For linear regression, Bartlett et al. (2020) identify nearly-matching necessary and sufﬁcient conditions

for the consistency of minimal-

norm interpolation; their subsequent work (Tsigler and Bartlett

2020) shows generalization bounds for overparametrized ridge regression. Following their work,

Negrea et al. (2020) and Zhou et al. (2020) explore the role of uniform convergence, including

showing that uniformly bounding the difference between training error and test error fails to explain

interpolation learning. Zhou et al. (2020) argue, however, that uniform convergence of interpolators

is sufﬁcient to establish consistency in a toy example. Koehler et al. (2021) extend their result to

arbitrary data covariance and norm, recovering the benign overﬁtting conditions of Bartlett et al.

(2020) as well as proving novel consistency results for the minimal-

norm interpolator. Based on

this uniform convergence framework, G. Wang et al. (2021) establish tight bounds for the minimal-

norm interpolator under a sparse signal with isotropic data. Earlier work (Ju et al. 2020; Chinot et al.

2020; Li and Wei 2021) also studied the minimal-

norm interpolator, without showing consistency.

Though the minimal-

norm interpolator suffers from an exponentially slow rate, Donhauser et al.

(2022) show the minimal-

norm interpolator can achieve faster rates with

close to 1. Zhou

et al. (2021) show a risk-dependent (“localized”) bound that extends the uniform convergence of

interpolators guarantee to predictors with arbitrary training loss, and used it to establish generalization

for regularized estimators such as Ridge and LASSO. Our Moreau envelope theory builds on the

techniques developed in this line of work to apply uniform convergence in the interpolation regime.

In terms of requirements on the data distribution, Bartlett et al. (2020) and Tsigler and Bartlett

(2020) only require the feature vector to be sub-Gaussian, but assume a well-speciﬁed linear model

for the conditional distribution of the label. The uniform convergence-based works also assume a

well-speciﬁed linear model, but the assumptions are more restrictive in the sense that the marginal

distribution of the feature needs to be exactly Gaussian because their proof techniques rely on the

Gaussian Minimax Theorem (GMT). Our Moreau envelope theory’s application to linear regression

signiﬁcantly relaxes the assumption on the label generating process, though it is still constrained by

the Gaussian data assumption. Shamir (2022) also studies model misspeciﬁcation in linear regression,

but allows non-Gaussian features, and shows that benign overﬁtting does not necessarily occur in the

most general setting, even with a spiked-covariance structure (see his Example 1).

For linear classiﬁcation, Muthukumar et al. (2021) analyze

max-margin classiﬁer by connecting to

minimal-norm interpolation in regression. Similarly, our analysis in the classiﬁcation case depends

on the fact the squared hinge loss goes through the same transformation as the square loss under

smoothing by Moreau envelope. Donhauser et al. (2022) prove generalization bounds for

max-

margin classiﬁers in the isotropic setting and do not consider the spiked-covariance case. Deng

et al. (2021), Montanari et al. (2019), and Liang and Sur (2020) derive exact expressions for the

asymptotic prediction risk of the

and

(with

p∈[1,2)

) max-margin classiﬁers. Though their

proof techniques also rely on the GMT, our approaches are drastically different. We use GMT in order

to show uniform convergence for a class of predictors and establish a non-asymptotic bound, whereas

their results are asymptotic and assume a proportional scaling limit. This is a key distinction, because

overﬁtting usually cannot be benign with proportional scaling (e.g. Donhauser et al. 2022, Proposition

1). Similar lower bounds have also been shown in the context of linear regression (Muthukumar et al.

2020; G. Wang et al. 2021; Zhou et al. 2020).

Some concurrent works have obtained consistency results for max-margin classiﬁcation in the spiked

covariance setting. In particular, the May 2022 version of the work by Shamir (2022) also studies

convergence to the minimum of the squared hinge loss, and obtains consistency under conditions

similar to the benign covariance condition of Bartlett et al. (2020). During preparation of this

manuscript we learned of concurrent work by Montanari et al., not yet publicly available, which

also studies consistency results for classiﬁcation. Comparing our Corollary 3to Shamir (2022), their

result applies to some non-Gaussian settings, but in the Gaussian setting their result is not as general

as ours. (Combining Assumptions 1 and 2 of Theorem 7 there, they require the norm of the data to be

bounded, whereas our Corollary 3applies even if

o(n)

eigenvalues of

grow arbitrarily quickly with

.) More conceptually, our result follows from a norm-based generalization bound that applies to all

predictors and outside of the “benign overﬁtting” conditions, generalizing the result of Koehler et al.

(2021) and unlike the analysis of prior work.

3 Problem Formulation

GLM setting.

Given a continuous loss function

f:R×R→R

and i.i.d. sample pairs

(xi, yi)

from some data distribution

, we can learn a linear model

( ˆw, ˆ

by minimizing the empirical loss

Lfwith the goal of achieving small population loss Lf:

Lf(w, b) = 1

i=1

f(hw, xii+b, yi), Lf(w, b) = E

(x,y)∼D f(hw, xi+b, y).(1)

Multi-index model. We assume that the data distribution Dover (x, y)is such that

1. x∼ N(0,Σ) is a centered Gaussian with unknown covariance matrix Σ.

There are unknown weight vectors

w∗

1, ..., w∗

k∈Rd

such that the

Σ1/2w∗

are orthonormal,

a function

g:Rk+1 →R

, and a random variable

ξ∼ Dξ

independent of

(not necessarily

Gaussian) such that

ηi=hw∗

i, xi, y =g(η1, ..., ηk, ξ).(2)

We can assume that the distribution of

is centered without loss of generality since presence of a

mean term simply corresponds to changing the bias term

hw, xi+b=hw, x −µi+ (b−hw, µi).

We can also assume that

Σ1/2w∗

1, ..., Σ1/2w∗

are orthonormal without loss of generality since we

have not imposed any assumption on the link function

. The multi-index model includes well-

speciﬁed linear regression, by setting

k= 1

and

g(η, ξ) = η+ξ

. It also allows nonlinear trends and

heteroskedasticity (such as the model in Figure 1) by changing the deﬁnition of

. Since

need not

be continuous, the label ycan be binary, as in linear classiﬁcation.

4 Moreau Envelope Generalization Theory

Our theory vitally depends on the Moreau envelope, deﬁned as follows.

Deﬁnition 1.

The Moreau envelope of

f:R×R→R

with parameter

λ∈R+

is deﬁned as the

function fλ:R×R→Rgiven by

fλ(ˆy, y) = inf

uf(u, y) + λ(u−ˆy)2.(3)

The Moreau envelope can be viewed as a smooth approximation to the original function

: in our

parameterization, smaller

corresponds to more smoothing. The map that outputs the minimizer

known as the proximal operator, plays an important role in convex analysis (Parikh and Boyd 2014;

Bauschke, Combettes, et al. 2011).

Our general theory, as stated in Theorem 1below, essentially upper bounds the generalization gap

between the population Moreau envelope

Lfλ

and the original training loss

by the sum of two

parts: a parametric component that can be controlled by the dimension

of the “meaningful” part of

, and a non-parametric component that can be controlled by a dimension-free complexity measure

such as the Euclidean norm of the predictor. Typically, the ﬁrst term is negligible since

is small, and

the complexity of ﬁtting all the noise is absorbed into the second term. More precisely, we introduce

the following deﬁnitions to formalize separating out a low dimensional component:

Deﬁnition 2.

Under the model assumptions

(2)

, deﬁne a (possibly oblique) projection matrix

onto

the space orthogonal to w∗

1, ..., w∗

kand a mapping φfrom Rdto Rk+1 by

Q=Id−

i=1

w∗

i(w∗

i)TΣ, φ(w)=(hw, Σw∗

1i, ..., hw, Σw∗

ki,kΣ1/2Qwk2)T.(4)

We let

Σ⊥=QTΣQ

denote the covariance matrix of

QTx

. We also deﬁne a low-dimensional

surrogate distribution ˜

Dover Rk+1 ×Rby

˜x∼ N(0, Ik+1),˜

ξ∼ Dξ,and ˜y=g(˜x1, ..., ˜xk,˜

ξ).(5)

This surrogate distribution compresses the “meaningful part” of

while maintaining the test loss,

as shown by our main result Theorem 1(proved in Appendix D). Note that as a non-asymptotic

statement, the functions λ,δ and Cδonly need hold for a speciﬁc choice of nand D.

Theorem 1.

Suppose

λ∈R+

satisﬁes that for any

δ∈(0,1)

, there exists a continuous function

λ,δ :Rk+1 →R

such that with probability at least

1−δ/4

over independent draws

(˜xi,˜yi)

from

the surrogate distribution ˜

Ddeﬁned in (5), we have uniformly over all ( ˜w, ˜

b)∈Rk+2 that

i=1

fλ(h˜w, ˜xii+˜

b, ˜yi)≥E

(˜x,˜y)∼˜

[fλ(h˜w, ˜xi+˜

b, ˜y)] −λ,δ( ˜w, ˜

b).(6)

Further, assume that for any

δ∈(0,1)

, there exists a continuous function

Cδ:Rd→[0,∞]

such

that with probability at least 1−δ/4over x∼ N (0,Σ), uniformly over all w∈Rd,

hQw, xi ≤ Cδ(w).(7)

Then it holds with probability at least 1−δthat uniformly over all (w, b)∈Rd+1, we have

Lfλ(w, b)≤ˆ

Lf(w, b) + λ,δ(φ(w), b) + λCδ(w)2

n.(8)

If we additionally assume that (6)holds uniformly for all λ∈R+, then (8)does as well.

As we will see, we can generally bound the difference between

Lfλ

and

when the loss is assumed

to be Lipschitz. If

is not Lipschitz but smooth (i.e.

∇f

is Lipschitz, as for the squared loss), we

can always write it as the Moreau envelope of another function

. In the special case of square loss

or squared hinge loss, the Moreau envelope

fλ

is proportional to

, meaning that

(8)

becomes a

generalization guarantee in terms of

. Optimizing over

will establish optimal bounds that recover

the result of Koehler et al. (2021) and Zhou et al. (2021), and lead to other novel results.

Remark 1.

The complexity functional

Cδ(w)

should be thought of as a localized, high-probability

version of Rademacher complexity. This is because the Gaussian width of a convex set

Esupw∈Khw, xi

, is the same as the Rademacher complexity of the class of linear functions

{x7→ hw, xi:w∈ K}

(Zhou et al. 2021, Proposition 1). A somewhat similar complexity functional

appears in Panchenko (2003). Also, note

(6)

requires only one-sided concentration — see Remark 3.

4.1 VC Theory for Low-dimensional Concentration

To apply our generalization result (Theorem 1), we should check the low-dimensional concentration

assumption

(6)

. The quantitative bounds in the low-dimensional concentration (i.e. the precise form

of error term

λ,δ

) will inevitably depend on the exact setting we consider (see e.g. Vapnik 1982;

Koltchinskii and Mendelson 2015; Lugosi and Mendelson 2019 for discussion).

First, we recall the following result from VC theory.

Theorem 2

(Special case of Assertion 4 of Vapnik (1982), Chapter 7.8; see also Theorem 7.6)

Let

K ⊂ Rd

and

B ⊂ R

. Suppose that a distribution

over

(x, y)∈Rd×R

satisﬁes that for some

τ > 0, it holds uniformly over all (w, b)∈ K × B that

Ef(hw, xi+b, y)4]1/4

Ef(hw, xi+b, y)≤τ. (9)

Also suppose the class of functions

{(x, y)7→ {f(hw, xi+b, y)> t}:w∈ K, b ∈ B, t ∈R}

has

VC-dimension at most

. Then for any

n>h

, with probability at least

1−δ

over the choice of

((x1, y1),...,(xn, yn)) ∼ Dn, it holds uniformly over all w∈ K, b ∈ B that

i=1

f(hw, xii+b, yi)≥ 1−8τrh(log(2n/h) + 1) + log(12/δ)

n!Ef(hw, xi+b, y).

The assumption

(9)

is standard (indeed, this is the setting primarily focused on in Vapnik 1982) and

is sometimes referred to as hypercontractivity or norm equivalence in the literature; a variant of the

result holds with

replaced by

1 + 

. In many settings of interest, this can be directly checked using

the fact that

is Gaussian (for instance, see Theorem 9and Appendix E.3). Of course, our general

result can be applied without this assumption, by using low-dimensional concentration under an

alternative assumption: Vapnik (1982), Panchenko (2002), Panchenko (2003), and Mendelson (2017)

have further discussion and alternative results; in particular, Assertion 3 of Vapnik (1982, Chapter

7.8) gives a bound based on a fourth-moment assumption, and Panchenko (2003, Theorem 3) gives

one based on a version of Rademacher complexity.

Combining Theorems 1and 2yields the following.

Corollary 1.

Under the model assumptions

(2)

, suppose that

Cδ

satisﬁes condition

(7)

. Also suppose

that for some ﬁxed

λ≥0

K ⊆ Rd

, and

B ⊆ R

, the surrogate distribution

satisﬁes assumption

(9)

under

fλ

uniformly over

φ(K)×B

, and that the class

{(x, y)7→ {fλ(h˜w, φ(x)i+˜

b, y)> t}: ˜w∈

φ(K),˜

b∈ B, t ∈R}

has VC-dimension at most

. Then with probability at least

1−δ

, uniformly

over all (w, b)∈ K × B

1−8τrh(log(2n/h) + 1) + log(48/δ)

n!Lfλ(w, b)≤ˆ

Lf(w, b) + λCδ(w)2

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ANon-AsymptoticMoreauEnvelopeTheoryforHigh-DimensionalGeneralizedLinearModelsLijiaZhouUniversityofChicagozlj@uchicago.eduFredericKoehlerStanfordUniversityfkoehler@stanford.eduPragyaSurHarvardUniversitypragya@fas.harvard.eduDanicaJ.SutherlandUniversityofBritishColumbia&Amiidsuth@cs.ubc.caNathanSreb...

展开>> 收起<<

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models Lijia Zhou.pdf

共53页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models Lijia Zhou

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: