SECOND -ORDER REGRESSION MODELS EXHIBIT PRO - GRESSIVE SHARPENING TO THE EDGE OF STABILITY Atish Agarwala Fabian Pedregosa Jeffrey Pennington

2025-05-03 0 0 2.35MB 26 页 10玖币

侵权投诉

SECOND-ORDER REGRESSION MODELS EXHIBIT PRO-

GRESSIVE SHARPENING TO THE EDGE OF STABILITY

Atish Agarwala, Fabian Pedregosa & Jeffrey Pennington

Google Research, Brain Team

{thetish, pedregosa,jpennin}@google.com

ABSTRACT

Recent studies of gradient descent with large step sizes have shown that there is

often a regime with an initial increase in the largest eigenvalue of the loss Hessian

(progressive sharpening), followed by a stabilization of the eigenvalue near the

maximum value which allows convergence (edge of stability). These phenomena

are intrinsically non-linear and do not happen for models in the constant Neural

Tangent Kernel (NTK) regime, for which the predictive function is approximately

linear in the parameters. As such, we consider the next simplest class of pre-

dictive models, namely those that are quadratic in the parameters, which we call

second-order regression models. For quadratic objectives in two dimensions, we

prove that this second-order regression model exhibits progressive sharpening of

the NTK eigenvalue towards a value that differs slightly from the edge of stability,

which we explicitly compute. In higher dimensions, the model generically shows

similar behavior, even without the speciﬁc structure of a neural network, suggest-

ing that progressive sharpening and edge-of-stability behavior aren’t unique fea-

tures of neural networks, and could be a more general property of discrete learning

algorithms in high-dimensional non-linear models.

1 INTRODUCTION

A recent trend in the theoretical understanding of deep learning has focused on the linearized regime,

where the Neural Tangent Kernel (NTK) controls the learning dynamics (Jacot et al.,2018;Lee et al.,

2019). The NTK describes learning dynamics of all networks over short enough time horizons, and

can describe the dynamics of wide networks over large time horizons. In the NTK regime, there is a

function-space ODE which allows for explicit characterization of the network outputs (Jacot et al.,

2018;Lee et al.,2019;Yang,2021). This approach has been used across the board to gain insights

into wide neural networks, but it suffers a major limitation: the model is linear in the parameters, so

it describes a regime with relatively trivial dynamics that cannot capture feature learning and cannot

accurately represent the types of complex training phenomena often observed in practice.

While other large-width scaling regimes can preserve some non-linearity and allow for certain types

of feature learning (Bordelon & Pehlevan,2022;Yang et al.,2022), such approaches tend to focus

on the small learning-rate or continuous-time dynamics. In contrast, recent empirical work has

highlighted a number of important phenomena arising from the non-linear discrete dynamics in

training practical networks with large learning rates (Neyshabur et al.,2017;Gilmer et al.,2022;

Ghorbani et al.,2019;Foret et al.,2022). In particular, many experiments have shown the tendency

for networks to display progressive sharpening of the curvature towards the edge of stability, in

which the maximum eigenvalue of the loss Hessian increases over the course of training until it

stabilizes at a value equal to roughly two divided by the learning rate, corresponding to the largest

eigenvalue for which gradient descent would converge in a quadratic potential (Wu et al.,2018;

Giladi et al.,2020;Cohen et al.,2022b;a).

In order to build a better understanding of this behavior, we introduce a class of models which display

all the relevant phenomenology, yet are simple enough to admit numerical and analytic understand-

ing. In particular, we propose a simple quadratic regression model and corresponding quartic loss

function which fulﬁlls both these goals. We prove that under the right conditions, this simple model

shows both progressive sharpening and edge-of-stability behavior. We then empirically analyze a

arXiv:2210.04860v1 [cs.LG] 10 Oct 2022

more general model which shows these behaviors generically in the large datapoint, large model

limit. Finally, we conduct a numerical analysis on the properties of a real neural network and use

tools from our theoretical analysis to show that edge-of-stability behavior “in the wild” shows some

of the same patterns as the theoretical models.

2 BASIC QUARTIC LOSS FUNCTION

2.1 MODEL DEFINITION

We consider the optimization of the quadratic loss function L(θ) = z2/2, where za quadratic

function on the P×1-dimensional parameter vector θand Qis a P×Psymmetric matrix:

z=1

2θ>Qθ−E.(1)

This can be interpreted either as a model in which the predictive function is quadratic in the input

parameters, or as a second-order approximation to a more complicated non-linear function such as

a deep network. In this objective, the gradient ﬂow (GF) dynamics with scaling factor ηis given by

θ=−η∇θL=ηz ∂z

∂θ=η

2θ>Qθ−EQθ.(2)

It is useful to re-write the dynamics in terms of ˜zand the 1×P-dimensional Jacobian J=∂z/∂θ:

˙z=−η(JJ>)z, ˙

J=−2ηzQJ .(3)

We note that in this case the neural tangent kernel (NTK) is a scalar given by the scalar JJ>. In these

coordinates, we have E=JQ+J>−2z, where Q+denotes the Moore-Penrose pseudoinverse.

The GF equations can be simpliﬁed by two transformations. First, we transform to ˜z=ηz and ˜

η1/2J. Next, we rotate θso that Qis diagonal. This is always possible since Qis symmetric. Since

the NTK is given by JJ>, this rotation preserves the dynamics of the curvature. Let ω1≥. . . ≥ωP

be the eigenvalues of Q, and vibe the associated eigenvectors (in case of degeneracy, one can pick

any basis). We deﬁne ˜

J(ωi) = ˜

Jvi, the projection of ˜

Jonto the ith eigenvector. Then the gradient

ﬂow equations can be written as:

d˜z

dt =−˜z

i=1

J(ωi)2,d˜

J(ωi)2

dt =−2˜zωi˜

J(ωi)2.(4)

The ﬁrst equation implies that ˜zdoes not change sign under GF dynamics. Modes with positive ωi˜z

decrease the curvature, and those with negative ωi˜zincrease the curvature.

In order to study edge-of-stability behavior, we need initializations which allow the curvature (JJ>

in this case) to increase over time - a phenomenon known as progressive sharpening. Progressive

sharpening has been shown to be ubiquitous in machine learning models (Cohen et al.,2022a), so

any useful phenomenological model should show it as well. One such initialization for this quadratic

regression model is ω1=−ω,ω2=ω,˜

J(ω1) = ˜

J(ω2). This initialization (and others) show

progressive sharpening at all times.

2.2 GRADIENT DESCENT

We are interested in understanding the edge-of-stability (EOS) behavior in this model: gradient

descent (GD) trajectories where the maximum eigenvalue of the NTK, JJ>, remains close to the

critical value 2/η. (Note: we deﬁne edge of stability with respect to the maximum NTK eigenvalue;

for any twice-differentiable model trained with squared loss, this is equivalent to the maximum

eigenvalue of the loss Hessian used in Cohen et al. (2022a) as the model converges to a stationary

point (Jacot et al.,2020).)

When Qhas both positive and negative eigenvalues, the loss landscape is the square of a hyperbolic

parabaloid (Figure 1, left). As suggested by the gradient ﬂow analysis, this causes some trajectories

to increase their curvature before convergence. This causes the ﬁnal curvature to depend on both

the initialization and learning rate. One of the challenges in analyzing the gradient descent (GD)

Figure 1: Quartic loss landscape L(·)as a function of the parameters θ, where D= 2, E = 0 and

Qhas eigenvalues 1and −0.1. The GD trajectories converge to minima with larger curvature than

at initialization and therefore show progressive sharpening (left). The two-step dynamics, in which

we consider only even iteration numbers, exhibit fewer oscillations near the edge of stability (right).

dynamics is that they rapidly and heavily oscillate around minima for large learning rates. One

way to mitigate this issue is to consider only every other step (Figure 1, right). We will use this

observation to analyze the gradient descent (GD) dynamics directly to ﬁnd conﬁgurations where

these trajectories show edge-of-stability behavior.

In the eigenbasis coordinates, the gradient descent equations are

˜zt+1 −˜zt=−˜zt

i=1

J(ωi)2

t+1

2(˜z2

i=1

ωi˜

J(ωi)2

t(5)

J(ωi)2

t+1 −˜

J(ωi)2

t=−˜ztωi(2 −˜ztωi)˜

J(ωi)2

tfor all 1≤i≤P . (6)

We’ll ﬁnd it convenient in the following to write the dynamics in terms of weighted averages of

J(ωi)2instead of the modes ˜

J(ωi):

T(α) =

i=1

ωα

i˜

J(ωi)2.(7)

The dynamical equations become:

˜zt+1 −˜zt=−˜ztTt(0) + 1

2(˜z2

t)Tt(1) (8)

Tt+1(k)−Tt(k) = −˜zt(2Tt(k+ 1) −˜ztTt(k+ 2)) .(9)

If Qis invertible, then we have E=Tt(−1) −2˜zt. Note that by deﬁnition Tt(0) = ηJtJ>

tis

the (rescaled) NTK. edge-of-stability behavior corresponds to dynamics which keep Tt(0) near the

value 2as ˜ztgoes to 0.

2.2.1 REDUCTION TO CATAPULT DYNAMICS

If the eigenvalues of Qare {−ω, ω}, and E= 0, the model becomes equivalent to a single hidden

layer linear network with one training datapoint (Appendix A.1) - also known as the catapult phase

dynamics. This model doesn’t exhibit sharpening or edge-of-stability behavior (Lewkowycz et al.,

2020). We will analyze this model in our ˜z−T(0) variables as a warmup, with an eye towards

analyzing a different parameter setting which does show sharpening and edge of stability.

We assume without loss of generality that the eigenvalues are {−1,1}- which can be accomplished

by rescaling ˜z. The loss function is then the square of a hyperbolic parabaloid. Since there are only

2variables, we can rewrite the dynamics in terms of ˜zand the curvature T(0) only (Appendix B.1):

˜zt+1 −˜zt=−˜ztTt(0) + 1

2(˜z2

t)(2˜zt+E)(10)

Tt+1(0) −Tt(0) = −2˜zt(2˜zt+E) + z2

tTt(0) .(11)

For E= 0, we can see that sign(∆T(0)) = sign(Tt(0) −4), as in Lewkowycz et al. (2020) - so

convergence requires strictly decreasing curvature. For E6= 0, there is a region where the curvature

can increase (Appendix B.1). However, there is still no edge-of-stability behavior - there is no set

of initializations which starts with λmax far from 2/η, which ends up near 2/η. In contrast, we will

show that asymmetric eigenvalues can lead to EOS behavior.

2.2.2 EDGE OF STABILITY REGIME

In this section, we consider the case in which Qhas two eigenvalues - one of which is large and

positive, and the other one small and negative. Without loss of generality, we assume that the largest

eigenvalue of Qis 1. We denote the second eigenvalue by −, for 0<  ≤1. With this notation we

can write the dynamical equations (Appendix B.1) as

˜zt+1 −˜zt=−˜ztTt(0) + 1

2(˜z2

t)((1 −)Tt(0) + (2˜zt+E)) (12)

Tt+1(0) −Tt(0) = −2˜zt((2˜zt+E) + (1 −)Tt(0)) + ˜z2

t[Tt(0) + (−1) (Tt(0) −E−2˜zt)] .

(13)

For small , there are trajectories where λmax is initially away from 2/η but converges towards it

(Figure 2, left) - in other words, EOS behavior. We used a variety of step sizes ηbut initialized at

pairs initialized at pairs (ηz0, ηT0(0)) to show the universality of the ˜z-T(0) coordinates.

In order to quantitatively understand the progressive sharpening and edge of stability, it is useful

to look at the two-step dynamics. One additional motivation for studying the two-step dynamics

follows from the analysis of gradient descent on linear least squares (i.e., linear model) with a large

step size λ. For every coordinate ˜

θ, the one-step and two-step dynamics are

θt+1 −˜

θt=−λ˜

θtand ˜

θt+2 −˜

θt= (1 −λ)2˜

θt(GD in quadratic potential) .(14)

While the dynamics converge for λ < 2, if λ > 1the one-step dynamics oscillate when approaching

minimum, whereas the the two-step dynamics maintain the sign of ˜

θand the trajectories exhibit no

oscillations.

Likewise, plotting every other iterate in the two parameter model more clearly demonstrates the

phenomenology. For small , the dynamics shows the distinct phases described in (Li et al.,2022):

an initial increase in T(0), a slow increase in ˜z, then a decrease in T(0), and ﬁnally a slow decrease

of ˜zwhile T(0) remains near 2(Figure 2, middle).

Unfortunately, the two-step version of the dynamics deﬁned by Equations 12 and 13 are more com-

plicated – they are 3rd order in T(0) and 9th order in ˜z; see Appendix B.2 for a more detailed

discussion. However we can still analyze the dynamics as ˜zgoes to 0. In order to understand the

mechanisms of the EOS behavior, it is useful to understand the nullclines of the two step dynamics.

The nullcline f˜z(˜z)of ˜zand fT(˜z)of T(0) are deﬁned implicitly by

(˜zt+2 −˜zt)(˜z, f˜z(˜z)) = 0,(Tt+2(0) −Tt(0))(˜z, fT(˜z)) = 0 (15)

where ˜zt+2 −˜ztand Tt+2(0) −Tt(0) are the aforementioned high order polynomials in ˜zand T(0).

Since these polynomials are cubic in T(0), there are three possible solutions as ˜zgoes to 0. We are

particularly interested in the solution that goes through ˜z= 0,T(0) = 2 - that is, the critical point

corresponding to EOS.

Calculations detailed in Appendix B.2 show that the distance between the two nullclines is linear in

, so they become close as goes to 0. (Figure 2, middle). In addition, the trajectories stay near f˜z

- which gives rise to EOS behavior. This suggests that the dynamics are slow near the nullclines,

and trajectories appear to be approaching an attractor. We can ﬁnd the structure of the attractor by

changing variables to yt≡Tt(0) −f˜z(˜zt)- the distance from the ˜znullcline. To lowest order in ˜z

and y, the two-step dynamical equations become (Appendix B.3):

˜zt+2 −˜zt= 2yt˜zt+O(y2

t˜zt) + O(yt˜z2

t)(16)

yt+2 −yt=−2(4 −3+ 42)yt˜z2

t−4˜z2

t+O(˜z3

t) + O(y2˜z2

t)(17)

We immediately see that ˜zchanges slowly for small y- since we chose coordinates where ˜zt+2−˜zt=

0when y= 0. We can also see that yt+2 −ytis O()for yt= 0 - so for small , the ydynamics

is slow too. Moreover, we see that the coefﬁcient of the ˜z2

tterm is negative - the changes in ˜ztend

to drive y(and therefore T(0)) to decrease. The coefﬁcient of the ytterm is negative as well; the

dynamics of ytends to be contractive. The key is that the contractive behavior takes yto an O()

ﬁxed point at a rate proportional to ˜z2, while the dynamics of ˜zare proportional to . This suggests

a separation of timescales if ˜z2, where yﬁrst equilibrates to a ﬁxed value, and then ˜zconverges

to 0(Figure 2, right). This intuition for the lowest order terms can be formalized, and gives us a

prediction of limt→∞ yt=−/2, conﬁrmed numerically in the full model (Appendix B.5).

Figure 2: For small , two-eigenvalue model shows EOS behavior for various step sizes (= 5·10−3,

left). Trajectories are the same up to scaling because corresponding rescaled coordinates ˜zand T(0)

are the same at initialization. Plotting every other iterate, we see that trajectories in ˜z−T(0) space

stay near the nullcline (˜z, f˜z(˜z)) - the curve where ˜zt+2 −˜zt= 0 (middle). Changing variables to

y=T(0) −f˜z(˜z)shows quick concentration to a curve of near-constant, small, negative y(right).

We can prove the following theorem about the long-time dynamics of ˜zand ywhen the higher order

terms are included (Appendix B.4):

Theorem 2.1. There exists an c>0such that for a quadratic regression model with E= 0 and

eigenvalues {−, 1},≤c. there exists a neighborhood U⊂R2and interval [η1, η2]such that for

initial θ∈Uand learning rate η∈[η1, η2], the model displays edge-of-stability behavior:

2/η −δλ≤lim

t→∞ λmax ≤2/η (18)

for δλof O().

Therefore, unlike the catapult phase model, the small provably has EOS behavior - whose mecha-

nism is well-understood by the ˜z−ycoordinate transformation.

3 QUADRATIC REGRESSION MODEL

3.1 GENERAL MODEL

While the model deﬁned in Equation 1provable displays edge-of-stability behavior, it required

tuning of the eigenvalues of Qto demonstrate it. We can deﬁne a more general model which exhibits

edge-of-stability behavior with less tuning. We deﬁne the quadratic regression model as follows.

Given a P-dimensional parameter vector θ, the D-dimensional output vector zis given by

z=y+G>θ+1

2Q(θ,θ)(19)

Here yis a D-dimensional vector, Gis a D×P-dimensional matrix, and Qis a D×P×P-

dimensional tensor symmetric in the last two indices - that is, Q(·,·)takes two P-dimensional

vectors as input, and outputs a D-dimensional vector verifying Q(θ,θ)α=θ>Qαθ. If Q=0,

the model corresponds to linearized learning (as in the NTK regime). When Q6=0, we obtain the

ﬁrst correction to NTK regime. We note that:

Gαi =∂zα

∂θiθ=0

,Qαij =∂2zα

∂θi∂θj

,→J=G+Q(θ,·),(20)

for the D×Pdimensional Jacobian J. For D= 1, we recover the model of Equation 1. In the

remainder of this section, we will study the limit as Dand Pincrease with ﬁxed ratio D/P .

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SECOND-ORDERREGRESSIONMODELSEXHIBITPRO-GRESSIVESHARPENINGTOTHEEDGEOFSTABILITYAtishAgarwala,FabianPedregosa&JeffreyPenningtonGoogleResearch,BrainTeamfthetish,pedregosa,jpenning@google.comABSTRACTRecentstudiesofgradientdescentwithlargestepsizeshaveshownthatthereisoftenaregimewithaninitialincreaseinthe...

展开>> 收起<<

SECOND -ORDER REGRESSION MODELS EXHIBIT PRO - GRESSIVE SHARPENING TO THE EDGE OF STABILITY Atish Agarwala Fabian Pedregosa Jeffrey Pennington.pdf

共26页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SECOND -ORDER REGRESSION MODELS EXHIBIT PRO - GRESSIVE SHARPENING TO THE EDGE OF STABILITY Atish Agarwala Fabian Pedregosa Jeffrey Pennington

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: