SECOND -ORDER REGRESSION MODELS EXHIBIT PRO - GRESSIVE SHARPENING TO THE EDGE OF STABILITY Atish Agarwala Fabian Pedregosa Jeffrey Pennington

2025-05-03 0 0 2.35MB 26 页 10玖币
侵权投诉
SECOND-ORDER REGRESSION MODELS EXHIBIT PRO-
GRESSIVE SHARPENING TO THE EDGE OF STABILITY
Atish Agarwala, Fabian Pedregosa & Jeffrey Pennington
Google Research, Brain Team
{thetish, pedregosa,jpennin}@google.com
ABSTRACT
Recent studies of gradient descent with large step sizes have shown that there is
often a regime with an initial increase in the largest eigenvalue of the loss Hessian
(progressive sharpening), followed by a stabilization of the eigenvalue near the
maximum value which allows convergence (edge of stability). These phenomena
are intrinsically non-linear and do not happen for models in the constant Neural
Tangent Kernel (NTK) regime, for which the predictive function is approximately
linear in the parameters. As such, we consider the next simplest class of pre-
dictive models, namely those that are quadratic in the parameters, which we call
second-order regression models. For quadratic objectives in two dimensions, we
prove that this second-order regression model exhibits progressive sharpening of
the NTK eigenvalue towards a value that differs slightly from the edge of stability,
which we explicitly compute. In higher dimensions, the model generically shows
similar behavior, even without the specific structure of a neural network, suggest-
ing that progressive sharpening and edge-of-stability behavior aren’t unique fea-
tures of neural networks, and could be a more general property of discrete learning
algorithms in high-dimensional non-linear models.
1 INTRODUCTION
A recent trend in the theoretical understanding of deep learning has focused on the linearized regime,
where the Neural Tangent Kernel (NTK) controls the learning dynamics (Jacot et al.,2018;Lee et al.,
2019). The NTK describes learning dynamics of all networks over short enough time horizons, and
can describe the dynamics of wide networks over large time horizons. In the NTK regime, there is a
function-space ODE which allows for explicit characterization of the network outputs (Jacot et al.,
2018;Lee et al.,2019;Yang,2021). This approach has been used across the board to gain insights
into wide neural networks, but it suffers a major limitation: the model is linear in the parameters, so
it describes a regime with relatively trivial dynamics that cannot capture feature learning and cannot
accurately represent the types of complex training phenomena often observed in practice.
While other large-width scaling regimes can preserve some non-linearity and allow for certain types
of feature learning (Bordelon & Pehlevan,2022;Yang et al.,2022), such approaches tend to focus
on the small learning-rate or continuous-time dynamics. In contrast, recent empirical work has
highlighted a number of important phenomena arising from the non-linear discrete dynamics in
training practical networks with large learning rates (Neyshabur et al.,2017;Gilmer et al.,2022;
Ghorbani et al.,2019;Foret et al.,2022). In particular, many experiments have shown the tendency
for networks to display progressive sharpening of the curvature towards the edge of stability, in
which the maximum eigenvalue of the loss Hessian increases over the course of training until it
stabilizes at a value equal to roughly two divided by the learning rate, corresponding to the largest
eigenvalue for which gradient descent would converge in a quadratic potential (Wu et al.,2018;
Giladi et al.,2020;Cohen et al.,2022b;a).
In order to build a better understanding of this behavior, we introduce a class of models which display
all the relevant phenomenology, yet are simple enough to admit numerical and analytic understand-
ing. In particular, we propose a simple quadratic regression model and corresponding quartic loss
function which fulfills both these goals. We prove that under the right conditions, this simple model
shows both progressive sharpening and edge-of-stability behavior. We then empirically analyze a
1
arXiv:2210.04860v1 [cs.LG] 10 Oct 2022
more general model which shows these behaviors generically in the large datapoint, large model
limit. Finally, we conduct a numerical analysis on the properties of a real neural network and use
tools from our theoretical analysis to show that edge-of-stability behavior “in the wild” shows some
of the same patterns as the theoretical models.
2 BASIC QUARTIC LOSS FUNCTION
2.1 MODEL DEFINITION
We consider the optimization of the quadratic loss function L(θ) = z2/2, where za quadratic
function on the P×1-dimensional parameter vector θand Qis a P×Psymmetric matrix:
z=1
2θ>QθE.(1)
This can be interpreted either as a model in which the predictive function is quadratic in the input
parameters, or as a second-order approximation to a more complicated non-linear function such as
a deep network. In this objective, the gradient flow (GF) dynamics with scaling factor ηis given by
˙
θ=ηθL=ηz z
θ=η
2θ>QθEQθ.(2)
It is useful to re-write the dynamics in terms of ˜zand the 1×P-dimensional Jacobian J=z/∂θ:
˙z=η(JJ>)z, ˙
J=2ηzQJ .(3)
We note that in this case the neural tangent kernel (NTK) is a scalar given by the scalar JJ>. In these
coordinates, we have E=JQ+J>2z, where Q+denotes the Moore-Penrose pseudoinverse.
The GF equations can be simplified by two transformations. First, we transform to ˜z=ηz and ˜
J=
η1/2J. Next, we rotate θso that Qis diagonal. This is always possible since Qis symmetric. Since
the NTK is given by JJ>, this rotation preserves the dynamics of the curvature. Let ω1. . . ωP
be the eigenvalues of Q, and vibe the associated eigenvectors (in case of degeneracy, one can pick
any basis). We define ˜
J(ωi) = ˜
Jvi, the projection of ˜
Jonto the ith eigenvector. Then the gradient
flow equations can be written as:
d˜z
dt =˜z
P
X
i=1
˜
J(ωi)2,d˜
J(ωi)2
dt =2˜zωi˜
J(ωi)2.(4)
The first equation implies that ˜zdoes not change sign under GF dynamics. Modes with positive ωi˜z
decrease the curvature, and those with negative ωi˜zincrease the curvature.
In order to study edge-of-stability behavior, we need initializations which allow the curvature (JJ>
in this case) to increase over time - a phenomenon known as progressive sharpening. Progressive
sharpening has been shown to be ubiquitous in machine learning models (Cohen et al.,2022a), so
any useful phenomenological model should show it as well. One such initialization for this quadratic
regression model is ω1=ω,ω2=ω,˜
J(ω1) = ˜
J(ω2). This initialization (and others) show
progressive sharpening at all times.
2.2 GRADIENT DESCENT
We are interested in understanding the edge-of-stability (EOS) behavior in this model: gradient
descent (GD) trajectories where the maximum eigenvalue of the NTK, JJ>, remains close to the
critical value 2. (Note: we define edge of stability with respect to the maximum NTK eigenvalue;
for any twice-differentiable model trained with squared loss, this is equivalent to the maximum
eigenvalue of the loss Hessian used in Cohen et al. (2022a) as the model converges to a stationary
point (Jacot et al.,2020).)
When Qhas both positive and negative eigenvalues, the loss landscape is the square of a hyperbolic
parabaloid (Figure 1, left). As suggested by the gradient flow analysis, this causes some trajectories
to increase their curvature before convergence. This causes the final curvature to depend on both
the initialization and learning rate. One of the challenges in analyzing the gradient descent (GD)
2
Figure 1: Quartic loss landscape L(·)as a function of the parameters θ, where D= 2, E = 0 and
Qhas eigenvalues 1and 0.1. The GD trajectories converge to minima with larger curvature than
at initialization and therefore show progressive sharpening (left). The two-step dynamics, in which
we consider only even iteration numbers, exhibit fewer oscillations near the edge of stability (right).
dynamics is that they rapidly and heavily oscillate around minima for large learning rates. One
way to mitigate this issue is to consider only every other step (Figure 1, right). We will use this
observation to analyze the gradient descent (GD) dynamics directly to find configurations where
these trajectories show edge-of-stability behavior.
In the eigenbasis coordinates, the gradient descent equations are
˜zt+1 ˜zt=˜zt
P
X
i=1
˜
J(ωi)2
t+1
2(˜z2
t)
P
X
i=1
ωi˜
J(ωi)2
t(5)
˜
J(ωi)2
t+1 ˜
J(ωi)2
t=˜ztωi(2 ˜ztωi)˜
J(ωi)2
tfor all 1iP . (6)
We’ll find it convenient in the following to write the dynamics in terms of weighted averages of
˜
J(ωi)2instead of the modes ˜
J(ωi):
T(α) =
P
X
i=1
ωα
i˜
J(ωi)2.(7)
The dynamical equations become:
˜zt+1 ˜zt=˜ztTt(0) + 1
2(˜z2
t)Tt(1) (8)
Tt+1(k)Tt(k) = ˜zt(2Tt(k+ 1) ˜ztTt(k+ 2)) .(9)
If Qis invertible, then we have E=Tt(1) 2˜zt. Note that by definition Tt(0) = ηJtJ>
tis
the (rescaled) NTK. edge-of-stability behavior corresponds to dynamics which keep Tt(0) near the
value 2as ˜ztgoes to 0.
2.2.1 REDUCTION TO CATAPULT DYNAMICS
If the eigenvalues of Qare {−ω, ω}, and E= 0, the model becomes equivalent to a single hidden
layer linear network with one training datapoint (Appendix A.1) - also known as the catapult phase
dynamics. This model doesn’t exhibit sharpening or edge-of-stability behavior (Lewkowycz et al.,
2020). We will analyze this model in our ˜zT(0) variables as a warmup, with an eye towards
analyzing a different parameter setting which does show sharpening and edge of stability.
We assume without loss of generality that the eigenvalues are {−1,1}- which can be accomplished
by rescaling ˜z. The loss function is then the square of a hyperbolic parabaloid. Since there are only
2variables, we can rewrite the dynamics in terms of ˜zand the curvature T(0) only (Appendix B.1):
˜zt+1 ˜zt=˜ztTt(0) + 1
2(˜z2
t)(2˜zt+E)(10)
Tt+1(0) Tt(0) = 2˜zt(2˜zt+E) + z2
tTt(0) .(11)
3
For E= 0, we can see that sign(∆T(0)) = sign(Tt(0) 4), as in Lewkowycz et al. (2020) - so
convergence requires strictly decreasing curvature. For E6= 0, there is a region where the curvature
can increase (Appendix B.1). However, there is still no edge-of-stability behavior - there is no set
of initializations which starts with λmax far from 2, which ends up near 2. In contrast, we will
show that asymmetric eigenvalues can lead to EOS behavior.
2.2.2 EDGE OF STABILITY REGIME
In this section, we consider the case in which Qhas two eigenvalues - one of which is large and
positive, and the other one small and negative. Without loss of generality, we assume that the largest
eigenvalue of Qis 1. We denote the second eigenvalue by , for 0<  1. With this notation we
can write the dynamical equations (Appendix B.1) as
˜zt+1 ˜zt=˜ztTt(0) + 1
2(˜z2
t)((1 )Tt(0) + (2˜zt+E)) (12)
Tt+1(0) Tt(0) = 2˜zt((2˜zt+E) + (1 )Tt(0)) + ˜z2
t[Tt(0) + (1) (Tt(0) E2˜zt)] .
(13)
For small , there are trajectories where λmax is initially away from 2but converges towards it
(Figure 2, left) - in other words, EOS behavior. We used a variety of step sizes ηbut initialized at
pairs initialized at pairs (ηz0, ηT0(0)) to show the universality of the ˜z-T(0) coordinates.
In order to quantitatively understand the progressive sharpening and edge of stability, it is useful
to look at the two-step dynamics. One additional motivation for studying the two-step dynamics
follows from the analysis of gradient descent on linear least squares (i.e., linear model) with a large
step size λ. For every coordinate ˜
θ, the one-step and two-step dynamics are
˜
θt+1 ˜
θt=λ˜
θtand ˜
θt+2 ˜
θt= (1 λ)2˜
θt(GD in quadratic potential) .(14)
While the dynamics converge for λ < 2, if λ > 1the one-step dynamics oscillate when approaching
minimum, whereas the the two-step dynamics maintain the sign of ˜
θand the trajectories exhibit no
oscillations.
Likewise, plotting every other iterate in the two parameter model more clearly demonstrates the
phenomenology. For small , the dynamics shows the distinct phases described in (Li et al.,2022):
an initial increase in T(0), a slow increase in ˜z, then a decrease in T(0), and finally a slow decrease
of ˜zwhile T(0) remains near 2(Figure 2, middle).
Unfortunately, the two-step version of the dynamics defined by Equations 12 and 13 are more com-
plicated – they are 3rd order in T(0) and 9th order in ˜z; see Appendix B.2 for a more detailed
discussion. However we can still analyze the dynamics as ˜zgoes to 0. In order to understand the
mechanisms of the EOS behavior, it is useful to understand the nullclines of the two step dynamics.
The nullcline f˜z(˜z)of ˜zand fT(˜z)of T(0) are defined implicitly by
(˜zt+2 ˜zt)(˜z, f˜z(˜z)) = 0,(Tt+2(0) Tt(0))(˜z, fT(˜z)) = 0 (15)
where ˜zt+2 ˜ztand Tt+2(0) Tt(0) are the aforementioned high order polynomials in ˜zand T(0).
Since these polynomials are cubic in T(0), there are three possible solutions as ˜zgoes to 0. We are
particularly interested in the solution that goes through ˜z= 0,T(0) = 2 - that is, the critical point
corresponding to EOS.
Calculations detailed in Appendix B.2 show that the distance between the two nullclines is linear in
, so they become close as goes to 0. (Figure 2, middle). In addition, the trajectories stay near f˜z
- which gives rise to EOS behavior. This suggests that the dynamics are slow near the nullclines,
and trajectories appear to be approaching an attractor. We can find the structure of the attractor by
changing variables to ytTt(0) f˜z(˜zt)- the distance from the ˜znullcline. To lowest order in ˜z
and y, the two-step dynamical equations become (Appendix B.3):
˜zt+2 ˜zt= 2yt˜zt+O(y2
t˜zt) + O(yt˜z2
t)(16)
yt+2 yt=2(4 3+ 42)yt˜z2
t4˜z2
t+O(˜z3
t) + O(y2˜z2
t)(17)
We immediately see that ˜zchanges slowly for small y- since we chose coordinates where ˜zt+2˜zt=
0when y= 0. We can also see that yt+2 ytis O()for yt= 0 - so for small , the ydynamics
4
is slow too. Moreover, we see that the coefficient of the ˜z2
tterm is negative - the changes in ˜ztend
to drive y(and therefore T(0)) to decrease. The coefficient of the ytterm is negative as well; the
dynamics of ytends to be contractive. The key is that the contractive behavior takes yto an O()
fixed point at a rate proportional to ˜z2, while the dynamics of ˜zare proportional to . This suggests
a separation of timescales if ˜z2, where yfirst equilibrates to a fixed value, and then ˜zconverges
to 0(Figure 2, right). This intuition for the lowest order terms can be formalized, and gives us a
prediction of limt→∞ yt=/2, confirmed numerically in the full model (Appendix B.5).
Figure 2: For small , two-eigenvalue model shows EOS behavior for various step sizes (= 5·103,
left). Trajectories are the same up to scaling because corresponding rescaled coordinates ˜zand T(0)
are the same at initialization. Plotting every other iterate, we see that trajectories in ˜zT(0) space
stay near the nullcline (˜z, f˜z(˜z)) - the curve where ˜zt+2 ˜zt= 0 (middle). Changing variables to
y=T(0) f˜z(˜z)shows quick concentration to a curve of near-constant, small, negative y(right).
We can prove the following theorem about the long-time dynamics of ˜zand ywhen the higher order
terms are included (Appendix B.4):
Theorem 2.1. There exists an c>0such that for a quadratic regression model with E= 0 and
eigenvalues {−, 1},c. there exists a neighborhood UR2and interval [η1, η2]such that for
initial θUand learning rate η[η1, η2], the model displays edge-of-stability behavior:
2δλlim
t→∞ λmax 2(18)
for δλof O().
Therefore, unlike the catapult phase model, the small provably has EOS behavior - whose mecha-
nism is well-understood by the ˜zycoordinate transformation.
3 QUADRATIC REGRESSION MODEL
3.1 GENERAL MODEL
While the model defined in Equation 1provable displays edge-of-stability behavior, it required
tuning of the eigenvalues of Qto demonstrate it. We can define a more general model which exhibits
edge-of-stability behavior with less tuning. We define the quadratic regression model as follows.
Given a P-dimensional parameter vector θ, the D-dimensional output vector zis given by
z=y+G>θ+1
2Q(θ,θ)(19)
Here yis a D-dimensional vector, Gis a D×P-dimensional matrix, and Qis a D×P×P-
dimensional tensor symmetric in the last two indices - that is, Q(·,·)takes two P-dimensional
vectors as input, and outputs a D-dimensional vector verifying Q(θ,θ)α=θ>Qαθ. If Q=0,
the model corresponds to linearized learning (as in the NTK regime). When Q6=0, we obtain the
first correction to NTK regime. We note that:
Gαi =zα
θiθ=0
,Qαij =2zα
θiθj
,J=G+Q(θ,·),(20)
for the D×Pdimensional Jacobian J. For D= 1, we recover the model of Equation 1. In the
remainder of this section, we will study the limit as Dand Pincrease with fixed ratio D/P .
5
摘要:

SECOND-ORDERREGRESSIONMODELSEXHIBITPRO-GRESSIVESHARPENINGTOTHEEDGEOFSTABILITYAtishAgarwala,FabianPedregosa&JeffreyPenningtonGoogleResearch,BrainTeamfthetish,pedregosa,jpenning@google.comABSTRACTRecentstudiesofgradientdescentwithlargestepsizeshaveshownthatthereisoftenaregimewithaninitialincreaseinthe...

展开>> 收起<<
SECOND -ORDER REGRESSION MODELS EXHIBIT PRO - GRESSIVE SHARPENING TO THE EDGE OF STABILITY Atish Agarwala Fabian Pedregosa Jeffrey Pennington.pdf

共26页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:26 页 大小:2.35MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 26
客服
关注