
SECOND-ORDER REGRESSION MODELS EXHIBIT PRO-
GRESSIVE SHARPENING TO THE EDGE OF STABILITY
Atish Agarwala, Fabian Pedregosa & Jeffrey Pennington
Google Research, Brain Team
{thetish, pedregosa,jpennin}@google.com
ABSTRACT
Recent studies of gradient descent with large step sizes have shown that there is
often a regime with an initial increase in the largest eigenvalue of the loss Hessian
(progressive sharpening), followed by a stabilization of the eigenvalue near the
maximum value which allows convergence (edge of stability). These phenomena
are intrinsically non-linear and do not happen for models in the constant Neural
Tangent Kernel (NTK) regime, for which the predictive function is approximately
linear in the parameters. As such, we consider the next simplest class of pre-
dictive models, namely those that are quadratic in the parameters, which we call
second-order regression models. For quadratic objectives in two dimensions, we
prove that this second-order regression model exhibits progressive sharpening of
the NTK eigenvalue towards a value that differs slightly from the edge of stability,
which we explicitly compute. In higher dimensions, the model generically shows
similar behavior, even without the specific structure of a neural network, suggest-
ing that progressive sharpening and edge-of-stability behavior aren’t unique fea-
tures of neural networks, and could be a more general property of discrete learning
algorithms in high-dimensional non-linear models.
1 INTRODUCTION
A recent trend in the theoretical understanding of deep learning has focused on the linearized regime,
where the Neural Tangent Kernel (NTK) controls the learning dynamics (Jacot et al.,2018;Lee et al.,
2019). The NTK describes learning dynamics of all networks over short enough time horizons, and
can describe the dynamics of wide networks over large time horizons. In the NTK regime, there is a
function-space ODE which allows for explicit characterization of the network outputs (Jacot et al.,
2018;Lee et al.,2019;Yang,2021). This approach has been used across the board to gain insights
into wide neural networks, but it suffers a major limitation: the model is linear in the parameters, so
it describes a regime with relatively trivial dynamics that cannot capture feature learning and cannot
accurately represent the types of complex training phenomena often observed in practice.
While other large-width scaling regimes can preserve some non-linearity and allow for certain types
of feature learning (Bordelon & Pehlevan,2022;Yang et al.,2022), such approaches tend to focus
on the small learning-rate or continuous-time dynamics. In contrast, recent empirical work has
highlighted a number of important phenomena arising from the non-linear discrete dynamics in
training practical networks with large learning rates (Neyshabur et al.,2017;Gilmer et al.,2022;
Ghorbani et al.,2019;Foret et al.,2022). In particular, many experiments have shown the tendency
for networks to display progressive sharpening of the curvature towards the edge of stability, in
which the maximum eigenvalue of the loss Hessian increases over the course of training until it
stabilizes at a value equal to roughly two divided by the learning rate, corresponding to the largest
eigenvalue for which gradient descent would converge in a quadratic potential (Wu et al.,2018;
Giladi et al.,2020;Cohen et al.,2022b;a).
In order to build a better understanding of this behavior, we introduce a class of models which display
all the relevant phenomenology, yet are simple enough to admit numerical and analytic understand-
ing. In particular, we propose a simple quadratic regression model and corresponding quartic loss
function which fulfills both these goals. We prove that under the right conditions, this simple model
shows both progressive sharpening and edge-of-stability behavior. We then empirically analyze a
1
arXiv:2210.04860v1 [cs.LG] 10 Oct 2022