SGD with Large Step Sizes Learns Sparse Features

2025-05-03 0 0 2.13MB 23 页 10玖币
侵权投诉
SGD with Large Step Sizes Learns Sparse Features
Maksym Andriushchenko 1Aditya Varre 1Loucas Pillaud-Vivien 1Nicolas Flammarion 1
Abstract
We showcase important features of the dynamics
of the Stochastic Gradient Descent (SGD) in the
training of neural networks. We present empiri-
cal observations that commonly used large step
sizes (i) may lead the iterates to jump from one
side of a valley to the other causing loss stabiliza-
tion, and (ii) this stabilization induces a hidden
stochastic dynamics that biases it implicitly to-
ward sparse predictors. Furthermore, we show
empirically that the longer large step sizes keep
SGD high in the loss landscape valleys, the bet-
ter the implicit regularization can operate and
find sparse representations. Notably, no explicit
regularization is used: the regularization effect
comes solely from the SGD dynamics influenced
by the large step sizes schedule. Therefore, these
observations unveil how, through the step size
schedules, both gradient and noise drive together
the SGD dynamics through the loss landscape of
neural networks. We justify these findings theo-
retically through the study of simple neural net-
work models as well as qualitative arguments in-
spired from stochastic processes. This analysis
allows us to shed new light on some common prac-
tices and observed phenomena when training deep
networks. The code of our experiments is avail-
able at
https://github.com/tml-epfl/
sgd-sparse-features.
1. Introduction
Deep neural networks have accomplished remarkable
achievements on a wide variety of tasks. Yet, the understand-
ing of their remarkable effectiveness remains incomplete.
From an optimization perspective, stochastic training proce-
dures challenge many insights drawn from convex models.
Notably, large step size schedules used in practice lead to
unexpected patterns of stabilizations and sudden drops in
1
EPFL. Correspondence to: Maksym Andriushchenko
<maksym.andriushchenko@epfl.ch>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
the training loss (He et al.,2016). From a generalization per-
spective, overparametrized deep nets generalize well while
fitting perfectly the data and without any explicit regulariz-
ers (Zhang et al.,2017). This suggests that optimization and
generalization are tightly intertwined: neural networks find
solutions that generalize well thanks to the optimization pro-
cedure used to train them. This property, known as implicit
bias or algorithmic regularization, has been studied both
for regression (Li et al.,2018;Woodworth et al.,2020) and
classification (Soudry et al.,2018;Lyu & Li,2020;Chizat &
Bach,2020). However, for these theoretical results, it is also
shown that typical timescales needed to enter the beneficial
feature learning regimes are prohibitively long (Woodworth
et al.,2020;Moroshko et al.,2020).
In this paper, we aim at staying closer to the experimental
practice and consider the SGD schedules from the ResNet
paper (He et al.,2016) where the large step size is first
kept constant and then decayed, potentially multiple times.
We illustrate this behavior in Fig. 1where we reproduce a
minimal setting without data augmentation or momentum,
and with only one step size decrease. We draw attention
to two key observations regarding the large step size phase:
(a) quickly after the start of training, the loss remains ap-
proximately constant on average and (b) despite no progress
on the training loss, running this phase for longer leads to
better generalization. We refer to such large step size phase
as loss stabilization. The better generalization hints at some
hidden dynamics in the parameter space not captured by the
loss curves in Fig. 1. Our main contribution is to unveil
the hidden dynamics behind this phase: loss stabilization
helps to amplify the noise of SGD that drives the network
towards a solution with sparser features, meaning that for a
feature vector
ψ(x)
, only a few unique features are active
for a given input x.
1.1. Our Contributions
The effective dynamics behind loss stabilization. We
characterize two main components of the SGD dynamics
with large step sizes: (i) a fast movement determined by the
bouncing directions causing loss stabilization, (ii) a slow
dynamics driven by the combination of the gradient and the
multiplicative noise—which is non-vanishing due to the loss
stabilization.
1
arXiv:2210.05337v2 [cs.LG] 7 Jun 2023
SGD with Large Step Sizes Learns Sparse Features
Figure 1.
A typical training dynamics for a ResNet-18 trained on CIFAR-10. We use weight decay but no momentum or data
augmentation for this experiment. We see a substantial difference in generalization (as large as 12% vs. 35% test error) depending on the
step size ηand its schedule. When the training loss stabilizes, there is a hidden progress occurring which we aim to characterize.
SDE model and sparse feature learning. We model the ef-
fective slow dynamics during loss stabilization by a stochas-
tic differential equation (SDE) whose multiplicative noise
is related to the neural tangent kernel features, and vali-
date this modeling experimentally. Building on the existing
theory on diagonal linear networks, which shows that this
noise structure leads to sparse predictors, we conjecture a
similar “sparsifying” effect on the features of more complex
architectures. We experimentally confirm this on neural
networks of increasing complexity.
Insights from our understanding. We draw a clear general
picture: the hidden optimization dynamics induced by large
step sizes and loss stabilization enable the transition to a
sparse feature learning regime. We argue that after a short
initial phase of training, SGD first identifies sparse features
of the training data and eventually fits the data when the
step size is decreased. Finally, we discuss informally how
many deep learning regularization methods (weight decay,
BatchNorm, SAM) may also fit into the same picture.
1.2. Related Work
He et al. (2016) popularized the piece-wise constant step
size schedule which often exhibits a clear loss stabilization
pattern which was later characterized theoretically in Li
et al. (2020) from the optimization point of view. However,
the regularization effect of this phase induced by the under-
lying hidden stochastic dynamics is still unclear. Li et al.
(2019b) analyzed the role of loss stabilization for a synthetic
distribution containing different patterns, but it is not clear
how this analysis can be extended to general problems. Jas-
trzebski et al. (2021) suggest that large step sizes prevent
the increase of local curvature during the early phase of
training. However, they do not provide an explanation for
this phenomenon.
The importance of large step sizes for generalization has
been investigated with diverse motivations. Many works
conjectured that large step sizes induce minimization of
some complexity measures related to the flatness of minima
(Keskar et al.,2016;Smith & Le,2018;Smith et al.,2021;
Yang et al.,2022). Notably, Xing et al. (2018) point out that
SGD moves through the loss landscape bouncing between
the walls of a valley where the role of large step sizes is to
guide the SGD iterates towards a flatter minimum. How-
ever, the correct flatness measure is often disputed (Dinh
et al.,2017) and its role in understanding generalization is
questionable since full-batch GD with large step sizes (un-
like SGD) can lead to flat solutions which don’t generalize
well (Kaur et al.,2022)
The attempts to explain the effect of large step size on
strongly convex models (Nakkiran,2020;Wu et al.,2021;
Beugnot et al.,2022) are inherently incomplete since it is a
phenomenon related to the existence of many zero solutions
with very different generalization properties. Works based
on stability analysis characterize the properties of the mini-
mum that SGD or GD can potentially converge depending
on the step size (Wu et al.,2018;Mulayoff et al.,2021;
Ma & Ying,2021;Nacson et al.,2022). However, these
approaches do not capture the entire training dynamics such
as the large step size phase that we consider where SGD
converges only after the step size is decayed.
To grasp the generalization of SGD, research has focused
on SGD augmented with label noise due to its beneficial
regularization properties and resemblance to the standard
noise of SGD. Its implicit bias has been first characterized
by Blanc et al. (2020) and extended by Li et al. (2022).
However, their analysis only holds in the final phase of the
training, close to a zero-loss manifold. Our work instead
is closer in spirit to Pillaud-Vivien et al. (2022) where the
label noise dynamics is analyzed in the central phase of the
training, i.e., when the loss is still substantially above zero.
The dynamics of GD with large step sizes have received
a lot of attention in recent times, particularly the edge-of-
stability phenomenon (Cohen et al.,2021) and the catapult
mechanism (Lewkowycz et al.,2020;Wang et al.,2022).
However, the lack of stochastic noise in their analysis ren-
ders them incapable of capturing stochastic training. Note
2
SGD with Large Step Sizes Learns Sparse Features
that it is possible to bridge the gap between GD and SGD by
using explicit regularization as in Geiping et al. (2022). We
instead focus on the implicit regularization of SGD which
remains the most practical approach for training deep nets.
Finally, sparse features and low-rank structures in deep
networks have been commonly used for model compression,
knowledge distillation, and lottery ticket hypothesis (Denton
et al.,2014;Hinton et al.,2015;Frankle & Carbin,2018). A
common theme of all these works is the presence of hidden
structure in the networks learned by SGD which allows one
to come up with a much smaller network that approximates
well the original one. In particular, Hoefler et al. (2021) note
that ReLU activations in deep networks trained with SGD
are typically much sparser than 50%. Our findings suggest
that the step size schedule can be the key component behind
emergence of such sparsity.
2. The Effective Dynamics of SGD with Large
Step Size: Sparse Feature Learning
In this section, we show that large step sizes may lead the
loss to stabilize by making SGD bounce above a valley.
We then unveil the effective dynamics induced by this loss
stabilization. To clarify our exposition we showcase our
results for the mean square error but other losses like the
cross-entropy carry the same key properties in terms of
the noise covariance (Wojtowytsch,2021b, Lemma 2.14).
We consider a generic parameterized family of prediction
functions
H:= {xhθ(x), θ Rp}
, a setting which
encompasses neural networks. In this case, the training loss
on input/output samples
(xi, yi)1inRd×R
is equal to
L(θ) := 1
2n
n
X
i=1
(hθ(xi)yi)2.(1)
We consider the overparameterized setting, i.e.
pn
,
hence, there shall exists many parameters
θ
that lead to
zero loss, i.e., perfectly interpolate the dataset. Therefore,
the question of which interpolator the algorithm converges
to is of paramount importance in terms of generalization. We
focus on the SGD recursion with step size
η > 0
, initialized
at θ0Rp: for all tN,
θt+1 =θtη(hθt(xit)yit)θhθt(xit),(2)
where
it U (J1, nK)
is the uniform distribution over the
sample indices. In the following, note that SGD with mini
batches of size
B > 1
would lead to similar analysis but
with η/B instead of η.
2.1. Background: SGD is GD with Specific Label Noise
To emphasize the combined roles of gradient and noise,
we highlight the connection between the SGD dynamics
and that of full-batch GD plus a specific label noise. Such
manner of reformulating the dynamics has already been used
in previous works attempting to understand the specificity
of the SGD noise (HaoChen et al.,2021;Ziyin et al.,2022).
We formalize it in the following proposition.
Proposition 2.1. Let
(θt)t0
follow the SGD dynamics
Eq.(2) with the random sampling function
(it)t0
. For
t0, define the random vector ξtRnsuch that
[ξt]i:= (hθt(xi)yi)(1 n1i=it),(3)
for
iJ1, nK
and where
1A
is the indicator of the event
A
.
Then
(θt)t0
follows the full-batch gradient dynamics on
L
with label noise (ξt)t0, that is
θt+1 =θtη
n
n
X
i=1
(hθt(xi)yt
i)θhθt(xi),(4)
where we define the random labels
yt:= y+ξt
. Further-
more,
ξt
is a mean zero random vector with variance such
that 1
n(n1) Eξt2= 2L(θt).
This reformulation shows two crucial aspects of the SGD
noise: (i) the noisy part at state
θ
always belongs to the linear
space spanned by
{∇θhθ(x1),...,θhθ(xn)}
, and (ii) it
scales as the training loss. Going further on (ii), we highlight
in the following section that the loss can stabilize because
of large step sizes: this may lead to a constant effective
scale of label noise. These two features are of paramount
importance when modelling the effective dynamics that take
place during loss stabilization.
2.2. The Effective Dynamics Behind Loss Stabilization
On loss stabilization. For generic quadratic costs, e.g.,
F(β) := Xβ y2
, gradient descent with step size
η
is convergent for
η < 2max
, divergent for
η > 2max
and converges to a bouncing
2
-periodic dynamics for
η=
2max
, where
λmax
is the largest eigenvalue of the Hes-
sian. However, the practitioner is not likely to hit perfectly
this unstable step size and, almost surely, the dynamics shall
either converge or diverge. Yet, non-quadratic costs bring to
this picture a particular complexity: it has been shown that,
even for non-convex toy models, there exist an open interval
of step sizes for which the gradient descent neither converge
nor diverge (Ma et al.,2022;Chen & Bruna,2022). As we
are interested in SGD, we complement this result by pre-
senting an example in which loss stabilization occurs almost
surely in the case of stochastic updates. Indeed, consider a
regression problem with quadratic parameterization on one-
dimensional data inputs
xi
s, coming from a distribution
ˆρ
,
and outputs generated by the linear model
yi=xiθ2
. The
loss writes
F(θ) := 1
4Eˆρy22
, and the SGD iterates
with step size η > 0follow, for any tN,
θt+1 =θt+η θtxityitxitθ2
twhere xitˆρ. (5)
3
SGD with Large Step Sizes Learns Sparse Features
For the sake of clarity, suppose that
θ= 1
and
supp(ˆρ) =
[a, b]
, we have the following proposition (a more general
result is presented in Proposition B.1 of the Appendix).
Proposition 2.2. For any
η(a2,1.25 ·b2)
and initial-
ization θ0(0,1), for all t > 0,
δ1< F (θt)< δ2almost surely, and (6)
T > 0,k > T, θt+2k<1< θt+2k+1 almost surely. (7)
where δ1, δ2, T > 0are constant given in the Appendix.
The proposition is divided in two parts: if the step size is
large enough, Eq.(6) the loss stabilizes in between level sets
δ1
and
δ2
and Eq.(7) shows that after some initial phase, the
iterates bounce from one side of the loss valley to the other
one. Note that despite the stochasticity of the process, the
results hold almost surely.
The effective dynamics. As observed in the prototypical
SGD training dynamics of Fig. 1and proved in the non-
convex toy model of Proposition 2.2, large step sizes lead
the loss to stabilize around some level set. To further un-
derstand the effect of this loss stabilization in parameter
space, we shall assume perfect stabilization. Then, from
Proposition 2.1, we conjecture the following behaviour
During loss stabilization, SGD is well modelled by GD with
constant label noise.
Label noise dynamics have been studied recently (Blanc
et al.,2020;Damian et al.,2021;Li et al.,2022) thanks
to their connection with Stochastic Differential Equa-
tions (SDEs). To properly write a SDE model, the
drift should match the gradient descent and the noise
should have the correct covariance structure (Li et al.,
2019a;Wojtowytsch,2021a). Proposition 2.1 implies that
the noise at state
θ
is spanned by the gradient vectors
{∇θhθ(x1),...,θhθ(xn)}
and has a constant intensity
corresponding to the loss stabilization at a level
δ > 0
.
Hence, we propose the following SDE model
dθt=−∇θL(θt)dt+pηδ ϕθt(X)dBt,(8)
where
(Bt)t0
is a standard Brownian motion in
Rn
and
ϕθ(X) := [θhθ(xi)]n
i=1 Rn×p
referred to as the Ja-
cobian (which is also the Neural Tangent Kernel (NTK)
feature matrix (Jacot et al.,2018)). This SDE can be seen
as the effective slow dynamics that drives the iterates while
they bounce rapidly in some directions at the level set
δ
. It
highlights the combination of the deterministic part of the
full-batch gradient and the noise induced by SGD. Beyond
the theoretical justification and consistency of this SDE
model, we validate it empirically in Sec. Cshowing that it
indeed captures the dynamics of large step size SGD. In
the next section, we leverage the SDE (8) to understand the
implicit bias of such learning dynamics.
2.3. Sparse Feature Learning
We begin with a simple model of diagonal linear networks
that showcase a sparsity inducing dynamics and further
disclose our general message about the overall implicit bias
promoted by the effective dynamics.
2.3.1. A WARM-UP:DIAGONAL LINEAR NETWORKS
An appealing example of simple non-linear networks that
help in forging an intuition for more complicated archi-
tectures is diagonal linear networks (Vaskevicius et al.,
2019;Woodworth et al.,2020;HaoChen et al.,2021;Pesme
et al.,2021). They are two-layer linear networks with
only diagonal connections: the prediction function writes
hu,v(x) = u, v x=uv, x
where
denotes ele-
mentwise multiplication. Even though the loss is convex
in the associated linear predictor
β:= uvRd
, it
is not in
(u, v)
, hence the training of such simple models
already exhibit a rich non-convex dynamics. In this case,
uhu,v(x) = vx, and the SDE model Eq.(8) writes
dut=−∇uL(ut, vt) dt+pηδ vtXdBt,(9)
where
(Bt)t0
is a standard Brownian motion in
Rn
. Equa-
tions are symmetric for (vt)t0.
What is the behaviour of this effective dynamics?
(Pillaud-Vivien et al.,2022) answered this question by ana-
lyzing a similar stochastic dynamics and unveiled the sparse
nature of the resulting solutions. Indeed, under sparse recov-
ery assumptions, denoting
β
the sparsest linear predictor
that interpolates the data, it is shown that the associated
linear predictor
βt=utvt
: (i) converges exponentially
fast to zero outside of the support of
β
(ii) is with high
probability in a
O(ηδ)
neighborhood of
β
in its support
after a time O(δ1).
Overall conclusion on the model. During a first phase,
SGD with large step sizes
η
decreases the training loss
until stabilization at some level set
δ > 0
. During this
loss stabilization, an effective noise-driven dynamics takes
place. It shrinks the coordinates outside of the support of
the sparsest signal and oscillates in parameter space at level
O(ηδ)
on its support. Hence, decreasing later the step
size leads to perfect recovery of the sparsest predictor. This
behaviour is illustrated in our experiments in Figure 2.
2.3.2. THE SPARSE FEATURE LEARNING CONJECTURE
FOR MORE GENERAL MODELS
Results for diagonal linear nets recalled in the previous para-
graph show that the noisy dynamics (9) induce a sparsity
bias. As emphasized in HaoChen et al. (2021), this effect
is largely due to the multiplicative structure of the noise
v[XdBt]
that, in this case, has a shrinking effect on the
coordinates (because of the coordinate-wise multiplication
4
SGD with Large Step Sizes Learns Sparse Features
with
v
). In the general case, we see, thanks to Eq.(8), that
the same multiplicative structure of the noise still happens
but this time with respect to the Jacobian
ϕθ(X)
. Hence,
this suggests that similarly to the diagonal linear network
case, the implicit bias of the noise can lead to a shrink-
age effect applied to
ϕθ(X)
. This effect depends on the
noise intensity
δ
and the step size of SGD. Indeed, an in-
teresting property of Brownian motion is that, for
vRp
,
v, Bt=v2Wt
, where the equality holds in law and
(Wt)t0
is a one-dimensional Brownian motion. Hence, the
process Eq.(8) is equivalent to a process whose
i
-th coordi-
nate is driven by a noise proportional to
ϕidWi
t
, where
ϕi
is the
i
-th column of
ϕθ(X)
and
(Wi
t)t0
is a Brownian
motion. This SDE structure, similar to the geometric Brow-
nian motion, is expected to induce the shrinkage of each
multiplicative factor (Oksendal,2013, Section 5.1), i.e., in
our case (∥∇θh(xi))n
i=1. Thus, we conjecture:
The noise part of Eq.(8) seeks to minimize the 2-norm of
the columns of ϕθ(X).
Note that the fitting part of the dynamics prevents the Ja-
cobian to collapse totally to zero, but as soon as they are
not needed to fit the signal, columns can be reduced to
zero. Remarkably, from a stability perspective, Blanc et al.
(2020) showed a similar bias: locally around a minimum,
the SGD dynamics implicitly tries to minimize the Frobe-
nius norm
ϕθ(X)F=Pn
i=1 ∥∇θhθ(xi)2
. Resolving
the above conjecture and characterizing the implicit bias
along the trajectory of SGD remains an exciting avenue for
future work. Now, we provide a specification of this implicit
bias for different architectures:
Diagonal linear networks: For
hu,v(x) = uv, x
, we
have
u,vhu,v(x)=[vx, u x]
. Thus, for a generic
data matrix
X
, minimizing the norm of each column of
ϕu,v(X)
amounts to put the maximal number of zero
coordinates and hence to minimize uv0.
ReLU networks: We take the prototypical one hid-
den layer to exhibit the sparsification effect. Let
ha,W (x) = a, σ(W x)
, then
aha,W (x) = σ(W x)
and
wjha,W (x) = ajx1wj,x>0
. Note that the
2
-
norm of the column corresponding to the neuron is re-
duced when it is activated at a minimal number of training
points, hence the implicit bias enables the learning of
sparse data-active features. Finally, when some direc-
tions are needed to fit the data, similarly activated neurons
align to fit, reducing the rank of ϕθ(X).
Feature sparsity. Our main insight is that the Jacobian
could be significantly simplified during the loss stabiliza-
tion phase. Indeed, while the gradient part tries to fit the
data and align neurons (see e.g. Fig. 10), the noise part of
Eq.(8) intends to minimize the
2
-norm of the columns of
ϕ(X)
. Hence, in combination, this motivates us to count the
average number of distinct (i.e., counting a group of aligned
neurons as one), non-zero activations over the training set.
We refer to this as the feature sparsity coefficient (see the
next section for a detailed description). Note that the afore-
mentioned sparsity comes both in the number of distinct
neurons and their activation.
We show next that the conjectured sparsity is indeed ob-
served empirically for a variety of models. Remark that
both the feature sparsity coefficient and the rank of
ϕθ(X)
can be used as a good proxy to track the hidden progress
during the loss stabilization phase.
3. Empirical Evidence of Sparse Feature
Learning Driven by SGD
Here we present empirical results for neural networks of in-
creasing complexity: from diagonal linear networks to deep
DenseNets on CIFAR-10, CIFAR-100, and Tiny ImageNet.
We make the following common observations for all these
networks trained using SGD schedules with large step sizes:
(O1)
Loss stabilization: training loss stabilizes around a
high level set until step size is decayed,
(O2)
Generalization benefit: longer loss stabilization leads
to better generalization,
(O3)
Sparse feature learning: longer loss stabilization
leads to sparser features.
Importantly, we use no explicit regularization (in particular,
no weight decay) in our experiments so that the training
dynamics is driven purely by SGD and the step size sched-
ule. Additionally, in some cases, we cannot find a single
large step size that would lead to loss stabilization. In such
cases, whenever explicitly mentioned, we use a warmup
step size schedule—i.e., increasing step sizes according to
some schedule—to make sure that the loss stabilizes around
some level set. Warmup is commonly used in practice (He
et al.,2016;Devlin et al.,2018) and often motivated purely
from the optimization perspective as a way to accelerate
training (Agarwal et al.,2021), but we suggest that it is also
a way to amplify the regularization effect of the SGD noise
which is proportional to the step size.
Measuring sparse feature learning. We track the sim-
plification of the Jacobian by measuring both the feature
sparsity and the rank of
ϕθ(X)
. We compute the rank over
iterations for each model (except deep networks for which
it is prohibitively expensive) by using a fixed threshold on
the singular values of ϕθ(X)normalized by the largest sin-
gular value. In this way, we ensure that the difference in
the rank that we detect is not simply due to different scales
of
ϕθ(X)
. Moreover, we always compute
ϕθ(X)
on the
number of fresh samples equal to the number of parameters
|θ|
to make sure that rank deficiency is not coming from
n≪ |θ|
which is the case in the overparametrized settings
we consider. To compute the feature sparsity coefficient,
5
摘要:

SGDwithLargeStepSizesLearnsSparseFeaturesMaksymAndriushchenko1AdityaVarre1LoucasPillaud-Vivien1NicolasFlammarion1AbstractWeshowcaseimportantfeaturesofthedynamicsoftheStochasticGradientDescent(SGD)inthetrainingofneuralnetworks.Wepresentempiri-calobservationsthatcommonlyusedlargestepsizes(i)mayleadthe...

展开>> 收起<<
SGD with Large Step Sizes Learns Sparse Features.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.13MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注