
SGD with Large Step Sizes Learns Sparse Features
with
v
). In the general case, we see, thanks to Eq.(8), that
the same multiplicative structure of the noise still happens
but this time with respect to the Jacobian
ϕθ(X)
. Hence,
this suggests that similarly to the diagonal linear network
case, the implicit bias of the noise can lead to a shrink-
age effect applied to
ϕθ(X)
. This effect depends on the
noise intensity
δ
and the step size of SGD. Indeed, an in-
teresting property of Brownian motion is that, for
v∈Rp
,
⟨v, Bt⟩=∥v∥2Wt
, where the equality holds in law and
(Wt)t≥0
is a one-dimensional Brownian motion. Hence, the
process Eq.(8) is equivalent to a process whose
i
-th coordi-
nate is driven by a noise proportional to
∥ϕi∥dWi
t
, where
ϕi
is the
i
-th column of
ϕθ(X)
and
(Wi
t)t≥0
is a Brownian
motion. This SDE structure, similar to the geometric Brow-
nian motion, is expected to induce the shrinkage of each
multiplicative factor (Oksendal,2013, Section 5.1), i.e., in
our case (∥∇θh(xi)∥)n
i=1. Thus, we conjecture:
The noise part of Eq.(8) seeks to minimize the ℓ2-norm of
the columns of ϕθ(X).
Note that the fitting part of the dynamics prevents the Ja-
cobian to collapse totally to zero, but as soon as they are
not needed to fit the signal, columns can be reduced to
zero. Remarkably, from a stability perspective, Blanc et al.
(2020) showed a similar bias: locally around a minimum,
the SGD dynamics implicitly tries to minimize the Frobe-
nius norm
∥ϕθ(X)∥F=Pn
i=1 ∥∇θhθ(xi)∥2
. Resolving
the above conjecture and characterizing the implicit bias
along the trajectory of SGD remains an exciting avenue for
future work. Now, we provide a specification of this implicit
bias for different architectures:
•
Diagonal linear networks: For
hu,v(x) = ⟨u⊙v, x⟩
, we
have
∇u,vhu,v(x)=[v⊙x, u ⊙x]
. Thus, for a generic
data matrix
X
, minimizing the norm of each column of
ϕu,v(X)
amounts to put the maximal number of zero
coordinates and hence to minimize ∥u⊙v∥0.
•
ReLU networks: We take the prototypical one hid-
den layer to exhibit the sparsification effect. Let
ha,W (x) = ⟨a, σ(W x)⟩
, then
∇aha,W (x) = σ(W x)
and
∇wjha,W (x) = ajx1⟨wj,x⟩>0
. Note that the
ℓ2
-
norm of the column corresponding to the neuron is re-
duced when it is activated at a minimal number of training
points, hence the implicit bias enables the learning of
sparse data-active features. Finally, when some direc-
tions are needed to fit the data, similarly activated neurons
align to fit, reducing the rank of ϕθ(X).
Feature sparsity. Our main insight is that the Jacobian
could be significantly simplified during the loss stabiliza-
tion phase. Indeed, while the gradient part tries to fit the
data and align neurons (see e.g. Fig. 10), the noise part of
Eq.(8) intends to minimize the
ℓ2
-norm of the columns of
ϕ(X)
. Hence, in combination, this motivates us to count the
average number of distinct (i.e., counting a group of aligned
neurons as one), non-zero activations over the training set.
We refer to this as the feature sparsity coefficient (see the
next section for a detailed description). Note that the afore-
mentioned sparsity comes both in the number of distinct
neurons and their activation.
We show next that the conjectured sparsity is indeed ob-
served empirically for a variety of models. Remark that
both the feature sparsity coefficient and the rank of
ϕθ(X)
can be used as a good proxy to track the hidden progress
during the loss stabilization phase.
3. Empirical Evidence of Sparse Feature
Learning Driven by SGD
Here we present empirical results for neural networks of in-
creasing complexity: from diagonal linear networks to deep
DenseNets on CIFAR-10, CIFAR-100, and Tiny ImageNet.
We make the following common observations for all these
networks trained using SGD schedules with large step sizes:
(O1)
Loss stabilization: training loss stabilizes around a
high level set until step size is decayed,
(O2)
Generalization benefit: longer loss stabilization leads
to better generalization,
(O3)
Sparse feature learning: longer loss stabilization
leads to sparser features.
Importantly, we use no explicit regularization (in particular,
no weight decay) in our experiments so that the training
dynamics is driven purely by SGD and the step size sched-
ule. Additionally, in some cases, we cannot find a single
large step size that would lead to loss stabilization. In such
cases, whenever explicitly mentioned, we use a warmup
step size schedule—i.e., increasing step sizes according to
some schedule—to make sure that the loss stabilizes around
some level set. Warmup is commonly used in practice (He
et al.,2016;Devlin et al.,2018) and often motivated purely
from the optimization perspective as a way to accelerate
training (Agarwal et al.,2021), but we suggest that it is also
a way to amplify the regularization effect of the SGD noise
which is proportional to the step size.
Measuring sparse feature learning. We track the sim-
plification of the Jacobian by measuring both the feature
sparsity and the rank of
ϕθ(X)
. We compute the rank over
iterations for each model (except deep networks for which
it is prohibitively expensive) by using a fixed threshold on
the singular values of ϕθ(X)normalized by the largest sin-
gular value. In this way, we ensure that the difference in
the rank that we detect is not simply due to different scales
of
ϕθ(X)
. Moreover, we always compute
ϕθ(X)
on the
number of fresh samples equal to the number of parameters
|θ|
to make sure that rank deficiency is not coming from
n≪ |θ|
which is the case in the overparametrized settings
we consider. To compute the feature sparsity coefficient,
5