
Sparsity by Redundancy
exclusive and equivalent sets, such that these sets have a one-
to-one mapping with the local minima in the corresponding
LL1
. We are the first to prove this one-to-one mapping rela-
tion for a general loss function. This proposition thus offers
a partial theoretical explanation to our empirical observation
that optimizing Eq.
(2)
is no more difficult (and often much
easier) than the original
L1
-regularized loss. A corollary
of this theorem reduces to the main theorem of Poon and
Peyr
´
e(2021), which states that if
L
is convex (such as in
Lasso), then every local minimum of
Lrs
is global. A cru-
cial new insight we offer is that one can still converge to a
bad minimum for a general landscape, but this only happens
because the original
LL1
has bad minima, not because of
the reparametrization trick.
Still, this alone is insufficient to imply that GD can navigate
this landscape easily because gradient descent can get stuck
on saddle points easily (Du et al.,2017;Ziyin et al.,2021).
In particular, GD often has a problem escaping higher-order
saddle points where the Hessian eigenvalues along escaping
directions vanish. The following theorem shows that this is
also not a problem for the reparametrization trick because
the strength of the gradient is as strong as the original LL1.
Theorem 3. Let
U=W
,
V=U⊙W
and
L
be every-
where differentiable. Then, for every infinitesimal variation
δV ,
1.
if
LL1(V)
is directionally differentiable in
δV
, there
exist variations
δW, δU ∈Θ(δV )
such that
LL1(V+
δV )=Lrs(U+δU, W +δW );
2.
if
LL1(V)
is not directionally differentiable in
δV
,
there exist variations
δW, δU ∈Θ(δV )0.5
such that
LL1(V+δV )=Lrs(U+δU, W +δW ).
Namely, away from nondifferential points of
LL1
, the
reparametrized landscape is qualitatively the same as
the original landscape, and escaping the saddles in the
reparametrized landscape must be no harder than escap-
ing the original saddle. If GD finds it difficult to escape a
saddle point, it must be because the original
LL1
contains a
difficult saddle. All nondifferentiable points of
LL1
occur
at a sparse solution where some parameters are zero. Here,
the first-order derivative is discontinuous, and the variation
of the
LL1
is thus first-order in
δV
. This implies that the
variation in the corresponding
Lrs
is second-order in
δU
and
δW
and that the Hessian of
Lrs
should have at least one
negative eigenvalue, which implies that escaping from these
points should be of no problem to gradient descent (Jin et al.,
2017). Combined, Theorem 2and 3directly motivate the
application of stochastic gradient descent to any problem
that SGD has been demonstrated efficient for, an important
example being a neural network.
In more general scenarios, one is interested in a structured
sparsity, where a group of parameters is encouraged to be
sparse simultaneously. It suffices to consider the case when
there is a single group because one can add
L1
penalty
recursively to prove the general multigroup case:
L(Vs, Vd)+κVs2.(4)
The following theorem gives the equivalent redundant form.
Theorem 4. Let αβ =κ2and
Lsr(u, W, Vd)∶=L(uW, Vd)+αu2+βW2.(5)
Then,
(u, W, Vd)
is a global minimum of Eq.
(5)
if and only
if (a)
u=W2
for all
i
and (b)
(uW, Vd)
is a global
minimum of Eq. (4).
Namely, every
L1
group only requires one additional param-
eter to sparsify. Note that recursively applying Theorem 4
and setting
W
to have dimension
1
allows us to recover
Theorem 1.
2
The above theory justifies the application of
the reparametrization trick to any sparsity-related tasks in
deep learning. For completeness, we give an explicit algo-
rithm in Algorithm 1and 2. Let
m
be the number of groups.
This algorithm adds
m
parameters to the training process.
Consequently, it has the same complexity as the standard
deep learning training algorithms such as SGD because it, at
most, doubles the memory and computation cost of training
and does not incur additional costs for inference. For the
ResNet18/CIFAR10 experiment we performed, each iter-
ation of training with spred takes less than
5%
more time
than the standard training, much lower than the worst-case
upper bound of 100%.
Algorithm 1 spred algorithm for parameter sparsity
Input: loss function
L(Vs, Vd)
, parameter
Vs, Vd
,
L1
regularization strength 2κ
Initialize W, U
Solve (with SGD, Adam, LBGFS, etc.)
minW,U,VdL(U⊙W, Vd)+κ(W2
2+U2)
Output:V∗=U⊙W
Algorithm 2 spred algorithm for structured sparsity
Input: loss function
L(Vs, Vd)
, parameter
Vs, Vd
,
L1
regularization strength 2κ
Initialize W, u
Solve minW,u,VdL(uW, Vd)+κ(W2
2+u2)
Output:V∗=uW
Implementation and practical remarks. First, multiple ways
exist to initialize the redundant parameters
W
and
U
. One
way is to initialize
W
with, say, the Kaiming init., and
U
to be of variance 1. The other way is to give both variables
2
Note that when
L
is a linear regression objective, the loss
function is equivalent to the group lasso.
4