spred Solving L1 Penalty with SGD

2025-05-03 0 0 1.96MB 16 页 10玖币
侵权投诉
spred: Solving L1Penalty with SGD
Liu Ziyin * 1 Zihao Wang * 2
Abstract
We propose to minimize a generic differentiable
objective with
L1
constraint using a simple
reparametrization and straightforward stochastic
gradient descent. Our proposal is the direct gener-
alization of previous ideas that the
L1
penalty may
be equivalent to a differentiable reparametrization
with weight decay. We prove that the proposed
method, spred, is an exact differentiable solver
of
L1
and that the reparametrization trick is com-
pletely “benign” for a generic nonconvex function.
Practically, we demonstrate the usefulness of the
method in (1) training sparse neural networks to
perform gene selection tasks, which involves find-
ing relevant features in a very high dimensional
space, and (2) neural network compression task,
to which previous attempts at applying the
L1
-
penalty have been unsuccessful. Conceptually,
our result bridges the gap between the sparsity in
deep learning and conventional statistical learn-
ing.
1. Introduction
In many problems, optimization of an objective function
under an
L1
constraint is of fundamental importance (San-
tosa and Symes,1986;Tibshirani,1996;Donoho,2006;Sun
et al.,2015;Candes et al.,2008). The advantage of the
L1
penalized solution is that they are sparse and thus highly
interpretable, and it could be of great use if we can broadly
apply the
L1
penalty to general problems. However,
L1
has
only seen limited use in the case of simple models such as
linear regression, logistic regression, or dictionary learning,
where effective optimization methods are known to exist.
As soon as the model becomes as complicated as a neural
network, it is unknown how to optimize an L1penalty.
*
Equal contribution
1
The University of Tokyo
2
HKUST. Corre-
spondence to: Liu Ziyin
<
liu.ziyin.p@gmail.com
>
, Zihao Wang
<zwanggc@cse.ust.hk>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
In contrast, with complicated models like neural networks,
gradient descent (GD) has been the favored method of opti-
mization because of its scalability on large-scale problems
and simplicity of implementation. However, gradient de-
scent has yet to be shown to work well in solving the
L1
penalty because the
L1
penalty is not differentiable at zero,
precisely where the model becomes sparse. In fact, there is
a large gap between the conventional
L1
learning and deep
learning literature. Many tasks, such as feature selection,
that
L1
-based methods work well cannot be tackled by deep
learning, and achieving sparsity in deep learning is almost
never based on
L1
. This gap between conventional statistics
and deep learning is perhaps because no method has been
demonstrated to efficiently solve the
L1
penalized objectives
in general nonlinear settings, not to mention incorporating
such methods within the standard backpropagation-based
neural network training pipelines. Thus, optimizing a gen-
eral nonconvex objective with
L1
regularization remains an
important open problem.
The foremost contribution of our work is to theoretically
prove and empirically demonstrate that a reparametrization
trick, also called the Hadamard parametrization, allows for
solving arbitrary nonconvex objectives with
L1
regulariza-
tion with gradient descent. The method is simple and takes
only a few lines to implement in any modern deep-learning
framework. Furthermore, we demonstrate that the proposed
method is compatible with and can be boosted by common
training tricks in deep learning, such as minibatch training,
adaptive learning rates, and pretraining. See Figure 1for an
illustration.
2. Related Works
L1 Penalty. It is well-known that the
L1
penalty leads to a
sparse solution (Wasserman,2013). For linear models, the
objectives with
L1
regularization are usually convex, but
they are challenging to solve because the objective becomes
non-differentiable precisely at the point where sparsity is
achieved (namely, the origin). The mainstream literature
often proposes special algorithms for solving the
L1
penalty
for a specific task. For example, the original lasso paper
suggests a method based on the quadratic programming
algorithms (Tibshirani,1996). Later, algorithms such as
coordinate descent (Friedman et al.,2010) and least-angle
1
arXiv:2210.01212v5 [cs.LG] 12 Jul 2023
Sparsity by Redundancy
Figure 1: Illustration of the spred algorithm for achieving parameter sparsity (left) and feature selection (right). Essentially, the proposed
algorithm creates redundant parameters and does not change the original architecture or training protocol. Therefore, the algorithm is
compatible with pretraining.
Figure 2: Loss landscape of the original
L1
regularized loss and the equivalent
L2
regularized redundant parametrization. With the
redundant parametrization, the loss becomes smooth and differentiable. The reparametrization introduces one additional minimum but is
entirely benign because the two minima are identical and converging to either achieves an equivalent performance. Left: the original 1d
L1loss for LL1=(wc)2+w.Mid: reparametrized loss with c=0.5.Right:c=1.5.
regression (LARS) (Efron et al.,2004) have been proposed
as more efficient alternatives. The same problem also exists
in the sparse multinomial logistic regression task (Cawley
et al.,2006), which relies on a diagonal second-order coor-
dinate descent algorithm. Another line of work proposes
to use the iterative thresholding algorithms (ISTA) for solv-
ing lasso (Beck and Teboulle,2009), but it is unclear how
ISTA-type algorithms could be generalized to solve general
nonconvex problems. Instead of finding an efficient algo-
rithm for a special
L1
problem, our strategy is to transform
an
L1
problem into a differentiable problem for which the
simplest gradient descent algorithms can be efficient.
Redundant Parameterization. The method we propose is
based on a reparametrization trick of the
L1
loss function.
The idea that a redundant parametrization with
L2
penalty
has some resemblance to an
L1
penalty has a rather long
history, and this resemblance has been utilized in various
limited settings to solve an L1problem. Grandvalet (1998)
is one of the earliest to suggest an equivalence between
L1
and a redundant parametrization. However, this equivalence
is only approximate. Hoff (2017) theoretically studies the
Hadamard parametrization in the context of generalized lin-
ear models and proposes to minimize the loss function by
alternatively applying the solution of the ridge regression
problem; notably, this work is the first to prove that not
only the global minima of the redundant parametrization is
equivalent to the
L1
global minima, but that all the local
minima of the redundant parametrization are also local min-
ima of the original
L1
objective, although only in case of
linear models. Poon and Peyr
´
e(2021) studied the redundant
parametrization in the case of a convex loss function and
showed that all local minima of the redundant loss function
are global and that the saddles are strict. In follow-up work,
Poon and Peyr
´
e(2022) analyzed the optimization property
of these convex loss functions.
Compared to previous results, our result comprehensively
characterizes all the saddle and local minima of the loss
landscape of the redundant parametrization for a generic
and nonconvex loss function. Our theoretical result, in turn,
justifies the application of simple SGD to solve this problem
and makes it possible to apply this method to highly com-
plicated and practical problems, such as training a sparse
neural network. Our motivation is also different from previ-
ous works. Previous works motivate the reparametrization
trick from the viewpoint of solving the original Lasso prob-
lem, whereas our focus is on solving and understanding
problems in deep learning. Application-wise, Hoff (2017)
applied the method to linear logistic regression.(Poon and
Peyr
´
e,2021) applied the method to lasso regression and
optimal transport. In contrast, our work is also the first to
identify and demonstrate its usage in contemporary deep
learning.
2
Sparsity by Redundancy
Sparsity in Deep Learning. One important application
of our theory is understanding and achieving any type of
parameter sparsity in deep learning. There are two main rea-
sons for introducing sparsity to the model. The first is that
some level of sparsity often leads to better generalization
performance; the second is that compressing the models can
lead to more memory/computation-efficient deployment of
the models (Gale et al.,2019;Blalock et al.,2020). However,
none of the popular methods for sparsity in deep learning is
based on the
L1
penalty, which is the favored method in con-
ventional statistics. For example, pruning-based methods
are the dominant strategies in deep learning (LeCun et al.,
1989). However, such methods are not satisfactory from a
principled perspective because the pruning part is separated
from the training, and it is hard to understand what these
pruning procedures are optimizing.
3. Algorithm and Theory
In this section, we first introduce the reparametrization trick.
We then present our theoretical results, which establish
that the reparametrization trick does not make the land-
scape more complicated. All the proofs are presented in
Appendix B.
3.1. Landscape of the Reparametrization Trick
Consider a generic objective function
L(Vs, Vd)
that de-
pends on two sets of learnable parameters
Vs
and
Vd
, where
the subscript
s
stands for “sparse,” and
d
stands for “dense.
Often, we want to find a sparse set of parameters
Vs
that
minimizes
L
. The conventional way to achieve this is by
minimizing the loss function with an
L1
penalty of strength
2κ:
min
Vs,Vd
L(Vs, Vd)+2κVs1.(1)
We will refer to
L(Vs, Vd)+2κVs1
as
LL1
. Under suit-
able conditions for
L
, the solutions of
L(Vs, Vd)
will feature
both (1) sparsity and (2) shrinkage of the norm of the so-
lution
Vs
, and thus one can perform variable selection and
overfitting avoidance at the same time. A primary obstacle
that has prevented a scalable optimization of Eq.
(1)
with
gradient descent algorithms is that it is non-differentiable
at the points where sparsity is achieved. The optimization
problem only has efficient algorithms when the loss function
belongs to a restrictive set of families. See Figure 2.
Let
denote the element-wise product. The following
theorem derives a precise equivalence of Eq.
(1)
with a
redundantly parameterized objective.
Theorem 1. Let αβ =κ2and
Lsr(U, W, Vd)=L(UW, Vd)+αU2+βW2.(2)
Then,
(U, W, Vd)
is a global minimum of Eq.
(2)
if and only
if (a)
Ui=Wi
for all
i
and (b)
(UW, Vd)
is a global
minimum of Eq. (1).1
Because having
Vd
in the loss function or not does not
change the proof, we omit writing
Vd
from this point on.
We note that the suggestion that this reparametrization trick
is equivalent to the
L1
penalty at global minima appeared in
previous works under various restricted settings. A limited
version of this theorem appeared in Hoff (2017) in the con-
text of a linear model. Poon and Peyr
´
e(2021) proved this
equivalence in the global minimum when the landscape is
convex.
The subscript
sr
stands for “sparsity by redundancy.” When
L
is
n
-time differentiable, the objective
Lsr
is also
n
-time
differentiable. It is thus tempting to apply simple gradient-
based optimization methods to optimize this alternative ob-
jective when
L
itself is differentiable. When
L
is twice-
differentiable, one can also apply second-order methods
for acceleration. As an example of
L
, consider the case
when
L
is a training-set-dependent loss function (such as in
deep learning), and the parameters
Vs
and
Vd
are learnable
weights of a nonlinear neural network. In this case, one can
write Lsr as
1
N
N
i=1
(fw(xi), yi)+αU2+βW2,(3)
where
w=(U, W, Vd)
denotes the total set of parameters
we want to minimize, and
(xi, yi)
are data point pairs of
an empirical dataset. For a deep learning practitioner, it
feels intuitive to solve this loss function with popular deep
learning training methods. Additionally,
L2
regularization
can be implemented efficiently as weight decay as in the
standard deep learning frameworks. Section 3.2 provides
several specific examples of this redundant parametrization.
However, the equivalence in the global minimum is insuffi-
cient to motivate an application of SGD to it because gra-
dient descent is local, and if this parametrization induces
many bad minima, SGD can still fail badly. An important
question is thus whether this redundant parametrization has
made the optimization process more difficult for SGD or
not. We now show that it does not, in the sense that all local
minima of Eq.
(2)
faithfully reproduce the local minima
of the original loss and vice versa. Thus, the redundant
parametrization cannot introduce new bad minima to the
loss landscape.
Theorem 2. All stationary points of Eq.
(2)
satisfy
Ui=
Wi
. Additionally,
(U, W )
is a local minimum of Eq.
(2)
if
and only if (a)
V=UW
is a local minimum of Eq.
(1)
and (b) Ui=Wi.
Namely, one can partition all of the local minima of
Lrs
into
1
In this work, we use the letter
L
exclusively for the part of
loss function that does not contain L1or L2penalty.
3
Sparsity by Redundancy
exclusive and equivalent sets, such that these sets have a one-
to-one mapping with the local minima in the corresponding
LL1
. We are the first to prove this one-to-one mapping rela-
tion for a general loss function. This proposition thus offers
a partial theoretical explanation to our empirical observation
that optimizing Eq.
(2)
is no more difficult (and often much
easier) than the original
L1
-regularized loss. A corollary
of this theorem reduces to the main theorem of Poon and
Peyr
´
e(2021), which states that if
L
is convex (such as in
Lasso), then every local minimum of
Lrs
is global. A cru-
cial new insight we offer is that one can still converge to a
bad minimum for a general landscape, but this only happens
because the original
LL1
has bad minima, not because of
the reparametrization trick.
Still, this alone is insufficient to imply that GD can navigate
this landscape easily because gradient descent can get stuck
on saddle points easily (Du et al.,2017;Ziyin et al.,2021).
In particular, GD often has a problem escaping higher-order
saddle points where the Hessian eigenvalues along escaping
directions vanish. The following theorem shows that this is
also not a problem for the reparametrization trick because
the strength of the gradient is as strong as the original LL1.
Theorem 3. Let
U=W
,
V=UW
and
L
be every-
where differentiable. Then, for every infinitesimal variation
δV ,
1.
if
LL1(V)
is directionally differentiable in
δV
, there
exist variations
δW, δU Θ(δV )
such that
LL1(V+
δV )=Lrs(U+δU, W +δW );
2.
if
LL1(V)
is not directionally differentiable in
δV
,
there exist variations
δW, δU Θ(δV )0.5
such that
LL1(V+δV )=Lrs(U+δU, W +δW ).
Namely, away from nondifferential points of
LL1
, the
reparametrized landscape is qualitatively the same as
the original landscape, and escaping the saddles in the
reparametrized landscape must be no harder than escap-
ing the original saddle. If GD finds it difficult to escape a
saddle point, it must be because the original
LL1
contains a
difficult saddle. All nondifferentiable points of
LL1
occur
at a sparse solution where some parameters are zero. Here,
the first-order derivative is discontinuous, and the variation
of the
LL1
is thus first-order in
δV
. This implies that the
variation in the corresponding
Lrs
is second-order in
δU
and
δW
and that the Hessian of
Lrs
should have at least one
negative eigenvalue, which implies that escaping from these
points should be of no problem to gradient descent (Jin et al.,
2017). Combined, Theorem 2and 3directly motivate the
application of stochastic gradient descent to any problem
that SGD has been demonstrated efficient for, an important
example being a neural network.
In more general scenarios, one is interested in a structured
sparsity, where a group of parameters is encouraged to be
sparse simultaneously. It suffices to consider the case when
there is a single group because one can add
L1
penalty
recursively to prove the general multigroup case:
L(Vs, Vd)+κVs2.(4)
The following theorem gives the equivalent redundant form.
Theorem 4. Let αβ =κ2and
Lsr(u, W, Vd)=L(uW, Vd)+αu2+βW2.(5)
Then,
(u, W, Vd)
is a global minimum of Eq.
(5)
if and only
if (a)
u=W2
for all
i
and (b)
(uW, Vd)
is a global
minimum of Eq. (4).
Namely, every
L1
group only requires one additional param-
eter to sparsify. Note that recursively applying Theorem 4
and setting
W
to have dimension
1
allows us to recover
Theorem 1.
2
The above theory justifies the application of
the reparametrization trick to any sparsity-related tasks in
deep learning. For completeness, we give an explicit algo-
rithm in Algorithm 1and 2. Let
m
be the number of groups.
This algorithm adds
m
parameters to the training process.
Consequently, it has the same complexity as the standard
deep learning training algorithms such as SGD because it, at
most, doubles the memory and computation cost of training
and does not incur additional costs for inference. For the
ResNet18/CIFAR10 experiment we performed, each iter-
ation of training with spred takes less than
5%
more time
than the standard training, much lower than the worst-case
upper bound of 100%.
Algorithm 1 spred algorithm for parameter sparsity
Input: loss function
L(Vs, Vd)
, parameter
Vs, Vd
,
L1
regularization strength 2κ
Initialize W, U
Solve (with SGD, Adam, LBGFS, etc.)
minW,U,VdL(UW, Vd)+κ(W2
2+U2)
Output:V=UW
Algorithm 2 spred algorithm for structured sparsity
Input: loss function
L(Vs, Vd)
, parameter
Vs, Vd
,
L1
regularization strength 2κ
Initialize W, u
Solve minW,u,VdL(uW, Vd)+κ(W2
2+u2)
Output:V=uW
Implementation and practical remarks. First, multiple ways
exist to initialize the redundant parameters
W
and
U
. One
way is to initialize
W
with, say, the Kaiming init., and
U
to be of variance 1. The other way is to give both variables
2
Note that when
L
is a linear regression objective, the loss
function is equivalent to the group lasso.
4
摘要:

spred:SolvingL1PenaltywithSGDLiuZiyin*1ZihaoWang*2AbstractWeproposetominimizeagenericdifferentiableobjectivewithL1constraintusingasimplereparametrizationandstraightforwardstochasticgradientdescent.Ourproposalisthedirectgener-alizationofpreviousideasthattheL1penaltymaybeequivalenttoadifferentiablerep...

展开>> 收起<<
spred Solving L1 Penalty with SGD.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.96MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注