spred Solving L1 Penalty with SGD

2025-05-03 0 0 1.96MB 16 页 10玖币

侵权投诉

spred: Solving L1Penalty with SGD

Liu Ziyin * 1 Zihao Wang * 2

Abstract

We propose to minimize a generic differentiable

objective with

constraint using a simple

reparametrization and straightforward stochastic

gradient descent. Our proposal is the direct gener-

alization of previous ideas that the

penalty may

be equivalent to a differentiable reparametrization

with weight decay. We prove that the proposed

method, spred, is an exact differentiable solver

and that the reparametrization trick is com-

pletely “benign” for a generic nonconvex function.

Practically, we demonstrate the usefulness of the

method in (1) training sparse neural networks to

perform gene selection tasks, which involves ﬁnd-

ing relevant features in a very high dimensional

space, and (2) neural network compression task,

to which previous attempts at applying the

penalty have been unsuccessful. Conceptually,

our result bridges the gap between the sparsity in

deep learning and conventional statistical learn-

ing.

1. Introduction

In many problems, optimization of an objective function

under an

constraint is of fundamental importance (San-

tosa and Symes,1986;Tibshirani,1996;Donoho,2006;Sun

et al.,2015;Candes et al.,2008). The advantage of the

penalized solution is that they are sparse and thus highly

interpretable, and it could be of great use if we can broadly

apply the

penalty to general problems. However,

has

only seen limited use in the case of simple models such as

linear regression, logistic regression, or dictionary learning,

where effective optimization methods are known to exist.

As soon as the model becomes as complicated as a neural

network, it is unknown how to optimize an L1penalty.

Equal contribution

The University of Tokyo

HKUST. Corre-

spondence to: Liu Ziyin

liu.ziyin.p@gmail.com

, Zihao Wang

<zwanggc@cse.ust.hk>.

Proceedings of the

40 th

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

In contrast, with complicated models like neural networks,

gradient descent (GD) has been the favored method of opti-

mization because of its scalability on large-scale problems

and simplicity of implementation. However, gradient de-

scent has yet to be shown to work well in solving the

penalty because the

penalty is not differentiable at zero,

precisely where the model becomes sparse. In fact, there is

a large gap between the conventional

learning and deep

learning literature. Many tasks, such as feature selection,

that

-based methods work well cannot be tackled by deep

learning, and achieving sparsity in deep learning is almost

never based on

. This gap between conventional statistics

and deep learning is perhaps because no method has been

demonstrated to efﬁciently solve the

penalized objectives

in general nonlinear settings, not to mention incorporating

such methods within the standard backpropagation-based

neural network training pipelines. Thus, optimizing a gen-

eral nonconvex objective with

regularization remains an

important open problem.

The foremost contribution of our work is to theoretically

prove and empirically demonstrate that a reparametrization

trick, also called the Hadamard parametrization, allows for

solving arbitrary nonconvex objectives with

regulariza-

tion with gradient descent. The method is simple and takes

only a few lines to implement in any modern deep-learning

framework. Furthermore, we demonstrate that the proposed

method is compatible with and can be boosted by common

training tricks in deep learning, such as minibatch training,

adaptive learning rates, and pretraining. See Figure 1for an

illustration.

2. Related Works

L1 Penalty. It is well-known that the

penalty leads to a

sparse solution (Wasserman,2013). For linear models, the

objectives with

regularization are usually convex, but

they are challenging to solve because the objective becomes

non-differentiable precisely at the point where sparsity is

achieved (namely, the origin). The mainstream literature

often proposes special algorithms for solving the

penalty

for a speciﬁc task. For example, the original lasso paper

suggests a method based on the quadratic programming

algorithms (Tibshirani,1996). Later, algorithms such as

coordinate descent (Friedman et al.,2010) and least-angle

arXiv:2210.01212v5 [cs.LG] 12 Jul 2023

Sparsity by Redundancy

Figure 1: Illustration of the spred algorithm for achieving parameter sparsity (left) and feature selection (right). Essentially, the proposed

algorithm creates redundant parameters and does not change the original architecture or training protocol. Therefore, the algorithm is

compatible with pretraining.

Figure 2: Loss landscape of the original

regularized loss and the equivalent

regularized redundant parametrization. With the

redundant parametrization, the loss becomes smooth and differentiable. The reparametrization introduces one additional minimum but is

entirely benign because the two minima are identical and converging to either achieves an equivalent performance. Left: the original 1d

L1loss for LL1=(w−c)2+∣w∣.Mid: reparametrized loss with c=0.5.Right:c=1.5.

regression (LARS) (Efron et al.,2004) have been proposed

as more efﬁcient alternatives. The same problem also exists

in the sparse multinomial logistic regression task (Cawley

et al.,2006), which relies on a diagonal second-order coor-

dinate descent algorithm. Another line of work proposes

to use the iterative thresholding algorithms (ISTA) for solv-

ing lasso (Beck and Teboulle,2009), but it is unclear how

ISTA-type algorithms could be generalized to solve general

nonconvex problems. Instead of ﬁnding an efﬁcient algo-

rithm for a special

problem, our strategy is to transform

problem into a differentiable problem for which the

simplest gradient descent algorithms can be efﬁcient.

Redundant Parameterization. The method we propose is

based on a reparametrization trick of the

loss function.

The idea that a redundant parametrization with

penalty

has some resemblance to an

penalty has a rather long

history, and this resemblance has been utilized in various

limited settings to solve an L1problem. Grandvalet (1998)

is one of the earliest to suggest an equivalence between

and a redundant parametrization. However, this equivalence

is only approximate. Hoff (2017) theoretically studies the

Hadamard parametrization in the context of generalized lin-

ear models and proposes to minimize the loss function by

alternatively applying the solution of the ridge regression

problem; notably, this work is the ﬁrst to prove that not

only the global minima of the redundant parametrization is

equivalent to the

global minima, but that all the local

minima of the redundant parametrization are also local min-

ima of the original

objective, although only in case of

linear models. Poon and Peyr

e(2021) studied the redundant

parametrization in the case of a convex loss function and

showed that all local minima of the redundant loss function

are global and that the saddles are strict. In follow-up work,

Poon and Peyr

e(2022) analyzed the optimization property

of these convex loss functions.

Compared to previous results, our result comprehensively

characterizes all the saddle and local minima of the loss

landscape of the redundant parametrization for a generic

and nonconvex loss function. Our theoretical result, in turn,

justiﬁes the application of simple SGD to solve this problem

and makes it possible to apply this method to highly com-

plicated and practical problems, such as training a sparse

neural network. Our motivation is also different from previ-

ous works. Previous works motivate the reparametrization

trick from the viewpoint of solving the original Lasso prob-

lem, whereas our focus is on solving and understanding

problems in deep learning. Application-wise, Hoff (2017)

applied the method to linear logistic regression.(Poon and

Peyr

e,2021) applied the method to lasso regression and

optimal transport. In contrast, our work is also the ﬁrst to

identify and demonstrate its usage in contemporary deep

learning.

Sparsity by Redundancy

Sparsity in Deep Learning. One important application

of our theory is understanding and achieving any type of

parameter sparsity in deep learning. There are two main rea-

sons for introducing sparsity to the model. The ﬁrst is that

some level of sparsity often leads to better generalization

performance; the second is that compressing the models can

lead to more memory/computation-efﬁcient deployment of

the models (Gale et al.,2019;Blalock et al.,2020). However,

none of the popular methods for sparsity in deep learning is

based on the

penalty, which is the favored method in con-

ventional statistics. For example, pruning-based methods

are the dominant strategies in deep learning (LeCun et al.,

1989). However, such methods are not satisfactory from a

principled perspective because the pruning part is separated

from the training, and it is hard to understand what these

pruning procedures are optimizing.

3. Algorithm and Theory

In this section, we ﬁrst introduce the reparametrization trick.

We then present our theoretical results, which establish

that the reparametrization trick does not make the land-

scape more complicated. All the proofs are presented in

Appendix B.

3.1. Landscape of the Reparametrization Trick

Consider a generic objective function

L(Vs, Vd)

that de-

pends on two sets of learnable parameters

and

, where

the subscript

stands for “sparse,” and

stands for “dense.”

Often, we want to ﬁnd a sparse set of parameters

that

minimizes

. The conventional way to achieve this is by

minimizing the loss function with an

penalty of strength

2κ:

min

Vs,Vd

L(Vs, Vd)+2κVs1.(1)

We will refer to

L(Vs, Vd)+2κVs1

LL1

. Under suit-

able conditions for

, the solutions of

L(Vs, Vd)

will feature

both (1) sparsity and (2) shrinkage of the norm of the so-

lution

, and thus one can perform variable selection and

overﬁtting avoidance at the same time. A primary obstacle

that has prevented a scalable optimization of Eq.

(1)

with

gradient descent algorithms is that it is non-differentiable

at the points where sparsity is achieved. The optimization

problem only has efﬁcient algorithms when the loss function

belongs to a restrictive set of families. See Figure 2.

Let

⊙

denote the element-wise product. The following

theorem derives a precise equivalence of Eq.

(1)

with a

redundantly parameterized objective.

Theorem 1. Let αβ =κ2and

Lsr(U, W, Vd)∶=L(U⊙W, Vd)+αU2+βW2.(2)

Then,

(U, W, Vd)

is a global minimum of Eq.

(2)

if and only

if (a)

Ui=Wi

for all

and (b)

(U⊙W, Vd)

is a global

minimum of Eq. (1).1

Because having

in the loss function or not does not

change the proof, we omit writing

from this point on.

We note that the suggestion that this reparametrization trick

is equivalent to the

penalty at global minima appeared in

previous works under various restricted settings. A limited

version of this theorem appeared in Hoff (2017) in the con-

text of a linear model. Poon and Peyr

e(2021) proved this

equivalence in the global minimum when the landscape is

convex.

The subscript

stands for “sparsity by redundancy.” When

-time differentiable, the objective

Lsr

is also

-time

differentiable. It is thus tempting to apply simple gradient-

based optimization methods to optimize this alternative ob-

jective when

itself is differentiable. When

is twice-

differentiable, one can also apply second-order methods

for acceleration. As an example of

, consider the case

when

is a training-set-dependent loss function (such as in

deep learning), and the parameters

and

are learnable

weights of a nonlinear neural network. In this case, one can

write Lsr as

∑

i=1

ℓ(fw(xi), yi)+αU2+βW2,(3)

where

w=(U, W, Vd)

denotes the total set of parameters

we want to minimize, and

(xi, yi)

are data point pairs of

an empirical dataset. For a deep learning practitioner, it

feels intuitive to solve this loss function with popular deep

learning training methods. Additionally,

regularization

can be implemented efﬁciently as weight decay as in the

standard deep learning frameworks. Section 3.2 provides

several speciﬁc examples of this redundant parametrization.

However, the equivalence in the global minimum is insufﬁ-

cient to motivate an application of SGD to it because gra-

dient descent is local, and if this parametrization induces

many bad minima, SGD can still fail badly. An important

question is thus whether this redundant parametrization has

made the optimization process more difﬁcult for SGD or

not. We now show that it does not, in the sense that all local

minima of Eq.

(2)

faithfully reproduce the local minima

of the original loss and vice versa. Thus, the redundant

parametrization cannot introduce new bad minima to the

loss landscape.

Theorem 2. All stationary points of Eq.

(2)

satisfy

Ui=

Wi

. Additionally,

(U, W )

is a local minimum of Eq.

(2)

and only if (a)

V=U⊙W

is a local minimum of Eq.

(1)

and (b) Ui=Wi.

Namely, one can partition all of the local minima of

Lrs

into

In this work, we use the letter

exclusively for the part of

loss function that does not contain L1or L2penalty.

Sparsity by Redundancy

exclusive and equivalent sets, such that these sets have a one-

to-one mapping with the local minima in the corresponding

LL1

. We are the ﬁrst to prove this one-to-one mapping rela-

tion for a general loss function. This proposition thus offers

a partial theoretical explanation to our empirical observation

that optimizing Eq.

(2)

is no more difﬁcult (and often much

easier) than the original

-regularized loss. A corollary

of this theorem reduces to the main theorem of Poon and

Peyr

e(2021), which states that if

is convex (such as in

Lasso), then every local minimum of

Lrs

is global. A cru-

cial new insight we offer is that one can still converge to a

bad minimum for a general landscape, but this only happens

because the original

LL1

has bad minima, not because of

the reparametrization trick.

Still, this alone is insufﬁcient to imply that GD can navigate

this landscape easily because gradient descent can get stuck

on saddle points easily (Du et al.,2017;Ziyin et al.,2021).

In particular, GD often has a problem escaping higher-order

saddle points where the Hessian eigenvalues along escaping

directions vanish. The following theorem shows that this is

also not a problem for the reparametrization trick because

the strength of the gradient is as strong as the original LL1.

Theorem 3. Let

U=W

V=U⊙W

and

be every-

where differentiable. Then, for every inﬁnitesimal variation

δV ,

LL1(V)

is directionally differentiable in

δV

, there

exist variations

δW, δU ∈Θ(δV )

such that

LL1(V+

δV )=Lrs(U+δU, W +δW );

LL1(V)

is not directionally differentiable in

δV

there exist variations

δW, δU ∈Θ(δV )0.5

such that

LL1(V+δV )=Lrs(U+δU, W +δW ).

Namely, away from nondifferential points of

LL1

, the

reparametrized landscape is qualitatively the same as

the original landscape, and escaping the saddles in the

reparametrized landscape must be no harder than escap-

ing the original saddle. If GD ﬁnds it difﬁcult to escape a

saddle point, it must be because the original

LL1

contains a

difﬁcult saddle. All nondifferentiable points of

LL1

occur

at a sparse solution where some parameters are zero. Here,

the ﬁrst-order derivative is discontinuous, and the variation

of the

LL1

is thus ﬁrst-order in

δV

. This implies that the

variation in the corresponding

Lrs

is second-order in

δU

and

δW

and that the Hessian of

Lrs

should have at least one

negative eigenvalue, which implies that escaping from these

points should be of no problem to gradient descent (Jin et al.,

2017). Combined, Theorem 2and 3directly motivate the

application of stochastic gradient descent to any problem

that SGD has been demonstrated efﬁcient for, an important

example being a neural network.

In more general scenarios, one is interested in a structured

sparsity, where a group of parameters is encouraged to be

sparse simultaneously. It sufﬁces to consider the case when

there is a single group because one can add

penalty

recursively to prove the general multigroup case:

L(Vs, Vd)+κVs2.(4)

The following theorem gives the equivalent redundant form.

Theorem 4. Let αβ =κ2and

Lsr(u, W, Vd)∶=L(uW, Vd)+αu2+βW2.(5)

Then,

(u, W, Vd)

is a global minimum of Eq.

(5)

if and only

if (a)

u=W2

for all

and (b)

(uW, Vd)

is a global

minimum of Eq. (4).

Namely, every

group only requires one additional param-

eter to sparsify. Note that recursively applying Theorem 4

and setting

to have dimension

allows us to recover

Theorem 1.

The above theory justiﬁes the application of

the reparametrization trick to any sparsity-related tasks in

deep learning. For completeness, we give an explicit algo-

rithm in Algorithm 1and 2. Let

be the number of groups.

This algorithm adds

parameters to the training process.

Consequently, it has the same complexity as the standard

deep learning training algorithms such as SGD because it, at

most, doubles the memory and computation cost of training

and does not incur additional costs for inference. For the

ResNet18/CIFAR10 experiment we performed, each iter-

ation of training with spred takes less than

more time

than the standard training, much lower than the worst-case

upper bound of 100%.

Algorithm 1 spred algorithm for parameter sparsity

Input: loss function

L(Vs, Vd)

, parameter

Vs, Vd

regularization strength 2κ

Initialize W, U

Solve (with SGD, Adam, LBGFS, etc.)

minW,U,VdL(U⊙W, Vd)+κ(W2

2+U2)

Output:V∗=U⊙W

Algorithm 2 spred algorithm for structured sparsity

Input: loss function

L(Vs, Vd)

, parameter

Vs, Vd

regularization strength 2κ

Initialize W, u

Solve minW,u,VdL(uW, Vd)+κ(W2

2+u2)

Output:V∗=uW

Implementation and practical remarks. First, multiple ways

exist to initialize the redundant parameters

and

. One

way is to initialize

with, say, the Kaiming init., and

to be of variance 1. The other way is to give both variables

Note that when

is a linear regression objective, the loss

function is equivalent to the group lasso.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

spred:SolvingL1PenaltywithSGDLiuZiyin*1ZihaoWang*2AbstractWeproposetominimizeagenericdifferentiableobjectivewithL1constraintusingasimplereparametrizationandstraightforwardstochasticgradientdescent.Ourproposalisthedirectgener-alizationofpreviousideasthattheL1penaltymaybeequivalenttoadifferentiablerep...

展开>> 收起<<

spred Solving L1 Penalty with SGD.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

spred Solving L1 Penalty with SGD

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: