Advancing Model Pruning via Bi-level Optimization Yihua Zhang1Yuguang Yao1Parikshit Ram2Pu Zhao3Tianlong Chen4 Mingyi Hong5Yanzhi Wang3Sijia Liu12

2025-04-27 0 0 2.12MB 27 页 10玖币

侵权投诉

Advancing Model Pruning via Bi-level Optimization

Yihua Zhang1,* Yuguang Yao1,* Parikshit Ram2Pu Zhao3Tianlong Chen4

Mingyi Hong5Yanzhi Wang3Sijia Liu1,2

1Michigan State University, 2IBM Research, 3Northeastern University,

4University of Texas at Austin, 5University of Minnesota, Twin Cities

*Equal contribution

Abstract

The deployment constraints in practical applications necessitate the pruning of

large-scale deep learning models, i.e., promoting their weight sparsity. As illus-

trated by the Lottery Ticket Hypothesis (LTH), pruning also has the potential of

improving their generalization ability. At the core of LTH, iterative magnitude

pruning (IMP) is the predominant pruning method to successfully ﬁnd ‘winning

tickets’. Yet, the computation cost of IMP grows prohibitively as the targeted

pruning ratio increases. To reduce the computation overhead, various efﬁcient

‘one-shot’ pruning methods have been developed but these schemes are usually

unable to ﬁnd winning tickets as good as IMP. This raises the question of how to

close the gap between pruning accuracy and pruning efﬁciency? To tackle it, we

pursue the algorithmic advancement of model pruning. Speciﬁcally, we formulate

the pruning problem from a fresh and novel viewpoint, bi-level optimization (BLO).

We show that the BLO interpretation provides a technically-grounded optimization

base for an efﬁcient implementation of the pruning-retraining learning paradigm

used in IMP. We also show that the proposed

-level optimization-oriented

runing

method (termed BIP) is a special class of BLO problems with a bi-linear problem

structure. By leveraging such bi-linearity, we theoretically show that BIP can

be solved as easily as ﬁrst-order optimization, thus inheriting the computation

efﬁciency. Through extensive experiments on both structured and unstructured

pruning with 5 model architectures and 4 data sets, we demonstrate that BIP

can ﬁnd better winning tickets than IMP in most cases, and is computationally

as efﬁcient as the one-shot pruning schemes, demonstrating 2-7

speedup over

IMP for the same level of model accuracy and sparsity. Codes are available at

https://github.com/OPTML-Group/BiP.

1 Introduction

While over-parameterized structures are key to the improved generalization of deep neural networks

(DNNs) [

–

], they create new problems – the millions or even billions of parameters not only

increase computational costs during inference, but also pose serious deployment challenges on

resource-limited devices [

]. As a result, model pruning has seen a lot of research interest in recent

years, focusing on reducing model sizes by removing (or pruning) redundant parameters [

–

Model sparsity (achieved by pruning) also beneﬁts adversarial robustness [

], out-of-distribution

generalization [

], and transfer learning [

]. Some pruning methods (towards structured sparsity)

facilitate model deployment on hardware [12, 13].

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.04092v4 [cs.LG] 21 Apr 2023

Among various proposed model pruning algorithms [

–

], the heuristics-based Iterative

Magnitude Pruning (IMP) is the current dominant approach to achieving model sparsity without

suffering performance loss, as suggested and empirically justiﬁed by the Lottery Ticket Hypothesis

(LTH) [

]. The LTH hypothesizes the existence of a subnetwork (the so-called ‘winning ticket’)

when trained in isolation (e.g., from either a random initialization [

] or an early-rewinding point of

dense model training [

]), can match the performance of the original dense model [

–

The core idea of IMP is to iteratively prune and retrain the model while progressively pruning

a small ratio of the remaining weights in each iteration, and continuing till the desired pruning

ratio has been reached. While IMP often ﬁnds the winning ticket, it incurs the cost of repeated

model retraining from scratch, making it prohibitively expensive for large datasets or large model

architectures [

]. To improve the efﬁciency of model pruning, numerous heuristics-based

one-shot pruning methods [

–

], e.g., one-shot magnitude pruning (OMP), have been proposed.

These schemes directly prune the model to the target sparsity and are signiﬁcantly more efﬁcient than

IMP. Yet, the promise of current one-shot pruning methods is dataset/model-speciﬁc [

] and

mostly lies in the low pruning ratio regime [

]. As systematically studied in [

], there exists

a clear performance gap between one-shot pruning and IMP. As an alternative to heuristics-based

schemes, optimization-based pruning methods [

] still follow the pruning-retraining

paradigm and adopt sparsity regularization [

] or parameterized masking [

] to prune

models efﬁciently. However, these methods do not always match the accuracy of IMP and thus have

not been widely used to ﬁnd winning tickets [

–

]. The empirical results showing that

optimization underperforms heuristics motivate us to revisit the algorithmic fundamentals of pruning.

Figure 1: A performance snapshot of the pro-

posed BIP method vs. the IMP baseline and the

original dense model across three pruned ResNet

models (ResNet-20, ResNet-56, and ResNet-18)

with 74% sparsity on CIFAR-10. The marker

size indicates the relative model size. The uni-

color region corresponds to the same model type

used by different pruning methods.

To this end, we put forth a novel perspective of model

pruning as a bi-level optimization (BLO) problem. In

this new formulation, we show that BLO provides a

technically-grounded optimization basis for an efﬁcient

implementation of the pruning-retraining paradigm, the

key algorithmic component used in IMP. To the best

of our knowledge, we make the ﬁrst rigorous connec-

tion between model pruning and BLO. Technically, we

propose a novel

-level optimization-enabled

runing

method (termed BIP). We further show how BIP takes

advantage of the bi-linearity of the pruning problem to

avoid the computational challenges of common BLO

methods, and is as efﬁcient as any ﬁrst-order alternating

optimization scheme. Practically, we demonstrate the

superiority of the proposed BIP in terms of accuracy,

sparsity, and computation efﬁciency through extensive

experiments. BIP ﬁnds the best winning ticket nearly in all settings while taking time comparable

to the one-shot OMP. In Fig. 1, we present a snapshot of our empirical results for CIFAR-10 with

3 ResNet architectures at 74% pruning ratio. In all cases, BIP (

) ﬁnds winning tickets, improving

accuracy over the dense model (•) and matching IMP (N), while being upto 5×faster than IMP.

Our contributions can be summarized as follows:

•

(Formulation) We rethink the algorithmic foundation of model pruning through the lens of BLO

(bi-level optimization). The new BLO-oriented formulation disentangles pruning and retraining

variables, providing the ﬂexibility to design the interface between pruning and retraining.

•

(Algorithm) We propose the new bi-level pruning (BIP) algorithm, which is built upon the aforemen-

tioned BLO formulation and the implicit gradient-based optimization theory. Unlike computationally

intensive standard BLO solvers, we theoretically show that BIP is as efﬁcient as any ﬁrst-order

optimization by taking advantage of the bi-linear nature of the pruning variables.

•

(Experiments) We conduct extensive experiments across 4 datasets (CIFAR-10, CIFAR-100, Tiny-

ImageNet and ImageNet), 5 model architectures, and 3 pruning settings (unstructured pruning,

ﬁlter-wise structured pruning, and channel-wise structured pruning). We show that (i) BIP achieves

higher test accuracy than IMP and ﬁnds the best winning tickets nearly in all settings, (ii) BIP is

highly efﬁcient (comparable to one-shot pruning schemes), that is able to achieve 2-7

speedup over

IMP for the same level of model accuracy and sparsity, and (iii) BIP is able to ﬁnd subnetworks that

achieve better performance than the dense model regardless of initialization rewinding.

2 Related Work and Open Question

Neural network pruning.

As neural networks become deeper and more sophisticated, model

pruning technology has gained increasing attention over the last decade since pruned models are

necessary for the deployment of deep networks in practical applications [

]. With the goal

of ﬁnding highly-sparse and highly-accurate subnetworks from original dense models, a variety

of pruning methods have been developed such as heuristics-based pruning [

–

]

and optimization-based pruning [

]. The former identiﬁes redundant model weights by

leveraging heuristics-based metrics such as weight magnitudes [

gradient magnitudes [

], and Hessian statistics [

–

].The latter is typically built on:

1) sparsity-promoting optimization [

–

], where model weights are trained by penalizing

their sparsity-inducing norms, such as

and

norms for irregular weight pruning, and

norm

for structured pruning; 2) parameterized masking [

–

], where model weight scores are

optimized to ﬁlter the most important weights and achieve better performance.

Iterative vs. one-shot pruning, and motivation.

Existing schemes can be further categorized into

one-shot or iterative pruning based on the pruning schedule employed for achieving the targeted model

sparsity. Among the iterative schemes, the IMP (Iterative Magnitude Pruning scheme) [

–

] has played a signiﬁcant role in identifying high-quality ‘winning tickets’, as postulated

by LTH (Lottery Ticket Hypothesis) [

]. To enable consistent comparisons among different

methods, we extend the original deﬁnition of winning tickets in [

] to ‘matching subnetworks’ [

]

so as to cover different implementations of winning tickets, e.g., the use of early-epoch rewinding for

model re-initialization [

] and the no-rewinding (i.e., ﬁne-tuning) variant [

]. Brieﬂy, the matching

subnetworks should match or surpass the performance of the original dense model [

]. In this work,

if a matching subnetwork is found better than the winning ticket obtained by the same method that

follows the original LTH setup [

], we will also call such a matching subnetwork a winning

ticket throughout the paper.

Figure 2: Illustration of the pros and cons of different pruning

methods executed over (ResNet-20, CIFAR-10).

Left

: test ac-

curacy vs. different pruning ratios of IMP and one-shot pruning

methods (OMP [

], GRASP [

], SNIP [

], SYNFLOW [

]).

Right: comparison of the efﬁciency with different sparsity.

For example, the current state-of-the-

art (SOTA) implementation of IMP in

[

] can lead to a pruned ResNet-20

on CIFAR-10 with 74% sparsity and

92.12%

test accuracy, matching the per-

formance of the original dense model

(see red

in Fig. 2-

Left

). The IMP algo-

rithm typically contains two key ingre-

dients:

(i)

atemporally-evolving prun-

ing schedule to progressively increase

model sparsity over pruning iterations,

and

(ii)

the pruning-retraining learning

mechanism applied at each pruning iter-

ation. With a target pruning ratio of

with

pruning iterations, an example pruning schedule in (i) could be as follows – each iteration

prunes

(p%)1

of the currently unpruned model weights, progressively pruning fewer weights in

each iteration. For (ii), the unpruned weights in each pruning iteration are re-set to the weights

at initialization or at an early-training epoch [

], and re-trained till convergence. In brief, IMP

repeatedly prunes, resets, and trains the network over multiple iterations.

However, winning tickets found by IMP incur signiﬁcant computational costs. The sparsest winning

ticket found by IMP in Fig. 2-

Left

(red

) utilizes

T= 6

pruning iterations. As shown in Fig. 2-

Right

this takes 3

more time than the original training of the dense model. To avoid the computational

cost of IMP, different kinds of ‘accelerated’ pruning methods were developed [

–

and many fall into the one-shot pruning category: The network is directly pruned to the target sparsity

and retrained once. In particular, OMP (one-shot magnitude pruning) is an important baseline that

simpliﬁes IMP [

]. It follows the pruning-retraining paradigm, but prunes the model to the target

ratio with a single pruning iteration. Although one-shot pruning schemes are computationally cheap

(Fig. 2-

Right

), they incur a signiﬁcant accuracy drop compared to IMP (Fig. 2-

Left

). Even if IMP is

customized with a reduced number of training epochs per pruning round, the pruning accuracy also

drops largely (see Fig. A2). Hence, there is a need for advanced model pruning techniques to ﬁnd

winning tickets like IMP, while being efﬁcient like one-shot pruning.

Different from unstructured weight pruning described above, structured pruning takes into consid-

eration the sparse patterns of the model, such as ﬁlter and channel-wise pruning [

–

]. Structured pruning is desirable for deep model deployment in the presence of hardware con-

straints [

]. However, compared to unstructured pruning, it is usually more challenging to

maintain the performance and ﬁnd structure-aware winning tickets [28, 29].

Open question.

As discussed above, one-shot pruning methods are unable to match the predic-

tive performance of IMP, and structure-aware winning tickets are hard to ﬁnd. Clearly, the best

optimization foundation of model pruning is underdeveloped. Thus, we ask:

Is there an

optimization basis

for a successful pruning algorithm that can attain high pruned

model accuracy (like IMP) and high computational efﬁciency (like one-shot pruning)?

The model pruning problem has a natural hierarchical structure – we need to ﬁnd the best mask to

prune model parameters, and then, given the mask, ﬁnd the best model weights for the unpruned

model parameters. Given this hierarchical structure, we believe that the bi-level optimization (BLO)

framework is one promising optimization basis for a successful pruning algorithm.

Bi-level optimization (BLO).

BLO is a general hierarchical optimization framework, where the

upper-level problem depends on the solution to the lower-level problem. Such a nested structure

makes the BLO in its most generic form very difﬁcult to solve, since it is hard to characterize the

inﬂuence of the lower-level optimization on the upper-level problem. Various existing literature

focuses on the design and analysis of algorithms for various special cases of BLO. Applications range

from the classical approximate descent methods [

–

], penalty-based methods [

], to recent

BigSAM [

] and its extensions [

]. It is also studied in the area of stochastic algorithms [

–

]

and back-propagation based algorithms [

–

]. BLO has also advanced adversarial training [

meta-learning [

], data poisoning attack generation [

], neural architecture search [

] as well as

reinforcement learning [

]. Although BLO was referred in [

] for model pruning, it actually called

an ordinary alternating optimization procedure, without taking the hierarchical learning structure of

BLO into consideration. To the best of our knowledge, the BLO framework has not been considered

for model pruning in-depth and systematically. We will show that model pruning yields a special

class of BLO problems with bi-linear optimization variables. We will also theoretically show that this

specialized BLO problem for model pruning can be solved as efﬁciently as ﬁrst-order optimization.

This is in a sharp contrast to existing BLO applications that rely on heuristics-based BLO solvers

(e.g., gradient unrolling in meta learning [87] and neural architecture search [89, 91]).

3 BIP: Model Pruning via Bi-level Optimization

In this section, we re-investigate the problem of model pruning through the lens of BLO and develop

the bi-level pruning (BIP) algorithm. We can theoretically show that BIP can be solved as easily as

the ﬁrst-order alternating optimization by taking advantage of the bi-linearity of pruning variables.

A BLO viewpoint on model pruning

As described in the previous section, the pruning-retraining

learning paradigm covers two kinds of tasks:

pruning that determines the sparse pattern of model

weights, and

training remaining non-zero weights to recover the model accuracy. In existing

optimization-based pruning methods [

–

], the tasks

are typically achieved by optimizing

model weights, together with penalizing their sparsity-inducing norms, e.g., the

and

norms [

Different from the above formulation, we propose to separate optimization variables involved in the

pruning tasks

and

. This leads to the (binary) pruning mask variable

m∈ {0,1}n

and the model

weight variable

θ∈Rn

. Here

denotes the total number of model parameters. Accordingly, the

pruned model is given by

(mθ)

, where



denotes the element-wise multiplication. As will be

evident later, this form of variable disentanglement enables us to explicitly depict how the pruning

and retraining process co-evolve, and helps customize the pruning task with high ﬂexibility.

We next elaborate on how BLO can be established to co-optimize the pruning mask

and the

retrained sparse model weights

. Given the pruning ratio

, the sparsity constraint is given by

m∈ S

, where

S={m|m∈ {0,1}n,1Tm≤k}

and

k= (1 −p%)n

. Our goal is to prune

the original model directly to the targeted pruning ratio

(i.e., without calling for the IMP-like

sparsity schedule as described in Sec. 2) and obtain the optimized sparse model

(mθ)

. To this

end, we interpret the pruning task (i.e.,

) and the model retraining task (i.e.,

) as two optimization

levels, where the former is formulated as an upper-level optimization problem, and it relies on the

optimization of the lower-level retraining task. We thus cast the model pruning problem as the

following BLO problem (with ·being nested inside ¶):

minimize

m∈S `(mθ∗(m))

| {z }

¶: Pruning task

;subject to θ∗(m) = arg min

θ∈Rn

:=g(m,θ)

z }| {

`(mθ) + γ

2kθk2

| {z }

·: Sparsity-ﬁxed model retraining

,(1)

where

denotes the training loss (e.g., cross-entropy),

and

are the upper-level and lower-level

optimization variables respectively,

θ∗(m)

signiﬁes the lower-level solution obtained by minimizing

the objective function

given the pruning mask

, and

γ > 0

is a regularization parameter introduced

to convexify the lower-level optimization so as to stabilize the gradient ﬂow from

θ∗(m)

and

thus the convergence of BLO [

]. In a sharp contrast to existing single-level optimization-based

model pruning methods [92–95], the BLO formulation (1) brings in two advantages.

First

, BLO has the ﬂexibility to use mismatched pruning and retraining objectives at the upper and

lower optimization levels, respectively. This ﬂexibility allows us to regularize the lower-level training

objective function in (1) and customize the implemented optimization methods at both levels. To be

more speciﬁc, one can update the upper-level pruning mask

using a data batch (called

) distinct

from the one (called

) used for obtaining the lower-level solution

θ∗(m)

. The resulting BLO

procedure can then mimic the idea of meta learning to improve model generalization [

], where the

lower-level problem ﬁne-tunes

using

, and the upper-level problem validates the generalization

of the sparsity-aware ﬁnetuned model (mθ∗(m)) using B2.

Second

, BLO enables us to explicitly model and optimize the coupling between the retrained model

weights

θ∗(m)

and the pruning mask

through the implicit gradient (IG)-based optimization

routine. Here IG refers to the gradient of the lower-level solution

θ∗(m)

with respect to (w.r.t.)

the upper-level variable

, and its derivation calls the implicit function theory [

]. The use of IG

makes our proposed BLO-oriented pruning (1) signiﬁcantly different from the greedy alternating

minimization [

] that learns the upper-level and lower-level variables independently (i.e., minimizes

one variable by ﬁxing the other). We refer readers to the following section for the detailed IG theory.

We will also show in Sec. 4 that the pruning strategy from (1) can outperform IMP in many pruning

scenarios but is much more efﬁcient as it does not call for the scheduler of iterative pruning ratios.

Optimization foundation of BIP.

The key optimization challenge of solving the BIP problem (1)

lies in the computation of IG (implicit gradient). Prior to developing an effective solution, we ﬁrst

elaborate on the IG challenge, the unique characteristic of BLO. In the context of gradient descent,

the gradient of the objective function in (1) yields

d`(mθ∗(m))

| {z }

Gradient of objective

=∇m`(mθ∗(m)) + d(θ∗(m)>)

| {z }

∇θ`(mθ∗(m)),(2)

where

∇m

and

∇θ

denote the partial derivatives of the bi-variate function

`(mθ)

w.r.t. the

variable

and

respectively,

dθ>/dm∈Rn×n

denotes the vector-wise full derivative, and for ease

of notation, we will omit the transpose

when the context is clear. In (2), the IG challenge refers to

the demand for computing the full gradient of the implicit function

θ∗(m) = arg minθg(m,θ)

w.r.t.

m, where recall from (1) that g(m,θ):=`(mθ) + γ

2kθk2

Next, we derive the IG formula following the rigorous implicit function theory [

]. Based

on the fact that

θ∗(m)

satisﬁes the stationarity condition for the lower-level objective function in (2),

it is not difﬁcult to obtain that (see derivation in Appendix A)

dθ∗(m)

dm=−∇2

mθ`(mθ∗)[∇2

θ`(mθ∗) + γI]−1,(3)

where

∇2

mθ`

and

∇2

θ`

denote the second-order partial derivatives of

respectively, and

(·)−1

denotes

the matrix inversion operation.

Yet, the exact IG formula (3) remains difﬁcult to calculate due to the presence of matrix inversion and

second-order partial derivatives. To simplify it, we impose the Hessian-free assumption,

∇2

θ`=0

which is mild in general; For example, the decision boundaries of neural networks with ReLU

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdvancingModelPruningviaBi-levelOptimizationYihuaZhang1,*YuguangYao1,*ParikshitRam2PuZhao3TianlongChen4MingyiHong5YanzhiWang3SijiaLiu1,21MichiganStateUniversity,2IBMResearch,3NortheasternUniversity,4UniversityofTexasatAustin,5UniversityofMinnesota,TwinCities*EqualcontributionAbstractThedeploymentcon...

展开>> 收起<<

Advancing Model Pruning via Bi-level Optimization Yihua Zhang1Yuguang Yao1Parikshit Ram2Pu Zhao3Tianlong Chen4 Mingyi Hong5Yanzhi Wang3Sijia Liu12.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Advancing Model Pruning via Bi-level Optimization Yihua Zhang1Yuguang Yao1Parikshit Ram2Pu Zhao3Tianlong Chen4 Mingyi Hong5Yanzhi Wang3Sijia Liu12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: