Advancing Model Pruning via Bi-level Optimization Yihua Zhang1Yuguang Yao1Parikshit Ram2Pu Zhao3Tianlong Chen4 Mingyi Hong5Yanzhi Wang3Sijia Liu12

2025-04-27 0 0 2.12MB 27 页 10玖币
侵权投诉
Advancing Model Pruning via Bi-level Optimization
Yihua Zhang1,* Yuguang Yao1,* Parikshit Ram2Pu Zhao3Tianlong Chen4
Mingyi Hong5Yanzhi Wang3Sijia Liu1,2
1Michigan State University, 2IBM Research, 3Northeastern University,
4University of Texas at Austin, 5University of Minnesota, Twin Cities
*Equal contribution
Abstract
The deployment constraints in practical applications necessitate the pruning of
large-scale deep learning models, i.e., promoting their weight sparsity. As illus-
trated by the Lottery Ticket Hypothesis (LTH), pruning also has the potential of
improving their generalization ability. At the core of LTH, iterative magnitude
pruning (IMP) is the predominant pruning method to successfully find ‘winning
tickets’. Yet, the computation cost of IMP grows prohibitively as the targeted
pruning ratio increases. To reduce the computation overhead, various efficient
‘one-shot’ pruning methods have been developed but these schemes are usually
unable to find winning tickets as good as IMP. This raises the question of how to
close the gap between pruning accuracy and pruning efficiency? To tackle it, we
pursue the algorithmic advancement of model pruning. Specifically, we formulate
the pruning problem from a fresh and novel viewpoint, bi-level optimization (BLO).
We show that the BLO interpretation provides a technically-grounded optimization
base for an efficient implementation of the pruning-retraining learning paradigm
used in IMP. We also show that the proposed
bi
-level optimization-oriented
p
runing
method (termed BIP) is a special class of BLO problems with a bi-linear problem
structure. By leveraging such bi-linearity, we theoretically show that BIP can
be solved as easily as first-order optimization, thus inheriting the computation
efficiency. Through extensive experiments on both structured and unstructured
pruning with 5 model architectures and 4 data sets, we demonstrate that BIP
can find better winning tickets than IMP in most cases, and is computationally
as efficient as the one-shot pruning schemes, demonstrating 2-7
×
speedup over
IMP for the same level of model accuracy and sparsity. Codes are available at
https://github.com/OPTML-Group/BiP.
1 Introduction
While over-parameterized structures are key to the improved generalization of deep neural networks
(DNNs) [
1
3
], they create new problems – the millions or even billions of parameters not only
increase computational costs during inference, but also pose serious deployment challenges on
resource-limited devices [
4
]. As a result, model pruning has seen a lot of research interest in recent
years, focusing on reducing model sizes by removing (or pruning) redundant parameters [
4
8
].
Model sparsity (achieved by pruning) also benefits adversarial robustness [
9
], out-of-distribution
generalization [
10
], and transfer learning [
11
]. Some pruning methods (towards structured sparsity)
facilitate model deployment on hardware [12, 13].
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.04092v4 [cs.LG] 21 Apr 2023
Among various proposed model pruning algorithms [
5
,
9
,
11
,
14
27
], the heuristics-based Iterative
Magnitude Pruning (IMP) is the current dominant approach to achieving model sparsity without
suffering performance loss, as suggested and empirically justified by the Lottery Ticket Hypothesis
(LTH) [
17
]. The LTH hypothesizes the existence of a subnetwork (the so-called ‘winning ticket’)
when trained in isolation (e.g., from either a random initialization [
17
] or an early-rewinding point of
dense model training [
19
]), can match the performance of the original dense model [
5
,
11
,
17
20
].
The core idea of IMP is to iteratively prune and retrain the model while progressively pruning
a small ratio of the remaining weights in each iteration, and continuing till the desired pruning
ratio has been reached. While IMP often finds the winning ticket, it incurs the cost of repeated
model retraining from scratch, making it prohibitively expensive for large datasets or large model
architectures [
28
,
29
]. To improve the efficiency of model pruning, numerous heuristics-based
one-shot pruning methods [
17
,
21
25
], e.g., one-shot magnitude pruning (OMP), have been proposed.
These schemes directly prune the model to the target sparsity and are significantly more efficient than
IMP. Yet, the promise of current one-shot pruning methods is dataset/model-specific [
22
,
30
] and
mostly lies in the low pruning ratio regime [
25
]. As systematically studied in [
22
,
31
], there exists
a clear performance gap between one-shot pruning and IMP. As an alternative to heuristics-based
schemes, optimization-based pruning methods [
9
,
15
,
16
,
26
,
27
] still follow the pruning-retraining
paradigm and adopt sparsity regularization [
32
,
33
] or parameterized masking [
9
,
16
,
27
] to prune
models efficiently. However, these methods do not always match the accuracy of IMP and thus have
not been widely used to find winning tickets [
17
,
21
,
23
25
,
29
]. The empirical results showing that
optimization underperforms heuristics motivate us to revisit the algorithmic fundamentals of pruning.
Figure 1: A performance snapshot of the pro-
posed BIP method vs. the IMP baseline and the
original dense model across three pruned ResNet
models (ResNet-20, ResNet-56, and ResNet-18)
with 74% sparsity on CIFAR-10. The marker
size indicates the relative model size. The uni-
color region corresponds to the same model type
used by different pruning methods.
To this end, we put forth a novel perspective of model
pruning as a bi-level optimization (BLO) problem. In
this new formulation, we show that BLO provides a
technically-grounded optimization basis for an efficient
implementation of the pruning-retraining paradigm, the
key algorithmic component used in IMP. To the best
of our knowledge, we make the first rigorous connec-
tion between model pruning and BLO. Technically, we
propose a novel
bi
-level optimization-enabled
p
runing
method (termed BIP). We further show how BIP takes
advantage of the bi-linearity of the pruning problem to
avoid the computational challenges of common BLO
methods, and is as efficient as any first-order alternating
optimization scheme. Practically, we demonstrate the
superiority of the proposed BIP in terms of accuracy,
sparsity, and computation efficiency through extensive
experiments. BIP finds the best winning ticket nearly in all settings while taking time comparable
to the one-shot OMP. In Fig. 1, we present a snapshot of our empirical results for CIFAR-10 with
3 ResNet architectures at 74% pruning ratio. In all cases, BIP (
?
) finds winning tickets, improving
accuracy over the dense model () and matching IMP (N), while being upto 5×faster than IMP.
Our contributions can be summarized as follows:
(Formulation) We rethink the algorithmic foundation of model pruning through the lens of BLO
(bi-level optimization). The new BLO-oriented formulation disentangles pruning and retraining
variables, providing the flexibility to design the interface between pruning and retraining.
(Algorithm) We propose the new bi-level pruning (BIP) algorithm, which is built upon the aforemen-
tioned BLO formulation and the implicit gradient-based optimization theory. Unlike computationally
intensive standard BLO solvers, we theoretically show that BIP is as efficient as any first-order
optimization by taking advantage of the bi-linear nature of the pruning variables.
(Experiments) We conduct extensive experiments across 4 datasets (CIFAR-10, CIFAR-100, Tiny-
ImageNet and ImageNet), 5 model architectures, and 3 pruning settings (unstructured pruning,
filter-wise structured pruning, and channel-wise structured pruning). We show that (i) BIP achieves
higher test accuracy than IMP and finds the best winning tickets nearly in all settings, (ii) BIP is
highly efficient (comparable to one-shot pruning schemes), that is able to achieve 2-7
×
speedup over
IMP for the same level of model accuracy and sparsity, and (iii) BIP is able to find subnetworks that
achieve better performance than the dense model regardless of initialization rewinding.
2
2 Related Work and Open Question
Neural network pruning.
As neural networks become deeper and more sophisticated, model
pruning technology has gained increasing attention over the last decade since pruned models are
necessary for the deployment of deep networks in practical applications [
4
,
34
,
35
]. With the goal
of finding highly-sparse and highly-accurate subnetworks from original dense models, a variety
of pruning methods have been developed such as heuristics-based pruning [
17
,
21
,
23
25
,
29
,
36
]
and optimization-based pruning [
9
,
16
,
26
,
27
]. The former identifies redundant model weights by
leveraging heuristics-based metrics such as weight magnitudes [
6
,
17
,
19
,
11
,
22
,
37
,
31
,
36
,
38
],
gradient magnitudes [
21
,
23
,
24
,
39
,
40
], and Hessian statistics [
41
46
].The latter is typically built on:
1) sparsity-promoting optimization [
15
,
33
,
47
50
], where model weights are trained by penalizing
their sparsity-inducing norms, such as
`0
and
`1
norms for irregular weight pruning, and
`2
norm
for structured pruning; 2) parameterized masking [
16
,
9
,
51
55
], where model weight scores are
optimized to filter the most important weights and achieve better performance.
Iterative vs. one-shot pruning, and motivation.
Existing schemes can be further categorized into
one-shot or iterative pruning based on the pruning schedule employed for achieving the targeted model
sparsity. Among the iterative schemes, the IMP (Iterative Magnitude Pruning scheme) [
17
,
20
,
56
65
,
36
] has played a significant role in identifying high-quality ‘winning tickets’, as postulated
by LTH (Lottery Ticket Hypothesis) [
18
,
19
]. To enable consistent comparisons among different
methods, we extend the original definition of winning tickets in [
17
] to ‘matching subnetworks’ [
20
]
so as to cover different implementations of winning tickets, e.g., the use of early-epoch rewinding for
model re-initialization [
18
] and the no-rewinding (i.e., fine-tuning) variant [
66
]. Briefly, the matching
subnetworks should match or surpass the performance of the original dense model [
20
]. In this work,
if a matching subnetwork is found better than the winning ticket obtained by the same method that
follows the original LTH setup [
18
,
19
], we will also call such a matching subnetwork a winning
ticket throughout the paper.
Figure 2: Illustration of the pros and cons of different pruning
methods executed over (ResNet-20, CIFAR-10).
Left
: test ac-
curacy vs. different pruning ratios of IMP and one-shot pruning
methods (OMP [
17
], GRASP [
23
], SNIP [
21
], SYNFLOW [
24
]).
Right: comparison of the efficiency with different sparsity.
For example, the current state-of-the-
art (SOTA) implementation of IMP in
[
22
] can lead to a pruned ResNet-20
on CIFAR-10 with 74% sparsity and
92.12%
test accuracy, matching the per-
formance of the original dense model
(see red
?
in Fig. 2-
Left
). The IMP algo-
rithm typically contains two key ingre-
dients:
(i)
atemporally-evolving prun-
ing schedule to progressively increase
model sparsity over pruning iterations,
and
(ii)
the pruning-retraining learning
mechanism applied at each pruning iter-
ation. With a target pruning ratio of
p%
with
T
pruning iterations, an example pruning schedule in (i) could be as follows – each iteration
prunes
(p%)1
/T
of the currently unpruned model weights, progressively pruning fewer weights in
each iteration. For (ii), the unpruned weights in each pruning iteration are re-set to the weights
at initialization or at an early-training epoch [
18
], and re-trained till convergence. In brief, IMP
repeatedly prunes, resets, and trains the network over multiple iterations.
However, winning tickets found by IMP incur significant computational costs. The sparsest winning
ticket found by IMP in Fig. 2-
Left
(red
?
) utilizes
T= 6
pruning iterations. As shown in Fig. 2-
Right
,
this takes 3
×
more time than the original training of the dense model. To avoid the computational
cost of IMP, different kinds of ‘accelerated’ pruning methods were developed [
17
,
21
,
23
25
,
29
],
and many fall into the one-shot pruning category: The network is directly pruned to the target sparsity
and retrained once. In particular, OMP (one-shot magnitude pruning) is an important baseline that
simplifies IMP [
17
]. It follows the pruning-retraining paradigm, but prunes the model to the target
ratio with a single pruning iteration. Although one-shot pruning schemes are computationally cheap
(Fig. 2-
Right
), they incur a significant accuracy drop compared to IMP (Fig. 2-
Left
). Even if IMP is
customized with a reduced number of training epochs per pruning round, the pruning accuracy also
drops largely (see Fig. A2). Hence, there is a need for advanced model pruning techniques to find
winning tickets like IMP, while being efficient like one-shot pruning.
3
Different from unstructured weight pruning described above, structured pruning takes into consid-
eration the sparse patterns of the model, such as filter and channel-wise pruning [
13
,
48
,
51
,
67
71
]. Structured pruning is desirable for deep model deployment in the presence of hardware con-
straints [
4
,
34
]. However, compared to unstructured pruning, it is usually more challenging to
maintain the performance and find structure-aware winning tickets [28, 29].
Open question.
As discussed above, one-shot pruning methods are unable to match the predic-
tive performance of IMP, and structure-aware winning tickets are hard to find. Clearly, the best
optimization foundation of model pruning is underdeveloped. Thus, we ask:
Is there an
optimization basis
for a successful pruning algorithm that can attain high pruned
model accuracy (like IMP) and high computational efficiency (like one-shot pruning)?
The model pruning problem has a natural hierarchical structure – we need to find the best mask to
prune model parameters, and then, given the mask, find the best model weights for the unpruned
model parameters. Given this hierarchical structure, we believe that the bi-level optimization (BLO)
framework is one promising optimization basis for a successful pruning algorithm.
Bi-level optimization (BLO).
BLO is a general hierarchical optimization framework, where the
upper-level problem depends on the solution to the lower-level problem. Such a nested structure
makes the BLO in its most generic form very difficult to solve, since it is hard to characterize the
influence of the lower-level optimization on the upper-level problem. Various existing literature
focuses on the design and analysis of algorithms for various special cases of BLO. Applications range
from the classical approximate descent methods [
72
74
], penalty-based methods [
75
,
76
], to recent
BigSAM [
77
] and its extensions [
78
,
79
]. It is also studied in the area of stochastic algorithms [
80
82
]
and back-propagation based algorithms [
83
85
]. BLO has also advanced adversarial training [
86
],
meta-learning [
87
], data poisoning attack generation [
88
], neural architecture search [
89
] as well as
reinforcement learning [
90
]. Although BLO was referred in [
50
] for model pruning, it actually called
an ordinary alternating optimization procedure, without taking the hierarchical learning structure of
BLO into consideration. To the best of our knowledge, the BLO framework has not been considered
for model pruning in-depth and systematically. We will show that model pruning yields a special
class of BLO problems with bi-linear optimization variables. We will also theoretically show that this
specialized BLO problem for model pruning can be solved as efficiently as first-order optimization.
This is in a sharp contrast to existing BLO applications that rely on heuristics-based BLO solvers
(e.g., gradient unrolling in meta learning [87] and neural architecture search [89, 91]).
3 BIP: Model Pruning via Bi-level Optimization
In this section, we re-investigate the problem of model pruning through the lens of BLO and develop
the bi-level pruning (BIP) algorithm. We can theoretically show that BIP can be solved as easily as
the first-order alternating optimization by taking advantage of the bi-linearity of pruning variables.
A BLO viewpoint on model pruning
As described in the previous section, the pruning-retraining
learning paradigm covers two kinds of tasks:
pruning that determines the sparse pattern of model
weights, and
·
training remaining non-zero weights to recover the model accuracy. In existing
optimization-based pruning methods [
92
95
], the tasks
-
·
are typically achieved by optimizing
model weights, together with penalizing their sparsity-inducing norms, e.g., the
`1
and
`2
norms [
96
].
Different from the above formulation, we propose to separate optimization variables involved in the
pruning tasks
and
·
. This leads to the (binary) pruning mask variable
m∈ {0,1}n
and the model
weight variable
θRn
. Here
n
denotes the total number of model parameters. Accordingly, the
pruned model is given by
(mθ)
, where
denotes the element-wise multiplication. As will be
evident later, this form of variable disentanglement enables us to explicitly depict how the pruning
and retraining process co-evolve, and helps customize the pruning task with high flexibility.
We next elaborate on how BLO can be established to co-optimize the pruning mask
m
and the
retrained sparse model weights
θ
. Given the pruning ratio
p%
, the sparsity constraint is given by
m∈ S
, where
S={m|m∈ {0,1}n,1Tmk}
and
k= (1 p%)n
. Our goal is to prune
the original model directly to the targeted pruning ratio
p%
(i.e., without calling for the IMP-like
sparsity schedule as described in Sec. 2) and obtain the optimized sparse model
(mθ)
. To this
4
end, we interpret the pruning task (i.e.,
) and the model retraining task (i.e.,
·
) as two optimization
levels, where the former is formulated as an upper-level optimization problem, and it relies on the
optimization of the lower-level retraining task. We thus cast the model pruning problem as the
following BLO problem (with ·being nested inside ):
minimize
m∈S `(mθ(m))
| {z }
: Pruning task
;subject to θ(m) = arg min
θRn
:=g(m,θ)
z }| {
`(mθ) + γ
2kθk2
2
| {z }
·: Sparsity-fixed model retraining
,(1)
where
`
denotes the training loss (e.g., cross-entropy),
m
and
θ
are the upper-level and lower-level
optimization variables respectively,
θ(m)
signifies the lower-level solution obtained by minimizing
the objective function
g
given the pruning mask
m
, and
γ > 0
is a regularization parameter introduced
to convexify the lower-level optimization so as to stabilize the gradient flow from
θ(m)
to
m
and
thus the convergence of BLO [
82
,
97
]. In a sharp contrast to existing single-level optimization-based
model pruning methods [92–95], the BLO formulation (1) brings in two advantages.
First
, BLO has the flexibility to use mismatched pruning and retraining objectives at the upper and
lower optimization levels, respectively. This flexibility allows us to regularize the lower-level training
objective function in (1) and customize the implemented optimization methods at both levels. To be
more specific, one can update the upper-level pruning mask
m
using a data batch (called
B2
) distinct
from the one (called
B1
) used for obtaining the lower-level solution
θ(m)
. The resulting BLO
procedure can then mimic the idea of meta learning to improve model generalization [
98
], where the
lower-level problem fine-tunes
θ
using
B1
, and the upper-level problem validates the generalization
of the sparsity-aware finetuned model (mθ(m)) using B2.
Second
, BLO enables us to explicitly model and optimize the coupling between the retrained model
weights
θ(m)
and the pruning mask
m
through the implicit gradient (IG)-based optimization
routine. Here IG refers to the gradient of the lower-level solution
θ(m)
with respect to (w.r.t.)
the upper-level variable
m
, and its derivation calls the implicit function theory [
76
]. The use of IG
makes our proposed BLO-oriented pruning (1) significantly different from the greedy alternating
minimization [
99
] that learns the upper-level and lower-level variables independently (i.e., minimizes
one variable by fixing the other). We refer readers to the following section for the detailed IG theory.
We will also show in Sec. 4 that the pruning strategy from (1) can outperform IMP in many pruning
scenarios but is much more efficient as it does not call for the scheduler of iterative pruning ratios.
Optimization foundation of BIP.
The key optimization challenge of solving the BIP problem (1)
lies in the computation of IG (implicit gradient). Prior to developing an effective solution, we first
elaborate on the IG challenge, the unique characteristic of BLO. In the context of gradient descent,
the gradient of the objective function in (1) yields
d`(mθ(m))
dm
| {z }
Gradient of objective
=m`(mθ(m)) + d(θ(m)>)
dm
| {z }
IG
θ`(mθ(m)),(2)
where
m
and
θ
denote the partial derivatives of the bi-variate function
`(mθ)
w.r.t. the
variable
m
and
θ
respectively,
dθ>/dmRn×n
denotes the vector-wise full derivative, and for ease
of notation, we will omit the transpose
>
when the context is clear. In (2), the IG challenge refers to
the demand for computing the full gradient of the implicit function
θ(m) = arg minθg(m,θ)
w.r.t.
m, where recall from (1) that g(m,θ):=`(mθ) + γ
2kθk2
2.
Next, we derive the IG formula following the rigorous implicit function theory [
76
,
82
,
87
]. Based
on the fact that
θ(m)
satisfies the stationarity condition for the lower-level objective function in (2),
it is not difficult to obtain that (see derivation in Appendix A)
dθ(m)
dm=−∇2
mθ`(mθ)[2
θ`(mθ) + γI]1,(3)
where
2
mθ`
and
2
θ`
denote the second-order partial derivatives of
`
respectively, and
(·)1
denotes
the matrix inversion operation.
Yet, the exact IG formula (3) remains difficult to calculate due to the presence of matrix inversion and
second-order partial derivatives. To simplify it, we impose the Hessian-free assumption,
2
θ`=0
,
which is mild in general; For example, the decision boundaries of neural networks with ReLU
5
摘要:

AdvancingModelPruningviaBi-levelOptimizationYihuaZhang1,*YuguangYao1,*ParikshitRam2PuZhao3TianlongChen4MingyiHong5YanzhiWang3SijiaLiu1,21MichiganStateUniversity,2IBMResearch,3NortheasternUniversity,4UniversityofTexasatAustin,5UniversityofMinnesota,TwinCities*EqualcontributionAbstractThedeploymentcon...

展开>> 收起<<
Advancing Model Pruning via Bi-level Optimization Yihua Zhang1Yuguang Yao1Parikshit Ram2Pu Zhao3Tianlong Chen4 Mingyi Hong5Yanzhi Wang3Sijia Liu12.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:2.12MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注