
Among various proposed model pruning algorithms [
5
,
9
,
11
,
14
–
27
], the heuristics-based Iterative
Magnitude Pruning (IMP) is the current dominant approach to achieving model sparsity without
suffering performance loss, as suggested and empirically justified by the Lottery Ticket Hypothesis
(LTH) [
17
]. The LTH hypothesizes the existence of a subnetwork (the so-called ‘winning ticket’)
when trained in isolation (e.g., from either a random initialization [
17
] or an early-rewinding point of
dense model training [
19
]), can match the performance of the original dense model [
5
,
11
,
17
–
20
].
The core idea of IMP is to iteratively prune and retrain the model while progressively pruning
a small ratio of the remaining weights in each iteration, and continuing till the desired pruning
ratio has been reached. While IMP often finds the winning ticket, it incurs the cost of repeated
model retraining from scratch, making it prohibitively expensive for large datasets or large model
architectures [
28
,
29
]. To improve the efficiency of model pruning, numerous heuristics-based
one-shot pruning methods [
17
,
21
–
25
], e.g., one-shot magnitude pruning (OMP), have been proposed.
These schemes directly prune the model to the target sparsity and are significantly more efficient than
IMP. Yet, the promise of current one-shot pruning methods is dataset/model-specific [
22
,
30
] and
mostly lies in the low pruning ratio regime [
25
]. As systematically studied in [
22
,
31
], there exists
a clear performance gap between one-shot pruning and IMP. As an alternative to heuristics-based
schemes, optimization-based pruning methods [
9
,
15
,
16
,
26
,
27
] still follow the pruning-retraining
paradigm and adopt sparsity regularization [
32
,
33
] or parameterized masking [
9
,
16
,
27
] to prune
models efficiently. However, these methods do not always match the accuracy of IMP and thus have
not been widely used to find winning tickets [
17
,
21
,
23
–
25
,
29
]. The empirical results showing that
optimization underperforms heuristics motivate us to revisit the algorithmic fundamentals of pruning.
Figure 1: A performance snapshot of the pro-
posed BIP method vs. the IMP baseline and the
original dense model across three pruned ResNet
models (ResNet-20, ResNet-56, and ResNet-18)
with 74% sparsity on CIFAR-10. The marker
size indicates the relative model size. The uni-
color region corresponds to the same model type
used by different pruning methods.
To this end, we put forth a novel perspective of model
pruning as a bi-level optimization (BLO) problem. In
this new formulation, we show that BLO provides a
technically-grounded optimization basis for an efficient
implementation of the pruning-retraining paradigm, the
key algorithmic component used in IMP. To the best
of our knowledge, we make the first rigorous connec-
tion between model pruning and BLO. Technically, we
propose a novel
bi
-level optimization-enabled
p
runing
method (termed BIP). We further show how BIP takes
advantage of the bi-linearity of the pruning problem to
avoid the computational challenges of common BLO
methods, and is as efficient as any first-order alternating
optimization scheme. Practically, we demonstrate the
superiority of the proposed BIP in terms of accuracy,
sparsity, and computation efficiency through extensive
experiments. BIP finds the best winning ticket nearly in all settings while taking time comparable
to the one-shot OMP. In Fig. 1, we present a snapshot of our empirical results for CIFAR-10 with
3 ResNet architectures at 74% pruning ratio. In all cases, BIP (
?
) finds winning tickets, improving
accuracy over the dense model (•) and matching IMP (N), while being upto 5×faster than IMP.
Our contributions can be summarized as follows:
•
(Formulation) We rethink the algorithmic foundation of model pruning through the lens of BLO
(bi-level optimization). The new BLO-oriented formulation disentangles pruning and retraining
variables, providing the flexibility to design the interface between pruning and retraining.
•
(Algorithm) We propose the new bi-level pruning (BIP) algorithm, which is built upon the aforemen-
tioned BLO formulation and the implicit gradient-based optimization theory. Unlike computationally
intensive standard BLO solvers, we theoretically show that BIP is as efficient as any first-order
optimization by taking advantage of the bi-linear nature of the pruning variables.
•
(Experiments) We conduct extensive experiments across 4 datasets (CIFAR-10, CIFAR-100, Tiny-
ImageNet and ImageNet), 5 model architectures, and 3 pruning settings (unstructured pruning,
filter-wise structured pruning, and channel-wise structured pruning). We show that (i) BIP achieves
higher test accuracy than IMP and finds the best winning tickets nearly in all settings, (ii) BIP is
highly efficient (comparable to one-shot pruning schemes), that is able to achieve 2-7
×
speedup over
IMP for the same level of model accuracy and sparsity, and (iii) BIP is able to find subnetworks that
achieve better performance than the dense model regardless of initialization rewinding.
2