
Our results are summarized in Figure 1, which
presents performance of state-of-the-art unstruc-
tured pruning techniques on two benchmarks.
Specifically, we compare GMP
F
with the Lot-
tery Ticket approach (Chen et al.,2020), Move-
ment Pruning (MvP) (Sanh et al.,2020) (as well
as its GMP baseline
GMPMvP
), upstream Prune
OFA (Zafrir et al.,2021), as well as the recently-
proposed second-order pruning oBERT (Kurtic
et al.,2022). We observe that: 1) for both bench-
marks, GMP
F
is only second to the more complex
oBERT method; 2) GMPFin fact outperforms the
highly competitive Prune OFA and MvP methods;
and 3) GMP
F
outperforms both Lottery Tickets
and GMPMvP by extremely wide margins.
Prior Work.
Following the vast BERT-pruning
literature, we focus on the unstructured pruning of
the
BERTBASE
model (Devlin et al.,2019). As pre-
viously noted, upstream and downstream pruning
paradigms exist, and methods are usually devel-
oped and specialized for only one of the two. For
example, Movement Pruning (MvP) (Sanh et al.,
2020;Lagunas et al.,2021) for downstream prun-
ing and Prune Once for All (Prune OFA) (Zafrir
et al.,2021) for upstream pruning. Simplicity and
generality of the GMP makes it suitable for both
paradigms, without any regime-specific modifica-
tions. New and more advanced pruning techniques,
which are, contrary to GMP, able to leverage gradi-
ents (Sanh et al.,2020;Lagunas et al.,2021), loss
curvature (Kurtic et al.,2022), compute-intensive
pre-training setup (Zafrir et al.,2021) are built on
the premise that the simple magnitude-based GMP
method falters when applied to BERT-pruning. In
this work, contrary to what is currently available in
the literature, we present empirical evidence that
GMP, when tuned carefully, can produce very accu-
rate sparse models which are competitive or even
better than most state-of-the-art pruning techniques
across both regimes (upstream and downstream).
As can be seen from Figure 1and our later results,
we massively improve upon existing GMP-based
pruning baselines, in some cases by even more than
20 accuracy points.
2 Competitive Gradual Magnitude
Pruning (GMPF)
Experimental setup.
We focus our attention on
the standard
BERTBASE
model, composed of embed-
ding and encoder layers, which has approximately
110M parameters. All methods focus on pruning
Figure 2: Learning rate and sparsity schedules for the
proposed gradual pruning framework.
among approximately 85M weights of encoder lay-
ers and report sparsities with respect to that number.
We evaluate models on the validation split of the
respective dataset, and to improve confidence in
the obtained results we perform multiple runs with
different seeds and report mean performance.
2.1 Downstream pruning
Following the literature, we consider three pop-
ular tasks: question-answering SQuADv1.1 (Ra-
jpurkar et al.,2016), recognition of textual entail-
ment MNLI (Williams et al.,2017), and duplicate
question detection QQP (Iyer et al.,2017). Now,
we reflect upon the most important constituents of
the gradual pruning framework that enabled us to
attain massive improvements.
Sparsity schedule.
In all of our gradual runs,
there is no pruning during the first two and the
last two epochs. The former fine-tunes the pre-
trained model, and the latter fine-tunes the sparse
model with the fixed mask. In between the two,
GMP
F
follows the cubic sparsity scheduler (Zhu
and Gupta,2017) and prunes weights with the fre-
quency of ten times per epoch. Motivated by the
fact that
BERTBASE
is heavily overparametrized for
downstream tasks, we deviate from the standard
cubic schedule by introducing a large first pruning
step. This showed to be of a crucial importance
when pruning the model to high target sparsities
(e.g. 97%) as it leaves more time to recover from
later pruning steps which are much more difficult.
In Table 8we report results from an ablation study
with respect to the size of the initial step. For con-
venience, we visualize the sparsity scheduler in
Figure 2. Our preliminary experiments showed
similar performance between uniform and global
sparsity distributions, so we use the former.
Learning rate schedule.
Our goal is to provide
a simple baseline setup that works well across
wide range of datasets without any additional task-
dependent tuning. Currently, papers either report
best results following an extensive hyperparameter
2