GMPF Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods Eldar Kurtic1and Dan Alistarh12

2025-04-27 0 0 326KB 8 页 10玖币
侵权投诉
GMPF: Well-Tuned Gradual Magnitude Pruning Can Outperform Most
BERT-Pruning Methods
Eldar Kurtic1and Dan Alistarh1,2
1Institute of Science and Technology Austria
2Neural Magic Inc.
Abstract
We revisit the performance of the classic grad-
ual magnitude pruning (GMP) baseline for
large language models, focusing on the clas-
sic BERT benchmark on various popular tasks.
Despite existing evidence in the literature that
GMP performs poorly, we show that a simple
and general variant, which we call GMPF, can
match and sometimes outperform more com-
plex state-of-the-art methods. Our results pro-
vide a simple yet strong baseline for future
work, highlight the importance of parameter
tuning for baselines, and even improve the per-
formance of the state-of-the-art second-order
pruning method in this setting.
1 Introduction
The massive recent growth of the computational
cost of accurate deep learning models, in particular
large language models (LLMs), has motivated the
development of several advanced model compres-
sion techniques (Hoefler et al.,2021;Gholami et al.,
2021), encompassing unstructured and structured
pruning, quantization, and knowledge distillation.
In this paper, we focus on the unstructured pruning,
for which we follow the standard pipeline. Such
models are first pre-trained on a large upstream cor-
pus of unlabelled text. Then, they are fine-tuned in
a supervised manner on a smaller downstream task,
such as question-answering or text classification.
In the context of compression, this pipeline led to
two paradigms: 1) upstream pruning, followed by
fine-tuning of the remaining weights on a down-
stream task, and 2) downstream pruning, pruning
and fine-tuning directly on the downstream task.
A tempting baseline approach in most settings
is gradual magnitude pruning (GMP) (Hagiwara,
1994;Zhu and Gupta,2017), that is, periodically
removing the smallest fraction of weights dur-
ing training, possibly interspersed with fine-tuning
steps designed to recover accuracy. GMP has been
Corresponding author: eldar.kurtic@ist.ac.at.
Figure 1: Performance of state-of-the-art unstructured
pruning methods relative to the dense BERTBASE model
at high sparsities and two tasks, SQuADv1.1 and
MNLI.
shown to be an extremely strong baseline in the
context of computer vision (Gale et al.,2019;Hoe-
fler et al.,2021). However, the literature on pruning
LLMs, and in particular BERT models (Sanh et al.,
2020;Chen et al.,2020;Zafrir et al.,2021), clearly
suggests that GMP does not perform well.
Contribution.
In this paper, we re-examine this
conclusion and investigate whether GMP can be a
competitive baseline, once carefully tuned. Specif-
ically, we show that a well tuned variant which
we call GMP
F
, can produce highly accurate and
sparse language models in both upstream and down-
stream pruning regimes, matching or even outper-
forming more complex methods. We explore ef-
fects of the crucial parameters for gradual pruning,
and provide simple and intuitive guidelines on how
to integrate them in a principled manner.
1
arXiv:2210.06384v3 [cs.CL] 8 Dec 2022
Our results are summarized in Figure 1, which
presents performance of state-of-the-art unstruc-
tured pruning techniques on two benchmarks.
Specifically, we compare GMP
F
with the Lot-
tery Ticket approach (Chen et al.,2020), Move-
ment Pruning (MvP) (Sanh et al.,2020) (as well
as its GMP baseline
GMPMvP
), upstream Prune
OFA (Zafrir et al.,2021), as well as the recently-
proposed second-order pruning oBERT (Kurtic
et al.,2022). We observe that: 1) for both bench-
marks, GMP
F
is only second to the more complex
oBERT method; 2) GMPFin fact outperforms the
highly competitive Prune OFA and MvP methods;
and 3) GMP
F
outperforms both Lottery Tickets
and GMPMvP by extremely wide margins.
Prior Work.
Following the vast BERT-pruning
literature, we focus on the unstructured pruning of
the
BERTBASE
model (Devlin et al.,2019). As pre-
viously noted, upstream and downstream pruning
paradigms exist, and methods are usually devel-
oped and specialized for only one of the two. For
example, Movement Pruning (MvP) (Sanh et al.,
2020;Lagunas et al.,2021) for downstream prun-
ing and Prune Once for All (Prune OFA) (Zafrir
et al.,2021) for upstream pruning. Simplicity and
generality of the GMP makes it suitable for both
paradigms, without any regime-specific modifica-
tions. New and more advanced pruning techniques,
which are, contrary to GMP, able to leverage gradi-
ents (Sanh et al.,2020;Lagunas et al.,2021), loss
curvature (Kurtic et al.,2022), compute-intensive
pre-training setup (Zafrir et al.,2021) are built on
the premise that the simple magnitude-based GMP
method falters when applied to BERT-pruning. In
this work, contrary to what is currently available in
the literature, we present empirical evidence that
GMP, when tuned carefully, can produce very accu-
rate sparse models which are competitive or even
better than most state-of-the-art pruning techniques
across both regimes (upstream and downstream).
As can be seen from Figure 1and our later results,
we massively improve upon existing GMP-based
pruning baselines, in some cases by even more than
20 accuracy points.
2 Competitive Gradual Magnitude
Pruning (GMPF)
Experimental setup.
We focus our attention on
the standard
BERTBASE
model, composed of embed-
ding and encoder layers, which has approximately
110M parameters. All methods focus on pruning
Figure 2: Learning rate and sparsity schedules for the
proposed gradual pruning framework.
among approximately 85M weights of encoder lay-
ers and report sparsities with respect to that number.
We evaluate models on the validation split of the
respective dataset, and to improve confidence in
the obtained results we perform multiple runs with
different seeds and report mean performance.
2.1 Downstream pruning
Following the literature, we consider three pop-
ular tasks: question-answering SQuADv1.1 (Ra-
jpurkar et al.,2016), recognition of textual entail-
ment MNLI (Williams et al.,2017), and duplicate
question detection QQP (Iyer et al.,2017). Now,
we reflect upon the most important constituents of
the gradual pruning framework that enabled us to
attain massive improvements.
Sparsity schedule.
In all of our gradual runs,
there is no pruning during the first two and the
last two epochs. The former fine-tunes the pre-
trained model, and the latter fine-tunes the sparse
model with the fixed mask. In between the two,
GMP
F
follows the cubic sparsity scheduler (Zhu
and Gupta,2017) and prunes weights with the fre-
quency of ten times per epoch. Motivated by the
fact that
BERTBASE
is heavily overparametrized for
downstream tasks, we deviate from the standard
cubic schedule by introducing a large first pruning
step. This showed to be of a crucial importance
when pruning the model to high target sparsities
(e.g. 97%) as it leaves more time to recover from
later pruning steps which are much more difficult.
In Table 8we report results from an ablation study
with respect to the size of the initial step. For con-
venience, we visualize the sparsity scheduler in
Figure 2. Our preliminary experiments showed
similar performance between uniform and global
sparsity distributions, so we use the former.
Learning rate schedule.
Our goal is to provide
a simple baseline setup that works well across
wide range of datasets without any additional task-
dependent tuning. Currently, papers either report
best results following an extensive hyperparameter
2
摘要:

GMPF:Well-TunedGradualMagnitudePruningCanOutperformMostBERT-PruningMethodsEldarKurtic1andDanAlistarh1,21InstituteofScienceandTechnologyAustria2NeuralMagicInc.AbstractWerevisittheperformanceoftheclassicgrad-ualmagnitudepruning(GMP)baselineforlargelanguagemodels,focusingontheclas-sicBERTbenchmarkonva...

展开>> 收起<<
GMPF Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods Eldar Kurtic1and Dan Alistarh12.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:326KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注