GMPF Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods Eldar Kurtic1and Dan Alistarh12

2025-04-27 0 0 326KB 8 页 10玖币

侵权投诉

GMPF: Well-Tuned Gradual Magnitude Pruning Can Outperform Most

BERT-Pruning Methods

Eldar Kurtic∗1and Dan Alistarh1,2

1Institute of Science and Technology Austria

2Neural Magic Inc.

Abstract

We revisit the performance of the classic grad-

ual magnitude pruning (GMP) baseline for

large language models, focusing on the clas-

sic BERT benchmark on various popular tasks.

Despite existing evidence in the literature that

GMP performs poorly, we show that a simple

and general variant, which we call GMPF, can

match and sometimes outperform more com-

plex state-of-the-art methods. Our results pro-

vide a simple yet strong baseline for future

work, highlight the importance of parameter

tuning for baselines, and even improve the per-

formance of the state-of-the-art second-order

pruning method in this setting.

1 Introduction

The massive recent growth of the computational

cost of accurate deep learning models, in particular

large language models (LLMs), has motivated the

development of several advanced model compres-

sion techniques (Hoeﬂer et al.,2021;Gholami et al.,

2021), encompassing unstructured and structured

pruning, quantization, and knowledge distillation.

In this paper, we focus on the unstructured pruning,

for which we follow the standard pipeline. Such

models are ﬁrst pre-trained on a large upstream cor-

pus of unlabelled text. Then, they are ﬁne-tuned in

a supervised manner on a smaller downstream task,

such as question-answering or text classiﬁcation.

In the context of compression, this pipeline led to

two paradigms: 1) upstream pruning, followed by

ﬁne-tuning of the remaining weights on a down-

stream task, and 2) downstream pruning, pruning

and ﬁne-tuning directly on the downstream task.

A tempting baseline approach in most settings

is gradual magnitude pruning (GMP) (Hagiwara,

1994;Zhu and Gupta,2017), that is, periodically

removing the smallest fraction of weights dur-

ing training, possibly interspersed with ﬁne-tuning

steps designed to recover accuracy. GMP has been

∗Corresponding author: eldar.kurtic@ist.ac.at.

Figure 1: Performance of state-of-the-art unstructured

pruning methods relative to the dense BERTBASE model

at high sparsities and two tasks, SQuADv1.1 and

MNLI.

shown to be an extremely strong baseline in the

context of computer vision (Gale et al.,2019;Hoe-

ﬂer et al.,2021). However, the literature on pruning

LLMs, and in particular BERT models (Sanh et al.,

2020;Chen et al.,2020;Zafrir et al.,2021), clearly

suggests that GMP does not perform well.

Contribution.

In this paper, we re-examine this

conclusion and investigate whether GMP can be a

competitive baseline, once carefully tuned. Specif-

ically, we show that a well tuned variant which

we call GMP

, can produce highly accurate and

sparse language models in both upstream and down-

stream pruning regimes, matching or even outper-

forming more complex methods. We explore ef-

fects of the crucial parameters for gradual pruning,

and provide simple and intuitive guidelines on how

to integrate them in a principled manner.

arXiv:2210.06384v3 [cs.CL] 8 Dec 2022

Our results are summarized in Figure 1, which

presents performance of state-of-the-art unstruc-

tured pruning techniques on two benchmarks.

Speciﬁcally, we compare GMP

with the Lot-

tery Ticket approach (Chen et al.,2020), Move-

ment Pruning (MvP) (Sanh et al.,2020) (as well

as its GMP baseline

GMPMvP

), upstream Prune

OFA (Zafrir et al.,2021), as well as the recently-

proposed second-order pruning oBERT (Kurtic

et al.,2022). We observe that: 1) for both bench-

marks, GMP

is only second to the more complex

oBERT method; 2) GMPFin fact outperforms the

highly competitive Prune OFA and MvP methods;

and 3) GMP

outperforms both Lottery Tickets

and GMPMvP by extremely wide margins.

Prior Work.

Following the vast BERT-pruning

literature, we focus on the unstructured pruning of

the

BERTBASE

model (Devlin et al.,2019). As pre-

viously noted, upstream and downstream pruning

paradigms exist, and methods are usually devel-

oped and specialized for only one of the two. For

example, Movement Pruning (MvP) (Sanh et al.,

2020;Lagunas et al.,2021) for downstream prun-

ing and Prune Once for All (Prune OFA) (Zafrir

et al.,2021) for upstream pruning. Simplicity and

generality of the GMP makes it suitable for both

paradigms, without any regime-speciﬁc modiﬁca-

tions. New and more advanced pruning techniques,

which are, contrary to GMP, able to leverage gradi-

ents (Sanh et al.,2020;Lagunas et al.,2021), loss

curvature (Kurtic et al.,2022), compute-intensive

pre-training setup (Zafrir et al.,2021) are built on

the premise that the simple magnitude-based GMP

method falters when applied to BERT-pruning. In

this work, contrary to what is currently available in

the literature, we present empirical evidence that

GMP, when tuned carefully, can produce very accu-

rate sparse models which are competitive or even

better than most state-of-the-art pruning techniques

across both regimes (upstream and downstream).

As can be seen from Figure 1and our later results,

we massively improve upon existing GMP-based

pruning baselines, in some cases by even more than

20 accuracy points.

2 Competitive Gradual Magnitude

Pruning (GMPF)

Experimental setup.

We focus our attention on

the standard

BERTBASE

model, composed of embed-

ding and encoder layers, which has approximately

110M parameters. All methods focus on pruning

Figure 2: Learning rate and sparsity schedules for the

proposed gradual pruning framework.

among approximately 85M weights of encoder lay-

ers and report sparsities with respect to that number.

We evaluate models on the validation split of the

respective dataset, and to improve conﬁdence in

the obtained results we perform multiple runs with

different seeds and report mean performance.

2.1 Downstream pruning

Following the literature, we consider three pop-

ular tasks: question-answering SQuADv1.1 (Ra-

jpurkar et al.,2016), recognition of textual entail-

ment MNLI (Williams et al.,2017), and duplicate

question detection QQP (Iyer et al.,2017). Now,

we reﬂect upon the most important constituents of

the gradual pruning framework that enabled us to

attain massive improvements.

Sparsity schedule.

In all of our gradual runs,

there is no pruning during the ﬁrst two and the

last two epochs. The former ﬁne-tunes the pre-

trained model, and the latter ﬁne-tunes the sparse

model with the ﬁxed mask. In between the two,

GMP

follows the cubic sparsity scheduler (Zhu

and Gupta,2017) and prunes weights with the fre-

quency of ten times per epoch. Motivated by the

fact that

BERTBASE

is heavily overparametrized for

downstream tasks, we deviate from the standard

cubic schedule by introducing a large ﬁrst pruning

step. This showed to be of a crucial importance

when pruning the model to high target sparsities

(e.g. 97%) as it leaves more time to recover from

later pruning steps which are much more difﬁcult.

In Table 8we report results from an ablation study

with respect to the size of the initial step. For con-

venience, we visualize the sparsity scheduler in

Figure 2. Our preliminary experiments showed

similar performance between uniform and global

sparsity distributions, so we use the former.

Learning rate schedule.

Our goal is to provide

a simple baseline setup that works well across

wide range of datasets without any additional task-

dependent tuning. Currently, papers either report

best results following an extensive hyperparameter

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GMPF:Well-TunedGradualMagnitudePruningCanOutperformMostBERT-PruningMethodsEldarKurtic1andDanAlistarh1,21InstituteofScienceandTechnologyAustria2NeuralMagicInc.AbstractWerevisittheperformanceoftheclassicgrad-ualmagnitudepruning(GMP)baselineforlargelanguagemodels,focusingontheclas-sicBERTbenchmarkonva...

展开>> 收起<<

GMPF Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods Eldar Kurtic1and Dan Alistarh12.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GMPF Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods Eldar Kurtic1and Dan Alistarh12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: