Pruning Pre-trained Language Models Without Fine-Tuning Ting Jiang1 Deqing Wang13y Fuzhen Zhuang123 Ruobing Xie4 Feng Xia4 1SKLSDE Lab School of Computer Beihang University Beijing China

2025-04-26 0 0 1.9MB 10 页 10玖币

侵权投诉

Pruning Pre-trained Language Models Without Fine-Tuning

Ting Jiang1, Deqing Wang13†, Fuzhen Zhuang123, Ruobing Xie4, Feng Xia4

1SKLSDE Lab, School of Computer, Beihang University, Beijing, China

2Institute of Artiﬁcial Intelligence, Beihang University, Beijing, China

3Zhongguancun Laboratory, Beijing, China 4WeChat, Tencent, Beijing, China

{royokong, dqwang, zhuangfuzhen}@buaa.edu.cn

Abstract

To overcome the overparameterized problem

in Pre-trained Language Models (PLMs), prun-

ing is widely used as a simple and straightfor-

ward compression method by directly remov-

ing unimportant weights. Previous ﬁrst-order

methods successfully compress PLMs to ex-

tremely high sparsity with little performance

drop. These methods, such as movement prun-

ing, use ﬁrst-order information to prune PLMs

while ﬁne-tuning the remaining weights. In

this work, we argue ﬁne-tuning is redundant

for ﬁrst-order pruning, since ﬁrst-order prun-

ing is sufﬁcient to converge PLMs to down-

stream tasks without ﬁne-tuning. Under this

motivation, we propose Static Model Prun-

ing (SMP), which only uses ﬁrst-order prun-

ing to adapt PLMs to downstream tasks while

achieving the target sparsity level. In addition,

we also design a new masking function and

training objective to further improve SMP. Ex-

tensive experiments at various sparsity levels

show SMP has signiﬁcant improvements over

ﬁrst-order and zero-order methods.Unlike pre-

vious ﬁrst-order methods, SMP is also applica-

ble to low sparsity and outperforms zero-order

methods. Meanwhile, SMP is more parameter

efﬁcient than other methods due to it does not

require ﬁne-tuning. Our code is available at

https://github.com/kongds/SMP.

1 Introduction

Pre-trained Language Models (PLMs) like

BERT (Devlin et al.,2019) have shown powerful

performance in natural language processing

by transferring the knowledge from large-scale

corpus to downstream tasks. These models also

require large-scale parameters to cope with the

large-scale corpus in pretraining. However, these

large-scale parameters are overwhelming for most

downstream tasks (Chen et al.,2020), which

†Corresponding Author.

results in signiﬁcant overhead for transferring and

storing them.

To compress PLM, pruning is widely used by

removing unimportant weights and setting them to

zeros. By using sparse subnetworks instead of the

original complete network, existing pruning meth-

ods can maintain the original accuracy by remov-

ing most weights. Magnitude pruning (Han et al.,

2015) as a common method uses zeroth-order in-

formation to make pruning decisions based on the

absolute value of weights. However, in the pro-

cess of adapting to downstream tasks, the weight

values in PLMs are already predetermined from

the original values. To overcome this shortcoming,

movement pruning (Sanh et al.,2020) uses ﬁrst-

order information to select weights based on how

they change in training rather than their absolute

value. To adapt PLMs for downstream tasks, most

methods like movement pruning perform pruning

and ﬁne-tuning together by gradually increasing

the sparsity during training. With the development

of the Lottery Ticket Hypothesis (LTH) (Frankle

and Carbin,2018) in PLMs, some methods (Chen

et al.,2020;Liang et al.,2021) ﬁnd certain subnet-

works from the PLM by pruning, and then ﬁne-tune

these subnetworks from pre-trained weights. More-

over, if the ﬁne-tuned subnetwok can match the

performance of the full PLM, this subnetwork is

called winning ticket (Chen et al.,2020).

In this work, we propose a simple but efﬁcient

ﬁrst-order method. Contrary to the previous prun-

ing method, our method adapts PLMs by only prun-

ing, without ﬁne-tuning. It makes pruning deci-

sions based on the movement trend of weights,

rather than actual movement in movement pruning.

To improve the performance of our method, we

propose a new masking function to better align the

remaining weights according to the architecture of

PLMs. We also avoid ﬁne-tuning weights in the

task-speciﬁc head by using our head initialization

method. By keeping the PLM frozen, we can save

arXiv:2210.06210v2 [cs.CL] 16 May 2023

half of the trainable parameters compared to other

ﬁrst-order methods, and only introduce a binary

mask as the new parameter for each downstream

task at various sparsity levels. Extensive experi-

ments on a wide variety of sparsity demonstrate

our methods strongly outperform state-of-the-art

pruning methods. Contrary to previous ﬁrst-order

methods (Sanh et al.,2020), which show poor per-

formance at low sparsity, our method is also applied

to low sparsity and achieves better performances

than zero-order methods.

2 Related Work

Compressing PLMs for transfer learning is a popu-

lar area of research. Many compression methods

are proposed to solve overparameterized problem

in PLMs, such as model pruning (Han et al.,2015;

Molchanov et al.,2017;Xia et al.,2022), knowl-

edge distillation (Jiao et al.,2020;Wang et al.,

2020), quantization (Shen et al.,2020;Qin et al.,

2022), and matrix decomposition (Lan et al.,2020).

Among them, pruning methods have been widely

studied as the most intuitive approach.

Pruning methods focus on identifying and re-

moving unimportant weights from the model. Zero-

order methods and ﬁrst-order methods are widely

used to prune PLMs. For zero-order methods, mag-

nitude pruning (Han et al.,2015) simply prunes

based on absolute value of their weights. For

ﬁrst-order methods, which are based on ﬁrst-order

Taylor expansion to make pruning decision,

regularization (Louizos et al.,2017) adds the

norm regularization to decrease remaining weights

by sampling them with hard-concrete distribution.

Movement pruning (Sanh et al.,2020) uses straight-

through estimator (Bengio et al.,2013) to calculate

ﬁrst-order informantion.

Based on pruning methods, Frankle and

Carbin (2018) proposes Lottery Ticket Hypothe-

sis (LTH). LTH clariﬁes the existence of sparse

subnetworks (i.e., winning tickets) that can achieve

almost the same performance as the full model

when trained individually. With the development

of LTH, lots of works that focus on the PLMs have

emerged. Chen et al. (2020) ﬁnd that BERT con-

tains winning tickets with a sparsity of 40% to 90%,

and the winning ticket in the mask language mod-

eling task can be transferred to other downstream

tasks. Recent works also try to leverage LTH to

improve the performance and efﬁciency of PLM.

Liang et al. (2021) ﬁnd generalization performance

of the winning tickets ﬁrst improves and then de-

teriorates after a certain threshold. By leveraging

this phenomenon, they show LTH can successfully

improve the performance of downstream tasks.

3 Background

Let

a=Wx

refer to a fully-connected layer in

PLMs, where

W∈Rn×n

is the weight matrix,

x∈Rn

and

a∈Rn

are the input and output

respectively. The pruning can be represented by

a= (WM)x

, where

M∈ {0,1}n×n

is the

binary mask.

We ﬁrst review two common pruning methods in

PLMs: magnitude pruning (Han et al.,2015) and

movement pruning (Sanh et al.,2020). Magnitude

pruning relies on the zeroth-order information to

decide Mby keeping the top vpercent of weights

according to their absolute value

M=Topv(S)

The importance scores S∈Rn×nis:

S(T)

i,j =W(T)

i,j 

=

Wi,j −αwX

t<T ∂L

∂Wi,j (t)

(1)

where S(T)

i,j is the importance score corresponding

W(T)

i,j

after

steps update,

and

αw

are learn-

ing objective and learning rate of

Wi,j

. Magnitude

pruning selects weights with high absolute values

during ﬁne-tuning.

For movement pruning, it relies on the ﬁrst-order

information by learning the importance scores

with gradient. The gradient of

is approximated

with the staight-through estimator (Bengio et al.,

2013), which directly uses the gradient from

According to (Sanh et al.,2020), the importance

scores Sis:

S(T)

i,j =−αsX

t<T ∂L

∂Wi,j (t)

W(t)

i,j (2)

where

αs

is the learning rate of

. Compared

to magnitude pruning, movement pruning selects

weights that are increasing their absolute value.

To achieve target sparsity, one common method

is automated gradual pruning (Michael H. Zhu,

2018). The sparsity level

is gradually increased

with a cubic sparsity scheduler starting from the

training step

vt=vf+ (v0−vf)1−t−t0

N∆t3

where

and

are the initial and target sparsity,

is overall pruning steps, and

∆t

is the pruning

frequency.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PruningPre-trainedLanguageModelsWithoutFine-TuningTingJiang1,DeqingWang13y,FuzhenZhuang123,RuobingXie4,FengXia41SKLSDELab,SchoolofComputer,BeihangUniversity,Beijing,China2InstituteofArticialIntelligence,BeihangUniversity,Beijing,China3ZhongguancunLaboratory,Beijing,China4WeChat,Tencent,Beijing,Chin...

展开>> 收起<<

Pruning Pre-trained Language Models Without Fine-Tuning Ting Jiang1 Deqing Wang13y Fuzhen Zhuang123 Ruobing Xie4 Feng Xia4 1SKLSDE Lab School of Computer Beihang University Beijing China.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pruning Pre-trained Language Models Without Fine-Tuning Ting Jiang1 Deqing Wang13y Fuzhen Zhuang123 Ruobing Xie4 Feng Xia4 1SKLSDE Lab School of Computer Beihang University Beijing China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: