Pruning Pre-trained Language Models Without Fine-Tuning Ting Jiang1 Deqing Wang13y Fuzhen Zhuang123 Ruobing Xie4 Feng Xia4 1SKLSDE Lab School of Computer Beihang University Beijing China

2025-04-26 0 0 1.9MB 10 页 10玖币
侵权投诉
Pruning Pre-trained Language Models Without Fine-Tuning
Ting Jiang1, Deqing Wang13, Fuzhen Zhuang123, Ruobing Xie4, Feng Xia4
1SKLSDE Lab, School of Computer, Beihang University, Beijing, China
2Institute of Artificial Intelligence, Beihang University, Beijing, China
3Zhongguancun Laboratory, Beijing, China 4WeChat, Tencent, Beijing, China
{royokong, dqwang, zhuangfuzhen}@buaa.edu.cn
Abstract
To overcome the overparameterized problem
in Pre-trained Language Models (PLMs), prun-
ing is widely used as a simple and straightfor-
ward compression method by directly remov-
ing unimportant weights. Previous first-order
methods successfully compress PLMs to ex-
tremely high sparsity with little performance
drop. These methods, such as movement prun-
ing, use first-order information to prune PLMs
while fine-tuning the remaining weights. In
this work, we argue fine-tuning is redundant
for first-order pruning, since first-order prun-
ing is sufficient to converge PLMs to down-
stream tasks without fine-tuning. Under this
motivation, we propose Static Model Prun-
ing (SMP), which only uses first-order prun-
ing to adapt PLMs to downstream tasks while
achieving the target sparsity level. In addition,
we also design a new masking function and
training objective to further improve SMP. Ex-
tensive experiments at various sparsity levels
show SMP has significant improvements over
first-order and zero-order methods.Unlike pre-
vious first-order methods, SMP is also applica-
ble to low sparsity and outperforms zero-order
methods. Meanwhile, SMP is more parameter
efficient than other methods due to it does not
require fine-tuning. Our code is available at
https://github.com/kongds/SMP.
1 Introduction
Pre-trained Language Models (PLMs) like
BERT (Devlin et al.,2019) have shown powerful
performance in natural language processing
by transferring the knowledge from large-scale
corpus to downstream tasks. These models also
require large-scale parameters to cope with the
large-scale corpus in pretraining. However, these
large-scale parameters are overwhelming for most
downstream tasks (Chen et al.,2020), which
Corresponding Author.
results in significant overhead for transferring and
storing them.
To compress PLM, pruning is widely used by
removing unimportant weights and setting them to
zeros. By using sparse subnetworks instead of the
original complete network, existing pruning meth-
ods can maintain the original accuracy by remov-
ing most weights. Magnitude pruning (Han et al.,
2015) as a common method uses zeroth-order in-
formation to make pruning decisions based on the
absolute value of weights. However, in the pro-
cess of adapting to downstream tasks, the weight
values in PLMs are already predetermined from
the original values. To overcome this shortcoming,
movement pruning (Sanh et al.,2020) uses first-
order information to select weights based on how
they change in training rather than their absolute
value. To adapt PLMs for downstream tasks, most
methods like movement pruning perform pruning
and fine-tuning together by gradually increasing
the sparsity during training. With the development
of the Lottery Ticket Hypothesis (LTH) (Frankle
and Carbin,2018) in PLMs, some methods (Chen
et al.,2020;Liang et al.,2021) find certain subnet-
works from the PLM by pruning, and then fine-tune
these subnetworks from pre-trained weights. More-
over, if the fine-tuned subnetwok can match the
performance of the full PLM, this subnetwork is
called winning ticket (Chen et al.,2020).
In this work, we propose a simple but efficient
first-order method. Contrary to the previous prun-
ing method, our method adapts PLMs by only prun-
ing, without fine-tuning. It makes pruning deci-
sions based on the movement trend of weights,
rather than actual movement in movement pruning.
To improve the performance of our method, we
propose a new masking function to better align the
remaining weights according to the architecture of
PLMs. We also avoid fine-tuning weights in the
task-specific head by using our head initialization
method. By keeping the PLM frozen, we can save
arXiv:2210.06210v2 [cs.CL] 16 May 2023
half of the trainable parameters compared to other
first-order methods, and only introduce a binary
mask as the new parameter for each downstream
task at various sparsity levels. Extensive experi-
ments on a wide variety of sparsity demonstrate
our methods strongly outperform state-of-the-art
pruning methods. Contrary to previous first-order
methods (Sanh et al.,2020), which show poor per-
formance at low sparsity, our method is also applied
to low sparsity and achieves better performances
than zero-order methods.
2 Related Work
Compressing PLMs for transfer learning is a popu-
lar area of research. Many compression methods
are proposed to solve overparameterized problem
in PLMs, such as model pruning (Han et al.,2015;
Molchanov et al.,2017;Xia et al.,2022), knowl-
edge distillation (Jiao et al.,2020;Wang et al.,
2020), quantization (Shen et al.,2020;Qin et al.,
2022), and matrix decomposition (Lan et al.,2020).
Among them, pruning methods have been widely
studied as the most intuitive approach.
Pruning methods focus on identifying and re-
moving unimportant weights from the model. Zero-
order methods and first-order methods are widely
used to prune PLMs. For zero-order methods, mag-
nitude pruning (Han et al.,2015) simply prunes
based on absolute value of their weights. For
first-order methods, which are based on first-order
Taylor expansion to make pruning decision,
L0
regularization (Louizos et al.,2017) adds the
L0
norm regularization to decrease remaining weights
by sampling them with hard-concrete distribution.
Movement pruning (Sanh et al.,2020) uses straight-
through estimator (Bengio et al.,2013) to calculate
first-order informantion.
Based on pruning methods, Frankle and
Carbin (2018) proposes Lottery Ticket Hypothe-
sis (LTH). LTH clarifies the existence of sparse
subnetworks (i.e., winning tickets) that can achieve
almost the same performance as the full model
when trained individually. With the development
of LTH, lots of works that focus on the PLMs have
emerged. Chen et al. (2020) find that BERT con-
tains winning tickets with a sparsity of 40% to 90%,
and the winning ticket in the mask language mod-
eling task can be transferred to other downstream
tasks. Recent works also try to leverage LTH to
improve the performance and efficiency of PLM.
Liang et al. (2021) find generalization performance
of the winning tickets first improves and then de-
teriorates after a certain threshold. By leveraging
this phenomenon, they show LTH can successfully
improve the performance of downstream tasks.
3 Background
Let
a=Wx
refer to a fully-connected layer in
PLMs, where
WRn×n
is the weight matrix,
xRn
and
aRn
are the input and output
respectively. The pruning can be represented by
a= (WM)x
, where
M∈ {0,1}n×n
is the
binary mask.
We first review two common pruning methods in
PLMs: magnitude pruning (Han et al.,2015) and
movement pruning (Sanh et al.,2020). Magnitude
pruning relies on the zeroth-order information to
decide Mby keeping the top vpercent of weights
according to their absolute value
M=Topv(S)
.
The importance scores SRn×nis:
S(T)
i,j =W(T)
i,j
=
Wi,j αwX
t<T L
Wi,j (t)
(1)
where S(T)
i,j is the importance score corresponding
to
W(T)
i,j
after
T
steps update,
L
and
αw
are learn-
ing objective and learning rate of
Wi,j
. Magnitude
pruning selects weights with high absolute values
during fine-tuning.
For movement pruning, it relies on the first-order
information by learning the importance scores
S
with gradient. The gradient of
S
is approximated
with the staight-through estimator (Bengio et al.,
2013), which directly uses the gradient from
M
.
According to (Sanh et al.,2020), the importance
scores Sis:
S(T)
i,j =αsX
t<T L
Wi,j (t)
W(t)
i,j (2)
where
αs
is the learning rate of
S
. Compared
to magnitude pruning, movement pruning selects
weights that are increasing their absolute value.
To achieve target sparsity, one common method
is automated gradual pruning (Michael H. Zhu,
2018). The sparsity level
v
is gradually increased
with a cubic sparsity scheduler starting from the
training step
t0
:
vt=vf+ (v0vf)1tt0
Nt3
,
where
v0
and
vf
are the initial and target sparsity,
N
is overall pruning steps, and
t
is the pruning
frequency.
摘要:

PruningPre-trainedLanguageModelsWithoutFine-TuningTingJiang1,DeqingWang13y,FuzhenZhuang123,RuobingXie4,FengXia41SKLSDELab,SchoolofComputer,BeihangUniversity,Beijing,China2InstituteofArticialIntelligence,BeihangUniversity,Beijing,China3ZhongguancunLaboratory,Beijing,China4WeChat,Tencent,Beijing,Chin...

展开>> 收起<<
Pruning Pre-trained Language Models Without Fine-Tuning Ting Jiang1 Deqing Wang13y Fuzhen Zhuang123 Ruobing Xie4 Feng Xia4 1SKLSDE Lab School of Computer Beihang University Beijing China.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1.9MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注