
Pruning Pre-trained Language Models Without Fine-Tuning
Ting Jiang1, Deqing Wang13†, Fuzhen Zhuang123, Ruobing Xie4, Feng Xia4
1SKLSDE Lab, School of Computer, Beihang University, Beijing, China
2Institute of Artificial Intelligence, Beihang University, Beijing, China
3Zhongguancun Laboratory, Beijing, China 4WeChat, Tencent, Beijing, China
{royokong, dqwang, zhuangfuzhen}@buaa.edu.cn
Abstract
To overcome the overparameterized problem
in Pre-trained Language Models (PLMs), prun-
ing is widely used as a simple and straightfor-
ward compression method by directly remov-
ing unimportant weights. Previous first-order
methods successfully compress PLMs to ex-
tremely high sparsity with little performance
drop. These methods, such as movement prun-
ing, use first-order information to prune PLMs
while fine-tuning the remaining weights. In
this work, we argue fine-tuning is redundant
for first-order pruning, since first-order prun-
ing is sufficient to converge PLMs to down-
stream tasks without fine-tuning. Under this
motivation, we propose Static Model Prun-
ing (SMP), which only uses first-order prun-
ing to adapt PLMs to downstream tasks while
achieving the target sparsity level. In addition,
we also design a new masking function and
training objective to further improve SMP. Ex-
tensive experiments at various sparsity levels
show SMP has significant improvements over
first-order and zero-order methods.Unlike pre-
vious first-order methods, SMP is also applica-
ble to low sparsity and outperforms zero-order
methods. Meanwhile, SMP is more parameter
efficient than other methods due to it does not
require fine-tuning. Our code is available at
https://github.com/kongds/SMP.
1 Introduction
Pre-trained Language Models (PLMs) like
BERT (Devlin et al.,2019) have shown powerful
performance in natural language processing
by transferring the knowledge from large-scale
corpus to downstream tasks. These models also
require large-scale parameters to cope with the
large-scale corpus in pretraining. However, these
large-scale parameters are overwhelming for most
downstream tasks (Chen et al.,2020), which
†Corresponding Author.
results in significant overhead for transferring and
storing them.
To compress PLM, pruning is widely used by
removing unimportant weights and setting them to
zeros. By using sparse subnetworks instead of the
original complete network, existing pruning meth-
ods can maintain the original accuracy by remov-
ing most weights. Magnitude pruning (Han et al.,
2015) as a common method uses zeroth-order in-
formation to make pruning decisions based on the
absolute value of weights. However, in the pro-
cess of adapting to downstream tasks, the weight
values in PLMs are already predetermined from
the original values. To overcome this shortcoming,
movement pruning (Sanh et al.,2020) uses first-
order information to select weights based on how
they change in training rather than their absolute
value. To adapt PLMs for downstream tasks, most
methods like movement pruning perform pruning
and fine-tuning together by gradually increasing
the sparsity during training. With the development
of the Lottery Ticket Hypothesis (LTH) (Frankle
and Carbin,2018) in PLMs, some methods (Chen
et al.,2020;Liang et al.,2021) find certain subnet-
works from the PLM by pruning, and then fine-tune
these subnetworks from pre-trained weights. More-
over, if the fine-tuned subnetwok can match the
performance of the full PLM, this subnetwork is
called winning ticket (Chen et al.,2020).
In this work, we propose a simple but efficient
first-order method. Contrary to the previous prun-
ing method, our method adapts PLMs by only prun-
ing, without fine-tuning. It makes pruning deci-
sions based on the movement trend of weights,
rather than actual movement in movement pruning.
To improve the performance of our method, we
propose a new masking function to better align the
remaining weights according to the architecture of
PLMs. We also avoid fine-tuning weights in the
task-specific head by using our head initialization
method. By keeping the PLM frozen, we can save
arXiv:2210.06210v2 [cs.CL] 16 May 2023