
PATS: Sensitivity-aware Noisy Learning for Pretrained Language Models
Yupeng Zhang1∗
, Hongzhi Zhang2∗
, Sirui Wang2, Wei Wu2and Zhoujun Li1†
1Beihang University, Beijing, China 2Meituan Inc., Beijing, China
{G0vi_qyx, lizj}@buaa.edu.cn
{zhanghongzhi03, wangsirui, wuwei30}@meituan.com
Abstract
A wide range of NLP tasks benefit from
the fine-tuning of pretrained language mod-
els (PLMs). However, a number of redun-
dant parameters which contribute less to the
downstream task are observed in a directly
fine-tuned model. We think the gap between
pretraining and downstream tasks hinders the
training of these redundant parameters, and
results in a suboptimal performance of the
overall model. In this paper, we present
PATS (Perturbation According ToSensitivity),
a noisy training mechanism which consid-
ers each parameter’s importance in the down-
stream task to help fine-tune PLMs. The main
idea of PATS is to add bigger noise to parame-
ters with lower sensitivity and vice versa, in or-
der to activate more parameters’ contributions
to downstream tasks without affecting the sen-
sitive ones much. Extensive experiments con-
ducted on different tasks of the GLUE bench-
mark show PATS can consistently empower
the fine-tuning of different sizes of PLMs, and
the parameters in the well-performing models
always have more concentrated distributions
of sensitivities, which experimentally proves
the effectiveness of our method.
1 Introduction
With a huge number of model parameters and well
designed training objectives, pretrained language
models (PLMs) have brought a new era to NLP
(Guu et al.,2020;Liu,2019;Zhu et al.,2020b;
Qiu et al.,2020). Fine-tuning PLMs such as BERT
(Devlin et al.,2019) has become a basic and effec-
tive way in many downstream tasks (Wadden et al.,
2019;Sun et al.,2019;Howard and Ruder,2018).
However, recent study has shown that aggres-
sive fine-tuning can induce an unstable and subop-
timal performance of the models especially with
∗
Work done during internship at Meituan Inc. The first
two authors have equal contributions.
†Corresponding author.
insufficient data (Dodge et al.,2020;Raffel et al.,
2019), which attracts some researchers to figure
out the culprits and explore effective methods to
solve them (Peters et al.,2019;Houlsby et al.,2019;
Mosbach et al.,2020). For example, there are some
regularization methods like RecAdam (Chen et al.,
2020) and Mixout (Lee et al.,2020), and adversar-
ial training techniques like SMART (Jiang et al.,
2020) and FreeLB (Zhu et al.,2020a) to alleviate
the overfitting of data in downstream tasks; Be-
yond that, Wu et al. (2022) proposed NoisyTune
with the argument that in addition to the overfitting
of the limited downstream data, there could also
exist overfitting in pretraining tasks, which could
result in enormous gaps between pretraining and
downstream task data. In order to overcome the
gaps, NoisyTune simply adds some noise to pa-
rameters in the PLM before fine-tuning. Besides,
it has also been demonstrated that the existence
of a large number of redundant parameters could
also be a factor in the suboptimal performances
of aggressively fine-tuned PLMs (Fan et al.,2019;
Sanh et al.,2020;Dalvi et al.,2020). Consider-
ing the redundant parameters in a model are not
insufficiently trained, Liang et al. (2022) proposed
a learning rate scheduler named SAGE in which
larger learning rates are assigned to these param-
eters of low sensitivity (a measure of parameter’s
importance to downstream tasks).
There could be some connection between the
gaps caused by overfitting of pretraining tasks and
the redundancy of parameters. We consider it could
be the gaps between pretraining and downstream
tasks that hinder the training of these redundant
parameters. SAGE enlarges the learning rates of
insensitive parameters to help their training. How-
ever, with the sensitivity measurement considered,
the insensitive parameters usually have smaller gra-
dients, so enlarged learning rates may help them
little to escape the sub-optimal areas compared
to involving additional noise. One noisy training
arXiv:2210.12403v2 [cs.CL] 25 Oct 2022