PATS Sensitivity-aware Noisy Learning for Pretrained Language Models Yupeng Zhang1 Hongzhi Zhang2 Sirui Wang2 Wei Wu2and Zhoujun Li1y 1Beihang University Beijing China2Meituan Inc. Beijing China

2025-05-02 0 0 404.09KB 8 页 10玖币
侵权投诉
PATS: Sensitivity-aware Noisy Learning for Pretrained Language Models
Yupeng Zhang1
, Hongzhi Zhang2
, Sirui Wang2, Wei Wu2and Zhoujun Li1
1Beihang University, Beijing, China 2Meituan Inc., Beijing, China
{G0vi_qyx, lizj}@buaa.edu.cn
{zhanghongzhi03, wangsirui, wuwei30}@meituan.com
Abstract
A wide range of NLP tasks benefit from
the fine-tuning of pretrained language mod-
els (PLMs). However, a number of redun-
dant parameters which contribute less to the
downstream task are observed in a directly
fine-tuned model. We think the gap between
pretraining and downstream tasks hinders the
training of these redundant parameters, and
results in a suboptimal performance of the
overall model. In this paper, we present
PATS (Perturbation According ToSensitivity),
a noisy training mechanism which consid-
ers each parameter’s importance in the down-
stream task to help fine-tune PLMs. The main
idea of PATS is to add bigger noise to parame-
ters with lower sensitivity and vice versa, in or-
der to activate more parameters’ contributions
to downstream tasks without affecting the sen-
sitive ones much. Extensive experiments con-
ducted on different tasks of the GLUE bench-
mark show PATS can consistently empower
the fine-tuning of different sizes of PLMs, and
the parameters in the well-performing models
always have more concentrated distributions
of sensitivities, which experimentally proves
the effectiveness of our method.
1 Introduction
With a huge number of model parameters and well
designed training objectives, pretrained language
models (PLMs) have brought a new era to NLP
(Guu et al.,2020;Liu,2019;Zhu et al.,2020b;
Qiu et al.,2020). Fine-tuning PLMs such as BERT
(Devlin et al.,2019) has become a basic and effec-
tive way in many downstream tasks (Wadden et al.,
2019;Sun et al.,2019;Howard and Ruder,2018).
However, recent study has shown that aggres-
sive fine-tuning can induce an unstable and subop-
timal performance of the models especially with
Work done during internship at Meituan Inc. The first
two authors have equal contributions.
Corresponding author.
insufficient data (Dodge et al.,2020;Raffel et al.,
2019), which attracts some researchers to figure
out the culprits and explore effective methods to
solve them (Peters et al.,2019;Houlsby et al.,2019;
Mosbach et al.,2020). For example, there are some
regularization methods like RecAdam (Chen et al.,
2020) and Mixout (Lee et al.,2020), and adversar-
ial training techniques like SMART (Jiang et al.,
2020) and FreeLB (Zhu et al.,2020a) to alleviate
the overfitting of data in downstream tasks; Be-
yond that, Wu et al. (2022) proposed NoisyTune
with the argument that in addition to the overfitting
of the limited downstream data, there could also
exist overfitting in pretraining tasks, which could
result in enormous gaps between pretraining and
downstream task data. In order to overcome the
gaps, NoisyTune simply adds some noise to pa-
rameters in the PLM before fine-tuning. Besides,
it has also been demonstrated that the existence
of a large number of redundant parameters could
also be a factor in the suboptimal performances
of aggressively fine-tuned PLMs (Fan et al.,2019;
Sanh et al.,2020;Dalvi et al.,2020). Consider-
ing the redundant parameters in a model are not
insufficiently trained, Liang et al. (2022) proposed
a learning rate scheduler named SAGE in which
larger learning rates are assigned to these param-
eters of low sensitivity (a measure of parameter’s
importance to downstream tasks).
There could be some connection between the
gaps caused by overfitting of pretraining tasks and
the redundancy of parameters. We consider it could
be the gaps between pretraining and downstream
tasks that hinder the training of these redundant
parameters. SAGE enlarges the learning rates of
insensitive parameters to help their training. How-
ever, with the sensitivity measurement considered,
the insensitive parameters usually have smaller gra-
dients, so enlarged learning rates may help them
little to escape the sub-optimal areas compared
to involving additional noise. One noisy training
arXiv:2210.12403v2 [cs.CL] 25 Oct 2022
method to alleviate the gaps is NoisyTune, in which
parameters of a matrix in a PLM are added with
noise according to the standard deviation of the
matrix before fine-tuning. Nevertheless, there are
few explanations about why or whether the param-
eters in the same matrix should be perturbed with
the same intensity. Considering different parame-
ters have different contributions to the model, noise
from a unified distribution may disturb knowledge
of some sensitive parameters, resulting in a loss
of performance. Besides, since each task needs to
capture an appropriate textual pattern and the data
of it usually comes from a special domain, differ-
ent downstream tasks could have different kinds
of gaps with those of the pretraining. So the noise
added to overcome the gaps should also be related
to the downstream task data.
In this paper, we propose a novel parameter-
wise noisy fine-tuning method called PATS
(
P
erturbation
A
ccording
T
o
S
ensitivity) to make
full use of perturbation on parameters to handle
the problems above. We focus on balancing the
contributions of all parameters in the model by ac-
tivating the insensitive ones to play better roles in
downstream tasks. So the main idea of our method
is adding different intensities of noise to parame-
ters according to their sensitivity when fine-tuning
PLMs, different from NoisyTune (Fig. 1(b)) in
which noise added to a matrix of parameters is
from a unified distribution and unrelated to down-
stream task data. Specifically, during fine-tuning
in PATS (Fig. 1(c)), larger noise will be added to
the parameters with lower sensitivity (such as the
parameter shown in red), while sensitive parame-
ters (such as the parameter shown in purple) will
be barely perturbed.
Our contributions can be summarized as follows:
1) We propose a simple but effective method to
help all parameters be trained sufficiently when
fine-tuning PLMs in downstream tasks. 2) Among
all the training methods with noise, PATS is the
first sensitivity-aware one which perturbs models
with noise of different distributions according to pa-
rameters’ sensitivity, to the best of our knowledge.
3) Extensive experiments on the GLUE benchmark
show PATS makes a difference in boosting the per-
formance of PLMs in downstream NLP tasks.
2 Approach
In this section, we present our PATS for PLMs
fine-tuning. Previous matrix-wise noisy methods
PLM
Fine-
tuning
(a) Commonly
Fine-tuning
PLM
(b) NoisyTune:
Fine-tuning
Perturbed PLM
(c) PATS: Fine-
tuning PLM with
Sensitivity-based
Noise
Task
Data
Task
Data
Task
Data
Sensitivity from low to high
Neuron Parameter
Perturbation Perturbed Parameter
Figure 1: Different schemata of fine-tuning PLMs.
In NoisyTune, a matrix of parameters in a PLM are per-
turbed with the same intensity before fine-tuning;
In PATS, parameters with lower sensitivity (the "red"
parameter) to downstream data are added with larger
noise, and vice versa (like the "purple" parameter).
perturb a PLM by adding noise from a uniform
distribution to a matrix of parameters. Different
from them, in PATS, each parameter even from
the same matrix will be paid to different attention
according to its sensitivity. It is also worth noting
that in PATS, a PLM is not perturbed in advance
like NoisyTune, instead the perturbation happens
during training as the task data comes in. In the fol-
lowing sections, we will introduce the calculation
of parameter sensitivity first and then present the
noisy learning mechanism in detail.
2.1 Sensitivity Measurement
The sensitivity of a parameter is used to measure
the change of the output or loss after setting it to
zero (Molchanov et al.,2017,2019;Ding et al.,
2019;Xiao et al.,2019;Lee et al.,2019). To be
specific, given a BERT-like pre-trained language
model
M
with parameters
Θ={θ1, θ2,··· , θn} ∈
Rn
, the sensitivity of the
j
-th parameter
θj
is writ-
ten as sj, which can be defined as:
sj=|L(Θ)− L(θ1,··· , θj1,0, θj+1,··· , θn)|
≈ |θjθjL(Θ)|,(1)
where
L
is a loss function and we use the first-
degree Taylor polynomial to approximate
sj
ignor-
ing the higher order remainder to accelerate the
calculation of it.
摘要:

PATS:Sensitivity-awareNoisyLearningforPretrainedLanguageModelsYupengZhang1,HongzhiZhang2,SiruiWang2,WeiWu2andZhoujunLi1y1BeihangUniversity,Beijing,China2MeituanInc.,Beijing,China{G0vi_qyx,lizj}@buaa.edu.cn{zhanghongzhi03,wangsirui,wuwei30}@meituan.comAbstractAwiderangeofNLPtasksbenetfromthene-tu...

展开>> 收起<<
PATS Sensitivity-aware Noisy Learning for Pretrained Language Models Yupeng Zhang1 Hongzhi Zhang2 Sirui Wang2 Wei Wu2and Zhoujun Li1y 1Beihang University Beijing China2Meituan Inc. Beijing China.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:404.09KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注