PATS Sensitivity-aware Noisy Learning for Pretrained Language Models Yupeng Zhang1 Hongzhi Zhang2 Sirui Wang2 Wei Wu2and Zhoujun Li1y 1Beihang University Beijing China2Meituan Inc. Beijing China

2025-05-02 0 0 404.09KB 8 页 10玖币

侵权投诉

PATS: Sensitivity-aware Noisy Learning for Pretrained Language Models

Yupeng Zhang1∗

, Hongzhi Zhang2∗

, Sirui Wang2, Wei Wu2and Zhoujun Li1†

1Beihang University, Beijing, China 2Meituan Inc., Beijing, China

{G0vi_qyx, lizj}@buaa.edu.cn

{zhanghongzhi03, wangsirui, wuwei30}@meituan.com

Abstract

A wide range of NLP tasks beneﬁt from

the ﬁne-tuning of pretrained language mod-

els (PLMs). However, a number of redun-

dant parameters which contribute less to the

downstream task are observed in a directly

ﬁne-tuned model. We think the gap between

pretraining and downstream tasks hinders the

training of these redundant parameters, and

results in a suboptimal performance of the

overall model. In this paper, we present

PATS (Perturbation According ToSensitivity),

a noisy training mechanism which consid-

ers each parameter’s importance in the down-

stream task to help ﬁne-tune PLMs. The main

idea of PATS is to add bigger noise to parame-

ters with lower sensitivity and vice versa, in or-

der to activate more parameters’ contributions

to downstream tasks without affecting the sen-

sitive ones much. Extensive experiments con-

ducted on different tasks of the GLUE bench-

mark show PATS can consistently empower

the ﬁne-tuning of different sizes of PLMs, and

the parameters in the well-performing models

always have more concentrated distributions

of sensitivities, which experimentally proves

the effectiveness of our method.

1 Introduction

With a huge number of model parameters and well

designed training objectives, pretrained language

models (PLMs) have brought a new era to NLP

(Guu et al.,2020;Liu,2019;Zhu et al.,2020b;

Qiu et al.,2020). Fine-tuning PLMs such as BERT

(Devlin et al.,2019) has become a basic and effec-

tive way in many downstream tasks (Wadden et al.,

2019;Sun et al.,2019;Howard and Ruder,2018).

However, recent study has shown that aggres-

sive ﬁne-tuning can induce an unstable and subop-

timal performance of the models especially with

∗

Work done during internship at Meituan Inc. The ﬁrst

two authors have equal contributions.

†Corresponding author.

insufﬁcient data (Dodge et al.,2020;Raffel et al.,

2019), which attracts some researchers to ﬁgure

out the culprits and explore effective methods to

solve them (Peters et al.,2019;Houlsby et al.,2019;

Mosbach et al.,2020). For example, there are some

regularization methods like RecAdam (Chen et al.,

2020) and Mixout (Lee et al.,2020), and adversar-

ial training techniques like SMART (Jiang et al.,

2020) and FreeLB (Zhu et al.,2020a) to alleviate

the overﬁtting of data in downstream tasks; Be-

yond that, Wu et al. (2022) proposed NoisyTune

with the argument that in addition to the overﬁtting

of the limited downstream data, there could also

exist overﬁtting in pretraining tasks, which could

result in enormous gaps between pretraining and

downstream task data. In order to overcome the

gaps, NoisyTune simply adds some noise to pa-

rameters in the PLM before ﬁne-tuning. Besides,

it has also been demonstrated that the existence

of a large number of redundant parameters could

also be a factor in the suboptimal performances

of aggressively ﬁne-tuned PLMs (Fan et al.,2019;

Sanh et al.,2020;Dalvi et al.,2020). Consider-

ing the redundant parameters in a model are not

insufﬁciently trained, Liang et al. (2022) proposed

a learning rate scheduler named SAGE in which

larger learning rates are assigned to these param-

eters of low sensitivity (a measure of parameter’s

importance to downstream tasks).

There could be some connection between the

gaps caused by overﬁtting of pretraining tasks and

the redundancy of parameters. We consider it could

be the gaps between pretraining and downstream

tasks that hinder the training of these redundant

parameters. SAGE enlarges the learning rates of

insensitive parameters to help their training. How-

ever, with the sensitivity measurement considered,

the insensitive parameters usually have smaller gra-

dients, so enlarged learning rates may help them

little to escape the sub-optimal areas compared

to involving additional noise. One noisy training

arXiv:2210.12403v2 [cs.CL] 25 Oct 2022

method to alleviate the gaps is NoisyTune, in which

parameters of a matrix in a PLM are added with

noise according to the standard deviation of the

matrix before ﬁne-tuning. Nevertheless, there are

few explanations about why or whether the param-

eters in the same matrix should be perturbed with

the same intensity. Considering different parame-

ters have different contributions to the model, noise

from a uniﬁed distribution may disturb knowledge

of some sensitive parameters, resulting in a loss

of performance. Besides, since each task needs to

capture an appropriate textual pattern and the data

of it usually comes from a special domain, differ-

ent downstream tasks could have different kinds

of gaps with those of the pretraining. So the noise

added to overcome the gaps should also be related

to the downstream task data.

In this paper, we propose a novel parameter-

wise noisy ﬁne-tuning method called PATS

(

erturbation

ccording

ensitivity) to make

full use of perturbation on parameters to handle

the problems above. We focus on balancing the

contributions of all parameters in the model by ac-

tivating the insensitive ones to play better roles in

downstream tasks. So the main idea of our method

is adding different intensities of noise to parame-

ters according to their sensitivity when ﬁne-tuning

PLMs, different from NoisyTune (Fig. 1(b)) in

which noise added to a matrix of parameters is

from a uniﬁed distribution and unrelated to down-

stream task data. Speciﬁcally, during ﬁne-tuning

in PATS (Fig. 1(c)), larger noise will be added to

the parameters with lower sensitivity (such as the

parameter shown in red), while sensitive parame-

ters (such as the parameter shown in purple) will

be barely perturbed.

Our contributions can be summarized as follows:

1) We propose a simple but effective method to

help all parameters be trained sufﬁciently when

ﬁne-tuning PLMs in downstream tasks. 2) Among

all the training methods with noise, PATS is the

ﬁrst sensitivity-aware one which perturbs models

with noise of different distributions according to pa-

rameters’ sensitivity, to the best of our knowledge.

3) Extensive experiments on the GLUE benchmark

show PATS makes a difference in boosting the per-

formance of PLMs in downstream NLP tasks.

2 Approach

In this section, we present our PATS for PLMs

ﬁne-tuning. Previous matrix-wise noisy methods

PLM

Fine-

tuning

(a) Commonly

Fine-tuning

PLM

(b) NoisyTune:

Fine-tuning

Perturbed PLM

tuning PLM with

Sensitivity-based

Noise

Task

Data

Task

Data

Task

Data

Sensitivity from low to high

Neuron Parameter

Perturbation Perturbed Parameter

Figure 1: Different schemata of ﬁne-tuning PLMs.

In NoisyTune, a matrix of parameters in a PLM are per-

turbed with the same intensity before ﬁne-tuning;

In PATS, parameters with lower sensitivity (the "red"

parameter) to downstream data are added with larger

noise, and vice versa (like the "purple" parameter).

perturb a PLM by adding noise from a uniform

distribution to a matrix of parameters. Different

from them, in PATS, each parameter even from

the same matrix will be paid to different attention

according to its sensitivity. It is also worth noting

that in PATS, a PLM is not perturbed in advance

like NoisyTune, instead the perturbation happens

during training as the task data comes in. In the fol-

lowing sections, we will introduce the calculation

of parameter sensitivity ﬁrst and then present the

noisy learning mechanism in detail.

2.1 Sensitivity Measurement

The sensitivity of a parameter is used to measure

the change of the output or loss after setting it to

zero (Molchanov et al.,2017,2019;Ding et al.,

2019;Xiao et al.,2019;Lee et al.,2019). To be

speciﬁc, given a BERT-like pre-trained language

model

with parameters

Θ={θ1, θ2,··· , θn} ∈

, the sensitivity of the

-th parameter

θj

is writ-

ten as sj, which can be deﬁned as:

sj=|L(Θ)− L(θ1,··· , θj−1,0, θj+1,··· , θn)|

≈ |θj∇θjL(Θ)|,(1)

where

is a loss function and we use the ﬁrst-

degree Taylor polynomial to approximate

ignor-

ing the higher order remainder to accelerate the

calculation of it.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PATS:Sensitivity-awareNoisyLearningforPretrainedLanguageModelsYupengZhang1,HongzhiZhang2,SiruiWang2,WeiWu2andZhoujunLi1y1BeihangUniversity,Beijing,China2MeituanInc.,Beijing,China{G0vi_qyx,lizj}@buaa.edu.cn{zhanghongzhi03,wangsirui,wuwei30}@meituan.comAbstractAwiderangeofNLPtasksbenetfromthene-tu...

展开>> 收起<<

PATS Sensitivity-aware Noisy Learning for Pretrained Language Models Yupeng Zhang1 Hongzhi Zhang2 Sirui Wang2 Wei Wu2and Zhoujun Li1y 1Beihang University Beijing China2Meituan Inc. Beijing China.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PATS Sensitivity-aware Noisy Learning for Pretrained Language Models Yupeng Zhang1 Hongzhi Zhang2 Sirui Wang2 Wei Wu2and Zhoujun Li1y 1Beihang University Beijing China2Meituan Inc. Beijing China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: