AD-D ROP Attribution-Driven Dropout for Robust Language Model Fine-Tuning Tao Yang1 Jinghao Deng1 Xiaojun Quan1 Qifan Wang2 Shaoliang Nie2

2025-04-27 0 0 1MB 20 页 10玖币

侵权投诉

AD-DROP: Attribution-Driven Dropout for Robust

Language Model Fine-Tuning

Tao Yang1, Jinghao Deng1, Xiaojun Quan1∗

, Qifan Wang2, Shaoliang Nie2

1School of Computer Science and Engineering, Sun Yat-sen University 2Meta AI

1{yangt225,dengjh27}@mail2.sysu.edu.cn, quanxj3@mail.sysu.edu.cn

2{wqfcr, snie}@fb.com

Abstract

Fine-tuning large pre-trained language models on downstream tasks is apt to suffer

from overﬁtting when limited training data is available. While dropout proves

to be an effective antidote by randomly dropping a proportion of units, existing

research has not examined its effect on the self-attention mechanism. In this paper,

we investigate this problem through self-attention attribution and ﬁnd that dropping

attention positions with low attribution scores can accelerate training and increase

the risk of overﬁtting. Motivated by this observation, we propose Attribution-

Driven Dropout (AD-DROP), which randomly discards some high-attribution

positions to encourage the model to make predictions by relying more on low-

attribution positions to reduce overﬁtting. We also develop a cross-tuning strategy

to alternate ﬁne-tuning and AD-DROP to avoid dropping high-attribution positions

excessively. Extensive experiments on various benchmarks show that AD-DROP

yields consistent improvements over baselines. Analysis further conﬁrms that

AD-DROP serves as a strategic regularizer to prevent overﬁtting during ﬁne-tuning.

1 Introduction

Pre-training large language models (PrLMs) on massive unlabeled corpora and ﬁne-tuning them

on downstream tasks has become a new paradigm [

–

]. Their success can be partly attributed

to the self-attention mechanism [

], yet these self-attention networks are often redundant [

]

and tend to cause overﬁtting when ﬁne-tuned on downstream tasks due to the mismatch between

their overparameterization and the limited annotated data [

–

]. To address this issue, various

regularization techniques such as data augmentation [

], adversarial training [

]), and

dropout-based methods [

] have been developed. Among them, dropout-based methods

are widely adopted for their simplicity and effectiveness. Dropout [

], which randomly discards a

proportion of units, is at the core of dropout-based methods. Recently, several variants of dropout have

been proposed, such as Concrete Dropout [

], DropBlock [

], and AutoDropout [

]. However,

these variants generally follow the vanilla dropout to randomly drop units during training and pay

little attention to the effect of dropout on self-attention. In this paper, we seek to ﬁll this gap from the

perspective of self-attention attribution [23] and aim to reduce overﬁtting when ﬁne-tuning PrLMs.

Attribution [

] is an interpretability method that attributes model predictions to input features via

saliency measures such as gradient [

]. It is also used to explain the inﬂuence patterns of

self-attention in recent literature [

]. Our prior experiment of self-attention attribution

(Section 2.2) reveals that attention positions are not equally important in preventing overﬁtting, and

dropping low-attribution positions is more likely to cause overﬁtting than discarding high-attribution

positions. This observation suggests that attention positions should not be treated the same in dropout.

∗Corresponding author.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05883v1 [cs.CL] 12 Oct 2022

Vanilla dropout AD-DROP

Figure 1: Attention maps of vanilla dropout and our

AD-DROP. Darker attention positions indicate higher

attribution scores, and crossed circles mean dropped

attention positions. Red-dotted boxes refer to can-

didate discard regions with high attribution scores.

Unlike vanilla dropout which randomly discards atten-

tion positions, AD-DROP focuses on dropping high-

attribution positions in candidate discard regions.

Motivated by the above, we propose

ttribution-

riven

Drop

out (AD-DROP)

to better ﬁne-tune PrLMs based on self-

attention attribution. The general idea of AD-

DROP is to drop a set of self-attention posi-

tions with high attribution scores. We illus-

trate the difference between vanilla dropout

and AD-DROP by their attention maps in Fig-

ure 1. When ﬁne-tuning a PrLM on a batch

of training samples, AD-DROP involves four

steps. First, predictions are made through a

forward computation without dropping any

attention position. Second, we compute the

attribution score of each attention position

by gradient [

] or integrated gradient [

]

attribution methods. Third, we sample a set

of positions with high attribution scores and

generate a mask matrix for each attention

map. Finally, the mask matrices are applied

to the next forward computation to make predictions for backpropagation. AD-DROP can be regarded

as a strategic dropout regularizer that forces the model to make predictions by relying more on

low-attribution positions to reduce overﬁtting. Nevertheless, excessive neglect of high-attribution

positions would leave insufﬁcient information for training. Hence, we further propose a cross-tuning

strategy that performs ﬁne-tuning and AD-DROP alternately to improve the training stability.

To verify the effectiveness of AD-DROP, we conduct extensive experiments with different PrLMs

(i.e., BERT [

], RoBERTa [

], ELECTRA [

], and OPUS-MT [

]) on various datasets (i.e., GLUE

[

], CoNLL-2003 [

], WMT 2016 EN-RO and TR-EN [

], HANS [

], and PAWS-X [

]).

Experimental results show that the models tuned with AD-DROP obtain remarkable improvements

over that tuned with the original ﬁne-tuning approach. For example, on the GLUE benchmark, BERT

achieves an average improvement of 1.98/0.87 points on the dev/test sets while RoBERTa achieves an

average improvement of 1.29/0.62 points. Moreover, ablation studies and analysis demonstrate that

gradient-based attribution [

] is a more suitable saliency measure for implementing AD-DROP

than directly using attention weights or simple random sampling. Moreover, they also demonstrate

that the cross-tuning strategy plays a crucial role in improving training stability.

To sum up, this work reveals that self-attention positions are not equally important for dropout

when ﬁne-tuning PrLMs. Arguably, low-attribution positions are more difﬁcult to optimize than

high-attribution positions, and dropping these positions tends not to relieve but accelerate overﬁtting.

This leads to a novel dropout regularizer, AD-DROP, driven by self-attention attribution. Although

proposed for self-attention units, AD-DROP can be potentially extended to other units as dropout.

2 Methodology

2.1 Preliminaries

Since Transformers [

] are the backbone of PrLMs, we ﬁrst review the details of self-attention in

Transformers and self-attention attribution [

]. Let

X∈Rn×d

be the input of a Transformer block,

where

is the sequence length and

is the embedding size. Self-attention in this block ﬁrst maps

into three matrices

and

via linear projections as query, key, and value respectively for

the h-th head. Then, the attention output of this head is calculated as:

Attention (Qh,Kh,Vh) = AhVh= softmax QhKhT

√dk

+Mh!Vh,(1)

where

√dk

is a scaling factor.

is the mask matrix to apply dropout in self-attention, and elements

in Mhwill be −∞ if the corresponding positions in attention maps are masked and 0 otherwise.

Based on the attention maps

A= [A1,A2,···,AH]

for

attention heads, gradient attribution

[25, 36] directly produces an attribution matrix Bhby computing the following partial derivative:

Bh=∂Fc(A)

∂Ah

,(2)

where Fc(·)denotes the logit output of the Transformer for class c.

To provide a theoretically more sound attribution method, Sundararajan et al.

[26]

propose integrated

gradient, which is employed by Hao et al.

[23]

as a saliency measure for self-attention attribution.

Speciﬁcally, Hao et al. [23] compute the attribution matrix Bhas:

Bh=Ah

m

k=1

∂Fck

mA

∂Ah

,(3)

where

is the number of steps for approximating the integration in integrated gradient, and



the element-wise multiplication operator. Despite its theoretical advantage over gradient attribution,

integrated gradient requires

times more computational effort, which is especially expensive when

it is applied to all the attention heads in Transformers. Moreover, our experiments in Section 3.4 show

that gradient attribution achieves comparable performance with integrated gradient but requires much

less computational cost, suggesting that gradient attribution is more desirable for AD-DROP.

2.2 A Prior Attribution Experiment

(a) Training loss (b) Validation loss

Figure 2: Results of training and validation losses when ﬁne-

tuning RoBERTa with different dropping strategies on MRPC.

The dropping rate is set to 0.3 if it applies.

To better motivate our work, we ﬁrst

conduct a prior experiment on MRPC

[

] to investigate how different po-

sitions in self-attention maps affect

ﬁne-tuning performance based on at-

tribution results. RoBERTa_base [2]

is used as the base model. To begin

with, we ﬁrst perform a forward com-

putation of the model on each batch

of training samples to obtain the logit

output of each sample corresponding

to the gold label. Then, we obtain

an attribution matrix

for the self-

attention positions in the ﬁrst layer

by gradient attribution with Eq. (2)

and sort each row of the matrix. Finally, we sample a set of self-attention positions with high or low

attribution scores in each row to generate a mask matrix

, which is fed into Eq. (1) to make the

ﬁnal predictions. After each epoch of training, we evaluate the model on the development set. Two

baseline dropping strategies (i.e., dropping by random sampling and without dropping any position)

are employed for comparison. We plot the loss curves of the model with these dropping strategies on

both training and development sets in Figure 2. The observations are threefold. First, dropping low-

attribution positions makes the model ﬁt the training data rapidly, whereas it performs poorly on the

development set, indicating that the model is not properly trained. Second, compared with the other

dropping strategies, dropping high-attribution positions reduces the ﬁtting speed signiﬁcantly. Third,

random dropping only slightly reduces overﬁtting, compared to the training without dropping. These

observations suggest that attention positions are of different importance in preventing overﬁtting. We

conjecture that low-attribution positions are more difﬁcult to optimize than high-attribution positions.

While dropping low-attribution positions tends to accelerate overﬁtting, discarding high-attribution

positions helps reduce overﬁtting.

2.3 Attribution-Driven Dropout

Inspired by the observations in Section 2.2, we propose a novel regularizer, AD-DROP, to better

prevent overﬁtting when adapting PrLMs to downstream tasks. The motivation of AD-DROP is to

minimize the over-reliance of these models on particular features which may affect their generalization.

2We provide more results and discussions in Appendix B.

Formally, given a training set

D={(xi, yi)}N

i=1

samples, where

is the

-th sample and

its label, the goal of AD-DROP is to ﬁne-tune a PrLM

F(·)

layers on

. Same as the vanilla

dropout [19], AD-DROP is only applied in the training phase.

Self-Attention Maps

FFN

Attribution Matrices Mask Matrices

-

Self-Attention Maps

FFN

Add

Softmax

Sort&

Sample

Softmax Pseudo Label Loss

Attribution

Logit

Step 1 Step 2 Step 3 Step 4

N×

Figure 3: Illustration of AD-DROP in four steps. (1) Conduct the ﬁrst forward computation to obtain

pseudo label

˜c

. (2) Generate attribution matrices

via computing the gradient of logit output

F˜c(A)

with respect to each attention head. (3) Sort

and strategically drop some positions to produce mask

matrices M. (4) Feed Minto the next forward computation to compute the ﬁnal loss.

As shown in Figure 3, the idea of AD-DROP can be described in four steps. First, we conduct a

forward computation of the model to obtain the label with the highest probability as the pseudo label.

The reason we adopt pseudo labels rather than gold labels for attribution will be explained shortly.

Speciﬁcally, for the input

with

tokens, we apply

F(·)

to encode it and obtain its pseudo label

˜c

˜c= arg max

(PF(c|xi)) ,(4)

where

PF(c|xi)

is the probability of class

for

. After the forward computation, we also obtain a

set of attention maps A= [A1,A2,···,AH]for each layer according to Eq. (1).

Second, we compute the attribution matrices

B= [B1,B2,···,BH]

for

heads according to

Eq. (2). Speciﬁcally, the attribution matrix Bhfor the h-th head is computed as:

Bh=∂F˜c(A)

∂Ah

,(5)

where F˜c(A)is the logit output of pseudo label ˜cbefore softmax.3

Third, we generate a mask matrix

based on

. To this end, we ﬁrst sort each row of

ascending order and obtain a sorted attribution matrix

. Then, we deﬁne a candidate discard region

Sh, in which each element si,j is deﬁned as:

si,j =1, bi,j <b

bi,int(n(1−p))

0,otherwise (6)

where

bi,j

and

bi,j

are elements of

and

, respectively,

int(·)

is an integer function, and

p∈(0,1)

is used to control the size of the candidate discard region. Next, we apply dropout in the

region to produce the mask matrix Mhas:

mi,j =−∞,(si,j +ui,j )=0

0,otherwise (7)

where ui,j ∼Bernoulli(1 −q)is an element of matrix Uh∈Rn×n, and qis the dropout rate.

Finally,

is fed into self-attention of Eq. (1) for the second forward computation, and the ﬁnal

output is used to calculate the loss for backpropagation.

3The negative loss will be used for both regression and token-level tasks, as introduced in Appendix A.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AD-DROP:Attribution-DrivenDropoutforRobustLanguageModelFine-TuningTaoYang1,JinghaoDeng1,XiaojunQuan1,QifanWang2,ShaoliangNie21SchoolofComputerScienceandEngineering,SunYat-senUniversity2MetaAI1{yangt225,dengjh27}@mail2.sysu.edu.cn,quanxj3@mail.sysu.edu.cn2{wqfcr,snie}@fb.comAbstractFine-tuninglargep...

展开>> 收起<<

AD-D ROP Attribution-Driven Dropout for Robust Language Model Fine-Tuning Tao Yang1 Jinghao Deng1 Xiaojun Quan1 Qifan Wang2 Shaoliang Nie2.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

AD-D ROP Attribution-Driven Dropout for Robust Language Model Fine-Tuning Tao Yang1 Jinghao Deng1 Xiaojun Quan1 Qifan Wang2 Shaoliang Nie2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: