AD-D ROP Attribution-Driven Dropout for Robust Language Model Fine-Tuning Tao Yang1 Jinghao Deng1 Xiaojun Quan1 Qifan Wang2 Shaoliang Nie2

2025-04-27 0 0 1MB 20 页 10玖币
侵权投诉
AD-DROP: Attribution-Driven Dropout for Robust
Language Model Fine-Tuning
Tao Yang1, Jinghao Deng1, Xiaojun Quan1
, Qifan Wang2, Shaoliang Nie2
1School of Computer Science and Engineering, Sun Yat-sen University 2Meta AI
1{yangt225,dengjh27}@mail2.sysu.edu.cn, quanxj3@mail.sysu.edu.cn
2{wqfcr, snie}@fb.com
Abstract
Fine-tuning large pre-trained language models on downstream tasks is apt to suffer
from overfitting when limited training data is available. While dropout proves
to be an effective antidote by randomly dropping a proportion of units, existing
research has not examined its effect on the self-attention mechanism. In this paper,
we investigate this problem through self-attention attribution and find that dropping
attention positions with low attribution scores can accelerate training and increase
the risk of overfitting. Motivated by this observation, we propose Attribution-
Driven Dropout (AD-DROP), which randomly discards some high-attribution
positions to encourage the model to make predictions by relying more on low-
attribution positions to reduce overfitting. We also develop a cross-tuning strategy
to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions
excessively. Extensive experiments on various benchmarks show that AD-DROP
yields consistent improvements over baselines. Analysis further confirms that
AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning.
1 Introduction
Pre-training large language models (PrLMs) on massive unlabeled corpora and fine-tuning them
on downstream tasks has become a new paradigm [
1
3
]. Their success can be partly attributed
to the self-attention mechanism [
4
], yet these self-attention networks are often redundant [
5
,
6
]
and tend to cause overfitting when fine-tuned on downstream tasks due to the mismatch between
their overparameterization and the limited annotated data [
7
13
]. To address this issue, various
regularization techniques such as data augmentation [
14
,
15
], adversarial training [
16
,
17
]), and
dropout-based methods [
11
,
13
,
18
] have been developed. Among them, dropout-based methods
are widely adopted for their simplicity and effectiveness. Dropout [
19
], which randomly discards a
proportion of units, is at the core of dropout-based methods. Recently, several variants of dropout have
been proposed, such as Concrete Dropout [
20
], DropBlock [
21
], and AutoDropout [
22
]. However,
these variants generally follow the vanilla dropout to randomly drop units during training and pay
little attention to the effect of dropout on self-attention. In this paper, we seek to fill this gap from the
perspective of self-attention attribution [23] and aim to reduce overfitting when fine-tuning PrLMs.
Attribution [
24
] is an interpretability method that attributes model predictions to input features via
saliency measures such as gradient [
25
,
26
]. It is also used to explain the influence patterns of
self-attention in recent literature [
23
,
27
,
28
]. Our prior experiment of self-attention attribution
(Section 2.2) reveals that attention positions are not equally important in preventing overfitting, and
dropping low-attribution positions is more likely to cause overfitting than discarding high-attribution
positions. This observation suggests that attention positions should not be treated the same in dropout.
Corresponding author.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05883v1 [cs.CL] 12 Oct 2022
Vanilla dropout AD-DROP
Figure 1: Attention maps of vanilla dropout and our
AD-DROP. Darker attention positions indicate higher
attribution scores, and crossed circles mean dropped
attention positions. Red-dotted boxes refer to can-
didate discard regions with high attribution scores.
Unlike vanilla dropout which randomly discards atten-
tion positions, AD-DROP focuses on dropping high-
attribution positions in candidate discard regions.
Motivated by the above, we propose
A
ttribution-
D
riven
Drop
out (AD-DROP)
to better fine-tune PrLMs based on self-
attention attribution. The general idea of AD-
DROP is to drop a set of self-attention posi-
tions with high attribution scores. We illus-
trate the difference between vanilla dropout
and AD-DROP by their attention maps in Fig-
ure 1. When fine-tuning a PrLM on a batch
of training samples, AD-DROP involves four
steps. First, predictions are made through a
forward computation without dropping any
attention position. Second, we compute the
attribution score of each attention position
by gradient [
25
] or integrated gradient [
26
]
attribution methods. Third, we sample a set
of positions with high attribution scores and
generate a mask matrix for each attention
map. Finally, the mask matrices are applied
to the next forward computation to make predictions for backpropagation. AD-DROP can be regarded
as a strategic dropout regularizer that forces the model to make predictions by relying more on
low-attribution positions to reduce overfitting. Nevertheless, excessive neglect of high-attribution
positions would leave insufficient information for training. Hence, we further propose a cross-tuning
strategy that performs fine-tuning and AD-DROP alternately to improve the training stability.
To verify the effectiveness of AD-DROP, we conduct extensive experiments with different PrLMs
(i.e., BERT [
1
], RoBERTa [
2
], ELECTRA [
29
], and OPUS-MT [
30
]) on various datasets (i.e., GLUE
[
31
], CoNLL-2003 [
32
], WMT 2016 EN-RO and TR-EN [
33
], HANS [
34
], and PAWS-X [
35
]).
Experimental results show that the models tuned with AD-DROP obtain remarkable improvements
over that tuned with the original fine-tuning approach. For example, on the GLUE benchmark, BERT
achieves an average improvement of 1.98/0.87 points on the dev/test sets while RoBERTa achieves an
average improvement of 1.29/0.62 points. Moreover, ablation studies and analysis demonstrate that
gradient-based attribution [
25
,
26
] is a more suitable saliency measure for implementing AD-DROP
than directly using attention weights or simple random sampling. Moreover, they also demonstrate
that the cross-tuning strategy plays a crucial role in improving training stability.
To sum up, this work reveals that self-attention positions are not equally important for dropout
when fine-tuning PrLMs. Arguably, low-attribution positions are more difficult to optimize than
high-attribution positions, and dropping these positions tends not to relieve but accelerate overfitting.
This leads to a novel dropout regularizer, AD-DROP, driven by self-attention attribution. Although
proposed for self-attention units, AD-DROP can be potentially extended to other units as dropout.
2 Methodology
2.1 Preliminaries
Since Transformers [
4
] are the backbone of PrLMs, we first review the details of self-attention in
Transformers and self-attention attribution [
23
]. Let
XRn×d
be the input of a Transformer block,
where
n
is the sequence length and
d
is the embedding size. Self-attention in this block first maps
X
into three matrices
Qh
,
Kh
and
Vh
via linear projections as query, key, and value respectively for
the h-th head. Then, the attention output of this head is calculated as:
Attention (Qh,Kh,Vh) = AhVh= softmax QhKhT
dk
+Mh!Vh,(1)
where
dk
is a scaling factor.
Mh
is the mask matrix to apply dropout in self-attention, and elements
in Mhwill be −∞ if the corresponding positions in attention maps are masked and 0 otherwise.
2
Based on the attention maps
A= [A1,A2,···,AH]
for
H
attention heads, gradient attribution
[25, 36] directly produces an attribution matrix Bhby computing the following partial derivative:
Bh=Fc(A)
Ah
,(2)
where Fc(·)denotes the logit output of the Transformer for class c.
To provide a theoretically more sound attribution method, Sundararajan et al.
[26]
propose integrated
gradient, which is employed by Hao et al.
[23]
as a saliency measure for self-attention attribution.
Specifically, Hao et al. [23] compute the attribution matrix Bhas:
Bh=Ah
m
m
X
k=1
Fck
mA
Ah
,(3)
where
m
is the number of steps for approximating the integration in integrated gradient, and
is
the element-wise multiplication operator. Despite its theoretical advantage over gradient attribution,
integrated gradient requires
m
times more computational effort, which is especially expensive when
it is applied to all the attention heads in Transformers. Moreover, our experiments in Section 3.4 show
that gradient attribution achieves comparable performance with integrated gradient but requires much
less computational cost, suggesting that gradient attribution is more desirable for AD-DROP.
2.2 A Prior Attribution Experiment
(a) Training loss (b) Validation loss
-
-
Figure 2: Results of training and validation losses when fine-
tuning RoBERTa with different dropping strategies on MRPC.
The dropping rate is set to 0.3 if it applies.
To better motivate our work, we first
conduct a prior experiment on MRPC
[
37
] to investigate how different po-
sitions in self-attention maps affect
fine-tuning performance based on at-
tribution results. RoBERTa_base [2]
is used as the base model. To begin
with, we first perform a forward com-
putation of the model on each batch
of training samples to obtain the logit
output of each sample corresponding
to the gold label. Then, we obtain
an attribution matrix
Bh
for the self-
attention positions in the first layer
2
by gradient attribution with Eq. (2)
and sort each row of the matrix. Finally, we sample a set of self-attention positions with high or low
attribution scores in each row to generate a mask matrix
Mh
, which is fed into Eq. (1) to make the
final predictions. After each epoch of training, we evaluate the model on the development set. Two
baseline dropping strategies (i.e., dropping by random sampling and without dropping any position)
are employed for comparison. We plot the loss curves of the model with these dropping strategies on
both training and development sets in Figure 2. The observations are threefold. First, dropping low-
attribution positions makes the model fit the training data rapidly, whereas it performs poorly on the
development set, indicating that the model is not properly trained. Second, compared with the other
dropping strategies, dropping high-attribution positions reduces the fitting speed significantly. Third,
random dropping only slightly reduces overfitting, compared to the training without dropping. These
observations suggest that attention positions are of different importance in preventing overfitting. We
conjecture that low-attribution positions are more difficult to optimize than high-attribution positions.
While dropping low-attribution positions tends to accelerate overfitting, discarding high-attribution
positions helps reduce overfitting.
2.3 Attribution-Driven Dropout
Inspired by the observations in Section 2.2, we propose a novel regularizer, AD-DROP, to better
prevent overfitting when adapting PrLMs to downstream tasks. The motivation of AD-DROP is to
minimize the over-reliance of these models on particular features which may affect their generalization.
2We provide more results and discussions in Appendix B.
3
Formally, given a training set
D={(xi, yi)}N
i=1
of
N
samples, where
xi
is the
i
-th sample and
yi
is
its label, the goal of AD-DROP is to fine-tune a PrLM
F(·)
of
L
layers on
D
. Same as the vanilla
dropout [19], AD-DROP is only applied in the training phase.
Self-Attention Maps
x
FFN
Attribution Matrices Mask Matrices
0
-
Self-Attention Maps
x
FFN
Add
Softmax
Sort&
Sample
Softmax Pseudo Label Loss
Attribution
Logit
Step 1 Step 2 Step 3 Step 4
N×
N×
Figure 3: Illustration of AD-DROP in four steps. (1) Conduct the first forward computation to obtain
pseudo label
˜c
. (2) Generate attribution matrices
B
via computing the gradient of logit output
F˜c(A)
with respect to each attention head. (3) Sort
B
and strategically drop some positions to produce mask
matrices M. (4) Feed Minto the next forward computation to compute the final loss.
As shown in Figure 3, the idea of AD-DROP can be described in four steps. First, we conduct a
forward computation of the model to obtain the label with the highest probability as the pseudo label.
The reason we adopt pseudo labels rather than gold labels for attribution will be explained shortly.
Specifically, for the input
xi
with
n
tokens, we apply
F(·)
to encode it and obtain its pseudo label
˜c
:
˜c= arg max
c
(PF(c|xi)) ,(4)
where
PF(c|xi)
is the probability of class
c
for
xi
. After the forward computation, we also obtain a
set of attention maps A= [A1,A2,···,AH]for each layer according to Eq. (1).
Second, we compute the attribution matrices
B= [B1,B2,···,BH]
for
H
heads according to
Eq. (2). Specifically, the attribution matrix Bhfor the h-th head is computed as:
Bh=F˜c(A)
Ah
,(5)
where F˜c(A)is the logit output of pseudo label ˜cbefore softmax.3
Third, we generate a mask matrix
Mh
based on
Bh
. To this end, we first sort each row of
Bh
in
ascending order and obtain a sorted attribution matrix
b
Bh
. Then, we define a candidate discard region
Sh, in which each element si,j is defined as:
si,j =1, bi,j <b
bi,int(n(1p))
0,otherwise (6)
where
bi,j
and
b
bi,j
are elements of
Bh
and
b
Bh
, respectively,
int(·)
is an integer function, and
p(0,1)
is used to control the size of the candidate discard region. Next, we apply dropout in the
region to produce the mask matrix Mhas:
mi,j =−∞,(si,j +ui,j )=0
0,otherwise (7)
where ui,j Bernoulli(1 q)is an element of matrix UhRn×n, and qis the dropout rate.
Finally,
Mh
is fed into self-attention of Eq. (1) for the second forward computation, and the final
output is used to calculate the loss for backpropagation.
3The negative loss will be used for both regression and token-level tasks, as introduced in Appendix A.
4
摘要:

AD-DROP:Attribution-DrivenDropoutforRobustLanguageModelFine-TuningTaoYang1,JinghaoDeng1,XiaojunQuan1,QifanWang2,ShaoliangNie21SchoolofComputerScienceandEngineering,SunYat-senUniversity2MetaAI1{yangt225,dengjh27}@mail2.sysu.edu.cn,quanxj3@mail.sysu.edu.cn2{wqfcr,snie}@fb.comAbstractFine-tuninglargep...

展开>> 收起<<
AD-D ROP Attribution-Driven Dropout for Robust Language Model Fine-Tuning Tao Yang1 Jinghao Deng1 Xiaojun Quan1 Qifan Wang2 Shaoliang Nie2.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:1MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注