
Figure 1: Attention maps of vanilla dropout and our
AD-DROP. Darker attention positions indicate higher
attribution scores, and crossed circles mean dropped
attention positions. Red-dotted boxes refer to can-
didate discard regions with high attribution scores.
Unlike vanilla dropout which randomly discards atten-
tion positions, AD-DROP focuses on dropping high-
attribution positions in candidate discard regions.
Motivated by the above, we propose
A
ttribution-
D
riven
Drop
out (AD-DROP)
to better fine-tune PrLMs based on self-
attention attribution. The general idea of AD-
DROP is to drop a set of self-attention posi-
tions with high attribution scores. We illus-
trate the difference between vanilla dropout
and AD-DROP by their attention maps in Fig-
ure 1. When fine-tuning a PrLM on a batch
of training samples, AD-DROP involves four
steps. First, predictions are made through a
forward computation without dropping any
attention position. Second, we compute the
attribution score of each attention position
by gradient [
25
] or integrated gradient [
26
]
attribution methods. Third, we sample a set
of positions with high attribution scores and
generate a mask matrix for each attention
map. Finally, the mask matrices are applied
to the next forward computation to make predictions for backpropagation. AD-DROP can be regarded
as a strategic dropout regularizer that forces the model to make predictions by relying more on
low-attribution positions to reduce overfitting. Nevertheless, excessive neglect of high-attribution
positions would leave insufficient information for training. Hence, we further propose a cross-tuning
strategy that performs fine-tuning and AD-DROP alternately to improve the training stability.
To verify the effectiveness of AD-DROP, we conduct extensive experiments with different PrLMs
(i.e., BERT [
1
], RoBERTa [
2
], ELECTRA [
29
], and OPUS-MT [
30
]) on various datasets (i.e., GLUE
[
31
], CoNLL-2003 [
32
], WMT 2016 EN-RO and TR-EN [
33
], HANS [
34
], and PAWS-X [
35
]).
Experimental results show that the models tuned with AD-DROP obtain remarkable improvements
over that tuned with the original fine-tuning approach. For example, on the GLUE benchmark, BERT
achieves an average improvement of 1.98/0.87 points on the dev/test sets while RoBERTa achieves an
average improvement of 1.29/0.62 points. Moreover, ablation studies and analysis demonstrate that
gradient-based attribution [
25
,
26
] is a more suitable saliency measure for implementing AD-DROP
than directly using attention weights or simple random sampling. Moreover, they also demonstrate
that the cross-tuning strategy plays a crucial role in improving training stability.
To sum up, this work reveals that self-attention positions are not equally important for dropout
when fine-tuning PrLMs. Arguably, low-attribution positions are more difficult to optimize than
high-attribution positions, and dropping these positions tends not to relieve but accelerate overfitting.
This leads to a novel dropout regularizer, AD-DROP, driven by self-attention attribution. Although
proposed for self-attention units, AD-DROP can be potentially extended to other units as dropout.
2 Methodology
2.1 Preliminaries
Since Transformers [
4
] are the backbone of PrLMs, we first review the details of self-attention in
Transformers and self-attention attribution [
23
]. Let
X∈Rn×d
be the input of a Transformer block,
where
n
is the sequence length and
d
is the embedding size. Self-attention in this block first maps
X
into three matrices
Qh
,
Kh
and
Vh
via linear projections as query, key, and value respectively for
the h-th head. Then, the attention output of this head is calculated as:
Attention (Qh,Kh,Vh) = AhVh= softmax QhKhT
√dk
+Mh!Vh,(1)
where
√dk
is a scaling factor.
Mh
is the mask matrix to apply dropout in self-attention, and elements
in Mhwill be −∞ if the corresponding positions in attention maps are masked and 0 otherwise.
2