Breaking BERT Evaluating and Optimizing Sparsified Attention Siddhartha BrahmaPolina Zablotskaia David Mimno

2025-04-30 0 0 981.43KB 11 页 10玖币
侵权投诉
Breaking BERT: Evaluating and Optimizing Sparsified Attention
Siddhartha BrahmaPolina Zablotskaia
David Mimno
Google Research
{sidbrahma,polinaz,mimno}@google.com
Abstract
Transformers allow attention between all pairs
of tokens, but there is reason to believe
that most of these connections—and their
quadratic time and memory—may not be nec-
essary. But which ones? We evaluate the
impact of sparsification patterns with a se-
ries of ablation experiments. First, we com-
pare masks based on syntax, lexical similarity,
and token position to random connections, and
measure which patterns reduce performance
the least. We find that on three common fine-
tuning tasks even using attention that is at least
78% sparse can have little effect on perfor-
mance if applied at later transformer layers,
but that applying sparsity throughout the net-
work reduces performance significantly. Sec-
ond, we vary the degree of sparsity for three
patterns supported by previous work, and find
that connections to neighbouring tokens are
the most significant. Finally, we treat sparsity
as an optimizable parameter, and present an al-
gorithm to learn degrees of neighboring con-
nections that gives a fine-grained control over
the accuracy-sparsity trade-off while approach-
ing the performance of existing methods.
1 Introduction
BERT (Devlin et al.,2019) models achieve high
performance at the cost of increased computation
and decreased interpretability. The combination
of multiple learned self-attention heads has the ca-
pacity to carry a significant amount of information
between all pairs of tokens, but it appears that many
of these potential connections are either unused or
unnecessary (Lu et al.,2021). For example, many
attention heads empirically correspond to relatively
simple patterns such as “attend to the following
token” or “attend to the first token” (Clark et al.,
2019), and many can be removed without affect-
ing performance significantly (Michel et al.,2019).
Equal contribution.
One state-of-the-art model for long sequences (Za-
heer et al.,2020) uses a specific combination of
three such patterns: globally connected tokens,
neighbour connections, and random connections.
The success of these sparsity patterns raises sev-
eral questions. While probing studies have pro-
vided insight into the behavior of individual heads,
the aggregate effect of attention is less clear, even
in models that induce sparsity (Meister et al.,2021).
Which patterns are most important? Are there other
patterns that could perform equally well or bet-
ter? And do specific parameterizations of patterns
affect performance? Is it possible to learn these
sparse patterns while optimizing for sparsity and
accuracy? We address these questions with three
experiments.
First, rather than trying to explain what attention
is doing, we attempt to rule out what attention is
not doing by applying aggressive sparsity patterns
while finetuning from a standard pre-trained model.
We apply a series of fixed masks with fixed spar-
sity to each input text based on an instance-specific
graph structure, including parse trees, semantic
similarity networks, linear-chain neighbours, and
random graphs. Our goal in such drastic model
alterations is not to improve performance but to
measure which interventions damage performance
the least: if a model cannot “rewire” an existing
model to accommodate a sparser network of inter-
actions, the missing interactions must have been
necessary in some way. We find that most patterns
are similar, but linear-chain neighbour connections
are the most robust over three commonly used fine-
tuning tasks.
In addition, we find that sparse attention masks
can be applied to the final layers of a BERT model
with little impact on performance, indicating that
these layers can be sparsified or even eliminated. In
contrast, applying the same masks to earlier layers
has a more significant effect on performance, indi-
cating that dense connections are more important
arXiv:2210.03841v1 [cs.CL] 7 Oct 2022
at the early stages of encoding.
Second, we investigate the effect of varying the
degree of sparsity of individual patterns when they
are used in combination, by exploring the space
of combinations of global, random, and neighbour
connections used in BigBird (Zaheer et al.,2020).
Again, we find that eliminating neighbour connec-
tions has the most drastic effect among these pat-
terns.
Finally, having established the importance of
neighbour connections, we present a method in-
spired by LDPC codes from coding theory that
learns the degree of connectivity in the neighbour
connections while optimizing for sparsity. We
show that our method gives a more fine-grained
control over the trade-off between accuracy and
sparsity than methods using fixed sparsity patterns
like BigBird. It also achieves higher accuracies
than BigBird in the high sparsity regime. To sum-
marize, our contributions are:
We investigate the attention mechanism in
BERT by applying aggressive sparsity pat-
terns. The Neighbour and Random patterns
are least damaging to the model, indicating
that syntactic or semantic meaning is not the
only information the model learns.
We further explore the best patterns with an
ablation study centered on the BigBird model.
We try various combinations of sparsity pat-
terns and node degrees, again confirming the
importance of the Neighbour pattern.
We propose a method to learn the degrees
of sparse Neighbour patterns that allows for
more fine-grained control over models with
desired accuracy and sparsity requirements.
Our results provide guidance to developers
of transformer-based models to anticipate the
effect of any desired degree of sparsity.
2 Related Work
There has been considerable work on sparsifying
attention. Attention heads are known to “special-
ize” (Clark et al.,2019), and removing heads that
were initially available can have surprisingly lit-
tle effect relative to using fewer heads from the
start (Voita et al.,2019;Michel et al.,2019). At-
tention can also be reformulated as conditional
expectations filtered through sparse networks of
tokens (Ren et al.,2021), resulting in reductions
in time and memory. Other approaches include
the Star Transformer (Guo et al.,2019), Sparse
Transformer (Child et al.,2019), Log-Sparse Trans-
former (Li et al.,2019) and the Block Sparse Trans-
former (Qiu et al.,2019). All these methods use
specific predefined sparsity patterns to mask the
full attention matrix.
In searching for effective sparse patterns, we
perform ablation experiments in which we remove
all connections except those that correspond to a
specific pattern. We compare networks based on
syntactic structure (Tenney et al.,2019;Hewitt and
Manning,2019;Coenen et al.,2019;Zanzotto et al.,
2020;Zhang et al.,2016) as well as semantic simi-
larity (Jawahar et al.,2019;Kelly et al.,2020) and
positional neighbours (Clark et al.,2019;Zaheer
et al.,2020).
There is a strong connection between graphs and
attention. Graph neural networks (GNNs) have be-
come increasingly popular, and have been used as
an alternative to transformers for language model-
ing (Bastings et al.,2017;Marcheggiani and Titov,
2017). It has been noted that attention weights in
transformer models can be thought of as a “soft”
graphical structure where all token nodes are fully
connected, but the edge weights vary based on
learned patterns. In both the transformer and GNN
context, performance and scalability depend criti-
cally on finding a simple attention matrix, equiv-
alent to applying a sparse graph between tokens.
But because there are combinatorially many graph
topologies, searching over all possible graphs to
find the optimal sparsity structure would be pro-
hibitively difficult.
In contrast to predefined sparsity patterns, a sec-
ond line of work tries to derive adaptive sparsity
patterns based on data. Adaptive Sparse Trans-
former (Correia et al.,2019) uses Entmax instead
of Softmax; Routing Transformer (Roy et al.,2021)
uses non-negative matrix factorization on the atten-
tion matrix. Another recent paper, Reformer (Ki-
taev et al.,2020), optimizes attention by using lo-
cally sensitive hashing (LSH) to only compute a
few dominant dot products rather than materializ-
ing the whole attention matrix. These approaches
usually learn sparser structures more adapted to the
characteristics of a particular task. They also allow
for a higher level of sparsity control, responsive
to a learning objective. Using an approach like
this can help us better understand how the model
learns, especially in combination with a static pat-
tern. Our third experiment on learning patterns
while optimizing sparsity is most directly compara-
ble to Adaptive Attention Span (Sukhbaatar et al.,
2019), but we learn global sparsity patterns spe-
cific to the task rather than being specific to each
instance.
3 Transformer sparsity and LDPC codes
Our objective is to find well performing sparse con-
nectivity patterns between tokens. The quadratic
scaling of attention produces blowup in time and
memory that can be burdensome even for short se-
quences. We know that much of this work is not
needed, but which parts? The best pattern would be
simple, requiring little to no information about the
specific text sequence; as sparse as possible; but
maintaining as much of the important information
flow as possible.
As a Transformer-based (Vaswani et al.,2017)
model, BERT uses the self-attention mecha-
nism (Cheng et al.,2016;Parikh et al.,2016;Paulus
et al.,2017) as a key component to transform a se-
quence of representations
X= [x1, x2, ..., xn]
of
length
n
. If
d
is the dimension of each of the repre-
sentations, the following function computes a new
set of representations
Attention(Q, K, V ) = σQKT
dV(1)
Here each of the matrices
Q, K, V
are linear trans-
formations (each of dimension
n×d
) of the matrix
of representations
X
, respectively called the query,
key and value matrices. The time complexity of
computing the attention matrix
QKT
is
O(n2d)
. If
the attention matrix is sparsified by the Hadamard
product between a binary sparsification mask
M
(of dimension
n×n
specifying the sparse pattern)
and
QKT
(Fig. 1), then the complexity of the at-
tention mechanism reduces to
O((1 ς(M))n2d)
,
where
ς(M)
is the sparsity of
M
which is the frac-
tion of entries in Mequal to 0.
One can think of
QKT
as a matrix encoding
information about the interaction of all the tokens
in a sequence as defined by the
n2
pairwise dot-
products. The goal of discovering sparse attention
masks
M
then is to find more compact encodings
of this interaction using fewer pairwise terms. But
the search space for
M
is exponentially large; since
each position is a binary variable it is
O(2n2)
. To
make the problem more tractable, we take inspi-
ration from coding theory and in particular from
Figure 1:
Hadamard product between a sparsity pattern
M
and the full attention matrix QKT.
LDPC (low density parity check) codes (Richard-
son and Urbanke,2008). For LDPC codes, the task
is to discover sparse
M
that can be used as parity
check matrices for encoding data. Using theoretical
results that show that given a specific degree dis-
tribution, random matrices have mostly equivalent
performance, the search then reduces to finding
sparse degree distributions and then sampling ran-
dom matrices from an optimal degree distribution.
We apply the same reasoning to search for good
sparse
M
(random or otherwise) to act as “en-
coders” of the attention information in transformers.
Note that, in contrast to LDPC codes where each of
the
n
positions is equivalent, positional information
in a sequence of tokens of natural language (and
hence their transformer representations) is impor-
tant. This observation leads to a series of research
questions:
Are sparse random graphs optimal for lan-
guage, which exhibits sequential and hierar-
chical structure? We address this issue by
comparing a random graph to a selection of
structured sparsity patterns, based on syntax,
lexical similarity, and positional proximity.
Can we measure difference between specific
sparsity patterns, and are they sensitive to de-
gree? We select three patterns used in the
BigBird model, enumerate combinations of
node-degrees and analyze their performance.
Within the best set of patterns, is the optimal
pattern learnable? Similar to LDPC codes,
we formulate a problem of learning degree
distributions on these patterns and fine-tune
models with an objective that optimizes for
sparsity (which is a function of the degree
distribution) as well as accuracy.
4 Data and Sparsity patterns
Coding theory implies that random graphs have
good performance for coding, but does this re-
摘要:

BreakingBERT:EvaluatingandOptimizingSparsiedAttentionSiddharthaBrahmaPolinaZablotskaiaDavidMimnoGoogleResearch{sidbrahma,polinaz,mimno}@google.comAbstractTransformersallowattentionbetweenallpairsoftokens,butthereisreasontobelievethatmostoftheseconnections—andtheirquadratictimeandmemory—maynotbene...

展开>> 收起<<
Breaking BERT Evaluating and Optimizing Sparsified Attention Siddhartha BrahmaPolina Zablotskaia David Mimno.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:981.43KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注