Breaking BERT Evaluating and Optimizing Sparsiﬁed Attention Siddhartha BrahmaPolina Zablotskaia David Mimno

2025-04-30 1 0 981.43KB 11 页 10玖币

侵权投诉

Breaking BERT: Evaluating and Optimizing Sparsiﬁed Attention

Siddhartha Brahma∗Polina Zablotskaia∗

David Mimno

Google Research

{sidbrahma,polinaz,mimno}@google.com

Abstract

Transformers allow attention between all pairs

of tokens, but there is reason to believe

that most of these connections—and their

quadratic time and memory—may not be nec-

essary. But which ones? We evaluate the

impact of sparsiﬁcation patterns with a se-

ries of ablation experiments. First, we com-

pare masks based on syntax, lexical similarity,

and token position to random connections, and

measure which patterns reduce performance

the least. We ﬁnd that on three common ﬁne-

tuning tasks even using attention that is at least

78% sparse can have little effect on perfor-

mance if applied at later transformer layers,

but that applying sparsity throughout the net-

work reduces performance signiﬁcantly. Sec-

ond, we vary the degree of sparsity for three

patterns supported by previous work, and ﬁnd

that connections to neighbouring tokens are

the most signiﬁcant. Finally, we treat sparsity

as an optimizable parameter, and present an al-

gorithm to learn degrees of neighboring con-

nections that gives a ﬁne-grained control over

the accuracy-sparsity trade-off while approach-

ing the performance of existing methods.

1 Introduction

BERT (Devlin et al.,2019) models achieve high

performance at the cost of increased computation

and decreased interpretability. The combination

of multiple learned self-attention heads has the ca-

pacity to carry a signiﬁcant amount of information

between all pairs of tokens, but it appears that many

of these potential connections are either unused or

unnecessary (Lu et al.,2021). For example, many

attention heads empirically correspond to relatively

simple patterns such as “attend to the following

token” or “attend to the ﬁrst token” (Clark et al.,

2019), and many can be removed without affect-

ing performance signiﬁcantly (Michel et al.,2019).

∗Equal contribution.

One state-of-the-art model for long sequences (Za-

heer et al.,2020) uses a speciﬁc combination of

three such patterns: globally connected tokens,

neighbour connections, and random connections.

The success of these sparsity patterns raises sev-

eral questions. While probing studies have pro-

vided insight into the behavior of individual heads,

the aggregate effect of attention is less clear, even

in models that induce sparsity (Meister et al.,2021).

Which patterns are most important? Are there other

patterns that could perform equally well or bet-

ter? And do speciﬁc parameterizations of patterns

affect performance? Is it possible to learn these

sparse patterns while optimizing for sparsity and

accuracy? We address these questions with three

experiments.

First, rather than trying to explain what attention

is doing, we attempt to rule out what attention is

not doing by applying aggressive sparsity patterns

while ﬁnetuning from a standard pre-trained model.

We apply a series of ﬁxed masks with ﬁxed spar-

sity to each input text based on an instance-speciﬁc

graph structure, including parse trees, semantic

similarity networks, linear-chain neighbours, and

random graphs. Our goal in such drastic model

alterations is not to improve performance but to

measure which interventions damage performance

the least: if a model cannot “rewire” an existing

model to accommodate a sparser network of inter-

actions, the missing interactions must have been

necessary in some way. We ﬁnd that most patterns

are similar, but linear-chain neighbour connections

are the most robust over three commonly used ﬁne-

tuning tasks.

In addition, we ﬁnd that sparse attention masks

can be applied to the ﬁnal layers of a BERT model

with little impact on performance, indicating that

these layers can be sparsiﬁed or even eliminated. In

contrast, applying the same masks to earlier layers

has a more signiﬁcant effect on performance, indi-

cating that dense connections are more important

arXiv:2210.03841v1 [cs.CL] 7 Oct 2022

at the early stages of encoding.

Second, we investigate the effect of varying the

degree of sparsity of individual patterns when they

are used in combination, by exploring the space

of combinations of global, random, and neighbour

connections used in BigBird (Zaheer et al.,2020).

Again, we ﬁnd that eliminating neighbour connec-

tions has the most drastic effect among these pat-

terns.

Finally, having established the importance of

neighbour connections, we present a method in-

spired by LDPC codes from coding theory that

learns the degree of connectivity in the neighbour

connections while optimizing for sparsity. We

show that our method gives a more ﬁne-grained

control over the trade-off between accuracy and

sparsity than methods using ﬁxed sparsity patterns

like BigBird. It also achieves higher accuracies

than BigBird in the high sparsity regime. To sum-

marize, our contributions are:

•

We investigate the attention mechanism in

BERT by applying aggressive sparsity pat-

terns. The Neighbour and Random patterns

are least damaging to the model, indicating

that syntactic or semantic meaning is not the

only information the model learns.

•

We further explore the best patterns with an

ablation study centered on the BigBird model.

We try various combinations of sparsity pat-

terns and node degrees, again conﬁrming the

importance of the Neighbour pattern.

•

We propose a method to learn the degrees

of sparse Neighbour patterns that allows for

more ﬁne-grained control over models with

desired accuracy and sparsity requirements.

Our results provide guidance to developers

of transformer-based models to anticipate the

effect of any desired degree of sparsity.

2 Related Work

There has been considerable work on sparsifying

attention. Attention heads are known to “special-

ize” (Clark et al.,2019), and removing heads that

were initially available can have surprisingly lit-

tle effect relative to using fewer heads from the

start (Voita et al.,2019;Michel et al.,2019). At-

tention can also be reformulated as conditional

expectations ﬁltered through sparse networks of

tokens (Ren et al.,2021), resulting in reductions

in time and memory. Other approaches include

the Star Transformer (Guo et al.,2019), Sparse

Transformer (Child et al.,2019), Log-Sparse Trans-

former (Li et al.,2019) and the Block Sparse Trans-

former (Qiu et al.,2019). All these methods use

speciﬁc predeﬁned sparsity patterns to mask the

full attention matrix.

In searching for effective sparse patterns, we

perform ablation experiments in which we remove

all connections except those that correspond to a

speciﬁc pattern. We compare networks based on

syntactic structure (Tenney et al.,2019;Hewitt and

Manning,2019;Coenen et al.,2019;Zanzotto et al.,

2020;Zhang et al.,2016) as well as semantic simi-

larity (Jawahar et al.,2019;Kelly et al.,2020) and

positional neighbours (Clark et al.,2019;Zaheer

et al.,2020).

There is a strong connection between graphs and

attention. Graph neural networks (GNNs) have be-

come increasingly popular, and have been used as

an alternative to transformers for language model-

ing (Bastings et al.,2017;Marcheggiani and Titov,

2017). It has been noted that attention weights in

transformer models can be thought of as a “soft”

graphical structure where all token nodes are fully

connected, but the edge weights vary based on

learned patterns. In both the transformer and GNN

context, performance and scalability depend criti-

cally on ﬁnding a simple attention matrix, equiv-

alent to applying a sparse graph between tokens.

But because there are combinatorially many graph

topologies, searching over all possible graphs to

ﬁnd the optimal sparsity structure would be pro-

hibitively difﬁcult.

In contrast to predeﬁned sparsity patterns, a sec-

ond line of work tries to derive adaptive sparsity

patterns based on data. Adaptive Sparse Trans-

former (Correia et al.,2019) uses Entmax instead

of Softmax; Routing Transformer (Roy et al.,2021)

uses non-negative matrix factorization on the atten-

tion matrix. Another recent paper, Reformer (Ki-

taev et al.,2020), optimizes attention by using lo-

cally sensitive hashing (LSH) to only compute a

few dominant dot products rather than materializ-

ing the whole attention matrix. These approaches

usually learn sparser structures more adapted to the

characteristics of a particular task. They also allow

for a higher level of sparsity control, responsive

to a learning objective. Using an approach like

this can help us better understand how the model

learns, especially in combination with a static pat-

tern. Our third experiment on learning patterns

while optimizing sparsity is most directly compara-

ble to Adaptive Attention Span (Sukhbaatar et al.,

2019), but we learn global sparsity patterns spe-

ciﬁc to the task rather than being speciﬁc to each

instance.

3 Transformer sparsity and LDPC codes

Our objective is to ﬁnd well performing sparse con-

nectivity patterns between tokens. The quadratic

scaling of attention produces blowup in time and

memory that can be burdensome even for short se-

quences. We know that much of this work is not

needed, but which parts? The best pattern would be

simple, requiring little to no information about the

speciﬁc text sequence; as sparse as possible; but

maintaining as much of the important information

ﬂow as possible.

As a Transformer-based (Vaswani et al.,2017)

model, BERT uses the self-attention mecha-

nism (Cheng et al.,2016;Parikh et al.,2016;Paulus

et al.,2017) as a key component to transform a se-

quence of representations

X= [x1, x2, ..., xn]

length

. If

is the dimension of each of the repre-

sentations, the following function computes a new

set of representations

Attention(Q, K, V ) = σQKT

√dV(1)

Here each of the matrices

Q, K, V

are linear trans-

formations (each of dimension

n×d

) of the matrix

of representations

, respectively called the query,

key and value matrices. The time complexity of

computing the attention matrix

QKT

O(n2d)

. If

the attention matrix is sparsiﬁed by the Hadamard

product between a binary sparsiﬁcation mask

(of dimension

n×n

specifying the sparse pattern)

and

QKT

(Fig. 1), then the complexity of the at-

tention mechanism reduces to

O((1 −ς(M))n2d)

where

ς(M)

is the sparsity of

which is the frac-

tion of entries in Mequal to 0.

One can think of

QKT

as a matrix encoding

information about the interaction of all the tokens

in a sequence as deﬁned by the

pairwise dot-

products. The goal of discovering sparse attention

masks

then is to ﬁnd more compact encodings

of this interaction using fewer pairwise terms. But

the search space for

is exponentially large; since

each position is a binary variable it is

O(2n2)

. To

make the problem more tractable, we take inspi-

ration from coding theory and in particular from

Figure 1:

Hadamard product between a sparsity pattern

and the full attention matrix QKT.

LDPC (low density parity check) codes (Richard-

son and Urbanke,2008). For LDPC codes, the task

is to discover sparse

that can be used as parity

check matrices for encoding data. Using theoretical

results that show that given a speciﬁc degree dis-

tribution, random matrices have mostly equivalent

performance, the search then reduces to ﬁnding

sparse degree distributions and then sampling ran-

dom matrices from an optimal degree distribution.

We apply the same reasoning to search for good

sparse

(random or otherwise) to act as “en-

coders” of the attention information in transformers.

Note that, in contrast to LDPC codes where each of

the

positions is equivalent, positional information

in a sequence of tokens of natural language (and

hence their transformer representations) is impor-

tant. This observation leads to a series of research

questions:

•

Are sparse random graphs optimal for lan-

guage, which exhibits sequential and hierar-

chical structure? We address this issue by

comparing a random graph to a selection of

structured sparsity patterns, based on syntax,

lexical similarity, and positional proximity.

•

Can we measure difference between speciﬁc

sparsity patterns, and are they sensitive to de-

gree? We select three patterns used in the

BigBird model, enumerate combinations of

node-degrees and analyze their performance.

•

Within the best set of patterns, is the optimal

pattern learnable? Similar to LDPC codes,

we formulate a problem of learning degree

distributions on these patterns and ﬁne-tune

models with an objective that optimizes for

sparsity (which is a function of the degree

distribution) as well as accuracy.

4 Data and Sparsity patterns

Coding theory implies that random graphs have

good performance for coding, but does this re-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BreakingBERT:EvaluatingandOptimizingSparsiedAttentionSiddharthaBrahmaPolinaZablotskaiaDavidMimnoGoogleResearch{sidbrahma,polinaz,mimno}@google.comAbstractTransformersallowattentionbetweenallpairsoftokens,butthereisreasontobelievethatmostoftheseconnectionsandtheirquadratictimeandmemorymaynotbene...

展开>> 收起<<

Breaking BERT Evaluating and Optimizing Sparsiﬁed Attention Siddhartha BrahmaPolina Zablotskaia David Mimno.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Breaking BERT Evaluating and Optimizing Sparsiﬁed Attention Siddhartha BrahmaPolina Zablotskaia David Mimno

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: