at the early stages of encoding.
Second, we investigate the effect of varying the
degree of sparsity of individual patterns when they
are used in combination, by exploring the space
of combinations of global, random, and neighbour
connections used in BigBird (Zaheer et al.,2020).
Again, we find that eliminating neighbour connec-
tions has the most drastic effect among these pat-
terns.
Finally, having established the importance of
neighbour connections, we present a method in-
spired by LDPC codes from coding theory that
learns the degree of connectivity in the neighbour
connections while optimizing for sparsity. We
show that our method gives a more fine-grained
control over the trade-off between accuracy and
sparsity than methods using fixed sparsity patterns
like BigBird. It also achieves higher accuracies
than BigBird in the high sparsity regime. To sum-
marize, our contributions are:
•
We investigate the attention mechanism in
BERT by applying aggressive sparsity pat-
terns. The Neighbour and Random patterns
are least damaging to the model, indicating
that syntactic or semantic meaning is not the
only information the model learns.
•
We further explore the best patterns with an
ablation study centered on the BigBird model.
We try various combinations of sparsity pat-
terns and node degrees, again confirming the
importance of the Neighbour pattern.
•
We propose a method to learn the degrees
of sparse Neighbour patterns that allows for
more fine-grained control over models with
desired accuracy and sparsity requirements.
Our results provide guidance to developers
of transformer-based models to anticipate the
effect of any desired degree of sparsity.
2 Related Work
There has been considerable work on sparsifying
attention. Attention heads are known to “special-
ize” (Clark et al.,2019), and removing heads that
were initially available can have surprisingly lit-
tle effect relative to using fewer heads from the
start (Voita et al.,2019;Michel et al.,2019). At-
tention can also be reformulated as conditional
expectations filtered through sparse networks of
tokens (Ren et al.,2021), resulting in reductions
in time and memory. Other approaches include
the Star Transformer (Guo et al.,2019), Sparse
Transformer (Child et al.,2019), Log-Sparse Trans-
former (Li et al.,2019) and the Block Sparse Trans-
former (Qiu et al.,2019). All these methods use
specific predefined sparsity patterns to mask the
full attention matrix.
In searching for effective sparse patterns, we
perform ablation experiments in which we remove
all connections except those that correspond to a
specific pattern. We compare networks based on
syntactic structure (Tenney et al.,2019;Hewitt and
Manning,2019;Coenen et al.,2019;Zanzotto et al.,
2020;Zhang et al.,2016) as well as semantic simi-
larity (Jawahar et al.,2019;Kelly et al.,2020) and
positional neighbours (Clark et al.,2019;Zaheer
et al.,2020).
There is a strong connection between graphs and
attention. Graph neural networks (GNNs) have be-
come increasingly popular, and have been used as
an alternative to transformers for language model-
ing (Bastings et al.,2017;Marcheggiani and Titov,
2017). It has been noted that attention weights in
transformer models can be thought of as a “soft”
graphical structure where all token nodes are fully
connected, but the edge weights vary based on
learned patterns. In both the transformer and GNN
context, performance and scalability depend criti-
cally on finding a simple attention matrix, equiv-
alent to applying a sparse graph between tokens.
But because there are combinatorially many graph
topologies, searching over all possible graphs to
find the optimal sparsity structure would be pro-
hibitively difficult.
In contrast to predefined sparsity patterns, a sec-
ond line of work tries to derive adaptive sparsity
patterns based on data. Adaptive Sparse Trans-
former (Correia et al.,2019) uses Entmax instead
of Softmax; Routing Transformer (Roy et al.,2021)
uses non-negative matrix factorization on the atten-
tion matrix. Another recent paper, Reformer (Ki-
taev et al.,2020), optimizes attention by using lo-
cally sensitive hashing (LSH) to only compute a
few dominant dot products rather than materializ-
ing the whole attention matrix. These approaches
usually learn sparser structures more adapted to the
characteristics of a particular task. They also allow
for a higher level of sparsity control, responsive
to a learning objective. Using an approach like
this can help us better understand how the model
learns, especially in combination with a static pat-