
Condevaux et al.
approaches can be distinguished in the literature: (i) re-
current models such as Transformers-XL [
7
] and Com-
pressive Transformers [
8
] which maintain a memory
of past activation at each layer to preserve long-range
contextual information; (ii) models based on factoriza-
tion or kernels aiming at compressing attention score
matrices, such as Linformer [
9
] or Performer [
10
]; (iii)
models based on clustering such as Reformer [
11
] that
dynamically define eligible attention patterns (i.e. where
attention may be made); and (iv) models based on fixed
or adaptative attention patterns, e.g. Longformer [
6
] or
Big Bird [12].
Recurrent approaches iteratively process the sequence
by maintaining a memory to enable long-range depen-
dencies. They generally suffer limitations induced by
specific, slow, and difficult to implement forward and
back propagation procedures. Alternatively, one of the
main line of study for reducing the complexity of Atten-
tion is thus to perform sparsity by limiting the number
of elements on which new representations will be based,
i.e. reducing the number of elements with non-null
attention scores. This approach is motivated by the ob-
servation of global or data-dependent positional patterns
of non-null attention scores depending on the task [
13
].
The sparsity of attention scores in the traditional Atten-
tion mechanism is indeed documented in the literature.
It has for instance been shown that in practice, full atten-
tion tends to overweight close elements in average, in
particular for MLM, machine translation, and seq-to-seq
tasks in general [
14
]. Moreover, according to analyses
on the use of multi-head full attention on specific tasks,
e.g. machine translation, numerous heads learn simi-
lar simple patterns [
15
]. Such redundant patterns may
be hardcoded implementing fixed-positional patterns,
eventually in a task-dependent manner.
Two main approaches are discussed in the literature
for implementing sparsity: fixed or adaptative patterns
based on whether attention scores are computed con-
sidering (1) predefined fixed elements based on their
location in the sequence, or (2) elements selected from
a given procedure. As an example, [
16
] have shown that
fixed
O(n)
convolutions can perform competitively on
machine translation. Longformer proposes an alterna-
tive
O(n)
approach based on sliding and global patterns
[
6
]. In the context of image, audio, and text processing,
[
13
] propose sparse Transformer, an
O(n√n)
model
based on sparse factorization of the attention matrix rely-
ing on specific 2D factorized attention schemes. Those
approaches however prevent the use of task-dependent
dynamic patterns. Considering adaptative patterns, [
16
]
also introduced dynamic convolutions as an
O(n)
com-
plexity substitute to self-attention. Kernels defining the
importance of context elements are specified at infer-
ence time rather than fixed after training. Another ex-
ample is Reformer [
11
], an
O(n log n)
approach based
on locality-sensitive hashing (LSH) based on random
projections.
In a transverse manner, several authors, explicitly or
implicitly motivated by the compositional nature of lan-
guage have studied structured approaches in which sub-
sequences (i.e. blocks) are processed independently
and then aggregated. This aims at implementing a lo-
cal or global dynamic memory for considering close to
long-range dependencies. [
17
] introduce a blockwise
approach to reduce the quadratic complexity induced by
large sequences in encoder-decoder architectures. [
18
]
propose a chunkwise attention in which attention is per-
formed in a blockwise manner adaptively splitting the
sequence into small chunks over which soft attention is
computed. This idea is also used in Transformer-XL [
7
].
[
19
] propose a masked block self-attention mechanism
in which the entire sequence is divided into blocks, to
further 1) apply self-attention intra-block for modeling
local contexts, to further 2) apply self-attention inter-
block for capturing long-range dependencies. Such an
approach enables implementing some forms of connec-
tivity between all positions over several steps without
being restricted by full attention limitations. This can
also be achieved by factorization techniques, e.g. [
13
].
More recently authors have proposed global attention
mechanisms encoding information related to blocks on
which attention is based [20, 21, 22].
This paper presents the LSG (Local, Sparse and Global)
attention based on block local attention to capture local
context, sparse attention to capture extended context and
global attention to improve information flow. Contrary
to prior work mostly focusing on defining new models,
the proposed LSG Attention mechanism is model agnos-
tic and aims at facilitating adapting existing (pretrained)
models for them to be used on long sequences.
3 LSG Attention
LSG attention relies on two main points. It is assumed
that locally, a token needs to capture low level informa-
tion thus dense attention is prefered. On the other hand,
as the context grows, higher level information is suffi-
cient. This translates into the need for connections to a
limited number of tokens following specific selection
and computation rules. The LSG approach relies on 3
components: block local attention to capture local con-
text, sparse attention to capture extended context and
global attention to improve information flow. A compar-
ison to Big Bird and Longformer attention patterns is
shown in Figure 1.
Figure 1: Attention patterns
3.1 Local Attention
Longformer depends on a fixed length sliding window
to perform local attention. However this approach is
difficult to optimize and must rely on a custom CUDA
kernel to be computationally efficient. To improve over-
all training and inference speed, we take advantage of a
block-based process similar to Big Bird. The sequence
2