LSG Attention Extrapolation of pretrained Transformers to long sequences Charles Condevaux

2025-05-06 0 0 748.83KB 12 页 10玖币
侵权投诉
LSG Attention: Extrapolation of pretrained
Transformers to long sequences
Charles Condevaux
CHROME
University of Nîmes, France
charles.condevaux@unimes.fr
Sébastien Harispe
EuroMov Digital Health in Motion,
Univ Montpellier, IMT Mines Ales, France
sebastien.harispe@mines-ales.fr
Abstract
Transformer models achieve state-of-the-art
performance on a wide range of NLP tasks.
They however suffer from a prohibitive limita-
tion due to the self-attention mechanism, induc-
ing
O(n2)
complexity with regard to sequence
length. To answer this limitation we introduce
the LSG architecture which relies on Local,
Sparse and Global attention. We show that
LSG attention is fast, efficient and competitive
in classification and summarization tasks on
long documents. Interestingly, it can also be
used to adapt existing pretrained models to ef-
ficiently extrapolate to longer sequences with
no additional training. Along with the intro-
duction of the LSG attention mechanism, we
propose tools to train new models and adapt
existing ones based on this mechanism.
1 Introduction
Transformer models [
1
] are nowadays state-of-the-art
in numerous domains, and in particular in NLP where
they are used in general language models, and to suc-
cessfully tackle several specific tasks such as document
summarization, machine translation and speech process-
ing to cite a few [
2
,
3
]. The cornerstone of Transformer
models is the Attention mechanism used to iteratively
build complex context-dependent representations of se-
quence elements, e.g. tokens, by dynamically aggre-
gating prior representations of these elements. Using
self-attention, a popular Attention flavour, this is made
by computing full attention scores defining how each
prior element representation will contribute to building
the new representation of an element. Considering a
sequence of
n
elements, the computation of the atten-
tion scores is therefore of complexity
O(n2)
which is
prohibitive when large sequences have to be processed.
Furthermore, in the current context where a large num-
ber of models based on full attention have been trained
on various datasets and tasks, we are also interested
in extrapolating those models to longer sequences by
simply substituting full attention by new attention mech-
anisms post training. Common pretrained models (e.g
BERT, RoBERTa) are indeed known to underperform
when extrapolated to sequences of length exceeding the
512 tokens considered during training. This is due to the
nature of the attention mechanism which largely impacts
extrapolation capabilities: full attention usually fails to
extrapolate, even considering post hoc adaptations, e.g.
adding constants in the score matrix [
4
], using a relative
positional embedding [
5
] or duplicating the positional
embedding [
6
]. Defining new attention mechanisms
that can efficiently substitute full attention in pretrained
models that are not originally capable of handling long
sequences would avoid the costs induced by training
large language models from scratch.
The main contributions of this paper are:
1. LSG (Local Sparse Global) attention, an efficient
O(n)
approach to approximate self-attention for pro-
cessing long sequences, is introduced.1
2. Results demonstrating that LSG is fast, efficient
and competitive on classification and summarization
tasks applied to long documents are presented. It is
also shown that LSG can adapt and extrapolate existing
pretrained models not based on LSG, with minimal to
no additional training.
3. A procedure and companion tools are proposed to
convert various existing models and checkpoints (BERT,
RoBERTa, DistilBERT, BART) from HuggingFace to
their LSG variant.2
Compared to several contributions aiming at reducing
the complexity of self-attention introduced hereafter, a
specific focus is given in our work on the extrapolation
of existing Transformer models, i.e. reuse, to longer
sequences.
2 Related works
Several contributions have been devoted to the optimiza-
tion of the Attention mechanism. Four categories of
1https://huggingface.co/ccdv
2https://github.com/ccdv-ai/convert_checkpoint_to_lsg
1
arXiv:2210.15497v1 [cs.CL] 13 Oct 2022
Condevaux et al.
approaches can be distinguished in the literature: (i) re-
current models such as Transformers-XL [
7
] and Com-
pressive Transformers [
8
] which maintain a memory
of past activation at each layer to preserve long-range
contextual information; (ii) models based on factoriza-
tion or kernels aiming at compressing attention score
matrices, such as Linformer [
9
] or Performer [
10
]; (iii)
models based on clustering such as Reformer [
11
] that
dynamically define eligible attention patterns (i.e. where
attention may be made); and (iv) models based on fixed
or adaptative attention patterns, e.g. Longformer [
6
] or
Big Bird [12].
Recurrent approaches iteratively process the sequence
by maintaining a memory to enable long-range depen-
dencies. They generally suffer limitations induced by
specific, slow, and difficult to implement forward and
back propagation procedures. Alternatively, one of the
main line of study for reducing the complexity of Atten-
tion is thus to perform sparsity by limiting the number
of elements on which new representations will be based,
i.e. reducing the number of elements with non-null
attention scores. This approach is motivated by the ob-
servation of global or data-dependent positional patterns
of non-null attention scores depending on the task [
13
].
The sparsity of attention scores in the traditional Atten-
tion mechanism is indeed documented in the literature.
It has for instance been shown that in practice, full atten-
tion tends to overweight close elements in average, in
particular for MLM, machine translation, and seq-to-seq
tasks in general [
14
]. Moreover, according to analyses
on the use of multi-head full attention on specific tasks,
e.g. machine translation, numerous heads learn simi-
lar simple patterns [
15
]. Such redundant patterns may
be hardcoded implementing fixed-positional patterns,
eventually in a task-dependent manner.
Two main approaches are discussed in the literature
for implementing sparsity: fixed or adaptative patterns
based on whether attention scores are computed con-
sidering (1) predefined fixed elements based on their
location in the sequence, or (2) elements selected from
a given procedure. As an example, [
16
] have shown that
fixed
O(n)
convolutions can perform competitively on
machine translation. Longformer proposes an alterna-
tive
O(n)
approach based on sliding and global patterns
[
6
]. In the context of image, audio, and text processing,
[
13
] propose sparse Transformer, an
O(nn)
model
based on sparse factorization of the attention matrix rely-
ing on specific 2D factorized attention schemes. Those
approaches however prevent the use of task-dependent
dynamic patterns. Considering adaptative patterns, [
16
]
also introduced dynamic convolutions as an
O(n)
com-
plexity substitute to self-attention. Kernels defining the
importance of context elements are specified at infer-
ence time rather than fixed after training. Another ex-
ample is Reformer [
11
], an
O(n log n)
approach based
on locality-sensitive hashing (LSH) based on random
projections.
In a transverse manner, several authors, explicitly or
implicitly motivated by the compositional nature of lan-
guage have studied structured approaches in which sub-
sequences (i.e. blocks) are processed independently
and then aggregated. This aims at implementing a lo-
cal or global dynamic memory for considering close to
long-range dependencies. [
17
] introduce a blockwise
approach to reduce the quadratic complexity induced by
large sequences in encoder-decoder architectures. [
18
]
propose a chunkwise attention in which attention is per-
formed in a blockwise manner adaptively splitting the
sequence into small chunks over which soft attention is
computed. This idea is also used in Transformer-XL [
7
].
[
19
] propose a masked block self-attention mechanism
in which the entire sequence is divided into blocks, to
further 1) apply self-attention intra-block for modeling
local contexts, to further 2) apply self-attention inter-
block for capturing long-range dependencies. Such an
approach enables implementing some forms of connec-
tivity between all positions over several steps without
being restricted by full attention limitations. This can
also be achieved by factorization techniques, e.g. [
13
].
More recently authors have proposed global attention
mechanisms encoding information related to blocks on
which attention is based [20, 21, 22].
This paper presents the LSG (Local, Sparse and Global)
attention based on block local attention to capture local
context, sparse attention to capture extended context and
global attention to improve information flow. Contrary
to prior work mostly focusing on defining new models,
the proposed LSG Attention mechanism is model agnos-
tic and aims at facilitating adapting existing (pretrained)
models for them to be used on long sequences.
3 LSG Attention
LSG attention relies on two main points. It is assumed
that locally, a token needs to capture low level informa-
tion thus dense attention is prefered. On the other hand,
as the context grows, higher level information is suffi-
cient. This translates into the need for connections to a
limited number of tokens following specific selection
and computation rules. The LSG approach relies on 3
components: block local attention to capture local con-
text, sparse attention to capture extended context and
global attention to improve information flow. A compar-
ison to Big Bird and Longformer attention patterns is
shown in Figure 1.
LSG Big Bird Longformer
Figure 1: Attention patterns
3.1 Local Attention
Longformer depends on a fixed length sliding window
to perform local attention. However this approach is
difficult to optimize and must rely on a custom CUDA
kernel to be computationally efficient. To improve over-
all training and inference speed, we take advantage of a
block-based process similar to Big Bird. The sequence
2
Condevaux et al.
is split into
nb
non-overlapping chunks of size
bt
. For
a given block, each token attends to the tokens inside
the block, as well as to those in the previous and next
blocks. In this configuration, the local attention win-
dow is asymmetrical since a token can connect up to
2×bt1tokens on the left or on the right.
3.2 Sparse Attention
Sparse connections are used to expand the local con-
text by selecting an additional set of tokens following
a set of rules. These tokens can be directly selected
based on a specific metric or using some computation
such as a pooling method. In the proposed approach,
each attention head can process different sparse tokens
independently. Sparse attention also relies on a block
structure where the sparse selection is done inside each
block. Five alternative criteria can be used in LSG.
Head-wise strided
Inspired by the Hepos model [
23
],
a fixed selection pattern is defined. Each attention head
will attend to a set of tokens following a specific stride
defined as the sparsify factor
f
. Figure 2 shows the
selection pattern.
Head 1
Head 2
Block
Sequence
Figure 2: Head-wise strided selection with a stride of 2.
Head-wise block strided
This selection pattern is
similar to the previous one but selects consecutive to-
kens instead. Figure 2 shows the selection pattern.
Head 1
Head 2
Block
Sequence
Figure 3: Block strided selection with a stride of 2.
Average pooling
A simple way to reduce sequence
length. After chunking the sequence into blocks, sparse
tokens are computed using average pooling. For a block
of size
bt
and a sparsify factor
f
, we pool inside each
block with a window of
f
and a stride of
f
to produce
bt/f tokens.
Max norm
The objective of a norm-based approach
is to select tokens that are most likely highly weighted
in the score matrix. Finding those keys efficiently is
difficult in practice so we use a simple and deterministic
metric. For a query and a key q,kRd, we can write:
qk>= cos(θ)kqkkkk
In this situation
cos(θ)
sign is unknown. However, if it
is positive and
kkk
is high, the key will likely dominate
the softmax regardless of the query. After chunking the
sequence into blocks, we select inside each block and
each head bt/f tokens with the highest key norm.
LSH Clustering
This approach is a non deterministic
one since it relies on the LSH algorithm [24]. For each
block,
bt/f
clusters are built using a single round LSH.
To get
c=bt/f
hashes and for an input
xRd
, a
random matrix RRd×c/2is generated, such that
h(x) = arg max([xR;xR])
with
[a;b]
the concatenation of two vectors. Using the
key matrix as input, each token inside the block gets
a cluster index from
h(x)
. Tokens inside a cluster are
averaged.
Computation
To reduce the computational cost, the
attention pattern is designed to compute each connection
once. For this, the local and sparse tokens are selected
such that there is no overlap between them during atten-
tion computation. Each query is connected to 3 local
blocks and 2 sparse blocks of keys. The maximum con-
text length (distance between two keys) is then equal
to
3×bt+ 2 ×bt×f
. The concatenation of local and
sparse keys is shown Figure 4. For causal attention, the
third local block and the second sparse block can be
ignored during computation.
ab
Sparse context Sparse context
ab
Sequence
Local context
Figure 4: Local and sparse contexts with a block size of
2 and a sparsity factor of 4. Queries
a
and
b
will attend
to 6 local keys and 4 sparse keys.
3.3 Global Attention
Global tokens improve the flow of information inside the
model. They attend to every tokens across the sequence
and all tokens attend to them. Rather than picking a
subset of tokens and defining them as global, they are
prepended to the sequence and trained using their own
embedding matrix, thus their number is an additional
hyperparameter. When a model is converted to its LSG
version, the first global token is initialized as the sum
of the [CLS] (or
<s>
) token and the first position from
the positional embedding. The other global tokens are
initialized as the sum of [MASK] (or
<mask>
) token
and the other positions from the positional embedding.
3.4 Positional Embedding
It is necessary to modify the positional embedding ma-
trix to reuse existing models to process long sequences.
3
摘要:

LSGAttention:ExtrapolationofpretrainedTransformerstolongsequencesCharlesCondevauxCHROMEUniversityofNîmes,Francecharles.condevaux@unimes.frSébastienHarispeEuroMovDigitalHealthinMotion,UnivMontpellier,IMTMinesAles,Francesebastien.harispe@mines-ales.frAbstractTransformermodelsachievestate-of-the-artper...

展开>> 收起<<
LSG Attention Extrapolation of pretrained Transformers to long sequences Charles Condevaux.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:748.83KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注