LSG Attention Extrapolation of pretrained Transformers to long sequences Charles Condevaux

2025-05-06 0 0 748.83KB 12 页 10玖币

侵权投诉

LSG Attention: Extrapolation of pretrained

Transformers to long sequences

Charles Condevaux

CHROME

University of Nîmes, France

charles.condevaux@unimes.fr

Sébastien Harispe

EuroMov Digital Health in Motion,

Univ Montpellier, IMT Mines Ales, France

sebastien.harispe@mines-ales.fr

Abstract

Transformer models achieve state-of-the-art

performance on a wide range of NLP tasks.

They however suffer from a prohibitive limita-

tion due to the self-attention mechanism, induc-

ing

O(n2)

complexity with regard to sequence

length. To answer this limitation we introduce

the LSG architecture which relies on Local,

Sparse and Global attention. We show that

LSG attention is fast, efﬁcient and competitive

in classiﬁcation and summarization tasks on

long documents. Interestingly, it can also be

used to adapt existing pretrained models to ef-

ﬁciently extrapolate to longer sequences with

no additional training. Along with the intro-

duction of the LSG attention mechanism, we

propose tools to train new models and adapt

existing ones based on this mechanism.

1 Introduction

Transformer models [

] are nowadays state-of-the-art

in numerous domains, and in particular in NLP where

they are used in general language models, and to suc-

cessfully tackle several speciﬁc tasks such as document

summarization, machine translation and speech process-

ing to cite a few [

]. The cornerstone of Transformer

models is the Attention mechanism used to iteratively

build complex context-dependent representations of se-

quence elements, e.g. tokens, by dynamically aggre-

gating prior representations of these elements. Using

self-attention, a popular Attention ﬂavour, this is made

by computing full attention scores deﬁning how each

prior element representation will contribute to building

the new representation of an element. Considering a

sequence of

elements, the computation of the atten-

tion scores is therefore of complexity

O(n2)

which is

prohibitive when large sequences have to be processed.

Furthermore, in the current context where a large num-

ber of models based on full attention have been trained

on various datasets and tasks, we are also interested

in extrapolating those models to longer sequences by

simply substituting full attention by new attention mech-

anisms post training. Common pretrained models (e.g

BERT, RoBERTa) are indeed known to underperform

when extrapolated to sequences of length exceeding the

512 tokens considered during training. This is due to the

nature of the attention mechanism which largely impacts

extrapolation capabilities: full attention usually fails to

extrapolate, even considering post hoc adaptations, e.g.

adding constants in the score matrix [

], using a relative

positional embedding [

] or duplicating the positional

embedding [

]. Deﬁning new attention mechanisms

that can efﬁciently substitute full attention in pretrained

models that are not originally capable of handling long

sequences would avoid the costs induced by training

large language models from scratch.

The main contributions of this paper are:

1. LSG (Local Sparse Global) attention, an efﬁcient

O(n)

approach to approximate self-attention for pro-

cessing long sequences, is introduced.1

2. Results demonstrating that LSG is fast, efﬁcient

and competitive on classiﬁcation and summarization

tasks applied to long documents are presented. It is

also shown that LSG can adapt and extrapolate existing

pretrained models not based on LSG, with minimal to

no additional training.

3. A procedure and companion tools are proposed to

convert various existing models and checkpoints (BERT,

RoBERTa, DistilBERT, BART) from HuggingFace to

their LSG variant.2

Compared to several contributions aiming at reducing

the complexity of self-attention introduced hereafter, a

speciﬁc focus is given in our work on the extrapolation

of existing Transformer models, i.e. reuse, to longer

sequences.

2 Related works

Several contributions have been devoted to the optimiza-

tion of the Attention mechanism. Four categories of

1https://huggingface.co/ccdv

2https://github.com/ccdv-ai/convert_checkpoint_to_lsg

arXiv:2210.15497v1 [cs.CL] 13 Oct 2022

Condevaux et al.

approaches can be distinguished in the literature: (i) re-

current models such as Transformers-XL [

] and Com-

pressive Transformers [

] which maintain a memory

of past activation at each layer to preserve long-range

contextual information; (ii) models based on factoriza-

tion or kernels aiming at compressing attention score

matrices, such as Linformer [

] or Performer [

]; (iii)

models based on clustering such as Reformer [

] that

dynamically deﬁne eligible attention patterns (i.e. where

attention may be made); and (iv) models based on ﬁxed

or adaptative attention patterns, e.g. Longformer [

] or

Big Bird [12].

Recurrent approaches iteratively process the sequence

by maintaining a memory to enable long-range depen-

dencies. They generally suffer limitations induced by

speciﬁc, slow, and difﬁcult to implement forward and

back propagation procedures. Alternatively, one of the

main line of study for reducing the complexity of Atten-

tion is thus to perform sparsity by limiting the number

of elements on which new representations will be based,

i.e. reducing the number of elements with non-null

attention scores. This approach is motivated by the ob-

servation of global or data-dependent positional patterns

of non-null attention scores depending on the task [

The sparsity of attention scores in the traditional Atten-

tion mechanism is indeed documented in the literature.

It has for instance been shown that in practice, full atten-

tion tends to overweight close elements in average, in

particular for MLM, machine translation, and seq-to-seq

tasks in general [

]. Moreover, according to analyses

on the use of multi-head full attention on speciﬁc tasks,

e.g. machine translation, numerous heads learn simi-

lar simple patterns [

]. Such redundant patterns may

be hardcoded implementing ﬁxed-positional patterns,

eventually in a task-dependent manner.

Two main approaches are discussed in the literature

for implementing sparsity: ﬁxed or adaptative patterns

based on whether attention scores are computed con-

sidering (1) predeﬁned ﬁxed elements based on their

location in the sequence, or (2) elements selected from

a given procedure. As an example, [

] have shown that

ﬁxed

O(n)

convolutions can perform competitively on

machine translation. Longformer proposes an alterna-

tive

O(n)

approach based on sliding and global patterns

[

]. In the context of image, audio, and text processing,

[

] propose sparse Transformer, an

O(n√n)

model

based on sparse factorization of the attention matrix rely-

ing on speciﬁc 2D factorized attention schemes. Those

approaches however prevent the use of task-dependent

dynamic patterns. Considering adaptative patterns, [

]

also introduced dynamic convolutions as an

O(n)

com-

plexity substitute to self-attention. Kernels deﬁning the

importance of context elements are speciﬁed at infer-

ence time rather than ﬁxed after training. Another ex-

ample is Reformer [

], an

O(n log n)

approach based

on locality-sensitive hashing (LSH) based on random

projections.

In a transverse manner, several authors, explicitly or

implicitly motivated by the compositional nature of lan-

guage have studied structured approaches in which sub-

sequences (i.e. blocks) are processed independently

and then aggregated. This aims at implementing a lo-

cal or global dynamic memory for considering close to

long-range dependencies. [

] introduce a blockwise

approach to reduce the quadratic complexity induced by

large sequences in encoder-decoder architectures. [

]

propose a chunkwise attention in which attention is per-

formed in a blockwise manner adaptively splitting the

sequence into small chunks over which soft attention is

computed. This idea is also used in Transformer-XL [

[

] propose a masked block self-attention mechanism

in which the entire sequence is divided into blocks, to

further 1) apply self-attention intra-block for modeling

local contexts, to further 2) apply self-attention inter-

block for capturing long-range dependencies. Such an

approach enables implementing some forms of connec-

tivity between all positions over several steps without

being restricted by full attention limitations. This can

also be achieved by factorization techniques, e.g. [

More recently authors have proposed global attention

mechanisms encoding information related to blocks on

which attention is based [20, 21, 22].

This paper presents the LSG (Local, Sparse and Global)

attention based on block local attention to capture local

context, sparse attention to capture extended context and

global attention to improve information ﬂow. Contrary

to prior work mostly focusing on deﬁning new models,

the proposed LSG Attention mechanism is model agnos-

tic and aims at facilitating adapting existing (pretrained)

models for them to be used on long sequences.

3 LSG Attention

LSG attention relies on two main points. It is assumed

that locally, a token needs to capture low level informa-

tion thus dense attention is prefered. On the other hand,

as the context grows, higher level information is sufﬁ-

cient. This translates into the need for connections to a

limited number of tokens following speciﬁc selection

and computation rules. The LSG approach relies on 3

components: block local attention to capture local con-

text, sparse attention to capture extended context and

global attention to improve information ﬂow. A compar-

ison to Big Bird and Longformer attention patterns is

shown in Figure 1.

LSG Big Bird Longformer

Figure 1: Attention patterns

3.1 Local Attention

Longformer depends on a ﬁxed length sliding window

to perform local attention. However this approach is

difﬁcult to optimize and must rely on a custom CUDA

kernel to be computationally efﬁcient. To improve over-

all training and inference speed, we take advantage of a

block-based process similar to Big Bird. The sequence

Condevaux et al.

is split into

non-overlapping chunks of size

. For

a given block, each token attends to the tokens inside

the block, as well as to those in the previous and next

blocks. In this conﬁguration, the local attention win-

dow is asymmetrical since a token can connect up to

2×bt−1tokens on the left or on the right.

3.2 Sparse Attention

Sparse connections are used to expand the local con-

text by selecting an additional set of tokens following

a set of rules. These tokens can be directly selected

based on a speciﬁc metric or using some computation

such as a pooling method. In the proposed approach,

each attention head can process different sparse tokens

independently. Sparse attention also relies on a block

structure where the sparse selection is done inside each

block. Five alternative criteria can be used in LSG.

Head-wise strided

Inspired by the Hepos model [

a ﬁxed selection pattern is deﬁned. Each attention head

will attend to a set of tokens following a speciﬁc stride

deﬁned as the sparsify factor

. Figure 2 shows the

selection pattern.

Head 1

Head 2

Block

Sequence

Figure 2: Head-wise strided selection with a stride of 2.

Head-wise block strided

This selection pattern is

similar to the previous one but selects consecutive to-

kens instead. Figure 2 shows the selection pattern.

Head 1

Head 2

Block

Sequence

Figure 3: Block strided selection with a stride of 2.

Average pooling

A simple way to reduce sequence

length. After chunking the sequence into blocks, sparse

tokens are computed using average pooling. For a block

of size

and a sparsify factor

, we pool inside each

block with a window of

and a stride of

to produce

bt/f tokens.

Max norm

The objective of a norm-based approach

is to select tokens that are most likely highly weighted

in the score matrix. Finding those keys efﬁciently is

difﬁcult in practice so we use a simple and deterministic

metric. For a query and a key q,k∈Rd, we can write:

qk>= cos(θ)kqkkkk

In this situation

cos(θ)

sign is unknown. However, if it

is positive and

kkk

is high, the key will likely dominate

the softmax regardless of the query. After chunking the

sequence into blocks, we select inside each block and

each head bt/f tokens with the highest key norm.

LSH Clustering

This approach is a non deterministic

one since it relies on the LSH algorithm [24]. For each

block,

bt/f

clusters are built using a single round LSH.

To get

c=bt/f

hashes and for an input

x∈Rd

, a

random matrix R∈Rd×c/2is generated, such that

h(x) = arg max([xR;−xR])

with

[a;b]

the concatenation of two vectors. Using the

key matrix as input, each token inside the block gets

a cluster index from

h(x)

. Tokens inside a cluster are

averaged.

Computation

To reduce the computational cost, the

attention pattern is designed to compute each connection

once. For this, the local and sparse tokens are selected

such that there is no overlap between them during atten-

tion computation. Each query is connected to 3 local

blocks and 2 sparse blocks of keys. The maximum con-

text length (distance between two keys) is then equal

3×bt+ 2 ×bt×f

. The concatenation of local and

sparse keys is shown Figure 4. For causal attention, the

third local block and the second sparse block can be

ignored during computation.

Sparse context Sparse context

Sequence

Local context

Figure 4: Local and sparse contexts with a block size of

2 and a sparsity factor of 4. Queries

and

will attend

to 6 local keys and 4 sparse keys.

3.3 Global Attention

Global tokens improve the ﬂow of information inside the

model. They attend to every tokens across the sequence

and all tokens attend to them. Rather than picking a

subset of tokens and deﬁning them as global, they are

prepended to the sequence and trained using their own

embedding matrix, thus their number is an additional

hyperparameter. When a model is converted to its LSG

version, the ﬁrst global token is initialized as the sum

of the [CLS] (or

<s>

) token and the ﬁrst position from

the positional embedding. The other global tokens are

initialized as the sum of [MASK] (or

<mask>

) token

and the other positions from the positional embedding.

3.4 Positional Embedding

It is necessary to modify the positional embedding ma-

trix to reuse existing models to process long sequences.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LSGAttention:ExtrapolationofpretrainedTransformerstolongsequencesCharlesCondevauxCHROMEUniversityofNîmes,Francecharles.condevaux@unimes.frSébastienHarispeEuroMovDigitalHealthinMotion,UnivMontpellier,IMTMinesAles,Francesebastien.harispe@mines-ales.frAbstractTransformermodelsachievestate-of-the-artper...

展开>> 收起<<

LSG Attention Extrapolation of pretrained Transformers to long sequences Charles Condevaux.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LSG Attention Extrapolation of pretrained Transformers to long sequences Charles Condevaux

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: