LittleBird Efﬁcient Faster Longer Transformer for Question Answering Minchul Lee and Kijong Han and Myeong Cheol Shin

2025-05-02 0 0 5.43MB 17 页 10玖币

侵权投诉

LittleBird: Efﬁcient Faster & Longer Transformer for Question

Answering

Minchul Lee and Kijong Han and Myeong Cheol Shin

Kakao Enterprise Corp., South Korea

{phil.i,mat.h,index.sh}@kakaoenterprise.com

Abstract

BERT has shown a lot of sucess in a wide

variety of NLP tasks. But it has a limitation

dealing with long inputs due to its attention

mechanism. Longformer, ETC and BigBird ad-

dressed this issue and effectively solved the

quadratic dependency problem. However we

ﬁnd that these models are not sufﬁcient, and

propose LittleBird, a novel model based on

BigBird with improved speed and memory

footprint while maintaining accuracy. In par-

ticular, we devise a more ﬂexible and efﬁcient

position representation method based on At-

tention with Linear Biases (ALiBi). We also

show that replacing the method of global infor-

mation represented in the BigBird with pack

and unpack attention is more effective. The

proposed model can work on long inputs even

after being pre-trained on short inputs, and

can be trained efﬁciently reusing existing pre-

trained language model for short inputs. This

is a signiﬁcant beneﬁt for low-resource lan-

guages where large amounts of long text data

are difﬁcult to obtain. As a result, our exper-

iments show that LittleBird works very well

in a variety of languages, achieving high per-

formance in question answering tasks, partic-

ularly in KorQuAD2.0, Korean Question An-

swering Dataset for long paragraphs.

1 Introduction

Transformer (Vaswani et al.,2017) and pre-trained

language models (Devlin et al.,2019;Liu et al.,

2019) based on it have shown a lot of success in a

wide variety of NLP tasks. However, the quadratic

dependency problem that comes from the attention

mechanism makes it impractical to process long

documents. Many techniques have been studied to

overcome this problem and BigBird (Zaheer et al.,

2020) showed robust and state-of-the-art perfor-

mance on various NLP downstream tasks.

In this study, we propose a new model LittleBird

by analyzing and improving the shortcomings of

BigBird. LittleBird shows improved speed and

memory footprint compared to BigBird while main-

taining the overall accuracy of the question answer-

ing (QA) benchmarks and showing better accuracy

in some of them.

In this study, we propose three major improve-

ments compared to BigBird. The ﬁrst is the method

for position representation. In BigBird, trainable

positional embedding is used similar to BERT (De-

vlin et al.,2019), and in ETC, relative positional

encoding is used similar to T5 (Raffel et al.,2020).

However, trainable positional embedding cannot

handle longer inputs than those used for training

and the relative position encoding is relatively slow

and uses extra memory and parameters (Ma et al.,

2021). Press et al. (2021) introduced the attention

with linear biases (ALiBi) method that resolves

these problems, but it was designed for causal lan-

guage modeling, not autoencoding language mod-

eling, which is typically useful for QA tasks. Thus,

we devise a new method based on the ALiBi that is

fast, ﬂexible, and also effective in QA tasks.

The second is the method of capturing global

information. BigBird introduces two ways of cap-

turing global information, the random sparse atten-

tion and global tokens (Ainslie et al.,2020) which

attend to and be attended by all other tokens. How-

ever, the random attention method is practically

slow compared to its time complexity because it

requires to repeat gather and scatter operations at

random positions. In addition, a relatively large

number of (

∼

hundreds) global tokens are required

to achieve the reported performance using only

global tokens without a random attention method

in ETC. We show that replacing them with modi-

ﬁed pack and unpack attention (Ma et al.,2021) is

more effective.

The last is the efﬁcient way to train a model for

long sequences. We introduce a simple but effec-

tive method, Padding Insertion, which makes the

model robust to long inputs while training on short

inputs. We also propose a distillation method that

arXiv:2210.11870v2 [cs.CL] 12 Apr 2023

can maximize the reuse of the pre-trained model

for a short length and show that our model can be

effectively pre-trained using these methods.

Our model shows a 12

∼

29% reduction in peak

memory usage and a 6

∼

46% reduction in latency

compared to various BigBird and ETC model set-

tings reported in the paper for 4K length document

inference while showing better accuracy on several

English QA benchmarks dev sets (Kwiatkowski

et al.,2019;Welbl et al.,2018). Our model achieves

new state-of-the-art performance on KorQUAD

2.0 (Kim et al.,2019), a Korean long document QA

benchmark. In addition, these results are obtained

with LittileBird pre-trained with only 2K sequence

length. It shows that our novel positional represen-

tation method works well when the model is ap-

plied to the QA downstream task with a document

longer than the sequence used in the pre-training

phase.

2 Background and Related Work

2.1 Transformers for Longer Input

Various methods have been studied to maintain rea-

sonable performance without using the quadratic

operation of Transformer to handle long documents.

Child et al. (2019) introduced sparse factorizations

of the attention matrix which reduce the complexity

O(n√n)

. Reformer (Kitaev et al.,2019) reduced

complexity to

O(nlog n)

using locality-sensitive

hashing.

Longformer (Beltagy et al.,2020) and

ETC (Ainslie et al.,2020) proposed a method

that utilizes several global attention tokens

and local windowed attention and reduced the

complexity to

O(n)

. In addition, these works

showed performance improvement in downstream

tasks. BigBird (Zaheer et al.,2020), an extended

study related to ETC, propose random sparse

attention method and provides detailed theoretical

background and experimental results for more

downstream tasks.

Recently, LUNA (Ma et al.,2021) introduced

the method that approximates softmax attention

with two nested linear attention functions called

Pack and Unpack Attention, which has only linear

time complexity. This method recorded improved

performance in both speed and score in the Long

Range Arena (LRA) benchmark (Tay et al.,2020).

2.2 Positional Encoding of Transformers

The attention mechanism of Transformers is de-

ﬁned as:

Attn (X,C) = σQ(X)K(C)|

√dV(C)

where

X∈Rl×d

is the query sequence with

length

C∈Rm×d

is the context sequence with

length

σ(·)

is a softmax activation and

V:Rd→Rd

is a linear transformation function

projecting inputs into the space of query, key and

value respectively. Since the attention function is

ignorant of the position information of sequence,

the Transformer model uses a method that added a

special embedding to token embeddings on input

of the ﬁrst layer, called Positional Embedding, to

inject position information. Vaswani et al. (2017)

proposed Sinusoidal Positional Embedding, which

is a non-trainable constant embedding computed

from trigonometric functions.

On the other hand, BERT (Devlin et al.,2019)

uses trainable positional embeddings instead of

constant embeddings. It is adaptable to training

data, but has limitations such as being unable to

handle longer inputs than those used for training

and not being translation-invariant.

Relative position methods have been studied, for

solving these problems (Shaw et al.,2018;Raf-

fel et al.,2020). It learns parameters representing

the relative distance between tokens and utilizes

them to calculate the attention score. However, It is

slower than the sinusoidal approach and uses extra

memory and parameters (Press et al.,2021).

Press et al. (2021) pointed out that previous

methods are vulnerable to extrapolation and pro-

poses ALiBi, a modiﬁed attention function for self-

attention as follows:

ALiBi (X) = σQ(X)K(X)|

√d−D|V(X)

Di,j =m×(i−j),for i≥j

∞,for i<j

where

is a head-speciﬁc positive real-number

hyperparameter and

D∈Rl×l

is a distance matrix.

2.3 Pretraining objectives for Question

Answering

To pretrain a language model, an appropriate train-

ing objective that fully exploits the language under-

standing should be deﬁned. Masked LM (Devlin

et al.,2019), for example, replaces 15% of the input

tokens with a mask token or a random token and

forces the model to denoise it. After pre-training

is complete, the last hidden representations of the

model contains information to restore the replaced

token to the original one and it is useful to transfer

this information for other NLP tasks as well, such

as question answering.

However, Masked LM (MLM) is suboptimal for

extractive QA task. Joshi et al. (2020) proposed

SpanBERT, which is pretrained by a span-level

masking scheme whose lengths follows geomet-

ric distribution and it outperformed BERT with

MLM in the most of tasks, especially extractive

QA. They proved that training objective predicting

spans rather than tokens generates better represen-

tations especially for span selection tasks.

Ram et al. (2021) introduced Recurring Span Se-

lection (RSS), a novel pre-training objective which

is better aligned to QA tasks. In RSS, each recur-

ring text span, except for one to be used as the

golden answer span, is masked with a special to-

ken, [QUESTION], and a model is trained to point

to the position of the golden answer span using the

representations from each [QUESTION] token. Be-

cause this pre-training task is so similar to the real

QA task, the model trained in this objective out-

performs models with other pre-training objectives

in both the few-shot and high-resource settings for

QA.

2.4 Datasets of Question Answering for

Longer Documents

The most widely used English QA dataset is

SQuAD (Rajpurkar et al.,2016), but it’s insufﬁcient

to test understanding of long contexts because of its

short paragraph. Thus, for QA of longer documents,

other datasets are considered. Typical examples

are Natural Questions (Kwiatkowski et al.,2019)

and TriviaQA (Joshi et al.,2017), which provide

a whole Wikipedia page as the document. Narra-

tiveQA (Koˇ

ciský et al.,2018), whose documents

consist of movie scripts and books, is another exam-

ple. Recently, Pang et al. (2022) introduced QuAL-

ITY, a multiple-choice QA dataset comprised of

around 5000 tokens of documents gathered from

various sources such as Project Gutenberg and

Open American National Corpus.

For Korean QA datasets, the most standard is

KorQuAD 1.0 and KorQuAD 2.0, which is com-

parable to SQuAD in English. The construction

and characteristics of the dataset in KorQuAD 1.0

(Lim et al.,2019) are nearly identical to those of

SQuAD, except that it is in Korean. Therefore, like

SQuAD, KorQuAD 1.0 is not suitable for evalu-

ating QA for long documents. To evaluate under-

standing of longer documents, KorQuAD 2.0 (Kim

et al.,2019) is often used. Since it provides the

whole Wikipedia page as a single context without

trimming and the page includes not only text but

also HTML components such as tables and lists,

structural understanding of long HTML documents

is required to conquer it.

3 LittleBird Architecture

In this section, we describe the architecture of Lit-

tleBird model. Basically, the model can be viewed

as a composition of several key ideas including slid-

ing window attention from BigBird (Zaheer et al.,

2020), linear bias to attention from ALiBi (Press

et al.,2021) and pack and unpack attention from

LUNA (Ma et al.,2021).

3.1 Bidirectional ALiBi

Since pre-trained language models (PLM) perform

best when using data of the same length as the

data used for pretraining in general, a new PLM

suitable for the length must be built to perform

inference on longer data, which is inefﬁcient. To

avoid this, we consider the main idea of ALiBi

(Press et al.,2021), which is more efﬁcient than

relative positional encoding used at T5. However,

because ALiBi was designed for causal language

modeling, not autoencoding language modeling,

each query can attend to keys to the left of itself

only, not keys further away or to the right in ALiBi.

Therefore, we devised BiALiBi (Bidirectional

ALiBi), which is improved version of ALiBi to suit

the autoencoding language model. BiALiBi has the

same attention function as ALiBi, but differs only

in the method of calculating the distance matrix as

follows:

Di,j =









0,for i=j

αfor i= 0 or j= 0

β(i−j),for i>j

γ(j−i),for i<j

where

and

are head-speciﬁc slopes like

in ALiBi.

is a value for the [CLS] token, which

usually appears at position 0. Because this token

should be global, it has the same bias regardless

of distance.

and

are involved in the attention

8IKS)\\MV\QWV

ࡼ א Թ௦×ௗࢄ א Թ௟×ௗ

)LL6WZU

=VXIKS;TQLQVO)\\MV\QWV

)LL6WZU

.MML.WZ_IZL

ࡼԢ א Թ௦×ௗࢄԢ א Թ௟×ௗ

QKV

QK V

Figure 1: LittleBird Layer

ܭ ࡯௣ܭࢄ

ܳ ࢄ

Figure 2: Unpack & Sliding Window Attention of Lit-

tleBird

intensity for tokens on the left and right, respec-

tively. Unlike ALiBi, we set

and

as learnable

parameters to have more ﬂexibility.

3.2 Sliding Window Attention

Attention module of BigBird (Zaheer et al.,2020)

consists of three types of attentions: Global, Win-

dow and Random. Global tokens can attend to all

other tokens and also can be attended from all other

tokens. On the other hand, non-global tokens can

only attend to all global tokens, some nearby tokens

(Window) and random tokens (Random).

For efﬁcient computation, this attention module

is implemented using blocked sliding window. But

there is still an overhead where random attention

needs repeating gather and scatter operations at ran-

dom positions. Since it is known that full attention

can be substituted well without random attention

when global tokens are sufﬁcient (Ainslie et al.,

2020), we completely eliminated random attention

from our model. We also reduced the number of

global tokens and removed global-local attention,

They were replaced with pack and unpack attention,

as explained in the following subsection.

3.3 Pack & Unpack Attention

To effectively replace random and global attention,

we employed pack and unpack attention (Ma et al.,

2021). However, in the original pack and unpack

attention, information loss is unavoidable because

all sequence information is packed into a small

capacity. We propose adding the sliding window

attention to the unpacking step to improve this. Fig-

ure 1depicts the entire architecture of the LittleBird

layer.

CP=Attn (P,X)

P0=LayerNorm (CP+P)

CX=USWAttn (X,CP)

A=LayerNorm (CX+X)

X0=LayerNorm (FFN(A) + A)

USWAttn (X,CP) =

σQ(X) [K(CP); K(X)]|

√d−[DP;D]|

·[V(CP); V(X)]

DP=β+γ

2bJs,l

where

X∈Rl×d

is the input sequence with length

P∈Rs×d

is an extra sequence for packing

contextual information with length

[A;B]

de-

notes concatenation of matrix

and

in row

axis,

D∈Rl×l

is a distance matrix from BiALiBi,

DP∈Rs×l

is a distance matrix for packing tokens

and Js,l is an all-ones matrix with shape (s, l).

The overall structure is the same as pack and un-

pack attention (Ma et al.,2021), but only one part,

USWAttn

(Unpack & Sliding Window Attention),

is different. In this step, we split

into blocks

with size

and perform block-level attention like

Zaheer et al. (2020), which is demonstrated at Fig-

ure 2. We set only the ﬁrst block as a global token,

and allow local-to-global attention. This is because

in most QA tasks, [CLS] tokens and questions are

placed in the front part of the input sequence, and

we believe it is important to allow the rest of the in-

put sequence to access information of these tokens

directly.

Also, we apply different distance matrices de-

pending on the type of attention. BiALiBi’s dis-

tance matrix

is applied to

-to-

attention, but

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LittleBird:EfcientFaster&LongerTransformerforQuestionAnsweringMinchulLeeandKijongHanandMyeongCheolShinKakaoEnterpriseCorp.,SouthKorea{phil.i,mat.h,index.sh}@kakaoenterprise.comAbstractBERThasshownalotofsucessinawidevarietyofNLPtasks.Butithasalimitationdealingwithlonginputsduetoitsattentionmechanism...

展开>> 收起<<

LittleBird Efﬁcient Faster Longer Transformer for Question Answering Minchul Lee and Kijong Han and Myeong Cheol Shin.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LittleBird Efﬁcient Faster Longer Transformer for Question Answering Minchul Lee and Kijong Han and Myeong Cheol Shin

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: