LittleBird Efficient Faster Longer Transformer for Question Answering Minchul Lee and Kijong Han and Myeong Cheol Shin

2025-05-02 0 0 5.43MB 17 页 10玖币
侵权投诉
LittleBird: Efficient Faster & Longer Transformer for Question
Answering
Minchul Lee and Kijong Han and Myeong Cheol Shin
Kakao Enterprise Corp., South Korea
{phil.i,mat.h,index.sh}@kakaoenterprise.com
Abstract
BERT has shown a lot of sucess in a wide
variety of NLP tasks. But it has a limitation
dealing with long inputs due to its attention
mechanism. Longformer, ETC and BigBird ad-
dressed this issue and effectively solved the
quadratic dependency problem. However we
find that these models are not sufficient, and
propose LittleBird, a novel model based on
BigBird with improved speed and memory
footprint while maintaining accuracy. In par-
ticular, we devise a more flexible and efficient
position representation method based on At-
tention with Linear Biases (ALiBi). We also
show that replacing the method of global infor-
mation represented in the BigBird with pack
and unpack attention is more effective. The
proposed model can work on long inputs even
after being pre-trained on short inputs, and
can be trained efficiently reusing existing pre-
trained language model for short inputs. This
is a significant benefit for low-resource lan-
guages where large amounts of long text data
are difficult to obtain. As a result, our exper-
iments show that LittleBird works very well
in a variety of languages, achieving high per-
formance in question answering tasks, partic-
ularly in KorQuAD2.0, Korean Question An-
swering Dataset for long paragraphs.
1 Introduction
Transformer (Vaswani et al.,2017) and pre-trained
language models (Devlin et al.,2019;Liu et al.,
2019) based on it have shown a lot of success in a
wide variety of NLP tasks. However, the quadratic
dependency problem that comes from the attention
mechanism makes it impractical to process long
documents. Many techniques have been studied to
overcome this problem and BigBird (Zaheer et al.,
2020) showed robust and state-of-the-art perfor-
mance on various NLP downstream tasks.
In this study, we propose a new model LittleBird
by analyzing and improving the shortcomings of
BigBird. LittleBird shows improved speed and
memory footprint compared to BigBird while main-
taining the overall accuracy of the question answer-
ing (QA) benchmarks and showing better accuracy
in some of them.
In this study, we propose three major improve-
ments compared to BigBird. The first is the method
for position representation. In BigBird, trainable
positional embedding is used similar to BERT (De-
vlin et al.,2019), and in ETC, relative positional
encoding is used similar to T5 (Raffel et al.,2020).
However, trainable positional embedding cannot
handle longer inputs than those used for training
and the relative position encoding is relatively slow
and uses extra memory and parameters (Ma et al.,
2021). Press et al. (2021) introduced the attention
with linear biases (ALiBi) method that resolves
these problems, but it was designed for causal lan-
guage modeling, not autoencoding language mod-
eling, which is typically useful for QA tasks. Thus,
we devise a new method based on the ALiBi that is
fast, flexible, and also effective in QA tasks.
The second is the method of capturing global
information. BigBird introduces two ways of cap-
turing global information, the random sparse atten-
tion and global tokens (Ainslie et al.,2020) which
attend to and be attended by all other tokens. How-
ever, the random attention method is practically
slow compared to its time complexity because it
requires to repeat gather and scatter operations at
random positions. In addition, a relatively large
number of (
hundreds) global tokens are required
to achieve the reported performance using only
global tokens without a random attention method
in ETC. We show that replacing them with modi-
fied pack and unpack attention (Ma et al.,2021) is
more effective.
The last is the efficient way to train a model for
long sequences. We introduce a simple but effec-
tive method, Padding Insertion, which makes the
model robust to long inputs while training on short
inputs. We also propose a distillation method that
arXiv:2210.11870v2 [cs.CL] 12 Apr 2023
can maximize the reuse of the pre-trained model
for a short length and show that our model can be
effectively pre-trained using these methods.
Our model shows a 12
29% reduction in peak
memory usage and a 6
46% reduction in latency
compared to various BigBird and ETC model set-
tings reported in the paper for 4K length document
inference while showing better accuracy on several
English QA benchmarks dev sets (Kwiatkowski
et al.,2019;Welbl et al.,2018). Our model achieves
new state-of-the-art performance on KorQUAD
2.0 (Kim et al.,2019), a Korean long document QA
benchmark. In addition, these results are obtained
with LittileBird pre-trained with only 2K sequence
length. It shows that our novel positional represen-
tation method works well when the model is ap-
plied to the QA downstream task with a document
longer than the sequence used in the pre-training
phase.
2 Background and Related Work
2.1 Transformers for Longer Input
Various methods have been studied to maintain rea-
sonable performance without using the quadratic
operation of Transformer to handle long documents.
Child et al. (2019) introduced sparse factorizations
of the attention matrix which reduce the complexity
to
O(nn)
. Reformer (Kitaev et al.,2019) reduced
complexity to
O(nlog n)
using locality-sensitive
hashing.
Longformer (Beltagy et al.,2020) and
ETC (Ainslie et al.,2020) proposed a method
that utilizes several global attention tokens
and local windowed attention and reduced the
complexity to
O(n)
. In addition, these works
showed performance improvement in downstream
tasks. BigBird (Zaheer et al.,2020), an extended
study related to ETC, propose random sparse
attention method and provides detailed theoretical
background and experimental results for more
downstream tasks.
Recently, LUNA (Ma et al.,2021) introduced
the method that approximates softmax attention
with two nested linear attention functions called
Pack and Unpack Attention, which has only linear
time complexity. This method recorded improved
performance in both speed and score in the Long
Range Arena (LRA) benchmark (Tay et al.,2020).
2.2 Positional Encoding of Transformers
The attention mechanism of Transformers is de-
fined as:
Attn (X,C) = σQ(X)K(C)|
dV(C)
where
XRl×d
is the query sequence with
length
l
,
CRm×d
is the context sequence with
length
m
,
σ(·)
is a softmax activation and
Q
,
K
,
V:RdRd
is a linear transformation function
projecting inputs into the space of query, key and
value respectively. Since the attention function is
ignorant of the position information of sequence,
the Transformer model uses a method that added a
special embedding to token embeddings on input
of the first layer, called Positional Embedding, to
inject position information. Vaswani et al. (2017)
proposed Sinusoidal Positional Embedding, which
is a non-trainable constant embedding computed
from trigonometric functions.
On the other hand, BERT (Devlin et al.,2019)
uses trainable positional embeddings instead of
constant embeddings. It is adaptable to training
data, but has limitations such as being unable to
handle longer inputs than those used for training
and not being translation-invariant.
Relative position methods have been studied, for
solving these problems (Shaw et al.,2018;Raf-
fel et al.,2020). It learns parameters representing
the relative distance between tokens and utilizes
them to calculate the attention score. However, It is
slower than the sinusoidal approach and uses extra
memory and parameters (Press et al.,2021).
Press et al. (2021) pointed out that previous
methods are vulnerable to extrapolation and pro-
poses ALiBi, a modified attention function for self-
attention as follows:
ALiBi (X) = σQ(X)K(X)|
dD|V(X)
Di,j =m×(ij),for ij
,for i<j
where
m
is a head-specific positive real-number
hyperparameter and
DRl×l
is a distance matrix.
2.3 Pretraining objectives for Question
Answering
To pretrain a language model, an appropriate train-
ing objective that fully exploits the language under-
standing should be defined. Masked LM (Devlin
et al.,2019), for example, replaces 15% of the input
tokens with a mask token or a random token and
forces the model to denoise it. After pre-training
is complete, the last hidden representations of the
model contains information to restore the replaced
token to the original one and it is useful to transfer
this information for other NLP tasks as well, such
as question answering.
However, Masked LM (MLM) is suboptimal for
extractive QA task. Joshi et al. (2020) proposed
SpanBERT, which is pretrained by a span-level
masking scheme whose lengths follows geomet-
ric distribution and it outperformed BERT with
MLM in the most of tasks, especially extractive
QA. They proved that training objective predicting
spans rather than tokens generates better represen-
tations especially for span selection tasks.
Ram et al. (2021) introduced Recurring Span Se-
lection (RSS), a novel pre-training objective which
is better aligned to QA tasks. In RSS, each recur-
ring text span, except for one to be used as the
golden answer span, is masked with a special to-
ken, [QUESTION], and a model is trained to point
to the position of the golden answer span using the
representations from each [QUESTION] token. Be-
cause this pre-training task is so similar to the real
QA task, the model trained in this objective out-
performs models with other pre-training objectives
in both the few-shot and high-resource settings for
QA.
2.4 Datasets of Question Answering for
Longer Documents
The most widely used English QA dataset is
SQuAD (Rajpurkar et al.,2016), but it’s insufficient
to test understanding of long contexts because of its
short paragraph. Thus, for QA of longer documents,
other datasets are considered. Typical examples
are Natural Questions (Kwiatkowski et al.,2019)
and TriviaQA (Joshi et al.,2017), which provide
a whole Wikipedia page as the document. Narra-
tiveQA (Koˇ
ciský et al.,2018), whose documents
consist of movie scripts and books, is another exam-
ple. Recently, Pang et al. (2022) introduced QuAL-
ITY, a multiple-choice QA dataset comprised of
around 5000 tokens of documents gathered from
various sources such as Project Gutenberg and
Open American National Corpus.
For Korean QA datasets, the most standard is
KorQuAD 1.0 and KorQuAD 2.0, which is com-
parable to SQuAD in English. The construction
and characteristics of the dataset in KorQuAD 1.0
(Lim et al.,2019) are nearly identical to those of
SQuAD, except that it is in Korean. Therefore, like
SQuAD, KorQuAD 1.0 is not suitable for evalu-
ating QA for long documents. To evaluate under-
standing of longer documents, KorQuAD 2.0 (Kim
et al.,2019) is often used. Since it provides the
whole Wikipedia page as a single context without
trimming and the page includes not only text but
also HTML components such as tables and lists,
structural understanding of long HTML documents
is required to conquer it.
3 LittleBird Architecture
In this section, we describe the architecture of Lit-
tleBird model. Basically, the model can be viewed
as a composition of several key ideas including slid-
ing window attention from BigBird (Zaheer et al.,
2020), linear bias to attention from ALiBi (Press
et al.,2021) and pack and unpack attention from
LUNA (Ma et al.,2021).
3.1 Bidirectional ALiBi
Since pre-trained language models (PLM) perform
best when using data of the same length as the
data used for pretraining in general, a new PLM
suitable for the length must be built to perform
inference on longer data, which is inefficient. To
avoid this, we consider the main idea of ALiBi
(Press et al.,2021), which is more efficient than
relative positional encoding used at T5. However,
because ALiBi was designed for causal language
modeling, not autoencoding language modeling,
each query can attend to keys to the left of itself
only, not keys further away or to the right in ALiBi.
Therefore, we devised BiALiBi (Bidirectional
ALiBi), which is improved version of ALiBi to suit
the autoencoding language model. BiALiBi has the
same attention function as ALiBi, but differs only
in the method of calculating the distance matrix as
follows:
Di,j =
0,for i=j
αfor i= 0 or j= 0
β(ij),for i>j
γ(ji),for i<j
where
α
,
β
and
γ
are head-specific slopes like
m
in ALiBi.
α
is a value for the [CLS] token, which
usually appears at position 0. Because this token
should be global, it has the same bias regardless
of distance.
β
and
γ
are involved in the attention
8IKS)\\MV\QWV
ࡼ א Թ×ࢄ א Թ×
)LL6WZU
=VXIKS;TQLQVO)\\MV\QWV
)LL6WZU
)LL6WZU
.MML.WZ_IZL
ࡼԢ א Թ×ࢄԢ א Թ×
QKV
VK
QK V
Figure 1: LittleBird Layer
ܭ ࡯ܭ
ܳ ࢄ
Figure 2: Unpack & Sliding Window Attention of Lit-
tleBird
intensity for tokens on the left and right, respec-
tively. Unlike ALiBi, we set
α
,
β
and
γ
as learnable
parameters to have more flexibility.
3.2 Sliding Window Attention
Attention module of BigBird (Zaheer et al.,2020)
consists of three types of attentions: Global, Win-
dow and Random. Global tokens can attend to all
other tokens and also can be attended from all other
tokens. On the other hand, non-global tokens can
only attend to all global tokens, some nearby tokens
(Window) and random tokens (Random).
For efficient computation, this attention module
is implemented using blocked sliding window. But
there is still an overhead where random attention
needs repeating gather and scatter operations at ran-
dom positions. Since it is known that full attention
can be substituted well without random attention
when global tokens are sufficient (Ainslie et al.,
2020), we completely eliminated random attention
from our model. We also reduced the number of
global tokens and removed global-local attention,
They were replaced with pack and unpack attention,
as explained in the following subsection.
3.3 Pack & Unpack Attention
To effectively replace random and global attention,
we employed pack and unpack attention (Ma et al.,
2021). However, in the original pack and unpack
attention, information loss is unavoidable because
all sequence information is packed into a small
capacity. We propose adding the sliding window
attention to the unpacking step to improve this. Fig-
ure 1depicts the entire architecture of the LittleBird
layer.
CP=Attn (P,X)
P0=LayerNorm (CP+P)
CX=USWAttn (X,CP)
A=LayerNorm (CX+X)
X0=LayerNorm (FFN(A) + A)
USWAttn (X,CP) =
σQ(X) [K(CP); K(X)]|
d[DP;D]|
·[V(CP); V(X)]
DP=β+γ
2bJs,l
where
XRl×d
is the input sequence with length
l
,
PRs×d
is an extra sequence for packing
contextual information with length
s
,
[A;B]
de-
notes concatenation of matrix
A
and
B
in row
axis,
DRl×l
is a distance matrix from BiALiBi,
DPRs×l
is a distance matrix for packing tokens
and Js,l is an all-ones matrix with shape (s, l).
The overall structure is the same as pack and un-
pack attention (Ma et al.,2021), but only one part,
USWAttn
(Unpack & Sliding Window Attention),
is different. In this step, we split
X
into blocks
with size
b
and perform block-level attention like
Zaheer et al. (2020), which is demonstrated at Fig-
ure 2. We set only the first block as a global token,
and allow local-to-global attention. This is because
in most QA tasks, [CLS] tokens and questions are
placed in the front part of the input sequence, and
we believe it is important to allow the rest of the in-
put sequence to access information of these tokens
directly.
Also, we apply different distance matrices de-
pending on the type of attention. BiALiBi’s dis-
tance matrix
D
is applied to
X
-to-
X
attention, but
摘要:

LittleBird:EfcientFaster&LongerTransformerforQuestionAnsweringMinchulLeeandKijongHanandMyeongCheolShinKakaoEnterpriseCorp.,SouthKorea{phil.i,mat.h,index.sh}@kakaoenterprise.comAbstractBERThasshownalotofsucessinawidevarietyofNLPtasks.Butithasalimitationdealingwithlonginputsduetoitsattentionmechanism...

展开>> 收起<<
LittleBird Efficient Faster Longer Transformer for Question Answering Minchul Lee and Kijong Han and Myeong Cheol Shin.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:17 页 大小:5.43MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注