et al.,2019), for example, replaces 15% of the input
tokens with a mask token or a random token and
forces the model to denoise it. After pre-training
is complete, the last hidden representations of the
model contains information to restore the replaced
token to the original one and it is useful to transfer
this information for other NLP tasks as well, such
as question answering.
However, Masked LM (MLM) is suboptimal for
extractive QA task. Joshi et al. (2020) proposed
SpanBERT, which is pretrained by a span-level
masking scheme whose lengths follows geomet-
ric distribution and it outperformed BERT with
MLM in the most of tasks, especially extractive
QA. They proved that training objective predicting
spans rather than tokens generates better represen-
tations especially for span selection tasks.
Ram et al. (2021) introduced Recurring Span Se-
lection (RSS), a novel pre-training objective which
is better aligned to QA tasks. In RSS, each recur-
ring text span, except for one to be used as the
golden answer span, is masked with a special to-
ken, [QUESTION], and a model is trained to point
to the position of the golden answer span using the
representations from each [QUESTION] token. Be-
cause this pre-training task is so similar to the real
QA task, the model trained in this objective out-
performs models with other pre-training objectives
in both the few-shot and high-resource settings for
QA.
2.4 Datasets of Question Answering for
Longer Documents
The most widely used English QA dataset is
SQuAD (Rajpurkar et al.,2016), but it’s insufficient
to test understanding of long contexts because of its
short paragraph. Thus, for QA of longer documents,
other datasets are considered. Typical examples
are Natural Questions (Kwiatkowski et al.,2019)
and TriviaQA (Joshi et al.,2017), which provide
a whole Wikipedia page as the document. Narra-
tiveQA (Koˇ
ciský et al.,2018), whose documents
consist of movie scripts and books, is another exam-
ple. Recently, Pang et al. (2022) introduced QuAL-
ITY, a multiple-choice QA dataset comprised of
around 5000 tokens of documents gathered from
various sources such as Project Gutenberg and
Open American National Corpus.
For Korean QA datasets, the most standard is
KorQuAD 1.0 and KorQuAD 2.0, which is com-
parable to SQuAD in English. The construction
and characteristics of the dataset in KorQuAD 1.0
(Lim et al.,2019) are nearly identical to those of
SQuAD, except that it is in Korean. Therefore, like
SQuAD, KorQuAD 1.0 is not suitable for evalu-
ating QA for long documents. To evaluate under-
standing of longer documents, KorQuAD 2.0 (Kim
et al.,2019) is often used. Since it provides the
whole Wikipedia page as a single context without
trimming and the page includes not only text but
also HTML components such as tables and lists,
structural understanding of long HTML documents
is required to conquer it.
3 LittleBird Architecture
In this section, we describe the architecture of Lit-
tleBird model. Basically, the model can be viewed
as a composition of several key ideas including slid-
ing window attention from BigBird (Zaheer et al.,
2020), linear bias to attention from ALiBi (Press
et al.,2021) and pack and unpack attention from
LUNA (Ma et al.,2021).
3.1 Bidirectional ALiBi
Since pre-trained language models (PLM) perform
best when using data of the same length as the
data used for pretraining in general, a new PLM
suitable for the length must be built to perform
inference on longer data, which is inefficient. To
avoid this, we consider the main idea of ALiBi
(Press et al.,2021), which is more efficient than
relative positional encoding used at T5. However,
because ALiBi was designed for causal language
modeling, not autoencoding language modeling,
each query can attend to keys to the left of itself
only, not keys further away or to the right in ALiBi.
Therefore, we devised BiALiBi (Bidirectional
ALiBi), which is improved version of ALiBi to suit
the autoencoding language model. BiALiBi has the
same attention function as ALiBi, but differs only
in the method of calculating the distance matrix as
follows:
Di,j =
0,for i=j
αfor i= 0 or j= 0
β(i−j),for i>j
γ(j−i),for i<j
where
α
,
β
and
γ
are head-specific slopes like
m
in ALiBi.
α
is a value for the [CLS] token, which
usually appears at position 0. Because this token
should be global, it has the same bias regardless
of distance.
β
and
γ
are involved in the attention