original model are kept fixed, and for each new
task only the adapters are fine-tuned. This only
adds a small number of parameters to the overall
architecture and allows for a much faster and more
efficient fine-tuning on different downstream tasks.
Another approach to improve efficiency of LM-
based transformers is shared parameterisation,
which was popularised by ALBERT (Lan et al.,
2019). While the original formulation of trans-
formers (Vaswani et al.,2017) employs full pa-
rameterisation wherein each model parameter is
independent of other modules and used only once
in the forward pass, shared parameterisation allows
different modules of the network to share parame-
ters, resulting in a more efficient use of resources
given the same parameterisation budget. However,
a common downside of this approach is slower in-
ference time and reduced performance. Ge and
Wei (2022) posits two different parameterisation
methods as an attempt to address the compute and
memory challenges of transformer models and ex-
plores layer-wise adaptation in an encoder-decoder
architecture. These methods exploit cross-layer pa-
rameter sharing in a way that would allow for the
model to be utilised on mobile devices with strict
memory constraints while achieving state-of-the-
art results on two seq2seq tasks for English.
In this work, we exploit some of the above ap-
proaches to create a number of compact and ef-
ficient encoder-only models distilled from much
larger language models. The contributions of this
work are as follows:
•
To the best of our knowledge, we are the
first to compress fully parameterised large lan-
guage models using recursive transformers
(i.e. ALBERT-like models that employ full
parameter sharing).
•
We demonstrate the effectiveness of our pre-
trained bottleneck adapters by merely fine-
tuning them on downstream tasks while still
achieving competitive results.
•
We present several light-weight transformers
with parameters ranging from
12
M for the
smallest to
32
M for the largest. These models
are shown to perform at the same level with
their fully parameterised versions.
•
Finally, we evaluate our models on a wide
range of tasks and datasets on general and
biomedial NLP datasets.
2 Background
2.1 LM-based Transformers and
Computational Complexity
Ever since the introduction of the transformer ar-
chitecture (Vaswani et al.,2017), large LM-based
transformers such as BERT (Devlin et al.,2019)
have become increasingly more popular in NLP
and lie at the heart of most state-of-the-art models.
A transformer is primarily composed of a number
of transformer blocks stacked on top of one another.
BERT
Base
, for instance, consists of
12
of these
blocks. The most important component in a block
is the multi-head self-attention module. To be use-
ful for language tasks, transformers are pre-trained
using a number of self-supervised auxiliary tasks
(Xia et al.,2020); these usually include some varia-
tion of Language Modelling (LM) and an optional
sentence-level prediction task. Examples of the for-
mer include Masked Language Modelling (MLM)
and Casual Language Modelling (CLM). For the
latter, BERT uses Next Sentence Prediction (NSP)
and ALBERT (Lan et al.,2019) employs Sentence
Order Prediction (SOP).
The standard approach to utilise these pre-
trained models is to fine-tune them on a target
task. Given
N
as the sequence length, the com-
putational and time complexity of self-attention
is
N2
(Keles et al.,2022). In recent years, dif-
ferent approaches have appeared in the literature
to address this bottleneck by modifying the self-
attention operation in order to improve the general
efficiency of transformers (with different perfor-
mance trade-offs). Tay et al. (2020) surveys the
most common approaches to develop what is re-
ferred to as ‘efficient transformers’.
The magnitude of the parameters of LM-based
transformers is another significant issue that re-
stricts their use. With new releases like GPT-3
and MT-NLG (Smith et al.,2022) that feature
hundreds of billions of parameters, these models
have become increasingly overparameterised due
to the large number of layers and embedding sizes
(Rogers et al.,2020).
2.2 Model Distillation
The overparameterisation issue has motivated re-
search into developing methods to compress larger
models into smaller and faster versions that per-
form reasonably close to their larger counterparts.
Knowledge distillation (Hinton et al.,2015) is
a prominent method that intended to distill a