MiniALBERT Model Distillation via Parameter-Efficient Recursive Transformers Mohammadmahdi Nouriborji2y Omid Rohanian12y Samaneh Kouchaki3

2025-04-27 0 0 937.99KB 13 页 10玖币
侵权投诉
MiniALBERT: Model Distillation via Parameter-Efficient Recursive
Transformers
Mohammadmahdi Nouriborji2, Omid Rohanian1,2, Samaneh Kouchaki3,
David A. Clifton1,4
1Department of Engineering Science, University of Oxford, Oxford, UK
2NLPie Research, Oxford, UK
3Dept. Electrical and Electronic Engineering, University of Surrey, Guildford, UK
4Oxford-Suzhou Centre for Advanced Research, Suzhou, China
{m.nouriborji,omid}@nlpie.com
samaneh.kouchaki@surrey.ac.uk
{omid.rohanian,david.clifton}@eng.ox.ac.uk
Abstract
Pre-trained Language Models (LMs) have
become an integral part of Natural Language
Processing (NLP) in recent years, due to
their superior performance in downstream
applications. In spite of this resounding
success, the usability of LMs is constrained
by computational and time complexity, along
with their increasing size; an issue that has
been referred to as ‘overparameterisation’.
Different strategies have been proposed in the
literature to alleviate these problems, with the
aim to create effective compact models that
nearly match the performance of their bloated
counterparts with negligible performance
losses. One of the most popular techniques
in this area of research is model distillation.
Another potent but underutilised technique is
cross-layer parameter sharing. In this work,
we combine these two strategies and present
MiniALBERT, a technique for converting the
knowledge of fully parameterised LMs (such
as BERT) into a compact recursive student.
In addition, we investigate the application of
bottleneck adapters for layer-wise adaptation
of our recursive student, and also explore the
efficacy of adapter tuning for fine-tuning of
compact models. We test our proposed models
on a number of general and biomedical NLP
tasks to demonstrate their viability and com-
pare them with the state-of-the-art and other
existing compact models. All the codes used
in the experiments are available at https:
//github.com/nlpie-research/
MiniALBERT. Our pre-trained com-
pact models can be accessed from
https://huggingface.co/nlpie.
The two authors contributed equally to this work.
1 Introduction
Following the introduction of BERT (Devlin et al.,
2019), generic pre-trained Language Models (LMs)
have started to dominate the field of NLP. Virtu-
ally all state-of-the-art NLP models are built on
top of some large pre-trained transformer as a back-
bone and are subsequently fine-tuned on their target
dataset. While this pre-train and fine-tune approach
has resulted in significant improvements across a
wide range of NLP tasks, the widespread use of
resource-exhaustive and overparameterised trans-
formers has also raised concerns among researchers
about their energy consumption, environmental im-
pact, and ethical implications (Strubell et al.,2019;
Bender et al.,2021).
As a response to this, different approaches have
appeared with the aim to make large LMs more
efficient, accessible, and environmentally friendly.
Model compression is a line of research that has re-
cently received considerable attention. It involves
encoding a larger and slower but more performant
model into a smaller and faster one with the aim to
retain much of the former’s performance capability
(Bucilua et al.,2006). Knowledge distillation (Hin-
ton et al.,2015), quantisation (Shen et al.,2020),
and pruning (Ganesh et al.,2021) are three exam-
ples of such methods.
Adapter modules (Bapna and Firat,2019;He
et al.,2021) are recently introduced as an effective
mechanism to address the parameter inefficiency of
large pre-trained models. In this approach, several
‘bottleneck adapters’(Houlsby et al.,2019a) are em-
bedded inside different locations within the original
network. During fine-tuning, the parameters of the
arXiv:2210.06425v2 [cs.CL] 30 Apr 2023
original model are kept fixed, and for each new
task only the adapters are fine-tuned. This only
adds a small number of parameters to the overall
architecture and allows for a much faster and more
efficient fine-tuning on different downstream tasks.
Another approach to improve efficiency of LM-
based transformers is shared parameterisation,
which was popularised by ALBERT (Lan et al.,
2019). While the original formulation of trans-
formers (Vaswani et al.,2017) employs full pa-
rameterisation wherein each model parameter is
independent of other modules and used only once
in the forward pass, shared parameterisation allows
different modules of the network to share parame-
ters, resulting in a more efficient use of resources
given the same parameterisation budget. However,
a common downside of this approach is slower in-
ference time and reduced performance. Ge and
Wei (2022) posits two different parameterisation
methods as an attempt to address the compute and
memory challenges of transformer models and ex-
plores layer-wise adaptation in an encoder-decoder
architecture. These methods exploit cross-layer pa-
rameter sharing in a way that would allow for the
model to be utilised on mobile devices with strict
memory constraints while achieving state-of-the-
art results on two seq2seq tasks for English.
In this work, we exploit some of the above ap-
proaches to create a number of compact and ef-
ficient encoder-only models distilled from much
larger language models. The contributions of this
work are as follows:
To the best of our knowledge, we are the
first to compress fully parameterised large lan-
guage models using recursive transformers
(i.e. ALBERT-like models that employ full
parameter sharing).
We demonstrate the effectiveness of our pre-
trained bottleneck adapters by merely fine-
tuning them on downstream tasks while still
achieving competitive results.
We present several light-weight transformers
with parameters ranging from
12
M for the
smallest to
32
M for the largest. These models
are shown to perform at the same level with
their fully parameterised versions.
Finally, we evaluate our models on a wide
range of tasks and datasets on general and
biomedial NLP datasets.
2 Background
2.1 LM-based Transformers and
Computational Complexity
Ever since the introduction of the transformer ar-
chitecture (Vaswani et al.,2017), large LM-based
transformers such as BERT (Devlin et al.,2019)
have become increasingly more popular in NLP
and lie at the heart of most state-of-the-art models.
A transformer is primarily composed of a number
of transformer blocks stacked on top of one another.
BERT
Base
, for instance, consists of
12
of these
blocks. The most important component in a block
is the multi-head self-attention module. To be use-
ful for language tasks, transformers are pre-trained
using a number of self-supervised auxiliary tasks
(Xia et al.,2020); these usually include some varia-
tion of Language Modelling (LM) and an optional
sentence-level prediction task. Examples of the for-
mer include Masked Language Modelling (MLM)
and Casual Language Modelling (CLM). For the
latter, BERT uses Next Sentence Prediction (NSP)
and ALBERT (Lan et al.,2019) employs Sentence
Order Prediction (SOP).
The standard approach to utilise these pre-
trained models is to fine-tune them on a target
task. Given
N
as the sequence length, the com-
putational and time complexity of self-attention
is
N2
(Keles et al.,2022). In recent years, dif-
ferent approaches have appeared in the literature
to address this bottleneck by modifying the self-
attention operation in order to improve the general
efficiency of transformers (with different perfor-
mance trade-offs). Tay et al. (2020) surveys the
most common approaches to develop what is re-
ferred to as ‘efficient transformers’.
The magnitude of the parameters of LM-based
transformers is another significant issue that re-
stricts their use. With new releases like GPT-3
and MT-NLG (Smith et al.,2022) that feature
hundreds of billions of parameters, these models
have become increasingly overparameterised due
to the large number of layers and embedding sizes
(Rogers et al.,2020).
2.2 Model Distillation
The overparameterisation issue has motivated re-
search into developing methods to compress larger
models into smaller and faster versions that per-
form reasonably close to their larger counterparts.
Knowledge distillation (Hinton et al.,2015) is
a prominent method that intended to distill a
Fully-Prametrised
Teacher
Layer 1
Layer 2
Layer 3
...
Layer N
Output
Layer 1
Layer 2
Layer 3
Output
Dataset
Distillation
Recursive
Student
MLM Labels
MLM
Loss
Recursive Layer
Output
...
Alignment
Loss
Output
Loss
...
Layer 1
Layer 2
Layer 3
Output
MiniALBERT
Output
Recursive Layer
Figure 1: The layer-to-layer distillation procedure proposed for distilling the knowledge of a fully-parameterised
teacher into a compact recursive student. While the teacher has fully parameterised layers, the recursive student
has only one layer and the output is fed back into the same layer repeatedly. Despite this compact structure, our
proposed distillation procedure is designed to align the output of each iteration of the recursive student with a
particular layer of the fully-parameterised teacher, as if the student had fully-parameterised layers. Additional
losses, namely, Output Loss, and MLM Loss, as shown above, are used for further knowledge distillation.
lightweight ‘student’ model from a larger ‘teacher’
network by using the outputs of the teacher netwrok
as soft labels. Distillation can either be done task-
specifically during fine-tuning, or task-agnostically
by mimicking the MLM outputs or the interme-
diate representations of the teacher prior to the
fine-tuning stage. The latter is more flexible and
computationally less expensive (Wang et al.,2020).
DistilBERT is a well-known example of a distilled
model derived from BERT which is claimed to be
40%
smaller in terms of parameters and
60%
faster
while retaining
97%
of BERT’s performance on a
range of language understanding tasks (Sanh et al.,
2019).
2.3 Efficient Fine-tuning Approaches
As discussed in Sec 2.1, LM-based transformers
involve a large number of parameters and they are
often fine-tuned on a target dataset. However, fine-
tuning could become time-consuming as the size
of the datasets grow. Different techniques exist
in the literature to alleviate this bottleneck during
fine-tuning. In this section we explore two of these
techniques, namely, prompt tuning and bottleneck
adapters.
Prompt tuning (Lester et al.,2021) is a technique
in which the weights of a language model are kept
frozen during the fine-tuning stage and fine-tuning
is reformulated as a cloze-style task. Similar to
T5, prompt tuning regards all tasks as a variation
of text generation and conditions the generation
using ‘soft prompts’. A typical prompt consists
of a text template with a masked token and a set
of candidate label words to fill the mask. This
turns the target task into another MLM objective
in which the right candidates are chosen and soft
prompts are learned. This method is especially
useful for few-shot learning scenarios where there
are not many target labels available for standard
fine-tuning.
Bottleneck Adapters (BAs) (Houlsby et al.,
2019b;Pfeiffer et al.,2021;Rücklé et al.,2020;
Pfeiffer et al.,2020) are another mechanism used
during fine-tuning to enhance efficiency of training.
Each BA block consists of a linear down-projection,
non-linearity, and up-projection along with residual
connections. Several of these adapters are placed
after the feed-forward or attention modules in a
transformer. Similar to prompts, only the BAs are
trained during fine-tuning.
2.4 Parameter Sharing via Recursion
Weight sharing is a strategy intended to reduce the
overall number of parameters in a model. Lan et al.
摘要:

MiniALBERT:ModelDistillationviaParameter-EfcientRecursiveTransformersMohammadmahdiNouriborji2y,OmidRohanian1;2y,SamanehKouchaki3,DavidA.Clifton1;41DepartmentofEngineeringScience,UniversityofOxford,Oxford,UK2NLPieResearch,Oxford,UK3Dept.ElectricalandElectronicEngineering,UniversityofSurrey,Guildford...

展开>> 收起<<
MiniALBERT Model Distillation via Parameter-Efficient Recursive Transformers Mohammadmahdi Nouriborji2y Omid Rohanian12y Samaneh Kouchaki3.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:937.99KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注