MiniALBERT Model Distillation via Parameter-Efﬁcient Recursive Transformers Mohammadmahdi Nouriborji2y Omid Rohanian12y Samaneh Kouchaki3

2025-04-27 0 0 937.99KB 13 页 10玖币

侵权投诉

MiniALBERT: Model Distillation via Parameter-Efﬁcient Recursive

Transformers

Mohammadmahdi Nouriborji2†, Omid Rohanian1,2†, Samaneh Kouchaki3,

David A. Clifton1,4

1Department of Engineering Science, University of Oxford, Oxford, UK

2NLPie Research, Oxford, UK

3Dept. Electrical and Electronic Engineering, University of Surrey, Guildford, UK

4Oxford-Suzhou Centre for Advanced Research, Suzhou, China

{m.nouriborji,omid}@nlpie.com

samaneh.kouchaki@surrey.ac.uk

{omid.rohanian,david.clifton}@eng.ox.ac.uk

Abstract

Pre-trained Language Models (LMs) have

become an integral part of Natural Language

Processing (NLP) in recent years, due to

their superior performance in downstream

applications. In spite of this resounding

success, the usability of LMs is constrained

by computational and time complexity, along

with their increasing size; an issue that has

been referred to as ‘overparameterisation’.

Different strategies have been proposed in the

literature to alleviate these problems, with the

aim to create effective compact models that

nearly match the performance of their bloated

counterparts with negligible performance

losses. One of the most popular techniques

in this area of research is model distillation.

Another potent but underutilised technique is

cross-layer parameter sharing. In this work,

we combine these two strategies and present

MiniALBERT, a technique for converting the

knowledge of fully parameterised LMs (such

as BERT) into a compact recursive student.

In addition, we investigate the application of

bottleneck adapters for layer-wise adaptation

of our recursive student, and also explore the

efﬁcacy of adapter tuning for ﬁne-tuning of

compact models. We test our proposed models

on a number of general and biomedical NLP

tasks to demonstrate their viability and com-

pare them with the state-of-the-art and other

existing compact models. All the codes used

in the experiments are available at https:

//github.com/nlpie-research/

MiniALBERT. Our pre-trained com-

pact models can be accessed from

https://huggingface.co/nlpie.

†The two authors contributed equally to this work.

1 Introduction

Following the introduction of BERT (Devlin et al.,

2019), generic pre-trained Language Models (LMs)

have started to dominate the ﬁeld of NLP. Virtu-

ally all state-of-the-art NLP models are built on

top of some large pre-trained transformer as a back-

bone and are subsequently ﬁne-tuned on their target

dataset. While this pre-train and ﬁne-tune approach

has resulted in signiﬁcant improvements across a

wide range of NLP tasks, the widespread use of

resource-exhaustive and overparameterised trans-

formers has also raised concerns among researchers

about their energy consumption, environmental im-

pact, and ethical implications (Strubell et al.,2019;

Bender et al.,2021).

As a response to this, different approaches have

appeared with the aim to make large LMs more

efﬁcient, accessible, and environmentally friendly.

Model compression is a line of research that has re-

cently received considerable attention. It involves

encoding a larger and slower but more performant

model into a smaller and faster one with the aim to

retain much of the former’s performance capability

(Bucilua et al.,2006). Knowledge distillation (Hin-

ton et al.,2015), quantisation (Shen et al.,2020),

and pruning (Ganesh et al.,2021) are three exam-

ples of such methods.

Adapter modules (Bapna and Firat,2019;He

et al.,2021) are recently introduced as an effective

mechanism to address the parameter inefﬁciency of

large pre-trained models. In this approach, several

‘bottleneck adapters’(Houlsby et al.,2019a) are em-

bedded inside different locations within the original

network. During ﬁne-tuning, the parameters of the

arXiv:2210.06425v2 [cs.CL] 30 Apr 2023

original model are kept ﬁxed, and for each new

task only the adapters are ﬁne-tuned. This only

adds a small number of parameters to the overall

architecture and allows for a much faster and more

efﬁcient ﬁne-tuning on different downstream tasks.

Another approach to improve efﬁciency of LM-

based transformers is shared parameterisation,

which was popularised by ALBERT (Lan et al.,

2019). While the original formulation of trans-

formers (Vaswani et al.,2017) employs full pa-

rameterisation wherein each model parameter is

independent of other modules and used only once

in the forward pass, shared parameterisation allows

different modules of the network to share parame-

ters, resulting in a more efﬁcient use of resources

given the same parameterisation budget. However,

a common downside of this approach is slower in-

ference time and reduced performance. Ge and

Wei (2022) posits two different parameterisation

methods as an attempt to address the compute and

memory challenges of transformer models and ex-

plores layer-wise adaptation in an encoder-decoder

architecture. These methods exploit cross-layer pa-

rameter sharing in a way that would allow for the

model to be utilised on mobile devices with strict

memory constraints while achieving state-of-the-

art results on two seq2seq tasks for English.

In this work, we exploit some of the above ap-

proaches to create a number of compact and ef-

ﬁcient encoder-only models distilled from much

larger language models. The contributions of this

work are as follows:

•

To the best of our knowledge, we are the

ﬁrst to compress fully parameterised large lan-

guage models using recursive transformers

(i.e. ALBERT-like models that employ full

parameter sharing).

•

We demonstrate the effectiveness of our pre-

trained bottleneck adapters by merely ﬁne-

tuning them on downstream tasks while still

achieving competitive results.

•

We present several light-weight transformers

with parameters ranging from

M for the

smallest to

M for the largest. These models

are shown to perform at the same level with

their fully parameterised versions.

•

Finally, we evaluate our models on a wide

range of tasks and datasets on general and

biomedial NLP datasets.

2 Background

2.1 LM-based Transformers and

Computational Complexity

Ever since the introduction of the transformer ar-

chitecture (Vaswani et al.,2017), large LM-based

transformers such as BERT (Devlin et al.,2019)

have become increasingly more popular in NLP

and lie at the heart of most state-of-the-art models.

A transformer is primarily composed of a number

of transformer blocks stacked on top of one another.

BERT

Base

, for instance, consists of

of these

blocks. The most important component in a block

is the multi-head self-attention module. To be use-

ful for language tasks, transformers are pre-trained

using a number of self-supervised auxiliary tasks

(Xia et al.,2020); these usually include some varia-

tion of Language Modelling (LM) and an optional

sentence-level prediction task. Examples of the for-

mer include Masked Language Modelling (MLM)

and Casual Language Modelling (CLM). For the

latter, BERT uses Next Sentence Prediction (NSP)

and ALBERT (Lan et al.,2019) employs Sentence

Order Prediction (SOP).

The standard approach to utilise these pre-

trained models is to ﬁne-tune them on a target

task. Given

as the sequence length, the com-

putational and time complexity of self-attention

(Keles et al.,2022). In recent years, dif-

ferent approaches have appeared in the literature

to address this bottleneck by modifying the self-

attention operation in order to improve the general

efﬁciency of transformers (with different perfor-

mance trade-offs). Tay et al. (2020) surveys the

most common approaches to develop what is re-

ferred to as ‘efﬁcient transformers’.

The magnitude of the parameters of LM-based

transformers is another signiﬁcant issue that re-

stricts their use. With new releases like GPT-3

and MT-NLG (Smith et al.,2022) that feature

hundreds of billions of parameters, these models

have become increasingly overparameterised due

to the large number of layers and embedding sizes

(Rogers et al.,2020).

2.2 Model Distillation

The overparameterisation issue has motivated re-

search into developing methods to compress larger

models into smaller and faster versions that per-

form reasonably close to their larger counterparts.

Knowledge distillation (Hinton et al.,2015) is

a prominent method that intended to distill a

Fully-Prametrised

Teacher

Layer 1

Layer 2

Layer 3

...

Layer N

Output

Layer 1

Layer 2

Layer 3

Output

Dataset

Distillation

Recursive

Student

MLM Labels

MLM

Loss

Recursive Layer

Output

...

Alignment

Loss

Output

Loss

...

Layer 1

Layer 2

Layer 3

Output

MiniALBERT

Output

Recursive Layer

Figure 1: The layer-to-layer distillation procedure proposed for distilling the knowledge of a fully-parameterised

teacher into a compact recursive student. While the teacher has fully parameterised layers, the recursive student

has only one layer and the output is fed back into the same layer repeatedly. Despite this compact structure, our

proposed distillation procedure is designed to align the output of each iteration of the recursive student with a

particular layer of the fully-parameterised teacher, as if the student had fully-parameterised layers. Additional

losses, namely, Output Loss, and MLM Loss, as shown above, are used for further knowledge distillation.

lightweight ‘student’ model from a larger ‘teacher’

network by using the outputs of the teacher netwrok

as soft labels. Distillation can either be done task-

speciﬁcally during ﬁne-tuning, or task-agnostically

by mimicking the MLM outputs or the interme-

diate representations of the teacher prior to the

ﬁne-tuning stage. The latter is more ﬂexible and

computationally less expensive (Wang et al.,2020).

DistilBERT is a well-known example of a distilled

model derived from BERT which is claimed to be

40%

smaller in terms of parameters and

60%

faster

while retaining

97%

of BERT’s performance on a

range of language understanding tasks (Sanh et al.,

2019).

2.3 Efﬁcient Fine-tuning Approaches

As discussed in Sec 2.1, LM-based transformers

involve a large number of parameters and they are

often ﬁne-tuned on a target dataset. However, ﬁne-

tuning could become time-consuming as the size

of the datasets grow. Different techniques exist

in the literature to alleviate this bottleneck during

ﬁne-tuning. In this section we explore two of these

techniques, namely, prompt tuning and bottleneck

adapters.

Prompt tuning (Lester et al.,2021) is a technique

in which the weights of a language model are kept

frozen during the ﬁne-tuning stage and ﬁne-tuning

is reformulated as a cloze-style task. Similar to

T5, prompt tuning regards all tasks as a variation

of text generation and conditions the generation

using ‘soft prompts’. A typical prompt consists

of a text template with a masked token and a set

of candidate label words to ﬁll the mask. This

turns the target task into another MLM objective

in which the right candidates are chosen and soft

prompts are learned. This method is especially

useful for few-shot learning scenarios where there

are not many target labels available for standard

ﬁne-tuning.

Bottleneck Adapters (BAs) (Houlsby et al.,

2019b;Pfeiffer et al.,2021;Rücklé et al.,2020;

Pfeiffer et al.,2020) are another mechanism used

during ﬁne-tuning to enhance efﬁciency of training.

Each BA block consists of a linear down-projection,

non-linearity, and up-projection along with residual

connections. Several of these adapters are placed

after the feed-forward or attention modules in a

transformer. Similar to prompts, only the BAs are

trained during ﬁne-tuning.

2.4 Parameter Sharing via Recursion

Weight sharing is a strategy intended to reduce the

overall number of parameters in a model. Lan et al.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MiniALBERT:ModelDistillationviaParameter-EfcientRecursiveTransformersMohammadmahdiNouriborji2y,OmidRohanian1;2y,SamanehKouchaki3,DavidA.Clifton1;41DepartmentofEngineeringScience,UniversityofOxford,Oxford,UK2NLPieResearch,Oxford,UK3Dept.ElectricalandElectronicEngineering,UniversityofSurrey,Guildford...

展开>> 收起<<

MiniALBERT Model Distillation via Parameter-Efﬁcient Recursive Transformers Mohammadmahdi Nouriborji2y Omid Rohanian12y Samaneh Kouchaki3.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MiniALBERT Model Distillation via Parameter-Efﬁcient Recursive Transformers Mohammadmahdi Nouriborji2y Omid Rohanian12y Samaneh Kouchaki3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: