COST-EFF Collaborative Optimization of Spatial and Temporal Efﬁciency with Slenderized Multi-exit Language Models Bowen Shen12 Zheng Lin12 Yuanxin Liu13 Zhengxiao Liu1

2025-05-08 0 0 684.1KB 12 页 10玖币

侵权投诉

COST-EFF: Collaborative Optimization of Spatial and Temporal

Efﬁciency with Slenderized Multi-exit Language Models

Bowen Shen1,2, Zheng Lin1,2∗

, Yuanxin Liu1,3, Zhengxiao Liu1,

Lei Wang1∗

, Weiping Wang1

1Institute of Information Engineering, Chinese Academy of Sciences

2School of Cyber Security, University of Chinese Academy of Sciences

3MOE Key Laboratory of Computational Linguistics, Peking University

{shenbowen, linzheng, liuzhengxiao, wanglei, wangweiping}@iie.ac.cn

liuyuanxin@stu.pku.edu.cn

Abstract

Transformer-based pre-trained language mod-

els (PLMs) mostly suffer from excessive over-

head despite their advanced capacity. For

resource-constrained devices, there is an ur-

gent need for a spatially and temporally efﬁ-

cient model which retains the major capacity

of PLMs. However, existing statically com-

pressed models are unaware of the diverse

complexities between input instances, poten-

tially resulting in redundancy and inadequacy

for simple and complex inputs. Also, minia-

ture models with early exiting encounter chal-

lenges in the trade-off between making predic-

tions and serving the deeper layers. Motivated

by such considerations, we propose a collab-

orative optimization for PLMs that integrates

static model compression and dynamic infer-

ence acceleration. Speciﬁcally, the PLM is

slenderized in width while the depth remains

intact, complementing layer-wise early exiting

to speed up inference dynamically. To address

the trade-off of early exiting, we propose a

joint training approach that calibrates slender-

ization and preserves contributive structures to

each exit instead of only the ﬁnal layer. Ex-

periments are conducted on GLUE benchmark

and the results verify the Pareto optimality of

our approach at high compression and acceler-

ation rate with 1/8 parameters and 1/19 FLOPs

of BERT.

1 Introduction

Pre-training generalized language models and ﬁne-

tuning them on speciﬁc downstream tasks has be-

come the dominant paradigm in natural language

processing (NLP) since the advent of Transformers

(Vaswani et al.,2017) and BERT (Devlin et al.,

2019). However, pre-trained language models

(PLMs) are predominantly designed to be vast in

the pursuit of model capacity and generalization.

With such a concern, the model storage and infer-

ence time of PLMs are usually high, making them

∗

Zheng Lin and Lei Wang are the corresponding authors.

Tfm Layer 1

Tfm Layer 6

…

Emb Layer

Clf 1

Clf 6

Clf 12Tfm Layer 12

…

Clf

BERTBase COST-EFF (ours)

Tfm Layer 12

Tfm Layer 6

Tfm Layer 1

Emb Layer

Distillation

Dynamic acceleration

for temporal efficiency

Static slenderization

for spatial efficiency Computation demands

… …

Figure 1: An illustration of COST-EFF model structure

and inference procedure. Emb, Tfm and Clf are abbre-

viations of embedding, Transformer and classiﬁer, re-

spectively. Blue bar charts denote probability distribu-

tion output by classiﬁers.

challenging to be deployed on resource-constrained

devices (Sun et al.,2020).

Recent studies indicate that Transformer-based

PLMs bear redundancy spatially and temporally

which comes from the excessive width and depth

of the model (Michel et al.,2019;Xin et al.,2021).

With static compression methods including net-

work pruning (Xia et al.,2022) and knowledge

distillation (Jiao et al.,2020), spatial overheads of

PLMs (i.e., model parameters) can be reduced to

a ﬁxed setting. From the perspective of input in-

stances rather than the model, early exiting without

passing all the model layers enables the dynamic

acceleration at inference time and diminishes the

temporal overheads (Zhou et al.,2020).

However, static compression can hardly ﬁnd an

optimal setting that is both efﬁcient on simple in-

put instances and accurate on complex ones, while

early exiting cannot diminish the redundancy in

model width and is impotent for reducing the actual

volume of model. Further, interpretability studies

indicate that the attention and semantic features

across layers are different in BERT (Clark et al.,

2019). Hence, deriving a multi-exit model from

arXiv:2210.15523v1 [cs.CL] 27 Oct 2022

a pre-trained single-exit model like BERT incurs

inconsistency in the training objective, where each

layer is simultaneously making predictions and

serving the deeper layers (Xin et al.,2021). Empir-

ically, we ﬁnd that the uncompressed BERT is not

severely inﬂuenced by such inconsistency, whereas

small capacity models are not capable of balancing

shallow and deep layers. Plugging in exits after

compression will lead to severe performance degra-

dation, which hinders the complementation of the

two optimizations.

To fully exploit the efﬁciency of PLMs and miti-

gate the above-mentioned issues, we design a slen-

derized multi-exit model and propose a Collabo-

rative Optimization approach of Spatial and Tem-

poral EFFiciency (COST-EFF) as depicted in Fig-

ure 1. Unlike previous works, e.g., DynaBERT

(Hou et al.,2020) and CoFi (Xia et al.,2022),

which obtain a squat model, we keep the depth

intact while slenderizing the PLM. Superiority of

slender architectures over squat ones is supported

by (Bengio et al.,2007) and (Turc et al.,2019) in

generic machine learning and PLM design. To ad-

dress the inconsistency in compressed multi-exit

model, we ﬁrst distill an multi-exit BERT from the

original PLM as both the teaching assistant (TA)

and the slenderization backbone, which is more

effective in balancing the trade-off between layers

than compressed models. Then, we propose a col-

laborative approach that slenderizes the backbone

with the calibration of exits. Such a slenderization

diminishes less contributive structures to each exit

as well as the redundancy in width. After the slen-

derization, task-speciﬁc knowledge distillation is

conducted with the objectives of hidden represen-

tations and predictions of each layer as recovery.

Speciﬁcally, the contributions of this paper are as

follows.

•

To comprehensively optimize the spatial and

temporal efﬁciency of PLMs, we leverage

both static slenderization and dynamic accel-

eration from the perspective of model scale

and variable computation.

•

We propose a collaborative training approach

that calibrates the slenderization under the

guidance of intermediate exits and mitigates

the inconsistency of early exiting.

•

Experiments conducted on the GLUE bench-

mark verify the Pareto optimality of our ap-

proach. COST-EFF achieves 96.5% perfor-

mance of ﬁne-tuned BERT

Base

with approxi-

mately

1/8

parameters and

1/19

FLOPs with-

out any form of data augmentation.1

2 Related Work

The compression and acceleration of PLMs were

recently investigated to neutralize the overhead of

large models by various means.

The structured pruning objects include, from

small to large, hidden dimensions (Wang et al.,

2020), attention heads (Michel et al.,2019), multi-

head attention (MHA) and feed-forward network

(FFN) modules (Xia et al.,2022) and entire Trans-

former layers (Fan et al.,2020). Considering the

beneﬁt of the overall structure, we keep all the mod-

ules while reducing their sizes. Besides pruning out

structures, a ﬁne-grained approach is unstructured

pruning which prunes out weights. Unstructured

pruning can achieve high sparsity of 97% (Xu et al.,

2022) but is not yet adaptable to general computing

platforms and hardware.

During the recovery training of compressed mod-

els, knowledge distillation objectives include pre-

dictions of classiﬁers (Sanh et al.,2020), features

of intermediate representations (Jiao et al.,2020)

and relations between samples (Tung and Mori,

2019). Also, the occasion of distillation varies from

general pre-training and task-speciﬁc ﬁne-tuning

(Turc et al.,2019). Distillation enables the training

without ground-truth labels complementing data

augmentation. In this paper, data augmentation is

not leveraged as it requires a long training time,

but our approach is well adaptable to it if better

performance is to be pursued.

Dynamic early exits come from BranchyNet

(Teerapittayanon et al.,2016), which introduces

exit branches after speciﬁc convolution layers of

the CNN model. The idea is adopted to PLMs as

Transformer layer-wise early exiting (Xin et al.,

2021;Zhou et al.,2020;Liu et al.,2020). However,

early exiting only accelerates inference but does

not reduce the model size and the redundancy in

width. Furthermore, owing to the inconsistency be-

tween shallow and deep layers, it is hard to achieve

high speedup using early exiting alone.

The prevailing PLMs, e.g., RoBERTa (Liu et al.,

2019) and XLNet (Yang et al.,2019) are vari-

ants of Transformer with similar overall structures,

well-adaptable to the optimizations that we pro-

Code is available at

https://github.com/sbwww/

COST-EFF.

pose. Apart from PLMs with increasing size, AL-

BERT(Lan et al.,2020) is distinctive with a small

volume of 18M (Million) parameters obtained from

weight sharing of Transformer layers. Weight shar-

ing allows the model to store the parameters only

once, greatly reducing the storage overhead. How-

ever, the shared weights have no contribution to

inference speedup. Instead, the time required for

ALBERT to achieve BERT-like accuracy increases.

3 Methodology

In this section, we analyze the major structures of

Transformer-based PLMs and devise correspond-

ing optimizations. The proposed COST-EFF has

three key properties, namely static slenderization,

dynamic acceleration and collaborative training.

3.1 Preliminaries

In this paper, we focus on optimizing the

Transformer-based PLM which mainly consists of

embedding, MHA and FFN. Speciﬁcally, embed-

ding converts each input token to a tensor of size

(i.e., hidden dimension). With a common vocabu-

lary size of

|V|= 30,522

, the word embedding ma-

trix accounts for

<22%

of BERT

Base

parameters.

Inside the Transformer, MHA has four matrices

and

, all of which with input

and output size of

. FFN has two matrices

WF I

and

WF O

with the size of

H×F

. As the key com-

ponents of Transformer, MHA and FFN account

for

<26%

and

<52%

of BERT

Base

parameters,

respectively.

Based on the analysis, we have the following

slenderization and acceleration schemes. (1) The

word embedding matrix

is decomposed into

the multiplication of two matrices following (Lan

et al.,2020). Thus, the vocabulary size

|V|

and

hidden size

are not changed. (2) For the trans-

formation matrices of MHA and FFN, structured

pruning is adopted to reduce their input or out-

put dimensions. (3) The inference is accelerated

through early exiting as we retain the pre-trained

model depth. To avoid introducing additional pa-

rameters, we remove the pre-trained pooler matrix

before classiﬁers. (4) Knowledge distillation on

prediction logits and hidden states of each layer

is leveraged as a substitute for conventional ﬁne-

tuning. The overall architecture of COST-EFF is

depicted in Figure 2.

3.2 Static Slenderization

3.2.1 Matrix Decomposition of Embedding

As mentioned before, the word embedding takes up

more than 1/5 of BERT

Base

parameters. The output

dimension of word embedding is equal to hidden

size, which we don’t modify, we use truncated

singular value decomposition (TSVD) to internally

compress the word embedding matrix.

TSVD ﬁrst decomposes the matrix as

Am×n=

Um×mΣm×nVn×n

, where

Σm×n

is the singular

value diagonal matrix. After that, the three ma-

trices are truncated to the given rank. Thus, the

decomposition of word embedding is as

W|V|×H

t≈W|V|×R

t1WR×H

=˜

U|V|×R˜

ΣR×R˜

VR×H,(1)

where we multiplies ˜

Uand ˜

Σmatrices as the ﬁrst

embedding matrix W|V|×R

t1and WR×H

t2=˜

Vis a

linear transformation with no bias.

3.2.2 Structured Pruning of MHA and FFN

To compress the matrices in MHA and FFN which

contribute to most of the PLM’s parameters, we

adopt structured pruning to compress one dimen-

sion of the matrices. As depicted in Figure 2, the

pruning granularity of MHA and FFN are attention

head and hidden dimension, respectively.

Following (Molchanov et al.,2017), COST-EFF

has the pruning objective of minimizing the differ-

ence between pruned and original model, which is

calculated by ﬁrst-order Taylor expansion

|∆(S)|=|L(X)− L(X|hi= 0,hi∈S)|

=



hi∈S

δL

δhi

(hi−0) + R(1)



≈



hi∈S

δL

δhi

hi



(2)

where

denotes a speciﬁc structure, i.e., a set

of weights,

L(·)

is the loss function and

δL

δhi

the gradient of loss to weight

|∆(S)|

is the

importance of structure

measured by absolute

value of the ﬁrst-order term. For simplicity, we

ignore the remainder R(1) in Taylor expansion.

In each Transformer layer, the structure

MHA is the attention head while that of FFN is

the hidden dimension as depicted in the lower part

of Figure 2. Speciﬁcally, the output dimensions

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

COST-EFF:CollaborativeOptimizationofSpatialandTemporalEfciencywithSlenderizedMulti-exitLanguageModelsBowenShen1;2,ZhengLin1;2,YuanxinLiu1;3,ZhengxiaoLiu1,LeiWang1,WeipingWang11InstituteofInformationEngineering,ChineseAcademyofSciences2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences3MOE...

展开>> 收起<<

COST-EFF Collaborative Optimization of Spatial and Temporal Efﬁciency with Slenderized Multi-exit Language Models Bowen Shen12 Zheng Lin12 Yuanxin Liu13 Zhengxiao Liu1.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

COST-EFF Collaborative Optimization of Spatial and Temporal Efﬁciency with Slenderized Multi-exit Language Models Bowen Shen12 Zheng Lin12 Yuanxin Liu13 Zhengxiao Liu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: