COST-EFF Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models Bowen Shen12 Zheng Lin12 Yuanxin Liu13 Zhengxiao Liu1

2025-05-08 0 0 684.1KB 12 页 10玖币
侵权投诉
COST-EFF: Collaborative Optimization of Spatial and Temporal
Efficiency with Slenderized Multi-exit Language Models
Bowen Shen1,2, Zheng Lin1,2
, Yuanxin Liu1,3, Zhengxiao Liu1,
Lei Wang1
, Weiping Wang1
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences
3MOE Key Laboratory of Computational Linguistics, Peking University
{shenbowen, linzheng, liuzhengxiao, wanglei, wangweiping}@iie.ac.cn
liuyuanxin@stu.pku.edu.cn
Abstract
Transformer-based pre-trained language mod-
els (PLMs) mostly suffer from excessive over-
head despite their advanced capacity. For
resource-constrained devices, there is an ur-
gent need for a spatially and temporally effi-
cient model which retains the major capacity
of PLMs. However, existing statically com-
pressed models are unaware of the diverse
complexities between input instances, poten-
tially resulting in redundancy and inadequacy
for simple and complex inputs. Also, minia-
ture models with early exiting encounter chal-
lenges in the trade-off between making predic-
tions and serving the deeper layers. Motivated
by such considerations, we propose a collab-
orative optimization for PLMs that integrates
static model compression and dynamic infer-
ence acceleration. Specifically, the PLM is
slenderized in width while the depth remains
intact, complementing layer-wise early exiting
to speed up inference dynamically. To address
the trade-off of early exiting, we propose a
joint training approach that calibrates slender-
ization and preserves contributive structures to
each exit instead of only the final layer. Ex-
periments are conducted on GLUE benchmark
and the results verify the Pareto optimality of
our approach at high compression and acceler-
ation rate with 1/8 parameters and 1/19 FLOPs
of BERT.
1 Introduction
Pre-training generalized language models and fine-
tuning them on specific downstream tasks has be-
come the dominant paradigm in natural language
processing (NLP) since the advent of Transformers
(Vaswani et al.,2017) and BERT (Devlin et al.,
2019). However, pre-trained language models
(PLMs) are predominantly designed to be vast in
the pursuit of model capacity and generalization.
With such a concern, the model storage and infer-
ence time of PLMs are usually high, making them
Zheng Lin and Lei Wang are the corresponding authors.
Tfm Layer 1
Tfm Layer 6
Emb Layer
Clf 1
Clf 6
Clf 12Tfm Layer 12
Clf
BERTBase COST-EFF (ours)
Tfm Layer 12
Tfm Layer 6
Tfm Layer 1
Emb Layer
Distillation
Dynamic acceleration
for temporal efficiency
Static slenderization
for spatial efficiency Computation demands
… …
Figure 1: An illustration of COST-EFF model structure
and inference procedure. Emb, Tfm and Clf are abbre-
viations of embedding, Transformer and classifier, re-
spectively. Blue bar charts denote probability distribu-
tion output by classifiers.
challenging to be deployed on resource-constrained
devices (Sun et al.,2020).
Recent studies indicate that Transformer-based
PLMs bear redundancy spatially and temporally
which comes from the excessive width and depth
of the model (Michel et al.,2019;Xin et al.,2021).
With static compression methods including net-
work pruning (Xia et al.,2022) and knowledge
distillation (Jiao et al.,2020), spatial overheads of
PLMs (i.e., model parameters) can be reduced to
a fixed setting. From the perspective of input in-
stances rather than the model, early exiting without
passing all the model layers enables the dynamic
acceleration at inference time and diminishes the
temporal overheads (Zhou et al.,2020).
However, static compression can hardly find an
optimal setting that is both efficient on simple in-
put instances and accurate on complex ones, while
early exiting cannot diminish the redundancy in
model width and is impotent for reducing the actual
volume of model. Further, interpretability studies
indicate that the attention and semantic features
across layers are different in BERT (Clark et al.,
2019). Hence, deriving a multi-exit model from
arXiv:2210.15523v1 [cs.CL] 27 Oct 2022
a pre-trained single-exit model like BERT incurs
inconsistency in the training objective, where each
layer is simultaneously making predictions and
serving the deeper layers (Xin et al.,2021). Empir-
ically, we find that the uncompressed BERT is not
severely influenced by such inconsistency, whereas
small capacity models are not capable of balancing
shallow and deep layers. Plugging in exits after
compression will lead to severe performance degra-
dation, which hinders the complementation of the
two optimizations.
To fully exploit the efficiency of PLMs and miti-
gate the above-mentioned issues, we design a slen-
derized multi-exit model and propose a Collabo-
rative Optimization approach of Spatial and Tem-
poral EFFiciency (COST-EFF) as depicted in Fig-
ure 1. Unlike previous works, e.g., DynaBERT
(Hou et al.,2020) and CoFi (Xia et al.,2022),
which obtain a squat model, we keep the depth
intact while slenderizing the PLM. Superiority of
slender architectures over squat ones is supported
by (Bengio et al.,2007) and (Turc et al.,2019) in
generic machine learning and PLM design. To ad-
dress the inconsistency in compressed multi-exit
model, we first distill an multi-exit BERT from the
original PLM as both the teaching assistant (TA)
and the slenderization backbone, which is more
effective in balancing the trade-off between layers
than compressed models. Then, we propose a col-
laborative approach that slenderizes the backbone
with the calibration of exits. Such a slenderization
diminishes less contributive structures to each exit
as well as the redundancy in width. After the slen-
derization, task-specific knowledge distillation is
conducted with the objectives of hidden represen-
tations and predictions of each layer as recovery.
Specifically, the contributions of this paper are as
follows.
To comprehensively optimize the spatial and
temporal efficiency of PLMs, we leverage
both static slenderization and dynamic accel-
eration from the perspective of model scale
and variable computation.
We propose a collaborative training approach
that calibrates the slenderization under the
guidance of intermediate exits and mitigates
the inconsistency of early exiting.
Experiments conducted on the GLUE bench-
mark verify the Pareto optimality of our ap-
proach. COST-EFF achieves 96.5% perfor-
mance of fine-tuned BERT
Base
with approxi-
mately
1/8
parameters and
1/19
FLOPs with-
out any form of data augmentation.1
2 Related Work
The compression and acceleration of PLMs were
recently investigated to neutralize the overhead of
large models by various means.
The structured pruning objects include, from
small to large, hidden dimensions (Wang et al.,
2020), attention heads (Michel et al.,2019), multi-
head attention (MHA) and feed-forward network
(FFN) modules (Xia et al.,2022) and entire Trans-
former layers (Fan et al.,2020). Considering the
benefit of the overall structure, we keep all the mod-
ules while reducing their sizes. Besides pruning out
structures, a fine-grained approach is unstructured
pruning which prunes out weights. Unstructured
pruning can achieve high sparsity of 97% (Xu et al.,
2022) but is not yet adaptable to general computing
platforms and hardware.
During the recovery training of compressed mod-
els, knowledge distillation objectives include pre-
dictions of classifiers (Sanh et al.,2020), features
of intermediate representations (Jiao et al.,2020)
and relations between samples (Tung and Mori,
2019). Also, the occasion of distillation varies from
general pre-training and task-specific fine-tuning
(Turc et al.,2019). Distillation enables the training
without ground-truth labels complementing data
augmentation. In this paper, data augmentation is
not leveraged as it requires a long training time,
but our approach is well adaptable to it if better
performance is to be pursued.
Dynamic early exits come from BranchyNet
(Teerapittayanon et al.,2016), which introduces
exit branches after specific convolution layers of
the CNN model. The idea is adopted to PLMs as
Transformer layer-wise early exiting (Xin et al.,
2021;Zhou et al.,2020;Liu et al.,2020). However,
early exiting only accelerates inference but does
not reduce the model size and the redundancy in
width. Furthermore, owing to the inconsistency be-
tween shallow and deep layers, it is hard to achieve
high speedup using early exiting alone.
The prevailing PLMs, e.g., RoBERTa (Liu et al.,
2019) and XLNet (Yang et al.,2019) are vari-
ants of Transformer with similar overall structures,
well-adaptable to the optimizations that we pro-
1
Code is available at
https://github.com/sbwww/
COST-EFF.
pose. Apart from PLMs with increasing size, AL-
BERT(Lan et al.,2020) is distinctive with a small
volume of 18M (Million) parameters obtained from
weight sharing of Transformer layers. Weight shar-
ing allows the model to store the parameters only
once, greatly reducing the storage overhead. How-
ever, the shared weights have no contribution to
inference speedup. Instead, the time required for
ALBERT to achieve BERT-like accuracy increases.
3 Methodology
In this section, we analyze the major structures of
Transformer-based PLMs and devise correspond-
ing optimizations. The proposed COST-EFF has
three key properties, namely static slenderization,
dynamic acceleration and collaborative training.
3.1 Preliminaries
In this paper, we focus on optimizing the
Transformer-based PLM which mainly consists of
embedding, MHA and FFN. Specifically, embed-
ding converts each input token to a tensor of size
H
(i.e., hidden dimension). With a common vocabu-
lary size of
|V|= 30,522
, the word embedding ma-
trix accounts for
<22%
of BERT
Base
parameters.
Inside the Transformer, MHA has four matrices
WQ
,
WK
,
WV
and
WO
, all of which with input
and output size of
H
. FFN has two matrices
WF I
and
WF O
with the size of
H×F
. As the key com-
ponents of Transformer, MHA and FFN account
for
<26%
and
<52%
of BERT
Base
parameters,
respectively.
Based on the analysis, we have the following
slenderization and acceleration schemes. (1) The
word embedding matrix
Wt
is decomposed into
the multiplication of two matrices following (Lan
et al.,2020). Thus, the vocabulary size
|V|
and
hidden size
H
are not changed. (2) For the trans-
formation matrices of MHA and FFN, structured
pruning is adopted to reduce their input or out-
put dimensions. (3) The inference is accelerated
through early exiting as we retain the pre-trained
model depth. To avoid introducing additional pa-
rameters, we remove the pre-trained pooler matrix
before classifiers. (4) Knowledge distillation on
prediction logits and hidden states of each layer
is leveraged as a substitute for conventional fine-
tuning. The overall architecture of COST-EFF is
depicted in Figure 2.
3.2 Static Slenderization
3.2.1 Matrix Decomposition of Embedding
As mentioned before, the word embedding takes up
more than 1/5 of BERT
Base
parameters. The output
dimension of word embedding is equal to hidden
size, which we don’t modify, we use truncated
singular value decomposition (TSVD) to internally
compress the word embedding matrix.
TSVD first decomposes the matrix as
Am×n=
Um×mΣm×nVn×n
, where
Σm×n
is the singular
value diagonal matrix. After that, the three ma-
trices are truncated to the given rank. Thus, the
decomposition of word embedding is as
W|VH
tW|VR
t1WR×H
t2
=˜
U|VR˜
ΣR×R˜
VR×H,(1)
where we multiplies ˜
Uand ˜
Σmatrices as the first
embedding matrix W|VR
t1and WR×H
t2=˜
Vis a
linear transformation with no bias.
3.2.2 Structured Pruning of MHA and FFN
To compress the matrices in MHA and FFN which
contribute to most of the PLM’s parameters, we
adopt structured pruning to compress one dimen-
sion of the matrices. As depicted in Figure 2, the
pruning granularity of MHA and FFN are attention
head and hidden dimension, respectively.
Following (Molchanov et al.,2017), COST-EFF
has the pruning objective of minimizing the differ-
ence between pruned and original model, which is
calculated by first-order Taylor expansion
|∆(S)|=|L(X)− L(X|hi= 0,hiS)|
=
X
hiS
δL
δhi
(hi0) + R(1)
X
hiS
δL
δhi
hi
,
(2)
where
S
denotes a specific structure, i.e., a set
of weights,
L(·)
is the loss function and
δL
δhi
is
the gradient of loss to weight
hi
.
|∆(S)|
is the
importance of structure
S
measured by absolute
value of the first-order term. For simplicity, we
ignore the remainder R(1) in Taylor expansion.
In each Transformer layer, the structure
S
of
MHA is the attention head while that of FFN is
the hidden dimension as depicted in the lower part
of Figure 2. Specifically, the output dimensions
摘要:

COST-EFF:CollaborativeOptimizationofSpatialandTemporalEfciencywithSlenderizedMulti-exitLanguageModelsBowenShen1;2,ZhengLin1;2,YuanxinLiu1;3,ZhengxiaoLiu1,LeiWang1,WeipingWang11InstituteofInformationEngineering,ChineseAcademyofSciences2SchoolofCyberSecurity,UniversityofChineseAcademyofSciences3MOE...

展开>> 收起<<
COST-EFF Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models Bowen Shen12 Zheng Lin12 Yuanxin Liu13 Zhengxiao Liu1.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:684.1KB 格式:PDF 时间:2025-05-08

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注