
a pre-trained single-exit model like BERT incurs
inconsistency in the training objective, where each
layer is simultaneously making predictions and
serving the deeper layers (Xin et al.,2021). Empir-
ically, we find that the uncompressed BERT is not
severely influenced by such inconsistency, whereas
small capacity models are not capable of balancing
shallow and deep layers. Plugging in exits after
compression will lead to severe performance degra-
dation, which hinders the complementation of the
two optimizations.
To fully exploit the efficiency of PLMs and miti-
gate the above-mentioned issues, we design a slen-
derized multi-exit model and propose a Collabo-
rative Optimization approach of Spatial and Tem-
poral EFFiciency (COST-EFF) as depicted in Fig-
ure 1. Unlike previous works, e.g., DynaBERT
(Hou et al.,2020) and CoFi (Xia et al.,2022),
which obtain a squat model, we keep the depth
intact while slenderizing the PLM. Superiority of
slender architectures over squat ones is supported
by (Bengio et al.,2007) and (Turc et al.,2019) in
generic machine learning and PLM design. To ad-
dress the inconsistency in compressed multi-exit
model, we first distill an multi-exit BERT from the
original PLM as both the teaching assistant (TA)
and the slenderization backbone, which is more
effective in balancing the trade-off between layers
than compressed models. Then, we propose a col-
laborative approach that slenderizes the backbone
with the calibration of exits. Such a slenderization
diminishes less contributive structures to each exit
as well as the redundancy in width. After the slen-
derization, task-specific knowledge distillation is
conducted with the objectives of hidden represen-
tations and predictions of each layer as recovery.
Specifically, the contributions of this paper are as
follows.
•
To comprehensively optimize the spatial and
temporal efficiency of PLMs, we leverage
both static slenderization and dynamic accel-
eration from the perspective of model scale
and variable computation.
•
We propose a collaborative training approach
that calibrates the slenderization under the
guidance of intermediate exits and mitigates
the inconsistency of early exiting.
•
Experiments conducted on the GLUE bench-
mark verify the Pareto optimality of our ap-
proach. COST-EFF achieves 96.5% perfor-
mance of fine-tuned BERT
Base
with approxi-
mately
1/8
parameters and
1/19
FLOPs with-
out any form of data augmentation.1
2 Related Work
The compression and acceleration of PLMs were
recently investigated to neutralize the overhead of
large models by various means.
The structured pruning objects include, from
small to large, hidden dimensions (Wang et al.,
2020), attention heads (Michel et al.,2019), multi-
head attention (MHA) and feed-forward network
(FFN) modules (Xia et al.,2022) and entire Trans-
former layers (Fan et al.,2020). Considering the
benefit of the overall structure, we keep all the mod-
ules while reducing their sizes. Besides pruning out
structures, a fine-grained approach is unstructured
pruning which prunes out weights. Unstructured
pruning can achieve high sparsity of 97% (Xu et al.,
2022) but is not yet adaptable to general computing
platforms and hardware.
During the recovery training of compressed mod-
els, knowledge distillation objectives include pre-
dictions of classifiers (Sanh et al.,2020), features
of intermediate representations (Jiao et al.,2020)
and relations between samples (Tung and Mori,
2019). Also, the occasion of distillation varies from
general pre-training and task-specific fine-tuning
(Turc et al.,2019). Distillation enables the training
without ground-truth labels complementing data
augmentation. In this paper, data augmentation is
not leveraged as it requires a long training time,
but our approach is well adaptable to it if better
performance is to be pursued.
Dynamic early exits come from BranchyNet
(Teerapittayanon et al.,2016), which introduces
exit branches after specific convolution layers of
the CNN model. The idea is adopted to PLMs as
Transformer layer-wise early exiting (Xin et al.,
2021;Zhou et al.,2020;Liu et al.,2020). However,
early exiting only accelerates inference but does
not reduce the model size and the redundancy in
width. Furthermore, owing to the inconsistency be-
tween shallow and deep layers, it is hard to achieve
high speedup using early exiting alone.
The prevailing PLMs, e.g., RoBERTa (Liu et al.,
2019) and XLNet (Yang et al.,2019) are vari-
ants of Transformer with similar overall structures,
well-adaptable to the optimizations that we pro-
1
Code is available at
https://github.com/sbwww/
COST-EFF.