Model Cascading Towards Jointly Improving Efficiency and Accuracy of NLP Systems Neeraj Varshney and Chitta Baral

2025-04-24 0 0 1.25MB 15 页 10玖币
侵权投诉
Model Cascading: Towards Jointly Improving Efficiency and Accuracy of
NLP Systems
Neeraj Varshney and Chitta Baral
Arizona State University
{nvarshn2, cbaral}@asu.edu
Abstract
Do all instances need inference through the
big models for a correct prediction?
Perhaps not; some instances are easy and can
be answered correctly by even small capacity
models. This provides opportunities for im-
proving the computational efficiency of sys-
tems. In this work, we present an explorative
study on ‘model cascading’, a simple tech-
nique that utilizes a collection of models of
varying capacities to accurately yet efficiently
output predictions. Through comprehensive
experiments in multiple task settings that dif-
fer in the number of models available for cas-
cading (Kvalue), we show that cascading im-
proves both the computational efficiency and
the prediction accuracy. For instance, in K=3
setting, cascading saves up to 88.93% compu-
tation cost and consistently achieves superior
prediction accuracy with an improvement of
up to 2.18%. We also study the impact of in-
troducing additional models in the cascade and
show that it further increases the efficiency im-
provements. Finally, we hope that our work
will facilitate development of efficient NLP
systems making their widespread adoption in
real-world applications possible.
1 Introduction
Pre-trained language models such as RoBERTa
(Liu et al.,2019), ELECTRA (Clark et al.,2020),
and T5 (Raffel et al.,2020) have achieved remark-
able performance on numerous natural language
processing benchmarks (Wang et al.,2018,2019;
Talmor et al.,2019). However, these models have
a large number of parameters which makes them
slow and computationally expensive; for instance,
T5-11B requires
87 ×1011
floating point oper-
ations (FLOPs) for an inference. This limits their
widespread adoption in real-world applications that
prefer computationally efficient systems in order to
achieve low response times.
The above concern has recently received consid-
erable attention from the NLP community leading
Figure 1: Illustrating a cascading approach with three
models (Mini, Med, and Base) arranged in increasing
order of capacity. An input is first passed through the
smallest model (Mini) which fails to predict with suf-
ficient confidence. Therefore, it is then inferred using
a bigger model (Med) that satisfies the confidence con-
straints and the system outputs its prediction (‘contra-
diction’ as dog has four legs). Thus, by avoiding infer-
ence through large/expensive models, the system saves
computation cost without sacrificing the accuracy.
to development of several techniques, such as (1)
network pruning that progressively removes model
weights from a big network (Wang et al.,2020;
Guo et al.,2021), (2) early exiting that allows mul-
tiple exit paths in a model (Xin et al.,2020), (3)
adaptive inference that adjusts model size by adap-
tively selecting its width and depth (Goyal et al.,
2020;Kim and Cho,2021), (4) knowledge distilla-
tion that transfers ‘dark-knowledge’ from a large
teacher model to a shallow student model (Jiao
et al.,2020;Li et al.,2022), and (5) input reduc-
tion that eliminates less contributing tokens from
the input text to speed up inference (Modarressi
et al.,2022). These methods typically require ar-
chitectural modifications, network manipulation,
arXiv:2210.05528v1 [cs.CL] 11 Oct 2022
saliency quantification, or even complex training
procedures. Moreover, computational efficiency
in these methods often comes with a compromise
on accuracy. In contrast, model cascading, a sim-
ple technique that utilizes a collection of models
of varying capacities to
accurately yet efficiently
output predictions has remained underexplored.
In this work, we address the above limitation by
first providing mathematical formulation of model
cascading and then exploring several approaches to
do it. In its problem setup, a collection of models of
different capacities (and hence performances) are
provided and the system needs to output its predic-
tion by leveraging one or more models. On one ex-
treme, the system can use only the smallest model
and on the other extreme, it can use all the available
models (ensembling). The former system would
be highly efficient but usually poor in performance
while the latter system would be fairly accurate but
expensive in computation. Model cascading strives
to get the best of both worlds by allowing the sys-
tem to efficiently utilize the available models while
achieving high prediction accuracy. This is in line
with the ‘Efficiency NLP’ (Arase and et al.,2021)
policy document put up by the ACL community.
Consider the case of CommitmentBank
(de Marneffe et al.,2019) dataset on which
BERT-medium model having just 41.7M param-
eters achieves
75%
accuracy and a bigger model
BERT-base having 110M parameters achieves
82%
accuracy. From this, it is clear that the performance
of the bigger model can be matched by inferring
a large number of instances using the smaller
model and only a few instances using the bigger
model. Thus, by carefully deciding when to use
bigger/more expensive models, the computational
efficiency of NLP systems can be improved. So,
how should we decide which model(s) to use for
a given test instance? Figure 1illustrates an
approach to achieve this; it infers an instance
sequentially through models (ordered in increasing
order of capacity) and uses a threshold over the
maximum softmax probability (MaxProb) to
decide whether to output the prediction or pass it to
the next model in sequence. The intuition behind
this approach is that MaxProb shows a positive
correlation with predictive correctness. Thus,
instances that are predicted with high MaxProb
get answered at early stages as their predictions
are likely to be correct and the remaining ones get
passed to the larger models. Hence, by avoiding
inference through large and expensive models
(primarily for easy instances), cascading makes the
system computationally efficient while maintaining
high prediction performance.
We describe several such cascading methods in
Section 3.2. Furthermore, cascading allows custom
computation costs as different number of models
can be used for inference. We compute accuracies
for a range of costs and plot an accuracy-cost curve.
Then, we calculate its area (AUC) to quantify the
efficacy of the cascading method. Larger the AUC
value, the better the method is as it implies higher
accuracy on average across computation costs.
We conduct comprehensive experiments with
10
diverse NLU datasets in multiple task settings that
differ in the number of models available for cascad-
ing (
K
value from Section 3). We first demonstrate
that cascading achieves considerable improvement
in computational efficiency. For example, in case
of QQP dataset, cascading system achieves
88.93%
computation improvement over the largest model
(
M3
) in K=3 setting i.e. it requires just
11.07%
of the computation cost of model
M3
to attain
equal accuracy. Then, we show that cascading
also achieves improvement in prediction accuracy.
For example, on CB dataset, the cascading system
achieves
2.18%
accuracy improvement over
M3
in the K=3 setting. Similar improvements are ob-
served in settings with different values of
K
. Lastly,
we show that introducing additional model in the
cascade further increases the efficiency benefits.
In summary, our contributions and findings are:
1. Model Cascading:
We provide mathematical
formulation of model cascading, explore several
methods, and systematically study its benefits.
2. Cascading Improves Efficiency
: Using
accuracy-cost curves, we show that cascading
systems require much lesser computation cost
to attain accuracies equal to that of big models.
3. Cascading Improves Accuracy:
We show
that cascading systems consistently achieve su-
perior prediction performance than even the
largest model available in the task setting.
4. Comparison of Cascading Methods:
We
compare performance of our proposed cascad-
ing methods and find that DTU (3.2) outper-
forms all others by achieving the highest AUC
of accuracy-cost curves on average.
We note that model cascading is trivially easy to
implement, can be applied to a variety of problems,
and can have good practical values.
2 Related Work
In recent times, several techniques have been de-
veloped to improve the efficiency of NLP systems,
such as network pruning (Wang et al.,2020;Guo
et al.,2021;Chen et al.,2020), quantization (Shen
et al.,2020;Zhang et al.,2020;Tao et al.,2022),
knowledge distillation (Clark et al.,2019;Jiao et al.,
2020;Li et al.,2022;Mirzadeh et al.,2020), and
input reduction (Modarressi et al.,2022). Our work
is more closely related to dynamic inference (Xin
et al.,2020) and adaptive model size (Goyal et al.,
2020;Kim and Cho,2021;Hou et al.,2020;Sol-
daini and Moschitti,2020).
Xin et al. (2020) proposed Dynamic early exit-
ing for BERT (DeeBERT) that speeds up BERT
inference by inserting extra classification layers be-
tween each transformer layer. It allows an instance
to choose conditional exit from multiple exit paths.
All the weights (including newly introduced classi-
fication layers) are jointly learnt during training.
Goyal et al. (2020) proposed Progressive Word-
vector Elimination (PoWER-BERT) that reduces
intermediate vectors computed along the encoder
pipeline. They eliminate vectors based on signif-
icance computed using self-attention mechanism.
Kim and Cho (2021) extended PoWER-BERT to
Length-Adaptive Transformer which adaptively de-
termines the sequence length at each layer. Hou
et al. (2020) proposed a dynamic BERT model
(DynaBERT) that adjusts the size of the model by
selecting adaptive width and depth. They first train
a width-adaptive BERT and then distill knowledge
from full-size models to small sub-models.
Lastly, cascading has been studied in machine
learning and vision with approaches such as Haar-
cascade (Soo,2014) but is underexplored in NLP.
We further note that cascading is non-trivially dif-
ferent from ‘ensembling’ as ensembling always
uses all the available models instead of carefully
selecting one or more models for inference.
Our work is different from existing methods in
the following aspects: (1) Existing methods typi-
cally require architectural changes, network manip-
ulation, saliency quantification, knowledge distilla-
tion, or complex training procedures. In contrast,
cascading is a simple technique that is easy to im-
plement and does not require such modifications,
(2) The computational efficiency in existing meth-
ods often comes with a compromise on accuracy.
Contrary to this, we show that model cascading
surpasses the accuracy of even the largest models,
(3) Existing methods typically require training a
separate model for each computation budget; on
the other hand, a single cascading system can be
adjusted to meet all the computation constraints.
(4) Finally, cascading does not require an instance
to be passed sequentially through the model lay-
ers; approaches such as routing (section 3) allow
passing it directly to a suitable model.
3 Model Cascading
We define model cascading as follows:
Given a collection of models of varying capac-
ities, the system needs to leverage one or more
models in a computationally efficient way to output
accurate predictions.
As previously mentioned, a system using only
the smallest model would be highly efficient but
poor in accuracy and a system using all the avail-
able models would be fairly accurate but expen-
sive in computation. The goal of cascading is to
achieve high prediction accuracy while efficiently
leveraging the available models. The remainder of
this section is organized as follows: we provide
mathematical formulation of cascading in 3.1 and
describe its various approaches in 3.2.
3.1 Formulation
Consider a collection of
K
trained models
(
M1, ..., MK
) ordered in increasing order of their
computation cost i.e. for an instance
x
,
cx
j< cx
k
(
j < k
) where
c
corresponds to the cost of inference.
The system needs to output a prediction for each
instance of the evaluation dataset
D
leveraging one
or more models. Let
Mx
j
be a function that indi-
cates whether model
Mj
is used by the system to
make inference for the instance xi.e.
Mx
j=(1,if model Mjis used for instance x
0,otherwise
Thus, the average cost of the system for the entire
evaluation dataset Dis calculated as:
CostD=PxiDPK
j=1 Mxi
j×cxi
j
|D|
In addition to this cost, we also measure accuracy
i.e. the percentage of correct predictions by the
system. The goal is to achieve high prediction
accuracy while being computationally efficient.
摘要:

ModelCascading:TowardsJointlyImprovingEfciencyandAccuracyofNLPSystemsNeerajVarshneyandChittaBaralArizonaStateUniversity{nvarshn2,cbaral}@asu.eduAbstractDoallinstancesneedinferencethroughthebigmodelsforacorrectprediction?Perhapsnot;someinstancesareeasyandcanbeansweredcorrectlybyevensmallcapacitymode...

展开>> 收起<<
Model Cascading Towards Jointly Improving Efficiency and Accuracy of NLP Systems Neeraj Varshney and Chitta Baral.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.25MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注