Model Cascading Towards Jointly Improving Efﬁciency and Accuracy of NLP Systems Neeraj Varshney and Chitta Baral

2025-04-24 0 0 1.25MB 15 页 10玖币

侵权投诉

Model Cascading: Towards Jointly Improving Efﬁciency and Accuracy of

NLP Systems

Neeraj Varshney and Chitta Baral

Arizona State University

{nvarshn2, cbaral}@asu.edu

Abstract

Do all instances need inference through the

big models for a correct prediction?

Perhaps not; some instances are easy and can

be answered correctly by even small capacity

models. This provides opportunities for im-

proving the computational efﬁciency of sys-

tems. In this work, we present an explorative

study on ‘model cascading’, a simple tech-

nique that utilizes a collection of models of

varying capacities to accurately yet efﬁciently

output predictions. Through comprehensive

experiments in multiple task settings that dif-

fer in the number of models available for cas-

cading (Kvalue), we show that cascading im-

proves both the computational efﬁciency and

the prediction accuracy. For instance, in K=3

setting, cascading saves up to 88.93% compu-

tation cost and consistently achieves superior

prediction accuracy with an improvement of

up to 2.18%. We also study the impact of in-

troducing additional models in the cascade and

show that it further increases the efﬁciency im-

provements. Finally, we hope that our work

will facilitate development of efﬁcient NLP

systems making their widespread adoption in

real-world applications possible.

1 Introduction

Pre-trained language models such as RoBERTa

(Liu et al.,2019), ELECTRA (Clark et al.,2020),

and T5 (Raffel et al.,2020) have achieved remark-

able performance on numerous natural language

processing benchmarks (Wang et al.,2018,2019;

Talmor et al.,2019). However, these models have

a large number of parameters which makes them

slow and computationally expensive; for instance,

T5-11B requires

∼87 ×1011

ﬂoating point oper-

ations (FLOPs) for an inference. This limits their

widespread adoption in real-world applications that

prefer computationally efﬁcient systems in order to

achieve low response times.

The above concern has recently received consid-

erable attention from the NLP community leading

Figure 1: Illustrating a cascading approach with three

models (Mini, Med, and Base) arranged in increasing

order of capacity. An input is ﬁrst passed through the

smallest model (Mini) which fails to predict with suf-

ﬁcient conﬁdence. Therefore, it is then inferred using

a bigger model (Med) that satisﬁes the conﬁdence con-

straints and the system outputs its prediction (‘contra-

diction’ as dog has four legs). Thus, by avoiding infer-

ence through large/expensive models, the system saves

computation cost without sacriﬁcing the accuracy.

to development of several techniques, such as (1)

network pruning that progressively removes model

weights from a big network (Wang et al.,2020;

Guo et al.,2021), (2) early exiting that allows mul-

tiple exit paths in a model (Xin et al.,2020), (3)

adaptive inference that adjusts model size by adap-

tively selecting its width and depth (Goyal et al.,

2020;Kim and Cho,2021), (4) knowledge distilla-

tion that transfers ‘dark-knowledge’ from a large

teacher model to a shallow student model (Jiao

et al.,2020;Li et al.,2022), and (5) input reduc-

tion that eliminates less contributing tokens from

the input text to speed up inference (Modarressi

et al.,2022). These methods typically require ar-

chitectural modiﬁcations, network manipulation,

arXiv:2210.05528v1 [cs.CL] 11 Oct 2022

saliency quantiﬁcation, or even complex training

procedures. Moreover, computational efﬁciency

in these methods often comes with a compromise

on accuracy. In contrast, model cascading, a sim-

ple technique that utilizes a collection of models

of varying capacities to

accurately yet efﬁciently

output predictions has remained underexplored.

In this work, we address the above limitation by

ﬁrst providing mathematical formulation of model

cascading and then exploring several approaches to

do it. In its problem setup, a collection of models of

different capacities (and hence performances) are

provided and the system needs to output its predic-

tion by leveraging one or more models. On one ex-

treme, the system can use only the smallest model

and on the other extreme, it can use all the available

models (ensembling). The former system would

be highly efﬁcient but usually poor in performance

while the latter system would be fairly accurate but

expensive in computation. Model cascading strives

to get the best of both worlds by allowing the sys-

tem to efﬁciently utilize the available models while

achieving high prediction accuracy. This is in line

with the ‘Efﬁciency NLP’ (Arase and et al.,2021)

policy document put up by the ACL community.

Consider the case of CommitmentBank

(de Marneffe et al.,2019) dataset on which

BERT-medium model having just 41.7M param-

eters achieves

75%

accuracy and a bigger model

BERT-base having 110M parameters achieves

82%

accuracy. From this, it is clear that the performance

of the bigger model can be matched by inferring

a large number of instances using the smaller

model and only a few instances using the bigger

model. Thus, by carefully deciding when to use

bigger/more expensive models, the computational

efﬁciency of NLP systems can be improved. So,

how should we decide which model(s) to use for

a given test instance? Figure 1illustrates an

approach to achieve this; it infers an instance

sequentially through models (ordered in increasing

order of capacity) and uses a threshold over the

maximum softmax probability (MaxProb) to

decide whether to output the prediction or pass it to

the next model in sequence. The intuition behind

this approach is that MaxProb shows a positive

correlation with predictive correctness. Thus,

instances that are predicted with high MaxProb

get answered at early stages as their predictions

are likely to be correct and the remaining ones get

passed to the larger models. Hence, by avoiding

inference through large and expensive models

(primarily for easy instances), cascading makes the

system computationally efﬁcient while maintaining

high prediction performance.

We describe several such cascading methods in

Section 3.2. Furthermore, cascading allows custom

computation costs as different number of models

can be used for inference. We compute accuracies

for a range of costs and plot an accuracy-cost curve.

Then, we calculate its area (AUC) to quantify the

efﬁcacy of the cascading method. Larger the AUC

value, the better the method is as it implies higher

accuracy on average across computation costs.

We conduct comprehensive experiments with

diverse NLU datasets in multiple task settings that

differ in the number of models available for cascad-

ing (

value from Section 3). We ﬁrst demonstrate

that cascading achieves considerable improvement

in computational efﬁciency. For example, in case

of QQP dataset, cascading system achieves

88.93%

computation improvement over the largest model

(

) in K=3 setting i.e. it requires just

11.07%

of the computation cost of model

to attain

equal accuracy. Then, we show that cascading

also achieves improvement in prediction accuracy.

For example, on CB dataset, the cascading system

achieves

2.18%

accuracy improvement over

in the K=3 setting. Similar improvements are ob-

served in settings with different values of

. Lastly,

we show that introducing additional model in the

cascade further increases the efﬁciency beneﬁts.

In summary, our contributions and ﬁndings are:

1. Model Cascading:

We provide mathematical

formulation of model cascading, explore several

methods, and systematically study its beneﬁts.

2. Cascading Improves Efﬁciency

: Using

accuracy-cost curves, we show that cascading

systems require much lesser computation cost

to attain accuracies equal to that of big models.

3. Cascading Improves Accuracy:

We show

that cascading systems consistently achieve su-

perior prediction performance than even the

largest model available in the task setting.

4. Comparison of Cascading Methods:

compare performance of our proposed cascad-

ing methods and ﬁnd that DTU (3.2) outper-

forms all others by achieving the highest AUC

of accuracy-cost curves on average.

We note that model cascading is trivially easy to

implement, can be applied to a variety of problems,

and can have good practical values.

2 Related Work

In recent times, several techniques have been de-

veloped to improve the efﬁciency of NLP systems,

such as network pruning (Wang et al.,2020;Guo

et al.,2021;Chen et al.,2020), quantization (Shen

et al.,2020;Zhang et al.,2020;Tao et al.,2022),

knowledge distillation (Clark et al.,2019;Jiao et al.,

2020;Li et al.,2022;Mirzadeh et al.,2020), and

input reduction (Modarressi et al.,2022). Our work

is more closely related to dynamic inference (Xin

et al.,2020) and adaptive model size (Goyal et al.,

2020;Kim and Cho,2021;Hou et al.,2020;Sol-

daini and Moschitti,2020).

Xin et al. (2020) proposed Dynamic early exit-

ing for BERT (DeeBERT) that speeds up BERT

inference by inserting extra classiﬁcation layers be-

tween each transformer layer. It allows an instance

to choose conditional exit from multiple exit paths.

All the weights (including newly introduced classi-

ﬁcation layers) are jointly learnt during training.

Goyal et al. (2020) proposed Progressive Word-

vector Elimination (PoWER-BERT) that reduces

intermediate vectors computed along the encoder

pipeline. They eliminate vectors based on signif-

icance computed using self-attention mechanism.

Kim and Cho (2021) extended PoWER-BERT to

Length-Adaptive Transformer which adaptively de-

termines the sequence length at each layer. Hou

et al. (2020) proposed a dynamic BERT model

(DynaBERT) that adjusts the size of the model by

selecting adaptive width and depth. They ﬁrst train

a width-adaptive BERT and then distill knowledge

from full-size models to small sub-models.

Lastly, cascading has been studied in machine

learning and vision with approaches such as Haar-

cascade (Soo,2014) but is underexplored in NLP.

We further note that cascading is non-trivially dif-

ferent from ‘ensembling’ as ensembling always

uses all the available models instead of carefully

selecting one or more models for inference.

Our work is different from existing methods in

the following aspects: (1) Existing methods typi-

cally require architectural changes, network manip-

ulation, saliency quantiﬁcation, knowledge distilla-

tion, or complex training procedures. In contrast,

cascading is a simple technique that is easy to im-

plement and does not require such modiﬁcations,

(2) The computational efﬁciency in existing meth-

ods often comes with a compromise on accuracy.

Contrary to this, we show that model cascading

surpasses the accuracy of even the largest models,

(3) Existing methods typically require training a

separate model for each computation budget; on

the other hand, a single cascading system can be

adjusted to meet all the computation constraints.

(4) Finally, cascading does not require an instance

to be passed sequentially through the model lay-

ers; approaches such as routing (section 3) allow

passing it directly to a suitable model.

3 Model Cascading

We deﬁne model cascading as follows:

Given a collection of models of varying capac-

ities, the system needs to leverage one or more

models in a computationally efﬁcient way to output

accurate predictions.

As previously mentioned, a system using only

the smallest model would be highly efﬁcient but

poor in accuracy and a system using all the avail-

able models would be fairly accurate but expen-

sive in computation. The goal of cascading is to

achieve high prediction accuracy while efﬁciently

leveraging the available models. The remainder of

this section is organized as follows: we provide

mathematical formulation of cascading in 3.1 and

describe its various approaches in 3.2.

3.1 Formulation

Consider a collection of

trained models

(

M1, ..., MK

) ordered in increasing order of their

computation cost i.e. for an instance

j< cx

(

∀

j < k

) where

corresponds to the cost of inference.

The system needs to output a prediction for each

instance of the evaluation dataset

leveraging one

or more models. Let

be a function that indi-

cates whether model

is used by the system to

make inference for the instance xi.e.

j=(1,if model Mjis used for instance x

0,otherwise

Thus, the average cost of the system for the entire

evaluation dataset Dis calculated as:

CostD=Pxi∈DPK

j=1 Mxi

j×cxi

|D|

In addition to this cost, we also measure accuracy

i.e. the percentage of correct predictions by the

system. The goal is to achieve high prediction

accuracy while being computationally efﬁcient.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ModelCascading:TowardsJointlyImprovingEfciencyandAccuracyofNLPSystemsNeerajVarshneyandChittaBaralArizonaStateUniversity{nvarshn2,cbaral}@asu.eduAbstractDoallinstancesneedinferencethroughthebigmodelsforacorrectprediction?Perhapsnot;someinstancesareeasyandcanbeansweredcorrectlybyevensmallcapacitymode...

展开>> 收起<<

Model Cascading Towards Jointly Improving Efﬁciency and Accuracy of NLP Systems Neeraj Varshney and Chitta Baral.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Model Cascading Towards Jointly Improving Efﬁciency and Accuracy of NLP Systems Neeraj Varshney and Chitta Baral

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: