saliency quantification, or even complex training
procedures. Moreover, computational efficiency
in these methods often comes with a compromise
on accuracy. In contrast, model cascading, a sim-
ple technique that utilizes a collection of models
of varying capacities to
accurately yet efficiently
output predictions has remained underexplored.
In this work, we address the above limitation by
first providing mathematical formulation of model
cascading and then exploring several approaches to
do it. In its problem setup, a collection of models of
different capacities (and hence performances) are
provided and the system needs to output its predic-
tion by leveraging one or more models. On one ex-
treme, the system can use only the smallest model
and on the other extreme, it can use all the available
models (ensembling). The former system would
be highly efficient but usually poor in performance
while the latter system would be fairly accurate but
expensive in computation. Model cascading strives
to get the best of both worlds by allowing the sys-
tem to efficiently utilize the available models while
achieving high prediction accuracy. This is in line
with the ‘Efficiency NLP’ (Arase and et al.,2021)
policy document put up by the ACL community.
Consider the case of CommitmentBank
(de Marneffe et al.,2019) dataset on which
BERT-medium model having just 41.7M param-
eters achieves
75%
accuracy and a bigger model
BERT-base having 110M parameters achieves
82%
accuracy. From this, it is clear that the performance
of the bigger model can be matched by inferring
a large number of instances using the smaller
model and only a few instances using the bigger
model. Thus, by carefully deciding when to use
bigger/more expensive models, the computational
efficiency of NLP systems can be improved. So,
how should we decide which model(s) to use for
a given test instance? Figure 1illustrates an
approach to achieve this; it infers an instance
sequentially through models (ordered in increasing
order of capacity) and uses a threshold over the
maximum softmax probability (MaxProb) to
decide whether to output the prediction or pass it to
the next model in sequence. The intuition behind
this approach is that MaxProb shows a positive
correlation with predictive correctness. Thus,
instances that are predicted with high MaxProb
get answered at early stages as their predictions
are likely to be correct and the remaining ones get
passed to the larger models. Hence, by avoiding
inference through large and expensive models
(primarily for easy instances), cascading makes the
system computationally efficient while maintaining
high prediction performance.
We describe several such cascading methods in
Section 3.2. Furthermore, cascading allows custom
computation costs as different number of models
can be used for inference. We compute accuracies
for a range of costs and plot an accuracy-cost curve.
Then, we calculate its area (AUC) to quantify the
efficacy of the cascading method. Larger the AUC
value, the better the method is as it implies higher
accuracy on average across computation costs.
We conduct comprehensive experiments with
10
diverse NLU datasets in multiple task settings that
differ in the number of models available for cascad-
ing (
K
value from Section 3). We first demonstrate
that cascading achieves considerable improvement
in computational efficiency. For example, in case
of QQP dataset, cascading system achieves
88.93%
computation improvement over the largest model
(
M3
) in K=3 setting i.e. it requires just
11.07%
of the computation cost of model
M3
to attain
equal accuracy. Then, we show that cascading
also achieves improvement in prediction accuracy.
For example, on CB dataset, the cascading system
achieves
2.18%
accuracy improvement over
M3
in the K=3 setting. Similar improvements are ob-
served in settings with different values of
K
. Lastly,
we show that introducing additional model in the
cascade further increases the efficiency benefits.
In summary, our contributions and findings are:
1. Model Cascading:
We provide mathematical
formulation of model cascading, explore several
methods, and systematically study its benefits.
2. Cascading Improves Efficiency
: Using
accuracy-cost curves, we show that cascading
systems require much lesser computation cost
to attain accuracies equal to that of big models.
3. Cascading Improves Accuracy:
We show
that cascading systems consistently achieve su-
perior prediction performance than even the
largest model available in the task setting.
4. Comparison of Cascading Methods:
We
compare performance of our proposed cascad-
ing methods and find that DTU (3.2) outper-
forms all others by achieving the highest AUC
of accuracy-cost curves on average.
We note that model cascading is trivially easy to
implement, can be applied to a variety of problems,
and can have good practical values.