•
We design a novel 2D query propagation pipeline that “unfolds” multi-stage prediction
workflows by leveraging the “happens-before” relationship between the stages, and achieves
a lower-cost prediction pipeline with minimal accuracy degradation.
•
We propose two learning algorithms to sufficiently navigate the cost/accuracy tradeoff space
and search an optimal set of policies for the designed 2D query propagation.
•
We apply the proposed pipeline to two real-world applications and demonstrate it reduces
the spatio-temporal cost of inference by orders of magnitude.
2 Related Work
The most relevant work to our proposed method is the one-step
IDK
cascade (Wang et al
.
, 2017),
which incorporates prior work of “I don’t know” (
IDK
) classes (Trappenberg and Back, 2000; Khani
et al
.
, 2016) into cascade construction and introduce a latency-aware objective into the construction
comparing with previous cascaded prediction frameworks (Rowley et al
.
, 1998; Viola and Jones,
2004; Angelova et al
.
, 2015). Another group of work focus on the problem of feature selection
assuming each feature can be acquired for a cost. They train a cascade of classifiers for optimizing
the trade-off between the expected classification error and the feature cost. Early solution (Raykar
et al
.
, 2010) limits the cascade to a family of linear discriminating functions. Cai et al
.
(2015) applies
boosting method for cascading a set of weak learners. Recent methods (Trapeznikov and Saligrama,
2013; Clertant et al
.
, 2019; Janisch et al
.
, 2019) develop POMDP-based frameworks and incorporate
deep Q-learning in training the cascades. In contrast to all of the above work that are only 1-D
pipelines for one-step prediction task (can be multi-class classifications), our method extends to a 2D
pipeline that can dynamically forward examples to next steps after they are confidently predicted as
passed on the current step. Further, we also develop a more efficient pipeline framework based on
Mixture-of-Experts (MoE) modeling and knowledge distillation, which can apply gradient decent
algorithms for learning the parameters efficiently.
The idea of MoE was originally introduced by Jacobs et al
.
(1991), for partitioning the training
data and feeding them into separate neural networks during the learning process. This gate decision
design is applied into many domains such as language modeling (Ma et al
.
, 2018), video captioning
(Wang et al
.
, 2019), multi-tasking learning (Ma et al
.
, 2018). It is also used in network architecture
searching (Eigen et al
.
, 2013) by setting gate activation on network layers. Sparse gates are introduced
in MoE so that it can efficiently select from thousands of sub-networks (Shazeer et al
.
, 2017) as
well as increases the representation power of large convolutional networks by only using a shallow
embedding network to produce the mixture weights (Wang et al
.
, 2020). We incorporate the idea
of sparsely gated MoE (Shazeer et al
.
, 2017; Wang et al
.
, 2020) into our prediction framework, and
design a soft-gating training algorithm by using ReLU as the sparse gating function and imposing
L1-norm regularization on the gating weights for further sparsity.
Confidence criterion has been incorporated into active learning by Li and Sethi (2006) and then
extended by Zhu et al
.
(2010). Lei (2014) proposed confidence based classifiers that identifies the
confident region (like
ICK
class) and uncertain region (like
IDK
class) in predictions. Confidence
are also introduced into word embedding (Vilnis and McCallum, 2015; Athiwaratkun and Wilson,
2018) and graph representations (Orbach and Crammer, 2012; Vashishth et al
.
, 2019). Our method
posits thresholds on prediction confidence for activating the gates in pipeline expansion. Bayesian
Prior Networks (BPNs) (Malinin and Gales, 2018) have been proposed to estimate the uncertainty
distribution in model predictions, which is more computationally efficient than traditional Bayesian
approaches (MacKay, 1992; Mackay, 1992; Hinton and Van Camp, 1993). We propose Dirichlet
Knowledge Distillation (DKD) based on BPNs for distilling prediction uncertainty in large models so
that we only need to run a low-cost multi-head model for producing the weights in MoE efficiently.
3 Multi-Stage Dynamic Prediction Pipeline
We introduce a dynamic 2D prediction pipeline
UnfoldML
, which learns the optimal policy for
making “I confidently know” (
ICK
) predictions on sequential multi-stage classification tasks. An
optimal policy will effectively trade off prediction accuracy against spatio-temporal costs in order to
maximize the overall system accuracy’s AUC while staying under user-imposed cost constraints.
3