UnfoldML Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification Yanbo Xu1 Alind Khare1 Glenn Matlin1 Monish Ramadoss1

2025-05-06 0 0 2.44MB 19 页 10玖币
侵权投诉
UnfoldML: Cost-Aware and Uncertainty-Based
Dynamic 2D Prediction for Multi-Stage Classification
Yanbo Xu1,, Alind Khare1,
, Glenn Matlin1, Monish Ramadoss1,
Rishikesan Kamaleswaran2, Chao Zhang1, Alexey Tumanov1
1Georgia Institute of Technology,
2Emory University
Atlanta, GA
Abstract
Machine Learning (ML) research has focused on maximizing the accuracy of
predictive tasks. ML models, however, are increasingly more complex, resource
intensive, and costlier to deploy in resource-constrained environments. These issues
are exacerbated for prediction tasks with sequential classification on progressively
transitioned stages with “happens-before” relation between them.We argue that it
is possible to “unfold” a monolithic single multi-class classifier, typically trained
for all stages using all data, into a series of single-stage classifiers. Each single-
stage classifier can be cascaded gradually from cheaper to more expensive binary
classifiers that are trained using only the necessary data modalities or features
required for that stage.
UnfoldML
is a cost-aware and uncertainty-based dynamic
2D prediction pipeline for multi-stage classification that enables (1) navigation of
the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference
by orders of magnitude, and (3) early prediction on proceeding stages.
UnfoldML
achieves orders of magnitude better cost in clinical settings, while detecting multi-
stage disease development in real time. It achieves within 0.1% accuracy from
the highest-performing multi-class baseline, while saving close to 20X on spatio-
temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also
show that
UnfoldML
generalizes to image classification, where it can predict
different level of labels (from coarse to fine) given different level of abstractions of
a image, saving close to 5X cost with as little as 0.4% accuracy reduction.
1 Introduction
Machine Learning (ML) research has mostly focused on improving prediction accuracy for classifica-
tion tasks, such as image classification (Foret et al
.
, 2020; Xie et al
.
, 2017), disease risk prediction
(Feng et al
.
, 2020; Xu et al
.
, 2018), pedestrian detection (Zhang et al
.
, 2018; Cai et al
.
, 2015), etc. The
understandable drive for high accuracy has often resulted in deeper, complex neural networks, which
can incur high memory (spatial cost) and high latency (temporal cost) at inference time. However,
the deployment of ML applications must be cost-aware. Run time environment like mobile devices or
bedside-patient monitors are commonly resource constrained, and applications that can be offloaded
to cloud computing always aim for a reduced cloud bill. In this paper, we focus on developing a
pipeline that can balance between high prediction accuracy and low spatio-temporal cost in deploying
ML classification models.
We consider a scenario of deploying ML classifiers in a multi-stage classification task where one
predicted class can progressively transform to a next stage of classes, characterized as “happens-
before” relationship between classes. This task is commonly observed in many real-world applications.
For instance in clinical settings, disease progression is often identified by a series of stage transitions
Authors contributed equally to this research.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.15056v2 [cs.LG] 28 Oct 2022
Abstract Input Detailed Input
IDK
IDK
ICK1
More Input
Exit
Stage-1
Cascades upgrade to costlier models
when lacking confidence (IDK)
Confidently transition query to next stage (ICK1),
or early exit the pipeline (ICK0)
More Input
ICK1
ICK1
ICK1
ICK0ICK1
Exit
Stage-2
Exit
Stage-3
ICK0
ICK0
ICK0
ICK0
ICK0
ICK0
Stage-1 Stage-2 Stage-3
IDK
IDK
(a) The 2D query propagation mechanism in UnfoldML
Dist. of predictive prob.
𝑝̂#(𝑌 = 1|𝑥) in true class Y=0
0 1
IDK class:
Uncertain
Cascades
upgrade to
costlier models
𝛼
ICK0class:
Low certainty
Return
prediction
and exit early
ICK1class:
High certainty
Prioritize and
transition to next
prediction stage
Dist. of predictive prob.
𝑝̂#(𝑌 = 1|𝑥) in true class Y=1
𝛽
(b) IDK and ICK classes
Figure 1: 2D uncertainty-based propagation in
UnfoldML
: Queries that are in confidently low risk will return
ICK0
and be monitored by cheaper models; queries that are hard to predict will return
IDK
and be advanced
to costlier but more confident models; queries that are in confidently high risk will return
ICK1
and be transited
to next stage.
(Sperling et al
.
, 2011; Singer et al
.
, 2016). Detection on early stage of the disease allows doctors to take
appropriate actions in time before it enters into late severe stages. In image classifications, recognition
from super classes to sub classes in a coarse-to-fine manner has shown improved classification
performance (Dutt et al
.
, 2017; Lei et al
.
, 2017). In all these applications, general multi-class
classifiers (Liu et al
.
, 2019; Fagerström et al
.
, 2019; Foret et al
.
, 2020) have been developed by
treating all the stages as multi classes and achieved state-of-the-art prediction performance, but at
significantly high spatio-temporal cost. Prior work (Wang et al
.
, 2017) trades off accuracy for low
cost but still ignores the key relationship between the classes, so it fails to find the optimal trade-off.
We propose
UnfoldML2
: a cost-aware and uncertainty-based prediction pipeline for dynamic multi-
stage classification. It “unfolds” a monolithic multi-class classifier into a series of single-stage
classifiers, reducing its deployment cost. Each single-stage classifier is then cascaded gradually from
cheaper to more expensive binary classifiers, further reducing the cost by dynamically selecting
an appropriate classifier for an input query. Figure 1 summarizes the two dimensional (2D) query
propagation mechanism designed in
UnfoldML
: Horizontally it allows a query to transition through
multiple stages, and vertically it allows the query to progressively upgrade to costlier models
constrained by the pre-specified budget limit. It computes a classifier’s prediction confidence
on a query then directs the query through one the following three gates: 1) “I confidently know
NO” (
ICK0
), which rejects the current query and early exits from the pipeline (exit); 2) “I don’t
know” (
IDK
), which upgrades the query to a higher accuracy but costlier model, producing a more
confident result within the budget (vertical cascading); and 3) “I confidently know YES” (
ICK1
),
which transitions the query to the next stage of the prediction task (horizontal forwarding). The design
of
ICK0
gate allows queries to early exit from the
UnfoldML
so it reduces the overall spatio-temporal
cost of the pipeline. The
ICK1
gate allows queries to faster transition to next stages so it enables early
prediction on late stages, which can be critical in clinical settings (Reyna et al
.
, 2019). Overall, the
combined 2D propagation mechanism uniquely enables the navigation of the cost/accuracy tradeoff
space for searching an optimal set of policies for dynamic model selection at inference time.
We also propose two training algorithms for learning the optimal policies for the designed 2D
query propagation in
UnfoldML
. The key idea of the training algorithms is to learn the optimal
thresholds on the gating functions defined for
ICK0
,
IDK
and
ICK1
. The first proposed hard-
gating algorithm assumes the gating functions to be step functions parameterized by deterministic
confidence thresholds. To find the optimal thresholds, it performs bottom-up grid search over a
topologically-sorted list of all the models and identifies the thresholds to minimize the prediction loss,
while following a cost constraint. The secondly proposed soft-gating algorithm defines the gating
functions to be probabilistic activation functions. It follows a Mixture-of-Expert (MoE) framework
to adaptively determine a models’ confidence threshold given all the other model’s confidence in
predicting the same query. To obviate the cost of running all models in MoE selection, we further
propose a Dirichlet Knowledge Distillation (DKD) to run only a cheap multi-label classifier that is
trained for distilling the Bayesian predictive uncertainties of all models.
We summarize the contributions of this paper as follows:
2Code is available at https://github.com/gatech-sysml/unfoldml.
2
We design a novel 2D query propagation pipeline that “unfolds” multi-stage prediction
workflows by leveraging the “happens-before” relationship between the stages, and achieves
a lower-cost prediction pipeline with minimal accuracy degradation.
We propose two learning algorithms to sufficiently navigate the cost/accuracy tradeoff space
and search an optimal set of policies for the designed 2D query propagation.
We apply the proposed pipeline to two real-world applications and demonstrate it reduces
the spatio-temporal cost of inference by orders of magnitude.
2 Related Work
The most relevant work to our proposed method is the one-step
IDK
cascade (Wang et al
.
, 2017),
which incorporates prior work of “I don’t know” (
IDK
) classes (Trappenberg and Back, 2000; Khani
et al
.
, 2016) into cascade construction and introduce a latency-aware objective into the construction
comparing with previous cascaded prediction frameworks (Rowley et al
.
, 1998; Viola and Jones,
2004; Angelova et al
.
, 2015). Another group of work focus on the problem of feature selection
assuming each feature can be acquired for a cost. They train a cascade of classifiers for optimizing
the trade-off between the expected classification error and the feature cost. Early solution (Raykar
et al
.
, 2010) limits the cascade to a family of linear discriminating functions. Cai et al
.
(2015) applies
boosting method for cascading a set of weak learners. Recent methods (Trapeznikov and Saligrama,
2013; Clertant et al
.
, 2019; Janisch et al
.
, 2019) develop POMDP-based frameworks and incorporate
deep Q-learning in training the cascades. In contrast to all of the above work that are only 1-D
pipelines for one-step prediction task (can be multi-class classifications), our method extends to a 2D
pipeline that can dynamically forward examples to next steps after they are confidently predicted as
passed on the current step. Further, we also develop a more efficient pipeline framework based on
Mixture-of-Experts (MoE) modeling and knowledge distillation, which can apply gradient decent
algorithms for learning the parameters efficiently.
The idea of MoE was originally introduced by Jacobs et al
.
(1991), for partitioning the training
data and feeding them into separate neural networks during the learning process. This gate decision
design is applied into many domains such as language modeling (Ma et al
.
, 2018), video captioning
(Wang et al
.
, 2019), multi-tasking learning (Ma et al
.
, 2018). It is also used in network architecture
searching (Eigen et al
.
, 2013) by setting gate activation on network layers. Sparse gates are introduced
in MoE so that it can efficiently select from thousands of sub-networks (Shazeer et al
.
, 2017) as
well as increases the representation power of large convolutional networks by only using a shallow
embedding network to produce the mixture weights (Wang et al
.
, 2020). We incorporate the idea
of sparsely gated MoE (Shazeer et al
.
, 2017; Wang et al
.
, 2020) into our prediction framework, and
design a soft-gating training algorithm by using ReLU as the sparse gating function and imposing
L1-norm regularization on the gating weights for further sparsity.
Confidence criterion has been incorporated into active learning by Li and Sethi (2006) and then
extended by Zhu et al
.
(2010). Lei (2014) proposed confidence based classifiers that identifies the
confident region (like
ICK
class) and uncertain region (like
IDK
class) in predictions. Confidence
are also introduced into word embedding (Vilnis and McCallum, 2015; Athiwaratkun and Wilson,
2018) and graph representations (Orbach and Crammer, 2012; Vashishth et al
.
, 2019). Our method
posits thresholds on prediction confidence for activating the gates in pipeline expansion. Bayesian
Prior Networks (BPNs) (Malinin and Gales, 2018) have been proposed to estimate the uncertainty
distribution in model predictions, which is more computationally efficient than traditional Bayesian
approaches (MacKay, 1992; Mackay, 1992; Hinton and Van Camp, 1993). We propose Dirichlet
Knowledge Distillation (DKD) based on BPNs for distilling prediction uncertainty in large models so
that we only need to run a low-cost multi-head model for producing the weights in MoE efficiently.
3 Multi-Stage Dynamic Prediction Pipeline
We introduce a dynamic 2D prediction pipeline
UnfoldML
, which learns the optimal policy for
making “I confidently know” (
ICK
) predictions on sequential multi-stage classification tasks. An
optimal policy will effectively trade off prediction accuracy against spatio-temporal costs in order to
maximize the overall system accuracy’s AUC while staying under user-imposed cost constraints.
3
3.1 Problem Formulation
Given
x∈ X
at time
t
, a multi-stage pipeline decides whether the individual should maintain at
the current stage
s
or progress into the next stage
s+ 1
for time
t+ 1
. If there are a total number
of
S
stages that need to be detected, we train
K
number of models
m
for each specific stage
s
to
form a model zoo
M={{m11,· · · , m1K1},· · · ,{mS1, ..., mSKS}}
. We measure each model’s
spatio-temporal cost by multiplying the device cost per unit time with the serving time per prediction
stage, denoted as cost(msk).
To optimize the limited system resources, we design a 2D UnfoldML in the following way: (1) start
with the simplest model for prediction of the initial stage on incoming data, and (2)
upgrade
vertically
to costlier models on those samples where “I don’t know” (
IDK
), or (3)
transition
horizontally to
the next stage of the pipeline samples where “I confidently know YES” (
ICK1
) that we have correctly
identified the sample at the current stage, otherwise (4)
exit
the pipeline for those “I confidently know
NO” (
ICK0
) samples that have been identified and discarded. Figure 1 (a) demonstrates the proposed
2D architecture of
UnfoldML
. The central problem of
UnfoldML
is learning the optimal policy for
each of the three classes of gating functions with the objective of maximizing high system-wide
accuracy while minimizing the prediction cost.
We formulate
UnfoldML
as a decision rule mapping function
mcasc :X × M M
, which takes
example
xt
coming at time
t
and the current model choice
msk
as an input to determine whether
the model for the query should take one of the aforementioned three actions: upgrade vertically,
transition horizontally, or exit the pipeline. These decisions rules can be realized by two groups
of parameters: a confidence criterion
q:X × M [0,1]
that measures the confidence score of a
model’s prediction on a data example, and two gate functions
GIDK
,
GICK1: [0,1] × {True,False}
that are applied on to the confidence score per each prediction made by model
msk
. The two exclusive
gates
IDK
and
ICK1
each respectively decides if the current prediction belongs to either an
IDK
class such that (s.t.) the system will upgrade the query to a costlier model while remaining within
the user-defined cost budget, or an
ICK1
class s.t. the system will transition the query to the next
stage in the pipeline. The third gate
GICK0
is determined by
¬GIDK ∧ ¬GICK1
. We formalize the
decision rule used at each prediction stage as follows
mcasc(xt, msk;q, G) =
ms(k+1), GIDK ∧ ¬GICK1qsk(xt),
m(s+1)1,¬GIDK GICK1qsk(xt),
msk,¬GI DK ∧ ¬GI CK1qsk (xt)
where
qsk(xt)
is a short notation for
q(xt, msk)
measuring the confidence of model
msk
s prediction
on data
xt
. The goal of configuring an optimal pipeline given a restricted computation resource can
be formalized as the following optimization problem:
min
GLmcasc;D)s.t. cost(mcasc;D)c, (1)
where
G
consists of the two gate functions
GIDK
and
GICK1
,
L
denotes the end-to-end prediction
loss on data Dand cis a user-specified cost-constraint for the system.
3.2 Gate Parameters Learning
Given a training data set
D
, which does not include any data that was used in the training of the models
in our model zoo:
D={xi,(y1
i, t1
i),· · · ,(yS
i, tS
i)}N
i=1
, where
xi= (xi1,· · · ,xiTi)
is an input
sequence observed for individual
i
,
ys
i∈ {0,1}
indicates whether the example entered to stage-
s
,
and if yes, we use
ts
i ∅ ∪ [1, Ti]
to determine when it entered. We first partition the multi-stage data
into
S
one-stage data sets:
Ds={(xi[ts1
i:ts
i], ys
i== 1)} ∪ {(xi[ts1
i:Ti], ys
i== 0)}
, then divide
the learning of IDK and ICK gate parameters into two separable sub-problems:
Sub-Objective 1: min
GIDK
s
Lsmcasc
s;Dss.t. cost(mcasc
s;Ds)cs, s = 1,· · · , S (2)
Sub-Objective 2: min
GICK1
Lmcasc;D, GIDK,
where
Ls
is the one-stage prediction loss on data
Ds
, and
cs
is the cost budget that is pre-allocated
for stage-ssatisfying Pscs=c.
4
摘要:

UnfoldML:Cost-AwareandUncertainty-BasedDynamic2DPredictionforMulti-StageClassicationYanboXu1;,AlindKhare1;,GlennMatlin1,MonishRamadoss1,RishikesanKamaleswaran2,ChaoZhang1,AlexeyTumanov11GeorgiaInstituteofTechnology,2EmoryUniversityAtlanta,GAAbstractMachineLearning(ML)researchhasfocusedonmaximizin...

展开>> 收起<<
UnfoldML Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification Yanbo Xu1 Alind Khare1 Glenn Matlin1 Monish Ramadoss1.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:19 页 大小:2.44MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注