UnfoldML Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classiﬁcation Yanbo Xu1 Alind Khare1 Glenn Matlin1 Monish Ramadoss1

2025-05-06 0 0 2.44MB 19 页 10玖币

侵权投诉

UnfoldML: Cost-Aware and Uncertainty-Based

Dynamic 2D Prediction for Multi-Stage Classiﬁcation

Yanbo Xu1,∗, Alind Khare1,∗

, Glenn Matlin1, Monish Ramadoss1,

Rishikesan Kamaleswaran2, Chao Zhang1, Alexey Tumanov1

1Georgia Institute of Technology,

2Emory University

Atlanta, GA

Abstract

Machine Learning (ML) research has focused on maximizing the accuracy of

predictive tasks. ML models, however, are increasingly more complex, resource

intensive, and costlier to deploy in resource-constrained environments. These issues

are exacerbated for prediction tasks with sequential classiﬁcation on progressively

transitioned stages with “happens-before” relation between them.We argue that it

is possible to “unfold” a monolithic single multi-class classiﬁer, typically trained

for all stages using all data, into a series of single-stage classiﬁers. Each single-

stage classiﬁer can be cascaded gradually from cheaper to more expensive binary

classiﬁers that are trained using only the necessary data modalities or features

required for that stage.

UnfoldML

is a cost-aware and uncertainty-based dynamic

2D prediction pipeline for multi-stage classiﬁcation that enables (1) navigation of

the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference

by orders of magnitude, and (3) early prediction on proceeding stages.

UnfoldML

achieves orders of magnitude better cost in clinical settings, while detecting multi-

stage disease development in real time. It achieves within 0.1% accuracy from

the highest-performing multi-class baseline, while saving close to 20X on spatio-

temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also

show that

UnfoldML

generalizes to image classiﬁcation, where it can predict

different level of labels (from coarse to ﬁne) given different level of abstractions of

a image, saving close to 5X cost with as little as 0.4% accuracy reduction.

1 Introduction

Machine Learning (ML) research has mostly focused on improving prediction accuracy for classiﬁca-

tion tasks, such as image classiﬁcation (Foret et al

, 2020; Xie et al

, 2017), disease risk prediction

(Feng et al

, 2020; Xu et al

, 2018), pedestrian detection (Zhang et al

, 2018; Cai et al

, 2015), etc. The

understandable drive for high accuracy has often resulted in deeper, complex neural networks, which

can incur high memory (spatial cost) and high latency (temporal cost) at inference time. However,

the deployment of ML applications must be cost-aware. Run time environment like mobile devices or

bedside-patient monitors are commonly resource constrained, and applications that can be ofﬂoaded

to cloud computing always aim for a reduced cloud bill. In this paper, we focus on developing a

pipeline that can balance between high prediction accuracy and low spatio-temporal cost in deploying

ML classiﬁcation models.

We consider a scenario of deploying ML classiﬁers in a multi-stage classiﬁcation task where one

predicted class can progressively transform to a next stage of classes, characterized as “happens-

before” relationship between classes. This task is commonly observed in many real-world applications.

For instance in clinical settings, disease progression is often identiﬁed by a series of stage transitions

∗Authors contributed equally to this research.

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.15056v2 [cs.LG] 28 Oct 2022

Abstract Input Detailed Input

IDK

ICK1

More Input

Exit

Stage-1

Cascades upgrade to costlier models

when lacking confidence (IDK)

Confidently transition query to next stage (ICK1),

or early exit the pipeline (ICK0)

More Input

ICK1

ICK0ICK1

…

Exit

Stage-2

Exit

Stage-3

ICK0

Stage-1 Stage-2 Stage-3

IDK

(a) The 2D query propagation mechanism in UnfoldML

Dist. of predictive prob.

𝑝̂#(𝑌 = 1|𝑥) in true class Y=0

0 1

IDK class:

Uncertain

Cascades

upgrade to

costlier models

𝛼

ICK0class:

Low certainty

Return

prediction

and exit early

ICK1class:

High certainty

Prioritize and

transition to next

prediction stage

Dist. of predictive prob.

𝑝̂#(𝑌 = 1|𝑥) in true class Y=1

𝛽

(b) IDK and ICK classes

Figure 1: 2D uncertainty-based propagation in

UnfoldML

: Queries that are in conﬁdently low risk will return

ICK0

and be monitored by cheaper models; queries that are hard to predict will return

IDK

and be advanced

to costlier but more conﬁdent models; queries that are in conﬁdently high risk will return

ICK1

and be transited

to next stage.

(Sperling et al

, 2011; Singer et al

, 2016). Detection on early stage of the disease allows doctors to take

appropriate actions in time before it enters into late severe stages. In image classiﬁcations, recognition

from super classes to sub classes in a coarse-to-ﬁne manner has shown improved classiﬁcation

performance (Dutt et al

, 2017; Lei et al

, 2017). In all these applications, general multi-class

classiﬁers (Liu et al

, 2019; Fagerström et al

, 2019; Foret et al

, 2020) have been developed by

treating all the stages as multi classes and achieved state-of-the-art prediction performance, but at

signiﬁcantly high spatio-temporal cost. Prior work (Wang et al

, 2017) trades off accuracy for low

cost but still ignores the key relationship between the classes, so it fails to ﬁnd the optimal trade-off.

We propose

UnfoldML2

: a cost-aware and uncertainty-based prediction pipeline for dynamic multi-

stage classiﬁcation. It “unfolds” a monolithic multi-class classiﬁer into a series of single-stage

classiﬁers, reducing its deployment cost. Each single-stage classiﬁer is then cascaded gradually from

cheaper to more expensive binary classiﬁers, further reducing the cost by dynamically selecting

an appropriate classiﬁer for an input query. Figure 1 summarizes the two dimensional (2D) query

propagation mechanism designed in

UnfoldML

: Horizontally it allows a query to transition through

multiple stages, and vertically it allows the query to progressively upgrade to costlier models

constrained by the pre-speciﬁed budget limit. It computes a classiﬁer’s prediction conﬁdence

on a query then directs the query through one the following three gates: 1) “I conﬁdently know

NO” (

ICK0

), which rejects the current query and early exits from the pipeline (exit); 2) “I don’t

know” (

IDK

), which upgrades the query to a higher accuracy but costlier model, producing a more

conﬁdent result within the budget (vertical cascading); and 3) “I conﬁdently know YES” (

ICK1

which transitions the query to the next stage of the prediction task (horizontal forwarding). The design

ICK0

gate allows queries to early exit from the

UnfoldML

so it reduces the overall spatio-temporal

cost of the pipeline. The

ICK1

gate allows queries to faster transition to next stages so it enables early

prediction on late stages, which can be critical in clinical settings (Reyna et al

, 2019). Overall, the

combined 2D propagation mechanism uniquely enables the navigation of the cost/accuracy tradeoff

space for searching an optimal set of policies for dynamic model selection at inference time.

We also propose two training algorithms for learning the optimal policies for the designed 2D

query propagation in

UnfoldML

. The key idea of the training algorithms is to learn the optimal

thresholds on the gating functions deﬁned for

ICK0

IDK

and

ICK1

. The ﬁrst proposed hard-

gating algorithm assumes the gating functions to be step functions parameterized by deterministic

conﬁdence thresholds. To ﬁnd the optimal thresholds, it performs bottom-up grid search over a

topologically-sorted list of all the models and identiﬁes the thresholds to minimize the prediction loss,

while following a cost constraint. The secondly proposed soft-gating algorithm deﬁnes the gating

functions to be probabilistic activation functions. It follows a Mixture-of-Expert (MoE) framework

to adaptively determine a models’ conﬁdence threshold given all the other model’s conﬁdence in

predicting the same query. To obviate the cost of running all models in MoE selection, we further

propose a Dirichlet Knowledge Distillation (DKD) to run only a cheap multi-label classiﬁer that is

trained for distilling the Bayesian predictive uncertainties of all models.

We summarize the contributions of this paper as follows:

2Code is available at https://github.com/gatech-sysml/unfoldml.

•

We design a novel 2D query propagation pipeline that “unfolds” multi-stage prediction

workﬂows by leveraging the “happens-before” relationship between the stages, and achieves

a lower-cost prediction pipeline with minimal accuracy degradation.

•

We propose two learning algorithms to sufﬁciently navigate the cost/accuracy tradeoff space

and search an optimal set of policies for the designed 2D query propagation.

•

We apply the proposed pipeline to two real-world applications and demonstrate it reduces

the spatio-temporal cost of inference by orders of magnitude.

2 Related Work

The most relevant work to our proposed method is the one-step

IDK

cascade (Wang et al

, 2017),

which incorporates prior work of “I don’t know” (

IDK

) classes (Trappenberg and Back, 2000; Khani

et al

, 2016) into cascade construction and introduce a latency-aware objective into the construction

comparing with previous cascaded prediction frameworks (Rowley et al

, 1998; Viola and Jones,

2004; Angelova et al

, 2015). Another group of work focus on the problem of feature selection

assuming each feature can be acquired for a cost. They train a cascade of classiﬁers for optimizing

the trade-off between the expected classiﬁcation error and the feature cost. Early solution (Raykar

et al

, 2010) limits the cascade to a family of linear discriminating functions. Cai et al

(2015) applies

boosting method for cascading a set of weak learners. Recent methods (Trapeznikov and Saligrama,

2013; Clertant et al

, 2019; Janisch et al

, 2019) develop POMDP-based frameworks and incorporate

deep Q-learning in training the cascades. In contrast to all of the above work that are only 1-D

pipelines for one-step prediction task (can be multi-class classiﬁcations), our method extends to a 2D

pipeline that can dynamically forward examples to next steps after they are conﬁdently predicted as

passed on the current step. Further, we also develop a more efﬁcient pipeline framework based on

Mixture-of-Experts (MoE) modeling and knowledge distillation, which can apply gradient decent

algorithms for learning the parameters efﬁciently.

The idea of MoE was originally introduced by Jacobs et al

(1991), for partitioning the training

data and feeding them into separate neural networks during the learning process. This gate decision

design is applied into many domains such as language modeling (Ma et al

, 2018), video captioning

(Wang et al

, 2019), multi-tasking learning (Ma et al

, 2018). It is also used in network architecture

searching (Eigen et al

, 2013) by setting gate activation on network layers. Sparse gates are introduced

in MoE so that it can efﬁciently select from thousands of sub-networks (Shazeer et al

, 2017) as

well as increases the representation power of large convolutional networks by only using a shallow

embedding network to produce the mixture weights (Wang et al

, 2020). We incorporate the idea

of sparsely gated MoE (Shazeer et al

, 2017; Wang et al

, 2020) into our prediction framework, and

design a soft-gating training algorithm by using ReLU as the sparse gating function and imposing

L1-norm regularization on the gating weights for further sparsity.

Conﬁdence criterion has been incorporated into active learning by Li and Sethi (2006) and then

extended by Zhu et al

(2010). Lei (2014) proposed conﬁdence based classiﬁers that identiﬁes the

conﬁdent region (like

ICK

class) and uncertain region (like

IDK

class) in predictions. Conﬁdence

are also introduced into word embedding (Vilnis and McCallum, 2015; Athiwaratkun and Wilson,

2018) and graph representations (Orbach and Crammer, 2012; Vashishth et al

, 2019). Our method

posits thresholds on prediction conﬁdence for activating the gates in pipeline expansion. Bayesian

Prior Networks (BPNs) (Malinin and Gales, 2018) have been proposed to estimate the uncertainty

distribution in model predictions, which is more computationally efﬁcient than traditional Bayesian

approaches (MacKay, 1992; Mackay, 1992; Hinton and Van Camp, 1993). We propose Dirichlet

Knowledge Distillation (DKD) based on BPNs for distilling prediction uncertainty in large models so

that we only need to run a low-cost multi-head model for producing the weights in MoE efﬁciently.

3 Multi-Stage Dynamic Prediction Pipeline

We introduce a dynamic 2D prediction pipeline

UnfoldML

, which learns the optimal policy for

making “I conﬁdently know” (

ICK

) predictions on sequential multi-stage classiﬁcation tasks. An

optimal policy will effectively trade off prediction accuracy against spatio-temporal costs in order to

maximize the overall system accuracy’s AUC while staying under user-imposed cost constraints.

3.1 Problem Formulation

Given

x∈ X

at time

, a multi-stage pipeline decides whether the individual should maintain at

the current stage

or progress into the next stage

s+ 1

for time

t+ 1

. If there are a total number

stages that need to be detected, we train

number of models

for each speciﬁc stage

form a model zoo

M={{m11,· · · , m1K1},· · · ,{mS1, ..., mSKS}}

. We measure each model’s

spatio-temporal cost by multiplying the device cost per unit time with the serving time per prediction

stage, denoted as cost(msk).

To optimize the limited system resources, we design a 2D UnfoldML in the following way: (1) start

with the simplest model for prediction of the initial stage on incoming data, and (2)

upgrade

vertically

to costlier models on those samples where “I don’t know” (

IDK

), or (3)

transition

horizontally to

the next stage of the pipeline samples where “I conﬁdently know YES” (

ICK1

) that we have correctly

identiﬁed the sample at the current stage, otherwise (4)

exit

the pipeline for those “I conﬁdently know

NO” (

ICK0

) samples that have been identiﬁed and discarded. Figure 1 (a) demonstrates the proposed

2D architecture of

UnfoldML

. The central problem of

UnfoldML

is learning the optimal policy for

each of the three classes of gating functions with the objective of maximizing high system-wide

accuracy while minimizing the prediction cost.

We formulate

UnfoldML

as a decision rule mapping function

mcasc :X × M → M

, which takes

example

coming at time

and the current model choice

msk

as an input to determine whether

the model for the query should take one of the aforementioned three actions: upgrade vertically,

transition horizontally, or exit the pipeline. These decisions rules can be realized by two groups

of parameters: a conﬁdence criterion

q:X × M → [0,1]

that measures the conﬁdence score of a

model’s prediction on a data example, and two gate functions

GIDK

GICK1: [0,1] × {True,False}

that are applied on to the conﬁdence score per each prediction made by model

msk

. The two exclusive

gates

IDK

and

ICK1

each respectively decides if the current prediction belongs to either an

IDK

class such that (s.t.) the system will upgrade the query to a costlier model while remaining within

the user-deﬁned cost budget, or an

ICK1

class s.t. the system will transition the query to the next

stage in the pipeline. The third gate

GICK0

is determined by

¬GIDK ∧ ¬GICK1

. We formalize the

decision rule used at each prediction stage as follows

mcasc(xt, msk;q, G) = 









ms(k+1), GIDK ∧ ¬GICK1qsk(xt),

m(s+1)1,¬GIDK ∧GICK1qsk(xt),

msk,¬GI DK ∧ ¬GI CK1qsk (xt)

where

qsk(xt)

is a short notation for

q(xt, msk)

measuring the conﬁdence of model

msk

’s prediction

on data

. The goal of conﬁguring an optimal pipeline given a restricted computation resource can

be formalized as the following optimization problem:

min

GLmcasc;D)s.t. cost(mcasc;D)≤c, (1)

where

consists of the two gate functions

GIDK

and

GICK1

denotes the end-to-end prediction

loss on data Dand cis a user-speciﬁed cost-constraint for the system.

3.2 Gate Parameters Learning

Given a training data set

, which does not include any data that was used in the training of the models

in our model zoo:

D={xi,(y1

i, t1

i),· · · ,(yS

i, tS

i)}N

i=1

, where

xi= (xi1,· · · ,xiTi)

is an input

sequence observed for individual

i∈ {0,1}

indicates whether the example entered to stage-

and if yes, we use

i∈ ∅ ∪ [1, Ti]

to determine when it entered. We ﬁrst partition the multi-stage data

into

one-stage data sets:

Ds={(xi[ts−1

i:ts

i], ys

i== 1)} ∪ {(xi[ts−1

i:Ti], ys

i== 0)}

, then divide

the learning of IDK and ICK gate parameters into two separable sub-problems:

Sub-Objective 1: min

GIDK

Lsmcasc

s;Dss.t. cost(mcasc

s;Ds)≤cs, s = 1,· · · , S (2)

Sub-Objective 2: min

GICK1

Lmcasc;D, GIDK∗,

where

is the one-stage prediction loss on data

, and

is the cost budget that is pre-allocated

for stage-ssatisfying Pscs=c.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UnfoldML:Cost-AwareandUncertainty-BasedDynamic2DPredictionforMulti-StageClassicationYanboXu1;,AlindKhare1;,GlennMatlin1,MonishRamadoss1,RishikesanKamaleswaran2,ChaoZhang1,AlexeyTumanov11GeorgiaInstituteofTechnology,2EmoryUniversityAtlanta,GAAbstractMachineLearning(ML)researchhasfocusedonmaximizin...

展开>> 收起<<

UnfoldML Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classiﬁcation Yanbo Xu1 Alind Khare1 Glenn Matlin1 Monish Ramadoss1.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UnfoldML Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classiﬁcation Yanbo Xu1 Alind Khare1 Glenn Matlin1 Monish Ramadoss1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: