Kairos Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

2025-04-26 0 0 1.05MB 14 页 10玖币

侵权投诉

Kairos: Building Cost-Eicient Machine Learning Inference

Systems with Heterogeneous Cloud Resources

Baolin Li

Northeastern University

Siddharth Samsi

MIT

Vijay Gadepally

MIT

Devesh Tiwari

Northeastern University

ABSTRACT

Online inference is becoming a key service product for many busi-

nesses, deployed in cloud platforms to meet customer demands.

Despite their revenue-generation capability, these services need to

operate under tight Quality-of-Service (QoS) and cost budget con-

straints. This paper introduces Kairos

, a novel runtime framework

that maximizes the query throughput while meeting QoS target and

a cost budget. Kairos designs and implements novel techniques to

build a pool of heterogeneous compute hardware without online

exploration overhead, and distribute inference queries optimally

at runtime. Our evaluation using industry-grade machine learning

(ML) models shows that Kairos yields up to 2

the throughput of

an optimal homogeneous solution, and outperforms state-of-the-art

schemes by up to 70%, despite advantageous implementations of

the competing schemes to ignore their exploration overhead.

KEYWORDS

Machine Learning; Inference Systems; Heterogeneous Hardware.

1 INTRODUCTION

As machine learning (ML) models are becoming widely adopted

in commercial services, the service providers will utilize cloud

computing resources to serve their customers, and online inference

has become a highly critical application for both on-premise and

public cloud computing platforms [

–

]. As a result, an increasing

amount of research eort is dedicated to improving the capability of

cloud systems for inference workloads [

–

]. Serving ML inference

is particularly challenging because they pose additional constraints

and objectives beyond meeting latency deadlines. For example,

business service providers can utilize the pay-as-you-go model

to rent cloud computing instances, but they seek the following

desirable objectives: (1) meet the quality-of-service target (QoS

constraint, e.g., 99% of queries nish within 100ms); (2) ecient

under a xed cost budget; (3) process as many queries as possible

per time unit (i.e., high query throughput).

Cloud platforms provide a wide range of virtual machines (VMs),

and each comes with dierent hardware types (e.g., dierent CPU,

GPU, and memory). While there have been previous attempts at

providing partial solutions to exploit hardware heterogeneity in

datacenter [

–

], edge [

], and cloud [

–

], we lack a

complete solution to achieve all the desirable properties (Sec. 2).

In particular, prior schemes do not consider the full aspects of

inference serving: heterogeneous resource allocation and intelligent

query distribution among allocated hardware resources.

Kairos has been accepted at the 32nd ACM International Symposium on High-

Performance Parallel and Distributed Computing (HPDC ’23)

Note that a heterogeneous pool of cloud compute instances (a

mixture of GPUs and CPUs) appear naturally more promising for

inference serving as they provide the opportunity to balance the

trade-o between cost and performance (QoS target). More pow-

erful and expensive instances can be used toward satisfying strict

QoS targets for larger queries. Less powerful and relatively less

expensive instances can be used for executing smaller queries that

will not violate their QoS on such instances, and thereby, provide

a chance to reduce the overall cost of the query serving system.

Consequently, many prior techniques have opportunistically taken

advantage of hardware heterogeneity to improve query through-

put or meet QoS target [

]. However, none of them

provide a systematic methodology to eciently optimize the het-

erogeneous conguration (i.e., determine the number of GPUs

and CPUs of dierent types).

Therefore, while prior works are

heterogeneity-aware, they do not proactively optimize the

hardware heterogeneity under a cost budget.

In fact, we show that some heterogeneous congurations can

perform signicantly worse than an otherwise cost- and QoS-

equivalent homogeneous conguration (Sec. 4). Determining a het-

erogeneous conguration requires online evaluation of multiple po-

tential candidates. Unfortunately, this approach is not suitable when

the query load changes or other system parameters change, since it

requires invoking the exploration process frequently and potentially

evaluating congurations that are worse than homogeneous con-

gurations. This has been the main hindrance for the community

to exploit heterogeneous computing hardware.

Kairos breaks

this limitation and designs novel techniques to take full ad-

vantage of hardware heterogeneity while meeting QoS con-

straints under a given cost budget.

Summary of Contributions.

We design and implement Kairos, a

novel runtime framework to maximize throughput under cost budget

and QoS constraints for machine learning inference tasks. Kairos

breaks away from searching the complex and vast conguration

space of heterogeneous hardware. Instead, Kairos devises two

techniques to quickly nd a high-throughput heterogeneous con-

guration without exploring.

First, Kairos designs an ecient query-distribution mechanism

to distribute queries among dierent cloud computing instances for

any given heterogeneous conguration to maximize throughput

– formulating this as a bipartite matching problem and solving it

eciently. Second, Kairos approximates the upper bound of the

throughput that a heterogeneous conguration can provide at the

best. Then, Kairos uses the similarity in top-ranked heterogeneous

arXiv:2210.05889v3 [cs.DC] 2 May 2023

Table 1: Overview of related works and Kairos.

Inference

QoS

Through-

put

Cost Query

Mapping

Proactive in

Heterogeneity

No Online

Exploration

Miscellaneous Notes

Paragon [10] ✘ ✔ ✘ ✔ ✘ ✘ Requires prior data for training

TetriSched [11] ✘ ✘ ✘ ✔ ✘ ✔ Supports user-based reservation

S3DNN [13] ✔ ✔ ✘ ✔ ✘ ✔ Uses supervised CUDA stream

DART [14] ✔ ✔ ✘ ✔ ✘ ✘ Proles layers and applies parallelism

Scrooge [15] ✔ ✔ ✔ ✘ ✘ ✘ Chain execution of media applications

Ribbon [16] ✔ ✔ ✔ ✘ ✔ ✘ Bayesian Optimization for allocation

DeepRecSys [17] ✔ ✔ ✘ ✔ ✘ ✘ Schedules using proled threshold

Clockwork [18] ✔ ✔ ✘ ✔ ✘ ✔ Consolidates latency for predictability

Kairos ✔ ✔ ✔ ✔ ✔ ✔ Full heterogeneity support

congurations to pick the most promising heterogeneous cong-

uration without online evaluation. Our evaluation conrms that

Kairos’s conguration choice is often the near-optimal cong-

uration across dierent machine learning models in production,

where the optimal conguration is determined via exhaustive of-

ine search of all heterogeneous congurations.

We have leveraged industry-grade deep learning models to drive

the evaluation of Kairos’s eectiveness [

] – although we note

that Kairos’s design is generic and not tuned for particular kinds

of ML models. Our evaluation shows that compared to the opti-

mal homogeneous conguration, Kairos is able to signicantly

increase the throughput (by up to 2

) under the same QoS target

and cost budget. Kairos outperforms the state-of-the-art schemes

in this area (Ribbon, DeepRecSys, and Clockwork [

–

]) by up

to 70%, despite advantageous implementations of those competing

schemes by ignoring the exploration overheads and improving the

query distribution technique. Our proposed solution, Kairos, is

publicly available as an open-source package at https:// doi.org/10.

5281/zenodo.7888058.

2 RELATED WORK

Table 1 lists the relevant works in exploiting heterogeneous hard-

ware and inference serving. Overall, Kairos is the only work that

satises all the desirable properties (table header from left to right):

(i) meets QoS for inference queries; (ii) has service throughput

requirement; (iii) is aware of heterogeneous hardware cost; (iv)

intelligently distributes (or maps) queries among resources; (v)

proactively allocates and optimizes heterogeneous resources; and

(vi) does not need prior knowledge to train a model or perform

online exploration. While some previous works are heterogeneity-

aware (i.e., can eciently use available heterogeneous hardware),

they do not proactively congure the heterogeneity to optimize

other aspects: query throughput, QoS, and cost budget.

Latency-critical applications are commonly studied in large-scale

datacenter and cloud systems [

–

]. Previous works such as

Paragon [

] and TetriSched [

] have focused on optimizing het-

erogeneous resource utilization [

–

], but their resource hetero-

geneity is pre-determined and sub-optimal, and their target applica-

tions are long-running jobs in datacenters, which is dierent from

online inference tasks. Some other previous works have relied on

tuning by expertise [

–

], prior proling [

–

], or historical

training data from similar applications [

–

], and cannot be used

to solve the Kairos problem.

Existing ML inference frameworks [

–

] are

not suitable for exploiting heterogeneous hardware optimally and

may require extensive proling, Kairos addresses this limitation.

For example, S

DNN and DART are heterogeneity-aware deep

neural network (DNN) inference frameworks [

], but their

hardware heterogeneity is pre-determined. INFaaS [

] selects one

particular hardware type from a pool of devices depending on the

user application, but unlike Kairos, it does not explore serving the

model using dierent hardware simultaneously. Media application

frameworks such as Llama [

] and Scrooge [

] allocate dierent

hardware for dierent stages of the media application inference, but

each query is assigned to the same sequence of hardware types, they

do not distribute queries to heterogeneous resources like Kairos

and are not suitable for general purpose applications.

Ribbon [

] optimizes the serving cost by exploring dierent

heterogeneous congurations, but compared to Kairos, it still in-

curs Bayesian Optimization exploration overhead and does not

exploit the heterogeneity by intelligently distributing the queries.

DeepRecSys [

] explores heterogeneity between GPUs and CPUs

when serving online queries. However, it does not explore the po-

tential of dierent CPU/GPU ratios under a cost budget. It uses a

hill-climbing algorithm to nd an optimal threshold for query dis-

tribution, but it incurs tuning overhead as the threshold is dierent

for each heterogeneous conguration. Clockwork [

] consolidates

design choices in a top-down manner for deterministic inference

latencies, but its central controller does not exploit heterogeneous

hardware like Kairos. Compared to all previous work, Kairos de-

livers a full suite of heterogeneity support for cloud service and

considers all key metrics (QoS, throughput, and cost).

3 BACKGROUND

Machine learning inference service.

When machine learning

models are trained into maturity, they will get deployed in produc-

tion to provide ML inference service. The service users can submit

inference requests through provided interfaces (e.g., HTTP request),

then get a response. The inference pipeline can have multiple stages

(e.g., data pre-processing, model prediction, post-processing), and

they are typically packaged into a container image along with the

software dependencies. On the cloud, the inference service provider

can then allocate a set of compute instances and use a resource

manager like Kubernetes to deploy the service. In this work, we

focus on discussing the potential of using a heterogeneous resource

instance allocation – how to eciently distribute the inference

queries and nd a good heterogeneous conguration quickly.

Inference serving with QoS constraints and cost budget.

The

inference service has a QoS target, requiring the tail latency (e.g.,

𝑡ℎ

percentile) of queries to be within a limit for a better user ex-

perience. For exibility reasons and the pay-as-you-go model, busi-

nesses rent computing power from the cloud computing provider

to meet the QoS target, but they also have a budget constraint.

Each compute instance type, rented from the cloud, is associated

with a price ($

/ℎ𝑟

). Given a cost budget, one can only allocate a

limited number of instances to serve as many queries as possible –

that is, maximize the query throughput. The query throughput is

dened as queries served per second (QPS). Since QoS cannot be vi-

olated, we use the

allowable throughput

, which is the maximum

throughput the allocated instances can serve without causing QoS

violation. In this work, we use allowable throughput,throughput,

and QPS inter-changeably. All of them hold the implicit condition

that QoS is satised.

4 MOTIVATION

In this section, we rst provide experimental evidence to demon-

strate that a heterogeneous conguration (a conguration can be

a mixture of a few GPU instances, a few instances of CPU type A,

and a few instances of type B) can be better than a homogeneous

conguration under the same cost budget while respecting QoS. But,

it is not always true – and any heterogeneous conguration is not

superior by simply the virtue of heterogeneity.

First, we note that given a certain cost budget, one can choose

to allocate the most cost-eective instances that can meet the QoS

for all queries. We denote such instance type as

base instance

and such strategy as

homogeneous serving

or homogeneous con-

guration. However, since inference queries have highly diverse

batch sizes (or query sizes) [

], even though a cheaper but

higher throughput-per-cost instance type cannot meet the QoS (so

it cannot serve standalone as the allowable throughput is 0), it can

still meet QoS for some smaller queries (queries with smaller batch

sizes) due to the lower latency. Another choice is to replace some

base instances with such cheaper instances (denoted as

auxiliary

instances

), we denote this as

heterogeneous serving

or hetero-

geneous conguration. Unlike the base instance which comes from

the optimal homogeneous instance, multiple types of auxiliary in-

stances can be used for more exibility and higher potential.

Are heterogeneous congurations always better?

In Fig. 1, we

compare the throughput of homogeneous serving against three dif-

ferent heterogeneous congurations on a Meta production model

RM2 [

] under a xed cost budget (dashed line). All congurations

shown here respect the QoS target. We use three AWS EC2 instance

Figure 1: Dierent heterogeneous congurations versus the

best homogeneous one. The number indicates the instance

count of each type.

Figure 2: Throughput improvement over homogeneous

when exploring using simulated annealing.

types denoted as G1 for base instance, and C1, C2 for auxiliary in-

stances (details in Sec. 7). The

(

)

homogeneous conguration

still has some unused budget for 70% of one G1, so we proportion-

ally scale its throughput and cost up till the budget to give it an

advantage. We observe that heterogeneous outperforms homoge-

neous as

(

)

has 15% higher throughput than

(

)

. However,

heterogeneity is not always necessarily better (e.g.,

(

)

and

(

)

). Especially for

(

)

, it indicates that simply raising the

budget is not an ideal approach to gain throughput. Therefore, being

only heterogeneity aware is not sucient (like previous work). But,

how do we nd an optimal conguration like (3,1,3)?

Finding a high-performing heterogeneous conguration is

expensive.

This is because the search space of possible hetero-

geneous congurations is large, especially when there are more

instance types, the space becomes high-dimensional and each in-

stance type may have multiple instances. Second, evaluating the

throughput of a new conguration is expensive and time-consuming

because it requires service reconguration, just allocating new

cloud instances would take signicant time (tens of seconds). Also,

during the online search of congurations, each explored congu-

ration may not yield enough throughput to sustain all the queries

- lower throughput than the homogeneous setting. Fig. 2 shows

the limitation of heterogeneous serving during online exploration

using simulated annealing [

]. Although we have pre-ltered out

congurations that yield less than 20 QPS, the majority of explored

congurations (about 70%) are still worse than the homogeneous

serving marked as the red line. QoS violations will occur frequently

if the allowable throughput is below the target level. High cost of

exploring and evaluating has prohibited previous works from nding

a better heterogeneous conguration. Kairos breaks this limitation by

providing an approximate method to quickly determine a promising

conguration without any online evaluation.

Exploiting heterogeneity via intelligent query distribution

is the key to higher throughput.

Next, we show that only nd-

ing a high-performing heterogeneous is not sucient. Distributing

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Kairos:BuildingCost-EfficientMachineLearningInferenceSystemswithHeterogeneousCloudResourcesBaolinLiNortheasternUniversitySiddharthSamsiMITVijayGadepallyMITDeveshTiwariNortheasternUniversityABSTRACTOnlineinferenceisbecomingakeyserviceproductformanybusi-nesses,deployedincloudplatformstomeetcustomerdem...

展开>> 收起<<

Kairos Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Kairos Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: