Kairos Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources
2025-04-26
0
0
1.05MB
14 页
10玖币
侵权投诉
Kairos: Building Cost-Eicient Machine Learning Inference
Systems with Heterogeneous Cloud Resources
Baolin Li
Northeastern University
Siddharth Samsi
MIT
Vijay Gadepally
MIT
Devesh Tiwari
Northeastern University
ABSTRACT
Online inference is becoming a key service product for many busi-
nesses, deployed in cloud platforms to meet customer demands.
Despite their revenue-generation capability, these services need to
operate under tight Quality-of-Service (QoS) and cost budget con-
straints. This paper introduces Kairos
1
, a novel runtime framework
that maximizes the query throughput while meeting QoS target and
a cost budget. Kairos designs and implements novel techniques to
build a pool of heterogeneous compute hardware without online
exploration overhead, and distribute inference queries optimally
at runtime. Our evaluation using industry-grade machine learning
(ML) models shows that Kairos yields up to 2
×
the throughput of
an optimal homogeneous solution, and outperforms state-of-the-art
schemes by up to 70%, despite advantageous implementations of
the competing schemes to ignore their exploration overhead.
KEYWORDS
Machine Learning; Inference Systems; Heterogeneous Hardware.
1 INTRODUCTION
As machine learning (ML) models are becoming widely adopted
in commercial services, the service providers will utilize cloud
computing resources to serve their customers, and online inference
has become a highly critical application for both on-premise and
public cloud computing platforms [
1
–
3
]. As a result, an increasing
amount of research eort is dedicated to improving the capability of
cloud systems for inference workloads [
4
–
9
]. Serving ML inference
is particularly challenging because they pose additional constraints
and objectives beyond meeting latency deadlines. For example,
business service providers can utilize the pay-as-you-go model
to rent cloud computing instances, but they seek the following
desirable objectives: (1) meet the quality-of-service target (QoS
constraint, e.g., 99% of queries nish within 100ms); (2) ecient
under a xed cost budget; (3) process as many queries as possible
per time unit (i.e., high query throughput).
Cloud platforms provide a wide range of virtual machines (VMs),
and each comes with dierent hardware types (e.g., dierent CPU,
GPU, and memory). While there have been previous attempts at
providing partial solutions to exploit hardware heterogeneity in
datacenter [
10
–
12
], edge [
13
,
14
], and cloud [
15
–
17
], we lack a
complete solution to achieve all the desirable properties (Sec. 2).
In particular, prior schemes do not consider the full aspects of
inference serving: heterogeneous resource allocation and intelligent
query distribution among allocated hardware resources.
1
Kairos has been accepted at the 32nd ACM International Symposium on High-
Performance Parallel and Distributed Computing (HPDC ’23)
Note that a heterogeneous pool of cloud compute instances (a
mixture of GPUs and CPUs) appear naturally more promising for
inference serving as they provide the opportunity to balance the
trade-o between cost and performance (QoS target). More pow-
erful and expensive instances can be used toward satisfying strict
QoS targets for larger queries. Less powerful and relatively less
expensive instances can be used for executing smaller queries that
will not violate their QoS on such instances, and thereby, provide
a chance to reduce the overall cost of the query serving system.
Consequently, many prior techniques have opportunistically taken
advantage of hardware heterogeneity to improve query through-
put or meet QoS target [
10
,
11
,
13
,
14
,
17
]. However, none of them
provide a systematic methodology to eciently optimize the het-
erogeneous conguration (i.e., determine the number of GPUs
and CPUs of dierent types).
Therefore, while prior works are
heterogeneity-aware, they do not proactively optimize the
hardware heterogeneity under a cost budget.
In fact, we show that some heterogeneous congurations can
perform signicantly worse than an otherwise cost- and QoS-
equivalent homogeneous conguration (Sec. 4). Determining a het-
erogeneous conguration requires online evaluation of multiple po-
tential candidates. Unfortunately, this approach is not suitable when
the query load changes or other system parameters change, since it
requires invoking the exploration process frequently and potentially
evaluating congurations that are worse than homogeneous con-
gurations. This has been the main hindrance for the community
to exploit heterogeneous computing hardware.
Kairos breaks
this limitation and designs novel techniques to take full ad-
vantage of hardware heterogeneity while meeting QoS con-
straints under a given cost budget.
Summary of Contributions.
We design and implement Kairos, a
novel runtime framework to maximize throughput under cost budget
and QoS constraints for machine learning inference tasks. Kairos
breaks away from searching the complex and vast conguration
space of heterogeneous hardware. Instead, Kairos devises two
techniques to quickly nd a high-throughput heterogeneous con-
guration without exploring.
First, Kairos designs an ecient query-distribution mechanism
to distribute queries among dierent cloud computing instances for
any given heterogeneous conguration to maximize throughput
– formulating this as a bipartite matching problem and solving it
eciently. Second, Kairos approximates the upper bound of the
throughput that a heterogeneous conguration can provide at the
best. Then, Kairos uses the similarity in top-ranked heterogeneous
arXiv:2210.05889v3 [cs.DC] 2 May 2023
Table 1: Overview of related works and Kairos.
Inference
QoS
Through-
put
Cost Query
Mapping
Proactive in
Heterogeneity
No Online
Exploration
Miscellaneous Notes
Paragon [10] ✘ ✔ ✘ ✔ ✘ ✘ Requires prior data for training
TetriSched [11] ✘ ✘ ✘ ✔ ✘ ✔ Supports user-based reservation
S3DNN [13] ✔ ✔ ✘ ✔ ✘ ✔ Uses supervised CUDA stream
DART [14] ✔ ✔ ✘ ✔ ✘ ✘ Proles layers and applies parallelism
Scrooge [15] ✔ ✔ ✔ ✘ ✘ ✘ Chain execution of media applications
Ribbon [16] ✔ ✔ ✔ ✘ ✔ ✘ Bayesian Optimization for allocation
DeepRecSys [17] ✔ ✔ ✘ ✔ ✘ ✘ Schedules using proled threshold
Clockwork [18] ✔ ✔ ✘ ✔ ✘ ✔ Consolidates latency for predictability
Kairos ✔ ✔ ✔ ✔ ✔ ✔ Full heterogeneity support
congurations to pick the most promising heterogeneous cong-
uration without online evaluation. Our evaluation conrms that
Kairos’s conguration choice is often the near-optimal cong-
uration across dierent machine learning models in production,
where the optimal conguration is determined via exhaustive of-
ine search of all heterogeneous congurations.
We have leveraged industry-grade deep learning models to drive
the evaluation of Kairos’s eectiveness [
17
] – although we note
that Kairos’s design is generic and not tuned for particular kinds
of ML models. Our evaluation shows that compared to the opti-
mal homogeneous conguration, Kairos is able to signicantly
increase the throughput (by up to 2
×
) under the same QoS target
and cost budget. Kairos outperforms the state-of-the-art schemes
in this area (Ribbon, DeepRecSys, and Clockwork [
16
–
18
]) by up
to 70%, despite advantageous implementations of those competing
schemes by ignoring the exploration overheads and improving the
query distribution technique. Our proposed solution, Kairos, is
publicly available as an open-source package at https:// doi.org/10.
5281/zenodo.7888058.
2 RELATED WORK
Table 1 lists the relevant works in exploiting heterogeneous hard-
ware and inference serving. Overall, Kairos is the only work that
satises all the desirable properties (table header from left to right):
(i) meets QoS for inference queries; (ii) has service throughput
requirement; (iii) is aware of heterogeneous hardware cost; (iv)
intelligently distributes (or maps) queries among resources; (v)
proactively allocates and optimizes heterogeneous resources; and
(vi) does not need prior knowledge to train a model or perform
online exploration. While some previous works are heterogeneity-
aware (i.e., can eciently use available heterogeneous hardware),
they do not proactively congure the heterogeneity to optimize
other aspects: query throughput, QoS, and cost budget.
Latency-critical applications are commonly studied in large-scale
datacenter and cloud systems [
19
–
23
]. Previous works such as
Paragon [
10
] and TetriSched [
11
] have focused on optimizing het-
erogeneous resource utilization [
24
–
27
], but their resource hetero-
geneity is pre-determined and sub-optimal, and their target applica-
tions are long-running jobs in datacenters, which is dierent from
online inference tasks. Some other previous works have relied on
tuning by expertise [
28
–
31
], prior proling [
32
–
35
], or historical
training data from similar applications [
36
–
40
], and cannot be used
to solve the Kairos problem.
Existing ML inference frameworks [
1
,
4
,
8
,
13
,
14
,
18
,
41
–
47
] are
not suitable for exploiting heterogeneous hardware optimally and
may require extensive proling, Kairos addresses this limitation.
For example, S
3
DNN and DART are heterogeneity-aware deep
neural network (DNN) inference frameworks [
13
,
14
], but their
hardware heterogeneity is pre-determined. INFaaS [
47
] selects one
particular hardware type from a pool of devices depending on the
user application, but unlike Kairos, it does not explore serving the
model using dierent hardware simultaneously. Media application
frameworks such as Llama [
46
] and Scrooge [
15
] allocate dierent
hardware for dierent stages of the media application inference, but
each query is assigned to the same sequence of hardware types, they
do not distribute queries to heterogeneous resources like Kairos
and are not suitable for general purpose applications.
Ribbon [
16
] optimizes the serving cost by exploring dierent
heterogeneous congurations, but compared to Kairos, it still in-
curs Bayesian Optimization exploration overhead and does not
exploit the heterogeneity by intelligently distributing the queries.
DeepRecSys [
17
] explores heterogeneity between GPUs and CPUs
when serving online queries. However, it does not explore the po-
tential of dierent CPU/GPU ratios under a cost budget. It uses a
hill-climbing algorithm to nd an optimal threshold for query dis-
tribution, but it incurs tuning overhead as the threshold is dierent
for each heterogeneous conguration. Clockwork [
18
] consolidates
design choices in a top-down manner for deterministic inference
latencies, but its central controller does not exploit heterogeneous
hardware like Kairos. Compared to all previous work, Kairos de-
livers a full suite of heterogeneity support for cloud service and
considers all key metrics (QoS, throughput, and cost).
3 BACKGROUND
Machine learning inference service.
When machine learning
models are trained into maturity, they will get deployed in produc-
tion to provide ML inference service. The service users can submit
inference requests through provided interfaces (e.g., HTTP request),
then get a response. The inference pipeline can have multiple stages
(e.g., data pre-processing, model prediction, post-processing), and
they are typically packaged into a container image along with the
software dependencies. On the cloud, the inference service provider
can then allocate a set of compute instances and use a resource
manager like Kubernetes to deploy the service. In this work, we
focus on discussing the potential of using a heterogeneous resource
instance allocation – how to eciently distribute the inference
queries and nd a good heterogeneous conguration quickly.
Inference serving with QoS constraints and cost budget.
The
inference service has a QoS target, requiring the tail latency (e.g.,
99
𝑡ℎ
percentile) of queries to be within a limit for a better user ex-
perience. For exibility reasons and the pay-as-you-go model, busi-
nesses rent computing power from the cloud computing provider
to meet the QoS target, but they also have a budget constraint.
Each compute instance type, rented from the cloud, is associated
with a price ($
/ℎ𝑟
). Given a cost budget, one can only allocate a
limited number of instances to serve as many queries as possible –
that is, maximize the query throughput. The query throughput is
dened as queries served per second (QPS). Since QoS cannot be vi-
olated, we use the
allowable throughput
, which is the maximum
throughput the allocated instances can serve without causing QoS
violation. In this work, we use allowable throughput,throughput,
and QPS inter-changeably. All of them hold the implicit condition
that QoS is satised.
4 MOTIVATION
In this section, we rst provide experimental evidence to demon-
strate that a heterogeneous conguration (a conguration can be
a mixture of a few GPU instances, a few instances of CPU type A,
and a few instances of type B) can be better than a homogeneous
conguration under the same cost budget while respecting QoS. But,
it is not always true – and any heterogeneous conguration is not
superior by simply the virtue of heterogeneity.
First, we note that given a certain cost budget, one can choose
to allocate the most cost-eective instances that can meet the QoS
for all queries. We denote such instance type as
base instance
,
and such strategy as
homogeneous serving
or homogeneous con-
guration. However, since inference queries have highly diverse
batch sizes (or query sizes) [
4
,
16
,
17
], even though a cheaper but
higher throughput-per-cost instance type cannot meet the QoS (so
it cannot serve standalone as the allowable throughput is 0), it can
still meet QoS for some smaller queries (queries with smaller batch
sizes) due to the lower latency. Another choice is to replace some
base instances with such cheaper instances (denoted as
auxiliary
instances
), we denote this as
heterogeneous serving
or hetero-
geneous conguration. Unlike the base instance which comes from
the optimal homogeneous instance, multiple types of auxiliary in-
stances can be used for more exibility and higher potential.
Are heterogeneous congurations always better?
In Fig. 1, we
compare the throughput of homogeneous serving against three dif-
ferent heterogeneous congurations on a Meta production model
RM2 [
2
] under a xed cost budget (dashed line). All congurations
shown here respect the QoS target. We use three AWS EC2 instance
Figure 1: Dierent heterogeneous congurations versus the
best homogeneous one. The number indicates the instance
count of each type.
Figure 2: Throughput improvement over homogeneous
when exploring using simulated annealing.
types denoted as G1 for base instance, and C1, C2 for auxiliary in-
stances (details in Sec. 7). The
(
4
,
0
,
0
)
homogeneous conguration
still has some unused budget for 70% of one G1, so we proportion-
ally scale its throughput and cost up till the budget to give it an
advantage. We observe that heterogeneous outperforms homoge-
neous as
(
3
,
1
,
3
)
has 15% higher throughput than
(
4
,
0
,
0
)
. However,
heterogeneity is not always necessarily better (e.g.,
(
2
,
0
,
9
)
and
(
1
,
4
,
2
)
). Especially for
(
1
,
4
,
2
)
, it indicates that simply raising the
budget is not an ideal approach to gain throughput. Therefore, being
only heterogeneity aware is not sucient (like previous work). But,
how do we nd an optimal conguration like (3,1,3)?
Finding a high-performing heterogeneous conguration is
expensive.
This is because the search space of possible hetero-
geneous congurations is large, especially when there are more
instance types, the space becomes high-dimensional and each in-
stance type may have multiple instances. Second, evaluating the
throughput of a new conguration is expensive and time-consuming
because it requires service reconguration, just allocating new
cloud instances would take signicant time (tens of seconds). Also,
during the online search of congurations, each explored congu-
ration may not yield enough throughput to sustain all the queries
- lower throughput than the homogeneous setting. Fig. 2 shows
the limitation of heterogeneous serving during online exploration
using simulated annealing [
48
]. Although we have pre-ltered out
congurations that yield less than 20 QPS, the majority of explored
congurations (about 70%) are still worse than the homogeneous
serving marked as the red line. QoS violations will occur frequently
if the allowable throughput is below the target level. High cost of
exploring and evaluating has prohibited previous works from nding
a better heterogeneous conguration. Kairos breaks this limitation by
providing an approximate method to quickly determine a promising
conguration without any online evaluation.
Exploiting heterogeneity via intelligent query distribution
is the key to higher throughput.
Next, we show that only nd-
ing a high-performing heterogeneous is not sucient. Distributing
摘要:
展开>>
收起<<
Kairos:BuildingCost-EfficientMachineLearningInferenceSystemswithHeterogeneousCloudResourcesBaolinLiNortheasternUniversitySiddharthSamsiMITVijayGadepallyMITDeveshTiwariNortheasternUniversityABSTRACTOnlineinferenceisbecomingakeyserviceproductformanybusi-nesses,deployedincloudplatformstomeetcustomerdem...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
公司营销部领导述职述廉报告VIP免费
2024-12-03 4 -
100套述职述廉述法述学框架提纲VIP免费
2024-12-03 3 -
20220106政府党组班子党史学习教育专题民主生活会“五个带头”对照检查材料VIP免费
2024-12-03 3 -
20220106县纪委监委领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 6 -
A文秘笔杆子工作资料汇编手册(近70000字)VIP免费
2024-12-03 3 -
20220106县领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 4 -
经济开发区党工委书记管委会主任述学述职述廉述法报告VIP免费
2024-12-03 34 -
20220106政府领导专题民主生活会五个方面对照检查材料VIP免费
2024-12-03 11 -
派出所教导员述职述廉报告6篇VIP免费
2024-12-03 8 -
民主生活会对县委班子及其成员批评意见清单VIP免费
2024-12-03 50
分类:图书资源
价格:10玖币
属性:14 页
大小:1.05MB
格式:PDF
时间:2025-04-26


渝公网安备50010702506394