Kairos Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources

2025-04-26 0 0 1.05MB 14 页 10玖币
侵权投诉
Kairos: Building Cost-Eicient Machine Learning Inference
Systems with Heterogeneous Cloud Resources
Baolin Li
Northeastern University
Siddharth Samsi
MIT
Vijay Gadepally
MIT
Devesh Tiwari
Northeastern University
ABSTRACT
Online inference is becoming a key service product for many busi-
nesses, deployed in cloud platforms to meet customer demands.
Despite their revenue-generation capability, these services need to
operate under tight Quality-of-Service (QoS) and cost budget con-
straints. This paper introduces Kairos
1
, a novel runtime framework
that maximizes the query throughput while meeting QoS target and
a cost budget. Kairos designs and implements novel techniques to
build a pool of heterogeneous compute hardware without online
exploration overhead, and distribute inference queries optimally
at runtime. Our evaluation using industry-grade machine learning
(ML) models shows that Kairos yields up to 2
×
the throughput of
an optimal homogeneous solution, and outperforms state-of-the-art
schemes by up to 70%, despite advantageous implementations of
the competing schemes to ignore their exploration overhead.
KEYWORDS
Machine Learning; Inference Systems; Heterogeneous Hardware.
1 INTRODUCTION
As machine learning (ML) models are becoming widely adopted
in commercial services, the service providers will utilize cloud
computing resources to serve their customers, and online inference
has become a highly critical application for both on-premise and
public cloud computing platforms [
1
3
]. As a result, an increasing
amount of research eort is dedicated to improving the capability of
cloud systems for inference workloads [
4
9
]. Serving ML inference
is particularly challenging because they pose additional constraints
and objectives beyond meeting latency deadlines. For example,
business service providers can utilize the pay-as-you-go model
to rent cloud computing instances, but they seek the following
desirable objectives: (1) meet the quality-of-service target (QoS
constraint, e.g., 99% of queries nish within 100ms); (2) ecient
under a xed cost budget; (3) process as many queries as possible
per time unit (i.e., high query throughput).
Cloud platforms provide a wide range of virtual machines (VMs),
and each comes with dierent hardware types (e.g., dierent CPU,
GPU, and memory). While there have been previous attempts at
providing partial solutions to exploit hardware heterogeneity in
datacenter [
10
12
], edge [
13
,
14
], and cloud [
15
17
], we lack a
complete solution to achieve all the desirable properties (Sec. 2).
In particular, prior schemes do not consider the full aspects of
inference serving: heterogeneous resource allocation and intelligent
query distribution among allocated hardware resources.
1
Kairos has been accepted at the 32nd ACM International Symposium on High-
Performance Parallel and Distributed Computing (HPDC ’23)
Note that a heterogeneous pool of cloud compute instances (a
mixture of GPUs and CPUs) appear naturally more promising for
inference serving as they provide the opportunity to balance the
trade-o between cost and performance (QoS target). More pow-
erful and expensive instances can be used toward satisfying strict
QoS targets for larger queries. Less powerful and relatively less
expensive instances can be used for executing smaller queries that
will not violate their QoS on such instances, and thereby, provide
a chance to reduce the overall cost of the query serving system.
Consequently, many prior techniques have opportunistically taken
advantage of hardware heterogeneity to improve query through-
put or meet QoS target [
10
,
11
,
13
,
14
,
17
]. However, none of them
provide a systematic methodology to eciently optimize the het-
erogeneous conguration (i.e., determine the number of GPUs
and CPUs of dierent types).
Therefore, while prior works are
heterogeneity-aware, they do not proactively optimize the
hardware heterogeneity under a cost budget.
In fact, we show that some heterogeneous congurations can
perform signicantly worse than an otherwise cost- and QoS-
equivalent homogeneous conguration (Sec. 4). Determining a het-
erogeneous conguration requires online evaluation of multiple po-
tential candidates. Unfortunately, this approach is not suitable when
the query load changes or other system parameters change, since it
requires invoking the exploration process frequently and potentially
evaluating congurations that are worse than homogeneous con-
gurations. This has been the main hindrance for the community
to exploit heterogeneous computing hardware.
Kairos breaks
this limitation and designs novel techniques to take full ad-
vantage of hardware heterogeneity while meeting QoS con-
straints under a given cost budget.
Summary of Contributions.
We design and implement Kairos, a
novel runtime framework to maximize throughput under cost budget
and QoS constraints for machine learning inference tasks. Kairos
breaks away from searching the complex and vast conguration
space of heterogeneous hardware. Instead, Kairos devises two
techniques to quickly nd a high-throughput heterogeneous con-
guration without exploring.
First, Kairos designs an ecient query-distribution mechanism
to distribute queries among dierent cloud computing instances for
any given heterogeneous conguration to maximize throughput
– formulating this as a bipartite matching problem and solving it
eciently. Second, Kairos approximates the upper bound of the
throughput that a heterogeneous conguration can provide at the
best. Then, Kairos uses the similarity in top-ranked heterogeneous
arXiv:2210.05889v3 [cs.DC] 2 May 2023
Table 1: Overview of related works and Kairos.
Inference
QoS
Through-
put
Cost Query
Mapping
Proactive in
Heterogeneity
No Online
Exploration
Miscellaneous Notes
Paragon [10] ✔ ✘ Requires prior data for training
TetriSched [11] ✘ ✘ Supports user-based reservation
S3DNN [13] ✔ ✘ Uses supervised CUDA stream
DART [14] ✔ ✘ Proles layers and applies parallelism
Scrooge [15] ✔ ✔ Chain execution of media applications
Ribbon [16] ✔ ✔ Bayesian Optimization for allocation
DeepRecSys [17] ✔ ✘ Schedules using proled threshold
Clockwork [18] ✔ ✘ Consolidates latency for predictability
Kairos ✔ ✔ Full heterogeneity support
congurations to pick the most promising heterogeneous cong-
uration without online evaluation. Our evaluation conrms that
Kairos’s conguration choice is often the near-optimal cong-
uration across dierent machine learning models in production,
where the optimal conguration is determined via exhaustive of-
ine search of all heterogeneous congurations.
We have leveraged industry-grade deep learning models to drive
the evaluation of Kairos’s eectiveness [
17
] – although we note
that Kairos’s design is generic and not tuned for particular kinds
of ML models. Our evaluation shows that compared to the opti-
mal homogeneous conguration, Kairos is able to signicantly
increase the throughput (by up to 2
×
) under the same QoS target
and cost budget. Kairos outperforms the state-of-the-art schemes
in this area (Ribbon, DeepRecSys, and Clockwork [
16
18
]) by up
to 70%, despite advantageous implementations of those competing
schemes by ignoring the exploration overheads and improving the
query distribution technique. Our proposed solution, Kairos, is
publicly available as an open-source package at https:// doi.org/10.
5281/zenodo.7888058.
2 RELATED WORK
Table 1 lists the relevant works in exploiting heterogeneous hard-
ware and inference serving. Overall, Kairos is the only work that
satises all the desirable properties (table header from left to right):
(i) meets QoS for inference queries; (ii) has service throughput
requirement; (iii) is aware of heterogeneous hardware cost; (iv)
intelligently distributes (or maps) queries among resources; (v)
proactively allocates and optimizes heterogeneous resources; and
(vi) does not need prior knowledge to train a model or perform
online exploration. While some previous works are heterogeneity-
aware (i.e., can eciently use available heterogeneous hardware),
they do not proactively congure the heterogeneity to optimize
other aspects: query throughput, QoS, and cost budget.
Latency-critical applications are commonly studied in large-scale
datacenter and cloud systems [
19
23
]. Previous works such as
Paragon [
10
] and TetriSched [
11
] have focused on optimizing het-
erogeneous resource utilization [
24
27
], but their resource hetero-
geneity is pre-determined and sub-optimal, and their target applica-
tions are long-running jobs in datacenters, which is dierent from
online inference tasks. Some other previous works have relied on
tuning by expertise [
28
31
], prior proling [
32
35
], or historical
training data from similar applications [
36
40
], and cannot be used
to solve the Kairos problem.
Existing ML inference frameworks [
1
,
4
,
8
,
13
,
14
,
18
,
41
47
] are
not suitable for exploiting heterogeneous hardware optimally and
may require extensive proling, Kairos addresses this limitation.
For example, S
3
DNN and DART are heterogeneity-aware deep
neural network (DNN) inference frameworks [
13
,
14
], but their
hardware heterogeneity is pre-determined. INFaaS [
47
] selects one
particular hardware type from a pool of devices depending on the
user application, but unlike Kairos, it does not explore serving the
model using dierent hardware simultaneously. Media application
frameworks such as Llama [
46
] and Scrooge [
15
] allocate dierent
hardware for dierent stages of the media application inference, but
each query is assigned to the same sequence of hardware types, they
do not distribute queries to heterogeneous resources like Kairos
and are not suitable for general purpose applications.
Ribbon [
16
] optimizes the serving cost by exploring dierent
heterogeneous congurations, but compared to Kairos, it still in-
curs Bayesian Optimization exploration overhead and does not
exploit the heterogeneity by intelligently distributing the queries.
DeepRecSys [
17
] explores heterogeneity between GPUs and CPUs
when serving online queries. However, it does not explore the po-
tential of dierent CPU/GPU ratios under a cost budget. It uses a
hill-climbing algorithm to nd an optimal threshold for query dis-
tribution, but it incurs tuning overhead as the threshold is dierent
for each heterogeneous conguration. Clockwork [
18
] consolidates
design choices in a top-down manner for deterministic inference
latencies, but its central controller does not exploit heterogeneous
hardware like Kairos. Compared to all previous work, Kairos de-
livers a full suite of heterogeneity support for cloud service and
considers all key metrics (QoS, throughput, and cost).
3 BACKGROUND
Machine learning inference service.
When machine learning
models are trained into maturity, they will get deployed in produc-
tion to provide ML inference service. The service users can submit
inference requests through provided interfaces (e.g., HTTP request),
then get a response. The inference pipeline can have multiple stages
(e.g., data pre-processing, model prediction, post-processing), and
they are typically packaged into a container image along with the
software dependencies. On the cloud, the inference service provider
can then allocate a set of compute instances and use a resource
manager like Kubernetes to deploy the service. In this work, we
focus on discussing the potential of using a heterogeneous resource
instance allocation – how to eciently distribute the inference
queries and nd a good heterogeneous conguration quickly.
Inference serving with QoS constraints and cost budget.
The
inference service has a QoS target, requiring the tail latency (e.g.,
99
𝑡
percentile) of queries to be within a limit for a better user ex-
perience. For exibility reasons and the pay-as-you-go model, busi-
nesses rent computing power from the cloud computing provider
to meet the QoS target, but they also have a budget constraint.
Each compute instance type, rented from the cloud, is associated
with a price ($
/ℎ𝑟
). Given a cost budget, one can only allocate a
limited number of instances to serve as many queries as possible –
that is, maximize the query throughput. The query throughput is
dened as queries served per second (QPS). Since QoS cannot be vi-
olated, we use the
allowable throughput
, which is the maximum
throughput the allocated instances can serve without causing QoS
violation. In this work, we use allowable throughput,throughput,
and QPS inter-changeably. All of them hold the implicit condition
that QoS is satised.
4 MOTIVATION
In this section, we rst provide experimental evidence to demon-
strate that a heterogeneous conguration (a conguration can be
a mixture of a few GPU instances, a few instances of CPU type A,
and a few instances of type B) can be better than a homogeneous
conguration under the same cost budget while respecting QoS. But,
it is not always true – and any heterogeneous conguration is not
superior by simply the virtue of heterogeneity.
First, we note that given a certain cost budget, one can choose
to allocate the most cost-eective instances that can meet the QoS
for all queries. We denote such instance type as
base instance
,
and such strategy as
homogeneous serving
or homogeneous con-
guration. However, since inference queries have highly diverse
batch sizes (or query sizes) [
4
,
16
,
17
], even though a cheaper but
higher throughput-per-cost instance type cannot meet the QoS (so
it cannot serve standalone as the allowable throughput is 0), it can
still meet QoS for some smaller queries (queries with smaller batch
sizes) due to the lower latency. Another choice is to replace some
base instances with such cheaper instances (denoted as
auxiliary
instances
), we denote this as
heterogeneous serving
or hetero-
geneous conguration. Unlike the base instance which comes from
the optimal homogeneous instance, multiple types of auxiliary in-
stances can be used for more exibility and higher potential.
Are heterogeneous congurations always better?
In Fig. 1, we
compare the throughput of homogeneous serving against three dif-
ferent heterogeneous congurations on a Meta production model
RM2 [
2
] under a xed cost budget (dashed line). All congurations
shown here respect the QoS target. We use three AWS EC2 instance
Figure 1: Dierent heterogeneous congurations versus the
best homogeneous one. The number indicates the instance
count of each type.
Figure 2: Throughput improvement over homogeneous
when exploring using simulated annealing.
types denoted as G1 for base instance, and C1, C2 for auxiliary in-
stances (details in Sec. 7). The
(
4
,
0
,
0
)
homogeneous conguration
still has some unused budget for 70% of one G1, so we proportion-
ally scale its throughput and cost up till the budget to give it an
advantage. We observe that heterogeneous outperforms homoge-
neous as
(
3
,
1
,
3
)
has 15% higher throughput than
(
4
,
0
,
0
)
. However,
heterogeneity is not always necessarily better (e.g.,
(
2
,
0
,
9
)
and
(
1
,
4
,
2
)
). Especially for
(
1
,
4
,
2
)
, it indicates that simply raising the
budget is not an ideal approach to gain throughput. Therefore, being
only heterogeneity aware is not sucient (like previous work). But,
how do we nd an optimal conguration like (3,1,3)?
Finding a high-performing heterogeneous conguration is
expensive.
This is because the search space of possible hetero-
geneous congurations is large, especially when there are more
instance types, the space becomes high-dimensional and each in-
stance type may have multiple instances. Second, evaluating the
throughput of a new conguration is expensive and time-consuming
because it requires service reconguration, just allocating new
cloud instances would take signicant time (tens of seconds). Also,
during the online search of congurations, each explored congu-
ration may not yield enough throughput to sustain all the queries
- lower throughput than the homogeneous setting. Fig. 2 shows
the limitation of heterogeneous serving during online exploration
using simulated annealing [
48
]. Although we have pre-ltered out
congurations that yield less than 20 QPS, the majority of explored
congurations (about 70%) are still worse than the homogeneous
serving marked as the red line. QoS violations will occur frequently
if the allowable throughput is below the target level. High cost of
exploring and evaluating has prohibited previous works from nding
a better heterogeneous conguration. Kairos breaks this limitation by
providing an approximate method to quickly determine a promising
conguration without any online evaluation.
Exploiting heterogeneity via intelligent query distribution
is the key to higher throughput.
Next, we show that only nd-
ing a high-performing heterogeneous is not sucient. Distributing
摘要:

Kairos:BuildingCost-EfficientMachineLearningInferenceSystemswithHeterogeneousCloudResourcesBaolinLiNortheasternUniversitySiddharthSamsiMITVijayGadepallyMITDeveshTiwariNortheasternUniversityABSTRACTOnlineinferenceisbecomingakeyserviceproductformanybusi-nesses,deployedincloudplatformstomeetcustomerdem...

展开>> 收起<<
Kairos Building Cost-Efficient Machine Learning Inference Systems with Heterogeneous Cloud Resources.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.05MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注