
then get a response. The inference pipeline can have multiple stages
(e.g., data pre-processing, model prediction, post-processing), and
they are typically packaged into a container image along with the
software dependencies. On the cloud, the inference service provider
can then allocate a set of compute instances and use a resource
manager like Kubernetes to deploy the service. In this work, we
focus on discussing the potential of using a heterogeneous resource
instance allocation – how to eciently distribute the inference
queries and nd a good heterogeneous conguration quickly.
Inference serving with QoS constraints and cost budget.
The
inference service has a QoS target, requiring the tail latency (e.g.,
99
𝑡ℎ
percentile) of queries to be within a limit for a better user ex-
perience. For exibility reasons and the pay-as-you-go model, busi-
nesses rent computing power from the cloud computing provider
to meet the QoS target, but they also have a budget constraint.
Each compute instance type, rented from the cloud, is associated
with a price ($
/ℎ𝑟
). Given a cost budget, one can only allocate a
limited number of instances to serve as many queries as possible –
that is, maximize the query throughput. The query throughput is
dened as queries served per second (QPS). Since QoS cannot be vi-
olated, we use the
allowable throughput
, which is the maximum
throughput the allocated instances can serve without causing QoS
violation. In this work, we use allowable throughput,throughput,
and QPS inter-changeably. All of them hold the implicit condition
that QoS is satised.
4 MOTIVATION
In this section, we rst provide experimental evidence to demon-
strate that a heterogeneous conguration (a conguration can be
a mixture of a few GPU instances, a few instances of CPU type A,
and a few instances of type B) can be better than a homogeneous
conguration under the same cost budget while respecting QoS. But,
it is not always true – and any heterogeneous conguration is not
superior by simply the virtue of heterogeneity.
First, we note that given a certain cost budget, one can choose
to allocate the most cost-eective instances that can meet the QoS
for all queries. We denote such instance type as
base instance
,
and such strategy as
homogeneous serving
or homogeneous con-
guration. However, since inference queries have highly diverse
batch sizes (or query sizes) [
4
,
16
,
17
], even though a cheaper but
higher throughput-per-cost instance type cannot meet the QoS (so
it cannot serve standalone as the allowable throughput is 0), it can
still meet QoS for some smaller queries (queries with smaller batch
sizes) due to the lower latency. Another choice is to replace some
base instances with such cheaper instances (denoted as
auxiliary
instances
), we denote this as
heterogeneous serving
or hetero-
geneous conguration. Unlike the base instance which comes from
the optimal homogeneous instance, multiple types of auxiliary in-
stances can be used for more exibility and higher potential.
Are heterogeneous congurations always better?
In Fig. 1, we
compare the throughput of homogeneous serving against three dif-
ferent heterogeneous congurations on a Meta production model
RM2 [
2
] under a xed cost budget (dashed line). All congurations
shown here respect the QoS target. We use three AWS EC2 instance
Figure 1: Dierent heterogeneous congurations versus the
best homogeneous one. The number indicates the instance
count of each type.
Figure 2: Throughput improvement over homogeneous
when exploring using simulated annealing.
types denoted as G1 for base instance, and C1, C2 for auxiliary in-
stances (details in Sec. 7). The
(
4
,
0
,
0
)
homogeneous conguration
still has some unused budget for 70% of one G1, so we proportion-
ally scale its throughput and cost up till the budget to give it an
advantage. We observe that heterogeneous outperforms homoge-
neous as
(
3
,
1
,
3
)
has 15% higher throughput than
(
4
,
0
,
0
)
. However,
heterogeneity is not always necessarily better (e.g.,
(
2
,
0
,
9
)
and
(
1
,
4
,
2
)
). Especially for
(
1
,
4
,
2
)
, it indicates that simply raising the
budget is not an ideal approach to gain throughput. Therefore, being
only heterogeneity aware is not sucient (like previous work). But,
how do we nd an optimal conguration like (3,1,3)?
Finding a high-performing heterogeneous conguration is
expensive.
This is because the search space of possible hetero-
geneous congurations is large, especially when there are more
instance types, the space becomes high-dimensional and each in-
stance type may have multiple instances. Second, evaluating the
throughput of a new conguration is expensive and time-consuming
because it requires service reconguration, just allocating new
cloud instances would take signicant time (tens of seconds). Also,
during the online search of congurations, each explored congu-
ration may not yield enough throughput to sustain all the queries
- lower throughput than the homogeneous setting. Fig. 2 shows
the limitation of heterogeneous serving during online exploration
using simulated annealing [
48
]. Although we have pre-ltered out
congurations that yield less than 20 QPS, the majority of explored
congurations (about 70%) are still worse than the homogeneous
serving marked as the red line. QoS violations will occur frequently
if the allowable throughput is below the target level. High cost of
exploring and evaluating has prohibited previous works from nding
a better heterogeneous conguration. Kairos breaks this limitation by
providing an approximate method to quickly determine a promising
conguration without any online evaluation.
Exploiting heterogeneity via intelligent query distribution
is the key to higher throughput.
Next, we show that only nd-
ing a high-performing heterogeneous is not sucient. Distributing