Shockwave Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning Pengfei Zheng1 Rui Pan1 Tarannum Khan2 Shivaram Venkataraman1and Aditya Akella2

2025-05-03 0 0 5.95MB 21 页 10玖币
侵权投诉
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in
Machine Learning
Pengfei Zheng1, Rui Pan1, Tarannum Khan2, Shivaram Venkataraman1and Aditya Akella2
1University of Wisconsin-Madison, 2The University of Texas at Austin
Abstract
Dynamic adaptation has become an essential technique in
accelerating distributed machine learning (ML) training. Re-
cent studies have shown that dynamically adjusting model
structure (e.g., lottery ticket hypothesis [16]) or hyperparame-
ters (e.g., batch size [1]) can significantly accelerate training
without sacrificing accuracy. However, existing ML cluster
schedulers are not designed to handle dynamic adaptation.
We show that existing schemes fail to provide fairness and de-
grade system efficiency when the training throughput changes
over time under dynamic adaptation. We design Shockwave,
a scheduler with future planning that builds on two key ideas.
First, Shockwave extends classic market theory from static
settings to dynamic settings to co-optimize efficiency and
fairness. Second, Shockwave utilizes stochastic dynamic pro-
gramming to handle dynamic changes. We build a system
for Shockwave and validate its performance with both trace-
driven simulation and cluster experiments. Results show that
for traces of ML jobs with dynamic adaptation, Shockwave im-
proves makespan by 1.3
×
and fairness by 2
×
when compared
with existing fair scheduling schemes.
1 Introduction
GPU-powered deep neural network (DNN) training is rapidly
becoming a core workload in data centers [25,28,29]. Due to
the sheer volume of training data and massive, ever-increasing
model sizes, many DNN models cannot be trained on a single
GPU device, and distributed, multi-GPU training has become
the norm. The increasing demand for GPU devices motivates
enterprises to consolidate their hardware resources and run
their workloads in a shared GPU cluster [25]. Thus, building
scheduling mechanisms that can fairly arbitrate among jobs
competing for GPU resources and efficiently schedule them
for high cluster utilization is important.
While there has been a plethora of work in designing sched-
ulers for DNN workloads, they do not use a rigorous ap-
proach to co-optimize system efficiency and fairness. Systems
like Gandiva [41] and Tiresias [21] optimize makespan and
average JCT (Job Completion Time) with techniques such
as dynamic scaling, time-slicing, and over-subscription, but
do not consider fairness. Processor sharing [40] based ap-
proaches such as DRF [17] and Gavel (Weighted Max-Min
Fairness) [33] provide instantaneous fair share of (dominant)
resources in each scheduling round, but this can significantly
undermine efficiency [20,35]. Stride [39] scheduling-based
approaches such as Gandiva-Fair [10] require cluster opera-
tors to explicitly specify an individual job’s share (e.g., A 20%
and B 80% of GPUs), and manually specified fixed shares
can violate long-term fairness for ML jobs [29]. Finally, Al-
loX [28] and Themis [29] aim to provide long-term fairness
by adopting a filter-based approach where within each round,
a subset of jobs that are furthest from the fair share are filtered,
and among the filtered jobs the ones which maximize effi-
ciency are chosen by the scheduler. However, the filter value
requires onerous hand-tuning; furthermore, even with careful
tuning, using a fixed filter can lead to sub-optimal efficiency
and fairness (§2).
We design Shockwave, a scheduler that leverages market
theory to jointly optimize efficiency and fairness for ML train-
ing jobs in a systematic and principled fashion. We formulate
a Fisher market [5] where every job receives an equal budget
to purchase resources from a central arbiter. The arbiter then
computes prices such that the market reaches an equilibrium;
i.e., each job’s budget is spent to maximize its performance
(e.g., training throughput) and all resources are completely
sold. Formulating resource allocation using market theory
is powerful because achieving market equilibrium guaran-
tees both fairness and efficiency. Each job has equal purchas-
ing power in acquiring resources, ensuring fairness. Further,
market-clearing equilibrium ensures work conservation and
that each job’s performance is maximized given its budget.
While economic theory has been the basis of many prior
systems (e.g., DRF [17], Themis [29], and REF [43]), they
all assume jobs have known static resource requests. This
assumption is no longer true for elastic ML training jobs [24,
30,36] whose resource requirements dynamically change over
time; further, the changes in resource requirements depend
on model update patterns, and thus they are unknown apriori.
arXiv:2210.00093v1 [cs.DC] 30 Sep 2022
For example, training jobs can dynamically scale their batch
size by computing the gradient noise scale (GNS) [30,31].
OpenAI has used batch size scaling (from 32 to 32M) to
accelerate GPT-3 training by 500x [7] and similarly, BERT-
Large training uses dynamic batch sizes (256 to 4096) to
achieve a 2.5x speedup [37]. In this paper, we extend market
theory to develop an efficient and fair scheduler for ML jobs
with elastic resource requirements.
Existing schedulers are either agnostic or reactive to dy-
namic changes and our experiments show (§3) that they fail
to guarantee fairness or significantly degrade efficiency. The
key reason for this is that an optimal schedule or weight as-
signment [10] at the current instant can be suboptimal in the
future, and reactively re-prioritizing jobs can be too late to
compensate for the under-prioritization in the early phases.
State-of-the-art schedulers that accommodate dynamism, e.g.,
Pollux [36] do so automatically on behalf of jobs, e.g., by
automatically scaling batch sizes. We find that this can hurt
training accuracy [1,11] (§2.3); thus, our aim is to let users
perform elastic changes as their algorithms demand. Achiev-
ing fair allocation under dynamism without assuming any
control over said dynamism is challenging, and is not studied
in existing research. We present a detailed comparison be-
tween Shockwave and other schedulers such as Themis [29],
AFS [24] and Pollux [36] in Section 2.
To support dynamic changes in resource requirements over
time, we extend the classic, static Fisher market and propose
a new discrete-time, dynamic market that can ensure long-
term efficiency and fairness. Using discrete time helps us
capture the effects of running a market repeatedly over many
rounds and a dynamic market helps us capture time-varying
utility
1
for jobs. For example, consider a scenario where we
are running 20 rounds of scheduling for a job. If a job’s per-
GPU batch size increases by 2
×
after 10 rounds due to GNS
scaling, its utility from being allocated one GPU (
𝑢0
) will also
double after 10 rounds (
𝑢1=
2
𝑢0
). A static market will assume
time-invariant utility, and will compute the accrued utility over
20 rounds for the job as 20
𝑢0
; in contrast, a dynamic market
can capture the change in utility for the job over time, and can
accurately compute the accrued utility over 20 epochs as 30
𝑢0
.
Accurately computing the utility can enable the dynamic
market to optimize fairness and efficiency over time. We prove
that our dynamic market formulation (§4.2) guarantees long-
term efficiency and fairness properties such as maximized
Nash social welfare over time, Pareto optimality over time,
and sharing incentive.
Implementing the dynamic market formulation in real sys-
tems is challenging for two main reasons. First, the market for-
mulation needs to know utility values in the future to compute
market equilibrium. Dynamic adaptations in jobs are non-
deterministically triggered, as they are dependent on gradient
values that vary across models and datasets, which makes
1
A utility function maps a job’s allocated resource (e.g., GPU) to the
resulting performance (e.g., throughput).
it challenging to predict utility in the future. Second, solv-
ing the dynamic market equilibrium for an (infinitely) long
time horizon is difficult and impractical. It is computation-
ally prohibitive and requires forecasting every job’s future
performance characteristics. Further, as jobs arrive and com-
plete online, we need to periodically solve for the market
equilibrium while maintaining low scheduling overheads.
To bridge the gap between theory and systems, Shockwave
addresses these challenges and implements a dynamic adap-
tation predictor and an approximate dynamic market. First,
we observe that dynamic adaptation for real-world ML work-
loads follows a handful of patterns, and these patterns can be
predicted using Bayesian statistics. We then develop methods
to integrate these predictions into our dynamic market for-
mulation. Second, while performing round-based scheduling,
we find that planning a schedule for an (infinitely) long time
horizon can introduce significant overheads. To maintain low
scheduling overheads, Shockwave only plans the schedule for
a finite length window (e.g, 30-60 minutes), and we design
estimators that can capture the effects on long-term fairness
and long-term efficiency that arise from short-term planning.
This design helps us balance the system overheads without
sacrificing long-term objectives.
We evaluate Shockwave on a 32-GPU cluster testbed and
use a simulator to study large-scale GPU clusters. Using multi-
ple workloads derived from prior, real-world systems [33,36],
we find that Shockwave improves makespan by 1.3
×
and
fairness by 2
×
compared to existing fair DNN cluster sched-
ulers including Themis [29], Gavel [33], AlloX [28], etc. We
further evaluate Shockwave on differently sized clusters. Us-
ing a simulator built with the same scheduler as in a phys-
ical cluster we find that Shockwave scales to schedule 900
active DNN training jobs on 256 GPUs and maintains the
benefits in makespan (1.26-1.37
×
) and fairness (2.5-3.1
×
)
when compared to existing schedulers. We show that our
solver overhead remains low and is less than 12.5% of a two-
minute-long round duration.
2
Shockwave is open sourced at
https://github.com/uw-mad-dash/shockwave.
2 Motivation
We begin by motivating the need to design a new cluster
scheduler for machine learning workloads.
Filter 𝑓Worst FTF-𝜌SI Avg. JCT Makespan
Adaptive - 1/1
3/2
30.83 X5 7
Fixed - 1/3 1.0 X5.7 7
Fixed - 2/3 1.1 ×5.7 7
Fixed - 1 1.1 ×6.0 7
Table 1: Themis example: using a fixed filter yields subop-
timal JCT and/or fairness compared with an adaptive filter.
Figure 1visualizes the schedule for
𝑓=
2
/
3, showing the
cluster and job setting, and demonstrates how a filter works.
2
The solver runs asynchronously in a separate thread and does not block
the main scheduling loop.
GPU ID \ round ID 0 1 2 3 4 5 6
0 A B A A B A A
1 A B A A B A A
2 B C C B C A A
3 B C C B C
f 2/3 2/3 2/3 2/3 2/3 2/3 2/3
Cluster setting: 4 GPUs. Jobs: A, B, C are three DNN training jobs with one iteration.
Serial (1-GPU) iteration times for A, B, and C are 12, 8, and 6.
The number of requested GPUs per iteration for A, B, and C are 3, 2, and 2.
Figure 1: Example - Themis [29] with a static filter (
𝑓=
2
/
3).
In each round of allocation, the filter (grey color) selects 2
/
3
of the jobs unfairly treated so far. The resulting FTF-
𝜌
values
for jobs (A, B, C) are (0.78, 0.83, 1.1), showing a static filter
hurts fairness. As in Themis, we assume a linear slowdown
when the number of allocated GPUs is less than requested.
2.1 Jointly Optimizing Fairness and Efficiency
Existing scheduling mechanisms lack a systematic approach
to jointly optimize fairness and efficiency. We first formally
define a
fairness
metric: we adopt the definition of fairness
used in recent work such as Themis [29], Gavel [33], and
Pollux [36]: Finish Time Fairness (FTF)
𝜌(𝐺)=𝑡𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒
𝑡𝑒𝑔𝑎𝑙𝑖𝑡𝑎𝑟 𝑖𝑎𝑛
;
where
𝑡𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒
represents the job finish time resulting from
a policy
𝐺
, and
𝑡𝑒𝑔 𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛
is
𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒 ·𝑁
;
𝑁
indicates the
number of contending jobs,
𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒
indicates run time when
running exclusively with requested resources. FTF
𝜌 >
1
(
𝜌 <=
1) indicates that a job is unfairly (fairly) scheduled.
Note that while we focus on FTF, Shockwaves underlying
market formulation can be extended to support other fairness
metrics. For example, by assigning different budgets to jobs
we can support weighted proportional fairness [27] with the
budgets encoding priorities.
Second, we define a policy as
efficient
if it minimizes
makespan, or correspondingly, maximizes cluster utilization
given a sequence of jobs.
Instantaneous fair-share sacrifices efficiency.
Existing fair
sharing policies, such as Processor-Sharing (PS) [40] and its
multi-dimensional extension, Dominant Resource Fairness
(DRF) [17], guarantee that at each instant, any job in the sched-
ule obtains exactly a 1
/𝑁
share of the (dominant) resources.
However, restricting schedulers to instantaneous fair share
can adversely degrade long-term efficiency. Previous work
in Altruistic Scheduling (AS) [20] has shown that sacrificing
instantaneous fair share and letting some jobs altruistically
contribute resources can improve efficiency by 26% [20].
Using filters to balance fairness and efficiency is sub-
optimal.
Given the limitations of instantaneous fair sharing
schemes, recent work [28,29] has proposed using round-based
schemes that optimize instantaneous efficiency and long-term
fairness. Within each round of scheduling, AlloX [28] and
Themis [29] select for allocation a fixed fraction (
𝑓
) of jobs
that have attained the least resource share in the past. Within
these filtered jobs, the scheduler tries to maximize efficiency.
Across rounds, the filter compensates for jobs unfairly sched-
uled in the past and thus pursues fairness in the long run.
Existing studies pre-specify a fixed value for filter
𝑓
across
rounds, but we find that adopting a fixed filter can incur a
loss in average JCT or makespan [29], and filter tuning is
challenging. Table 1uses a simple example with three jobs
to show how different filters yield very different performance
outcomes: fixed filter values
𝑓=
1and
𝑓=2
3
violate finish
time fairness (
𝜌 >
1) while
𝑓=
1
/
3leads to worse JCT. We
included the full toy examples in Appendix B. Tuning the
hand-crafted fairness filter is challenging without any insight
into the resulting performance outcomes, and it is more diffi-
cult when the workload varies.
Overall, this argues for a rigorous, systematic approach that
jointly and intrinsically (i.e., without knob-tuning) optimizes
efficiency and fairness in resource sharing.
2.2 Handling Dynamic Batch Size Scaling
The above goal of optimizing for fairness and efficiency is
made further challenging in the presence of dynamism. Dy-
namism can result in a number of different scenarios. For
example, dynamism can result from job arrivals leading to
time-variant cluster contention, and systems like AFS [24]
are designed to improve JCT by adjusting shares based on job
arrivals. On the other hand, dynamism can arise from training
jobs that can change their training configurations dynamically.
For example, if a job uses gradient noise scale (GNS) [30,31],
the batch size used can change dynamically. This can affect
fair scheduling because when a job switches to using a large
batch size, the per epoch time will decrease, and thereby its re-
maining running time will also decrease (Figure 2(a)). Unlike
prior systems which only handle dynamism that arise from
job arrivals, Shockwave (and prior work in Pollux [36]) focus
on handling dynamic changes in batch sizes of training jobs.
Being agnostic or reactive to dynamic adaptation breaks
finish time fairness.
We show that being agnostic or reactive
to dynamic changes (or dynamic adaption) can yield severe
unfairness. Finish time fairness (FTF) implies a soft dead-
line
𝑡𝑒𝑔 𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛
=
𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒 ·𝑁
for job completion; the later
the job finishes after the deadline, the more unfair the sched-
ule is. Computing FTF requires computing the exclusive run
time (i.e.,
𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒
), which is straightforward for static jobs
since run time roughly equals training throughput (samples
per second) times the remaining amount of work (remain-
ing number of epochs). However, for jobs that use dynamic
adaptation, future training epochs could be significantly faster
because of using a larger batch size. Agnostic scheduling and
reactive scheduling are both unaware of future speedups and
hence overestimate run time, and thus mistakenly extend the
deadline 𝑡𝑒𝑔 𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛 leading to significant unfairness.
Figure 2b uses a job from our trace to show the difference
between Themis, which uses a reactive approach, and Shock-
wave, which uses a proactive approach. The job dynamically
doubles batch size three times from 32 to 256, and gradually
boosts training speed by up to 1.7
×
(Figure 2a). Themis is
notified and updates the job’s throughput immediately after
each batch size scaling, and recomputes the estimated finish
(a) Dynamic Adaptation
(b)
𝑇𝑒𝑔𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛
- Interpolated finish time
under 1/𝑁cluster share (c) GPU Allocation
Figure 2: Example - Reactive scheduling (Themis [29]) for dynamic adaptation breaks finish time fairness. Proactive scheduling
(Shockwave) for dynamic adaptation preserves finish time fairness.
Figure 3: Comparing model accuracy (ResNet18-CIFAR-10)
for vanilla training, expert-set batch size scaling, and Pollux
autoscaling. The legends in the bottom figure indicate batch
size.
time based on that (red dashed line in Figure 2b). Changes in
the estimated finish time lead Themis to detect that the job
has received less than its fair share and Themis attempts to
prioritize this job in future scheduling rounds. However, the
job has already suffered from under-prioritization in its early
phases and misses the fairness deadline by 2.07
×
(Figure 2c).
Agnostic scheduling is even worse and increases the job’s
FTF 𝜌to 3.07; we omit this from the figure.
Being agnostic or reactive to dynamic adaptation de-
grades system efficiency.
Many makespan-minimizing algo-
rithms, such as Mixed Integer Linear Programming (MILP),
Longest Processing Time (LPT) [14], and JCT (Job Comple-
tion Time) minimization algorithms such as Shortest Remain-
ing Time (SRPT) and AlloX [28], rely on the exact job run
time, or the ordering of jobs’ run time to derive a schedule.
However, dynamic adaption adaptively changes a job’s
throughput, and thus a job’s run time can be reduced (when
batch size is increased) or prolonged (when batch size is de-
creased) on the fly. This means that when making scheduling
decisions, the job run time estimated using initial or current
throughput is only valid for the current instant, and if it is used
beyond that system efficiency can be significantly undermined.
In Figure 4, the example shows that for MILP makespan mini-
mization, being reactive to dynamic adaptation yields a 22.3%
Job 1 vanilla training, bs=16
bs=16 bs=128
Job 3 vanilla training, bs=32
Job 2 vanilla training, bs=40
bs=40
1 1
32 2
1
1
3
2 2
11
3
2 2
GPU1GPU0
GPU1
GPU0GPU1GPU0
(b) Agnostic scheduling
(c) Reactive scheduling
(d) Proactive scheduling
(a) Jobs 1 & 2 apply dynamic
adaptation after 2s,
accelerating training
012345678910 t
012345678910 t
012345678910 t
01234567t
bs=80
Figure 4: Being agnostic and/or reactive to dynamic adapta-
tion undermines efficiency while proactive scheduling mini-
mizes makespan and maximizes efficiency.
worse makespan and 28% worse cluster utilization compared
to proactive scheduling. Reactive scheduling considers
𝐽
1
and
𝐽
2as long-running jobs from their initial throughput and
prioritizes them to minimize makespan. But due to dynamic
adaptation,
𝐽
1and
𝐽
2become shorter than
𝐽
3in their second
epoch, and it is too late to compensate and re-prioritize
𝐽
3.
Being completely agnostic to dynamic adaption is even worse,
yielding a 30% worse makespan.
Overall, the above points motivate the need for a scheduler
that can model future dynamic adaptation and account for this
uncertainty while optimizing for both fairness and efficiency.
2.3 Supporting User-defined Dynamic Ddaptation
While dynamic adaptation with batch size scaling is a key
enabler for efficient training of large-scale DNNs, improper
changes to the batch size can adversely impact convergence
properties and degrade model accuracy. Thus, unlike systems
such as Pollux [36] which automatically modify the batch
size of training jobs, we argue that schedulers should support
user-defined dynamic adaptation schedules to avoid affect-
ing training accuracy. This is mainly because no adaptive
batch size scaling technique works consistently well across
all datasets and optimizers. As a result, ML researchers have
developed many different batch sizing scaling policies includ-
ing linear scaling rule [7], Accordion [1], Gradient Noise scale
(GNS) [36], SimiGrad [37], Hessian eigenspectrum [42], etc.
We next study an example of how choosing an incorrect
batch size schedule can affect accuracy. In Figure 3, we con-
sider a CIFAR-10 training job on 2 GPUs with ResNet18 with
an initial batch size of 32. When using Pollux [36], the end-to-
end training time reduces by 5
×
as Pollux scales up the batch
size from 32 to 64 at epoch 1, and then up to 314 at epoch 2,
then to 690 at epoch 30, and finally up to 1682 at epoch 70 till
completion. However, this aggressive scaling leads to a 2-3%
accuracy loss. Plotting the statistical efficiency (as defined in
Pollux [36]), we find that using large batch sizes in the first 30
epochs leads to accuracy degradation. Our conversation with
the authors of Pollux suggests that the accuracy degradation
depends on the initial batch size used (32 in this case) and
can thus vary across jobs.3
We also tested an expert heuristic for batch size scaling
of ResNet18 training on CIFAR-10. The heuristic scales up
the batch size when the gradient norm [1] has insignificant
(<50%) changes, and does not scale in the initial epochs 20
epochs and the 10 epochs before and after each learning rate
decay. This expert-defined scaling schedule has minimal ac-
curacy loss and is 3
×
faster than vanilla training. Such expert
heuristic and the associated thresholds vary across models and
datasets; the above heuristic is specific to ResNet-18-CIFAR-
10 training and is not easily transferable. For example, for
ResNet-50-ImageNet training, the experts propose a different
heuristic that scales up the batch size by a factor of ten at the
30th, 60th, and 80th epoch, respectively [38]. Thus, while
prior work has developed scheduling mechanisms for specific
dynamic adaptation techniques, in Shockwave, on the other
hand, we assume no preference for any technique and respect
any choice made by users regarding how to dynamically scale
a training job’s batch size.
In summary, we find that automatically performing dy-
namic adaptation runs the risk of accuracy degradation. Hence
in this work, we aim to develop a scheduler that can observe
and forecast future scaling events but treats dynamic adapta-
tion as a part of the user’s program that cannot be modified.
3 Overview
We next present an overview of Shockwave, a new scheduling
framework that jointly optimizes efficiency and fairness for
machine learning workloads in shared clusters.
Using Market Theory for Efficiency and Fairness
In
Shockwave we propose using market theory to provably guar-
antee efficiency and fairness for resource sharing. While prior
schedulers [29,44] have also leveraged market theory for
fair sharing, they are built on static market models which
assume that resource requests for a job don’t change over
time. We find that the fairness and efficiency guarantees of a
static market do not hold when jobs dynamically change over
time [15]. Thus, Shockwave extends the classic, static market
3
We also found that the statistical efficiency metric in Pollux can be
incorrect for Neural-MF models [22]. We include details of this experiment
in Appendix A.
to a discrete-time, dynamic market, to support efficient and
fair resource sharing under dynamic adaptation.
Predicting Dynamic Adaptation
Building a dynamic mar-
ket alone is not enough as it presumes perfect future knowl-
edge of jobs’ dynamic adaptation behavior; that is, the mar-
ket needs to know when and how much jobs’ performance
(e.g., training throughput) is changed by dynamic adaptation
as training progresses. As ML training itself is a stochastic
process, the trajectory of dynamic scaling is intrinsically un-
certain. We address this problem in Shockwave by forecasting
the trajectory of dynamic adaptation and developing methods
to use these forecasts in the dynamic market.
Scalable System Implementation
Solving a dynamic mar-
ket and predicting dynamic adaptation introduces scheduling
overhead. We build a centralized, round-based scheduler [33]
and incorporate tractable approximations that can ensure the
overhead remains low even as we scale the number of GPUs
and cluster size. We find that Shockwave can maintain low
overhead while scheduling every 120 seconds and scale to
handle 900 active jobs running on a cluster of 256 GPUs.
4 Dynamic Market Theory Formulation
We begin by describing our theoretical formulation of a
discrete-time, dynamic market and the properties it provides.
4.1 Volatile Fisher Market
Market theory provides a fundamental approach to provably
guarantee efficiency and fairness in resource sharing. The
equilibrium of a
Fisher Market
[5], which is solved by max-
imizing Nash Social Welfare (NSW) [8], is a strong condi-
tion that implies all fairness guarantees used in prior sys-
tems. It is known that Fisher market equilibrium (under equal
endowment) implies Pareto Optimality (PO), Envy-freeness
(EF), and Proportionality (PR), which are fairness properties
adopted by existing systems like DRF [17].
We define efficiency in terms of the utility of a job, where
utility
is a function that maps allocated resources to the result-
ing job progress (e.g., throughput improvement if we allocate
more resources). The market equilibrium for a Fisher market
has also been shown to maximize efficiency [6]. Thus, we
explore the applicability of Fisher markets for DL jobs.
From static market to dynamic markets: Volatile Fisher
Market.
Classic Fisher Market assumes static, time-invariant
utility for jobs, and a recent study [15] shows that efficiency
and fairness guarantees can be violated for dynamic, time-
variant utilities. Prior work [2,3] on dynamic markets has also
studied settings where goods (resources) arrive online, while
our market considers a different setting where buyers in the
market have time-variant utilities over goods.
To perform efficient fair sharing under dynamic adapta-
tion, we extend the static Fisher market to a discrete-time,
dynamic market. We name this new market
Volatile Fisher
Market (VFM)
. We prove that maximizing Nash Social Wel-
fare Over Time (i.e.,
NSWOT
in Equation 1) solves the market
摘要:

Shockwave:FairandEfcientClusterSchedulingforDynamicAdaptationinMachineLearningPengfeiZheng1,RuiPan1,TarannumKhan2,ShivaramVenkataraman1andAdityaAkella21UniversityofWisconsin-Madison,2TheUniversityofTexasatAustinAbstractDynamicadaptationhasbecomeanessentialtechniqueinacceleratingdistributedmachinele...

展开>> 收起<<
Shockwave Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning Pengfei Zheng1 Rui Pan1 Tarannum Khan2 Shivaram Venkataraman1and Aditya Akella2.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:5.95MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注