Shockwave Fair and Efﬁcient Cluster Scheduling for Dynamic Adaptation in Machine Learning Pengfei Zheng1 Rui Pan1 Tarannum Khan2 Shivaram Venkataraman1and Aditya Akella2

2025-05-03 0 0 5.95MB 21 页 10玖币

侵权投诉

Shockwave: Fair and Efﬁcient Cluster Scheduling for Dynamic Adaptation in

Machine Learning

Pengfei Zheng1, Rui Pan1, Tarannum Khan2, Shivaram Venkataraman1and Aditya Akella2

1University of Wisconsin-Madison, 2The University of Texas at Austin

Abstract

Dynamic adaptation has become an essential technique in

accelerating distributed machine learning (ML) training. Re-

cent studies have shown that dynamically adjusting model

structure (e.g., lottery ticket hypothesis [16]) or hyperparame-

ters (e.g., batch size [1]) can signiﬁcantly accelerate training

without sacriﬁcing accuracy. However, existing ML cluster

schedulers are not designed to handle dynamic adaptation.

We show that existing schemes fail to provide fairness and de-

grade system efﬁciency when the training throughput changes

over time under dynamic adaptation. We design Shockwave,

a scheduler with future planning that builds on two key ideas.

First, Shockwave extends classic market theory from static

settings to dynamic settings to co-optimize efﬁciency and

fairness. Second, Shockwave utilizes stochastic dynamic pro-

gramming to handle dynamic changes. We build a system

for Shockwave and validate its performance with both trace-

driven simulation and cluster experiments. Results show that

for traces of ML jobs with dynamic adaptation, Shockwave im-

proves makespan by 1.3

and fairness by 2

when compared

with existing fair scheduling schemes.

1 Introduction

GPU-powered deep neural network (DNN) training is rapidly

becoming a core workload in data centers [25,28,29]. Due to

the sheer volume of training data and massive, ever-increasing

model sizes, many DNN models cannot be trained on a single

GPU device, and distributed, multi-GPU training has become

the norm. The increasing demand for GPU devices motivates

enterprises to consolidate their hardware resources and run

their workloads in a shared GPU cluster [25]. Thus, building

scheduling mechanisms that can fairly arbitrate among jobs

competing for GPU resources and efﬁciently schedule them

for high cluster utilization is important.

While there has been a plethora of work in designing sched-

ulers for DNN workloads, they do not use a rigorous ap-

proach to co-optimize system efﬁciency and fairness. Systems

like Gandiva [41] and Tiresias [21] optimize makespan and

average JCT (Job Completion Time) with techniques such

as dynamic scaling, time-slicing, and over-subscription, but

do not consider fairness. Processor sharing [40] based ap-

proaches such as DRF [17] and Gavel (Weighted Max-Min

Fairness) [33] provide instantaneous fair share of (dominant)

resources in each scheduling round, but this can signiﬁcantly

undermine efﬁciency [20,35]. Stride [39] scheduling-based

approaches such as Gandiva-Fair [10] require cluster opera-

tors to explicitly specify an individual job’s share (e.g., A 20%

and B 80% of GPUs), and manually speciﬁed ﬁxed shares

can violate long-term fairness for ML jobs [29]. Finally, Al-

loX [28] and Themis [29] aim to provide long-term fairness

by adopting a ﬁlter-based approach where within each round,

a subset of jobs that are furthest from the fair share are ﬁltered,

and among the ﬁltered jobs the ones which maximize efﬁ-

ciency are chosen by the scheduler. However, the ﬁlter value

requires onerous hand-tuning; furthermore, even with careful

tuning, using a ﬁxed ﬁlter can lead to sub-optimal efﬁciency

and fairness (§2).

We design Shockwave, a scheduler that leverages market

theory to jointly optimize efﬁciency and fairness for ML train-

ing jobs in a systematic and principled fashion. We formulate

a Fisher market [5] where every job receives an equal budget

to purchase resources from a central arbiter. The arbiter then

computes prices such that the market reaches an equilibrium;

i.e., each job’s budget is spent to maximize its performance

(e.g., training throughput) and all resources are completely

sold. Formulating resource allocation using market theory

is powerful because achieving market equilibrium guaran-

tees both fairness and efﬁciency. Each job has equal purchas-

ing power in acquiring resources, ensuring fairness. Further,

market-clearing equilibrium ensures work conservation and

that each job’s performance is maximized given its budget.

While economic theory has been the basis of many prior

systems (e.g., DRF [17], Themis [29], and REF [43]), they

all assume jobs have known static resource requests. This

assumption is no longer true for elastic ML training jobs [24,

30,36] whose resource requirements dynamically change over

time; further, the changes in resource requirements depend

on model update patterns, and thus they are unknown apriori.

arXiv:2210.00093v1 [cs.DC] 30 Sep 2022

For example, training jobs can dynamically scale their batch

size by computing the gradient noise scale (GNS) [30,31].

OpenAI has used batch size scaling (from 32 to 32M) to

accelerate GPT-3 training by 500x [7] and similarly, BERT-

Large training uses dynamic batch sizes (256 to 4096) to

achieve a 2.5x speedup [37]. In this paper, we extend market

theory to develop an efﬁcient and fair scheduler for ML jobs

with elastic resource requirements.

Existing schedulers are either agnostic or reactive to dy-

namic changes and our experiments show (§3) that they fail

to guarantee fairness or signiﬁcantly degrade efﬁciency. The

key reason for this is that an optimal schedule or weight as-

signment [10] at the current instant can be suboptimal in the

future, and reactively re-prioritizing jobs can be too late to

compensate for the under-prioritization in the early phases.

State-of-the-art schedulers that accommodate dynamism, e.g.,

Pollux [36] do so automatically on behalf of jobs, e.g., by

automatically scaling batch sizes. We ﬁnd that this can hurt

training accuracy [1,11] (§2.3); thus, our aim is to let users

perform elastic changes as their algorithms demand. Achiev-

ing fair allocation under dynamism without assuming any

control over said dynamism is challenging, and is not studied

in existing research. We present a detailed comparison be-

tween Shockwave and other schedulers such as Themis [29],

AFS [24] and Pollux [36] in Section 2.

To support dynamic changes in resource requirements over

time, we extend the classic, static Fisher market and propose

a new discrete-time, dynamic market that can ensure long-

term efﬁciency and fairness. Using discrete time helps us

capture the effects of running a market repeatedly over many

rounds and a dynamic market helps us capture time-varying

utility

for jobs. For example, consider a scenario where we

are running 20 rounds of scheduling for a job. If a job’s per-

GPU batch size increases by 2

after 10 rounds due to GNS

scaling, its utility from being allocated one GPU (

𝑢0

) will also

double after 10 rounds (

𝑢1=

𝑢0

). A static market will assume

time-invariant utility, and will compute the accrued utility over

20 rounds for the job as 20

𝑢0

; in contrast, a dynamic market

can capture the change in utility for the job over time, and can

accurately compute the accrued utility over 20 epochs as 30

𝑢0

Accurately computing the utility can enable the dynamic

market to optimize fairness and efﬁciency over time. We prove

that our dynamic market formulation (§4.2) guarantees long-

term efﬁciency and fairness properties such as maximized

Nash social welfare over time, Pareto optimality over time,

and sharing incentive.

Implementing the dynamic market formulation in real sys-

tems is challenging for two main reasons. First, the market for-

mulation needs to know utility values in the future to compute

market equilibrium. Dynamic adaptations in jobs are non-

deterministically triggered, as they are dependent on gradient

values that vary across models and datasets, which makes

A utility function maps a job’s allocated resource (e.g., GPU) to the

resulting performance (e.g., throughput).

it challenging to predict utility in the future. Second, solv-

ing the dynamic market equilibrium for an (inﬁnitely) long

time horizon is difﬁcult and impractical. It is computation-

ally prohibitive and requires forecasting every job’s future

performance characteristics. Further, as jobs arrive and com-

plete online, we need to periodically solve for the market

equilibrium while maintaining low scheduling overheads.

To bridge the gap between theory and systems, Shockwave

addresses these challenges and implements a dynamic adap-

tation predictor and an approximate dynamic market. First,

we observe that dynamic adaptation for real-world ML work-

loads follows a handful of patterns, and these patterns can be

predicted using Bayesian statistics. We then develop methods

to integrate these predictions into our dynamic market for-

mulation. Second, while performing round-based scheduling,

we ﬁnd that planning a schedule for an (inﬁnitely) long time

horizon can introduce signiﬁcant overheads. To maintain low

scheduling overheads, Shockwave only plans the schedule for

a ﬁnite length window (e.g, 30-60 minutes), and we design

estimators that can capture the effects on long-term fairness

and long-term efﬁciency that arise from short-term planning.

This design helps us balance the system overheads without

sacriﬁcing long-term objectives.

We evaluate Shockwave on a 32-GPU cluster testbed and

use a simulator to study large-scale GPU clusters. Using multi-

ple workloads derived from prior, real-world systems [33,36],

we ﬁnd that Shockwave improves makespan by 1.3

and

fairness by 2

compared to existing fair DNN cluster sched-

ulers including Themis [29], Gavel [33], AlloX [28], etc. We

further evaluate Shockwave on differently sized clusters. Us-

ing a simulator built with the same scheduler as in a phys-

ical cluster we ﬁnd that Shockwave scales to schedule 900

active DNN training jobs on 256 GPUs and maintains the

beneﬁts in makespan (1.26-1.37

) and fairness (2.5-3.1

)

when compared to existing schedulers. We show that our

solver overhead remains low and is less than 12.5% of a two-

minute-long round duration.

Shockwave is open sourced at

https://github.com/uw-mad-dash/shockwave.

2 Motivation

We begin by motivating the need to design a new cluster

scheduler for machine learning workloads.

Filter 𝑓Worst FTF-𝜌SI Avg. JCT Makespan

Adaptive - 1/1

3/2

30.83 X5 7

Fixed - 1/3 1.0 X5.7 7

Fixed - 2/3 1.1 ×5.7 7

Fixed - 1 1.1 ×6.0 7

Table 1: Themis example: using a ﬁxed ﬁlter yields subop-

timal JCT and/or fairness compared with an adaptive ﬁlter.

Figure 1visualizes the schedule for

𝑓=

3, showing the

cluster and job setting, and demonstrates how a ﬁlter works.

The solver runs asynchronously in a separate thread and does not block

the main scheduling loop.

GPU ID \ round ID 0 1 2 3 4 5 6

0 A B A A B A A

1 A B A A B A A

2 B C C B C A A

3 B C C B C

f 2/3 2/3 2/3 2/3 2/3 2/3 2/3

Cluster setting: 4 GPUs. Jobs: A, B, C are three DNN training jobs with one iteration.

Serial (1-GPU) iteration times for A, B, and C are 12, 8, and 6.

The number of requested GPUs per iteration for A, B, and C are 3, 2, and 2.

Figure 1: Example - Themis [29] with a static ﬁlter (

𝑓=

3).

In each round of allocation, the ﬁlter (grey color) selects 2

of the jobs unfairly treated so far. The resulting FTF-

𝜌

values

for jobs (A, B, C) are (0.78, 0.83, 1.1), showing a static ﬁlter

hurts fairness. As in Themis, we assume a linear slowdown

when the number of allocated GPUs is less than requested.

2.1 Jointly Optimizing Fairness and Efﬁciency

Existing scheduling mechanisms lack a systematic approach

to jointly optimize fairness and efﬁciency. We ﬁrst formally

deﬁne a

fairness

metric: we adopt the deﬁnition of fairness

used in recent work such as Themis [29], Gavel [33], and

Pollux [36]: Finish Time Fairness (FTF)

𝜌(𝐺)=𝑡𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒

𝑡𝑒𝑔𝑎𝑙𝑖𝑡𝑎𝑟 𝑖𝑎𝑛

;

where

𝑡𝑠𝑐ℎ𝑒𝑑𝑢𝑙𝑒

represents the job ﬁnish time resulting from

a policy

𝐺

, and

𝑡𝑒𝑔 𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛

𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒 ·𝑁

;

𝑁

indicates the

number of contending jobs,

𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒

indicates run time when

running exclusively with requested resources. FTF

𝜌 >

(

𝜌 <=

1) indicates that a job is unfairly (fairly) scheduled.

Note that while we focus on FTF, Shockwave’s underlying

market formulation can be extended to support other fairness

metrics. For example, by assigning different budgets to jobs

we can support weighted proportional fairness [27] with the

budgets encoding priorities.

Second, we deﬁne a policy as

efﬁcient

if it minimizes

makespan, or correspondingly, maximizes cluster utilization

given a sequence of jobs.

Instantaneous fair-share sacriﬁces efﬁciency.

Existing fair

sharing policies, such as Processor-Sharing (PS) [40] and its

multi-dimensional extension, Dominant Resource Fairness

(DRF) [17], guarantee that at each instant, any job in the sched-

ule obtains exactly a 1

/𝑁

share of the (dominant) resources.

However, restricting schedulers to instantaneous fair share

can adversely degrade long-term efﬁciency. Previous work

in Altruistic Scheduling (AS) [20] has shown that sacriﬁcing

instantaneous fair share and letting some jobs altruistically

contribute resources can improve efﬁciency by 26% [20].

Using ﬁlters to balance fairness and efﬁciency is sub-

optimal.

Given the limitations of instantaneous fair sharing

schemes, recent work [28,29] has proposed using round-based

schemes that optimize instantaneous efﬁciency and long-term

fairness. Within each round of scheduling, AlloX [28] and

Themis [29] select for allocation a ﬁxed fraction (

𝑓

) of jobs

that have attained the least resource share in the past. Within

these ﬁltered jobs, the scheduler tries to maximize efﬁciency.

Across rounds, the ﬁlter compensates for jobs unfairly sched-

uled in the past and thus pursues fairness in the long run.

Existing studies pre-specify a ﬁxed value for ﬁlter

𝑓

across

rounds, but we ﬁnd that adopting a ﬁxed ﬁlter can incur a

loss in average JCT or makespan [29], and ﬁlter tuning is

challenging. Table 1uses a simple example with three jobs

to show how different ﬁlters yield very different performance

outcomes: ﬁxed ﬁlter values

𝑓=

1and

𝑓=2

violate ﬁnish

time fairness (

𝜌 >

1) while

𝑓=

3leads to worse JCT. We

included the full toy examples in Appendix B. Tuning the

hand-crafted fairness ﬁlter is challenging without any insight

into the resulting performance outcomes, and it is more difﬁ-

cult when the workload varies.

Overall, this argues for a rigorous, systematic approach that

jointly and intrinsically (i.e., without knob-tuning) optimizes

efﬁciency and fairness in resource sharing.

2.2 Handling Dynamic Batch Size Scaling

The above goal of optimizing for fairness and efﬁciency is

made further challenging in the presence of dynamism. Dy-

namism can result in a number of different scenarios. For

example, dynamism can result from job arrivals leading to

time-variant cluster contention, and systems like AFS [24]

are designed to improve JCT by adjusting shares based on job

arrivals. On the other hand, dynamism can arise from training

jobs that can change their training conﬁgurations dynamically.

For example, if a job uses gradient noise scale (GNS) [30,31],

the batch size used can change dynamically. This can affect

fair scheduling because when a job switches to using a large

batch size, the per epoch time will decrease, and thereby its re-

maining running time will also decrease (Figure 2(a)). Unlike

prior systems which only handle dynamism that arise from

job arrivals, Shockwave (and prior work in Pollux [36]) focus

on handling dynamic changes in batch sizes of training jobs.

Being agnostic or reactive to dynamic adaptation breaks

ﬁnish time fairness.

We show that being agnostic or reactive

to dynamic changes (or dynamic adaption) can yield severe

unfairness. Finish time fairness (FTF) implies a soft dead-

line

𝑡𝑒𝑔 𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛

𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒 ·𝑁

for job completion; the later

the job ﬁnishes after the deadline, the more unfair the sched-

ule is. Computing FTF requires computing the exclusive run

time (i.e.,

𝑡𝑒𝑥𝑐𝑙𝑢𝑠𝑖𝑣𝑒

), which is straightforward for static jobs

since run time roughly equals training throughput (samples

per second) times the remaining amount of work (remain-

ing number of epochs). However, for jobs that use dynamic

adaptation, future training epochs could be signiﬁcantly faster

because of using a larger batch size. Agnostic scheduling and

reactive scheduling are both unaware of future speedups and

hence overestimate run time, and thus mistakenly extend the

deadline 𝑡𝑒𝑔 𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛 leading to signiﬁcant unfairness.

Figure 2b uses a job from our trace to show the difference

between Themis, which uses a reactive approach, and Shock-

wave, which uses a proactive approach. The job dynamically

doubles batch size three times from 32 to 256, and gradually

boosts training speed by up to 1.7

(Figure 2a). Themis is

notiﬁed and updates the job’s throughput immediately after

each batch size scaling, and recomputes the estimated ﬁnish

(a) Dynamic Adaptation

(b)

𝑇𝑒𝑔𝑎𝑙𝑖𝑡 𝑎𝑟 𝑖𝑎𝑛

- Interpolated ﬁnish time

under 1/𝑁cluster share (c) GPU Allocation

Figure 2: Example - Reactive scheduling (Themis [29]) for dynamic adaptation breaks ﬁnish time fairness. Proactive scheduling

(Shockwave) for dynamic adaptation preserves ﬁnish time fairness.

Figure 3: Comparing model accuracy (ResNet18-CIFAR-10)

for vanilla training, expert-set batch size scaling, and Pollux

autoscaling. The legends in the bottom ﬁgure indicate batch

size.

time based on that (red dashed line in Figure 2b). Changes in

the estimated ﬁnish time lead Themis to detect that the job

has received less than its fair share and Themis attempts to

prioritize this job in future scheduling rounds. However, the

job has already suffered from under-prioritization in its early

phases and misses the fairness deadline by 2.07

(Figure 2c).

Agnostic scheduling is even worse and increases the job’s

FTF 𝜌to 3.07; we omit this from the ﬁgure.

Being agnostic or reactive to dynamic adaptation de-

grades system efﬁciency.

Many makespan-minimizing algo-

rithms, such as Mixed Integer Linear Programming (MILP),

Longest Processing Time (LPT) [14], and JCT (Job Comple-

tion Time) minimization algorithms such as Shortest Remain-

ing Time (SRPT) and AlloX [28], rely on the exact job run

time, or the ordering of jobs’ run time to derive a schedule.

However, dynamic adaption adaptively changes a job’s

throughput, and thus a job’s run time can be reduced (when

batch size is increased) or prolonged (when batch size is de-

creased) on the ﬂy. This means that when making scheduling

decisions, the job run time estimated using initial or current

throughput is only valid for the current instant, and if it is used

beyond that system efﬁciency can be signiﬁcantly undermined.

In Figure 4, the example shows that for MILP makespan mini-

mization, being reactive to dynamic adaptation yields a 22.3%

Job 1 vanilla training, bs=16

bs=16 bs=128

Job 3 vanilla training, bs=32

Job 2 vanilla training, bs=40

bs=40

1 1

32 2

2 2

➡

GPU1GPU0

GPU1

GPU0GPU1GPU0

(b) Agnostic scheduling

(d) Proactive scheduling

(a) Jobs 1 & 2 apply dynamic

adaptation after 2s,

accelerating training

012345678910 t

01234567t

bs=80

Figure 4: Being agnostic and/or reactive to dynamic adapta-

tion undermines efﬁciency while proactive scheduling mini-

mizes makespan and maximizes efﬁciency.

worse makespan and 28% worse cluster utilization compared

to proactive scheduling. Reactive scheduling considers

𝐽

and

𝐽

2as long-running jobs from their initial throughput and

prioritizes them to minimize makespan. But due to dynamic

adaptation,

𝐽

1and

𝐽

2become shorter than

𝐽

3in their second

epoch, and it is too late to compensate and re-prioritize

𝐽

Being completely agnostic to dynamic adaption is even worse,

yielding a 30% worse makespan.

Overall, the above points motivate the need for a scheduler

that can model future dynamic adaptation and account for this

uncertainty while optimizing for both fairness and efﬁciency.

2.3 Supporting User-deﬁned Dynamic Ddaptation

While dynamic adaptation with batch size scaling is a key

enabler for efﬁcient training of large-scale DNNs, improper

changes to the batch size can adversely impact convergence

properties and degrade model accuracy. Thus, unlike systems

such as Pollux [36] which automatically modify the batch

size of training jobs, we argue that schedulers should support

user-deﬁned dynamic adaptation schedules to avoid affect-

ing training accuracy. This is mainly because no adaptive

batch size scaling technique works consistently well across

all datasets and optimizers. As a result, ML researchers have

developed many different batch sizing scaling policies includ-

ing linear scaling rule [7], Accordion [1], Gradient Noise scale

(GNS) [36], SimiGrad [37], Hessian eigenspectrum [42], etc.

We next study an example of how choosing an incorrect

batch size schedule can affect accuracy. In Figure 3, we con-

sider a CIFAR-10 training job on 2 GPUs with ResNet18 with

an initial batch size of 32. When using Pollux [36], the end-to-

end training time reduces by 5

as Pollux scales up the batch

size from 32 to 64 at epoch 1, and then up to 314 at epoch 2,

then to 690 at epoch 30, and ﬁnally up to 1682 at epoch 70 till

completion. However, this aggressive scaling leads to a 2-3%

accuracy loss. Plotting the statistical efﬁciency (as deﬁned in

Pollux [36]), we ﬁnd that using large batch sizes in the ﬁrst 30

epochs leads to accuracy degradation. Our conversation with

the authors of Pollux suggests that the accuracy degradation

depends on the initial batch size used (32 in this case) and

can thus vary across jobs.3

We also tested an expert heuristic for batch size scaling

of ResNet18 training on CIFAR-10. The heuristic scales up

the batch size when the gradient norm [1] has insigniﬁcant

(<50%) changes, and does not scale in the initial epochs 20

epochs and the 10 epochs before and after each learning rate

decay. This expert-deﬁned scaling schedule has minimal ac-

curacy loss and is 3

faster than vanilla training. Such expert

heuristic and the associated thresholds vary across models and

datasets; the above heuristic is speciﬁc to ResNet-18-CIFAR-

10 training and is not easily transferable. For example, for

ResNet-50-ImageNet training, the experts propose a different

heuristic that scales up the batch size by a factor of ten at the

30th, 60th, and 80th epoch, respectively [38]. Thus, while

prior work has developed scheduling mechanisms for speciﬁc

dynamic adaptation techniques, in Shockwave, on the other

hand, we assume no preference for any technique and respect

any choice made by users regarding how to dynamically scale

a training job’s batch size.

In summary, we ﬁnd that automatically performing dy-

namic adaptation runs the risk of accuracy degradation. Hence

in this work, we aim to develop a scheduler that can observe

and forecast future scaling events but treats dynamic adapta-

tion as a part of the user’s program that cannot be modiﬁed.

3 Overview

We next present an overview of Shockwave, a new scheduling

framework that jointly optimizes efﬁciency and fairness for

machine learning workloads in shared clusters.

Using Market Theory for Efﬁciency and Fairness

Shockwave we propose using market theory to provably guar-

antee efﬁciency and fairness for resource sharing. While prior

schedulers [29,44] have also leveraged market theory for

fair sharing, they are built on static market models which

assume that resource requests for a job don’t change over

time. We ﬁnd that the fairness and efﬁciency guarantees of a

static market do not hold when jobs dynamically change over

time [15]. Thus, Shockwave extends the classic, static market

We also found that the statistical efﬁciency metric in Pollux can be

incorrect for Neural-MF models [22]. We include details of this experiment

in Appendix A.

to a discrete-time, dynamic market, to support efﬁcient and

fair resource sharing under dynamic adaptation.

Predicting Dynamic Adaptation

Building a dynamic mar-

ket alone is not enough as it presumes perfect future knowl-

edge of jobs’ dynamic adaptation behavior; that is, the mar-

ket needs to know when and how much jobs’ performance

(e.g., training throughput) is changed by dynamic adaptation

as training progresses. As ML training itself is a stochastic

process, the trajectory of dynamic scaling is intrinsically un-

certain. We address this problem in Shockwave by forecasting

the trajectory of dynamic adaptation and developing methods

to use these forecasts in the dynamic market.

Scalable System Implementation

Solving a dynamic mar-

ket and predicting dynamic adaptation introduces scheduling

overhead. We build a centralized, round-based scheduler [33]

and incorporate tractable approximations that can ensure the

overhead remains low even as we scale the number of GPUs

and cluster size. We ﬁnd that Shockwave can maintain low

overhead while scheduling every 120 seconds and scale to

handle 900 active jobs running on a cluster of 256 GPUs.

4 Dynamic Market Theory Formulation

We begin by describing our theoretical formulation of a

discrete-time, dynamic market and the properties it provides.

4.1 Volatile Fisher Market

Market theory provides a fundamental approach to provably

guarantee efﬁciency and fairness in resource sharing. The

equilibrium of a

Fisher Market

[5], which is solved by max-

imizing Nash Social Welfare (NSW) [8], is a strong condi-

tion that implies all fairness guarantees used in prior sys-

tems. It is known that Fisher market equilibrium (under equal

endowment) implies Pareto Optimality (PO), Envy-freeness

(EF), and Proportionality (PR), which are fairness properties

adopted by existing systems like DRF [17].

We deﬁne efﬁciency in terms of the utility of a job, where

utility

is a function that maps allocated resources to the result-

ing job progress (e.g., throughput improvement if we allocate

more resources). The market equilibrium for a Fisher market

has also been shown to maximize efﬁciency [6]. Thus, we

explore the applicability of Fisher markets for DL jobs.

From static market to dynamic markets: Volatile Fisher

Market.

Classic Fisher Market assumes static, time-invariant

utility for jobs, and a recent study [15] shows that efﬁciency

and fairness guarantees can be violated for dynamic, time-

variant utilities. Prior work [2,3] on dynamic markets has also

studied settings where goods (resources) arrive online, while

our market considers a different setting where buyers in the

market have time-variant utilities over goods.

To perform efﬁcient fair sharing under dynamic adapta-

tion, we extend the static Fisher market to a discrete-time,

dynamic market. We name this new market

Volatile Fisher

Market (VFM)

. We prove that maximizing Nash Social Wel-

fare Over Time (i.e.,

NSWOT

in Equation 1) solves the market

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Shockwave:FairandEfcientClusterSchedulingforDynamicAdaptationinMachineLearningPengfeiZheng1,RuiPan1,TarannumKhan2,ShivaramVenkataraman1andAdityaAkella21UniversityofWisconsin-Madison,2TheUniversityofTexasatAustinAbstractDynamicadaptationhasbecomeanessentialtechniqueinacceleratingdistributedmachinele...

展开>> 收起<<

Shockwave Fair and Efﬁcient Cluster Scheduling for Dynamic Adaptation in Machine Learning Pengfei Zheng1 Rui Pan1 Tarannum Khan2 Shivaram Venkataraman1and Aditya Akella2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Shockwave Fair and Efﬁcient Cluster Scheduling for Dynamic Adaptation in Machine Learning Pengfei Zheng1 Rui Pan1 Tarannum Khan2 Shivaram Venkataraman1and Aditya Akella2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: