
We next study an example of how choosing an incorrect
batch size schedule can affect accuracy. In Figure 3, we con-
sider a CIFAR-10 training job on 2 GPUs with ResNet18 with
an initial batch size of 32. When using Pollux [36], the end-to-
end training time reduces by 5
×
as Pollux scales up the batch
size from 32 to 64 at epoch 1, and then up to 314 at epoch 2,
then to 690 at epoch 30, and finally up to 1682 at epoch 70 till
completion. However, this aggressive scaling leads to a 2-3%
accuracy loss. Plotting the statistical efficiency (as defined in
Pollux [36]), we find that using large batch sizes in the first 30
epochs leads to accuracy degradation. Our conversation with
the authors of Pollux suggests that the accuracy degradation
depends on the initial batch size used (32 in this case) and
can thus vary across jobs.3
We also tested an expert heuristic for batch size scaling
of ResNet18 training on CIFAR-10. The heuristic scales up
the batch size when the gradient norm [1] has insignificant
(<50%) changes, and does not scale in the initial epochs 20
epochs and the 10 epochs before and after each learning rate
decay. This expert-defined scaling schedule has minimal ac-
curacy loss and is 3
×
faster than vanilla training. Such expert
heuristic and the associated thresholds vary across models and
datasets; the above heuristic is specific to ResNet-18-CIFAR-
10 training and is not easily transferable. For example, for
ResNet-50-ImageNet training, the experts propose a different
heuristic that scales up the batch size by a factor of ten at the
30th, 60th, and 80th epoch, respectively [38]. Thus, while
prior work has developed scheduling mechanisms for specific
dynamic adaptation techniques, in Shockwave, on the other
hand, we assume no preference for any technique and respect
any choice made by users regarding how to dynamically scale
a training job’s batch size.
In summary, we find that automatically performing dy-
namic adaptation runs the risk of accuracy degradation. Hence
in this work, we aim to develop a scheduler that can observe
and forecast future scaling events but treats dynamic adapta-
tion as a part of the user’s program that cannot be modified.
3 Overview
We next present an overview of Shockwave, a new scheduling
framework that jointly optimizes efficiency and fairness for
machine learning workloads in shared clusters.
Using Market Theory for Efficiency and Fairness
In
Shockwave we propose using market theory to provably guar-
antee efficiency and fairness for resource sharing. While prior
schedulers [29,44] have also leveraged market theory for
fair sharing, they are built on static market models which
assume that resource requests for a job don’t change over
time. We find that the fairness and efficiency guarantees of a
static market do not hold when jobs dynamically change over
time [15]. Thus, Shockwave extends the classic, static market
3
We also found that the statistical efficiency metric in Pollux can be
incorrect for Neural-MF models [22]. We include details of this experiment
in Appendix A.
to a discrete-time, dynamic market, to support efficient and
fair resource sharing under dynamic adaptation.
Predicting Dynamic Adaptation
Building a dynamic mar-
ket alone is not enough as it presumes perfect future knowl-
edge of jobs’ dynamic adaptation behavior; that is, the mar-
ket needs to know when and how much jobs’ performance
(e.g., training throughput) is changed by dynamic adaptation
as training progresses. As ML training itself is a stochastic
process, the trajectory of dynamic scaling is intrinsically un-
certain. We address this problem in Shockwave by forecasting
the trajectory of dynamic adaptation and developing methods
to use these forecasts in the dynamic market.
Scalable System Implementation
Solving a dynamic mar-
ket and predicting dynamic adaptation introduces scheduling
overhead. We build a centralized, round-based scheduler [33]
and incorporate tractable approximations that can ensure the
overhead remains low even as we scale the number of GPUs
and cluster size. We find that Shockwave can maintain low
overhead while scheduling every 120 seconds and scale to
handle 900 active jobs running on a cluster of 256 GPUs.
4 Dynamic Market Theory Formulation
We begin by describing our theoretical formulation of a
discrete-time, dynamic market and the properties it provides.
4.1 Volatile Fisher Market
Market theory provides a fundamental approach to provably
guarantee efficiency and fairness in resource sharing. The
equilibrium of a
Fisher Market
[5], which is solved by max-
imizing Nash Social Welfare (NSW) [8], is a strong condi-
tion that implies all fairness guarantees used in prior sys-
tems. It is known that Fisher market equilibrium (under equal
endowment) implies Pareto Optimality (PO), Envy-freeness
(EF), and Proportionality (PR), which are fairness properties
adopted by existing systems like DRF [17].
We define efficiency in terms of the utility of a job, where
utility
is a function that maps allocated resources to the result-
ing job progress (e.g., throughput improvement if we allocate
more resources). The market equilibrium for a Fisher market
has also been shown to maximize efficiency [6]. Thus, we
explore the applicability of Fisher markets for DL jobs.
From static market to dynamic markets: Volatile Fisher
Market.
Classic Fisher Market assumes static, time-invariant
utility for jobs, and a recent study [15] shows that efficiency
and fairness guarantees can be violated for dynamic, time-
variant utilities. Prior work [2,3] on dynamic markets has also
studied settings where goods (resources) arrive online, while
our market considers a different setting where buyers in the
market have time-variant utilities over goods.
To perform efficient fair sharing under dynamic adapta-
tion, we extend the static Fisher market to a discrete-time,
dynamic market. We name this new market
Volatile Fisher
Market (VFM)
. We prove that maximizing Nash Social Wel-
fare Over Time (i.e.,
NSWOT
in Equation 1) solves the market