Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR Sami AlabedDominik Grewe Juliana Franco Bart Chrzaszcz

2025-04-27 0 0 481.34KB 11 页 10玖币
侵权投诉
Automatic Discovery of Composite SPMD
Partitioning Strategies in PartIR
Sami Alabed Dominik Grewe Juliana Franco Bart Chrzaszcz
Tom Natan Tamara Norman Norman A. Rink
Dimitrios Vytiniotis Michael Schaarschmidt
DeepMind
Abstract
Large neural network models are commonly trained through a combination of ad-
vanced parallelism strategies in a single program, multiple data (SPMD) paradigm.
For example, training large transformer models requires combining data, model,
and pipeline partitioning; and optimizer sharding techniques. However, identifying
efficient combinations for many model architectures and accelerator systems re-
quires significant manual analysis. In this work, we present an automatic partitioner
that identifies these combinations through a goal-oriented search. Our key findings
are that a Monte Carlo Tree Search-based partitioner leveraging partition-specific
compiler analysis directly into the search and guided goals matches expert-level
strategies for various models.
1 Introduction
The rapid rise of large neural networks with significant memory and compute requirements have
made partitioning strategies critical for enabling their training and inference. For example, large
language models such as GPT-3 [
2
], Gopher [
22
], PaLM [
5
] or Chinchilla [
10
] have relied on various
combinations of Megatron-style layer sharding [
28
,
20
], batch/data parallelism, pipeline parallelism
[19, 35, 11], or ZeRO sharding/offloading [23, 25].
Model sizes across applications have been outpacing device memory and FLOPS growth: Google
TPU [
14
,
15
] v2 reports 46 TFLOPS / 16 GB of HBM per chip, while v4 reports 275 TFLOPS / 32
GB
2
over a period where model parameters – and consequently FLOPS requirements – increased by
approximately 4 orders of magnitude. The use of sophisticated parallelism strategies is, therefore,
unavoidable. Despite progress in the development of partitioning APIs (e.g. GSPMD-style [
34
,
18
]
partitioning annotations exposed in front-end APIs by JAX [
1
]) or parallelism frameworks such as
DeepSpeed [
24
], customising new research models to hardware configurations often still requires
expert analysis to account for model-specific compute and communication patterns. Recent work has
proposed automated parallelism frameworks specifically targeting large neural network training. For
example, Alpa [
37
] defines a hierarchical space of intra- and inter-operator parallelism and solves
sub-problems through integer linear programming (ILP). Unity [
29
] as an extension of FlexFlow
[
13
] and TASO [
12
] jointly optimises parallelism and algebraic transformations through hierarchical
search. There exists work on applying RL for the partitioning problem [
31
], recently combining RL
with a constraint solver and achieving good generalization properties [32].
Correspondence: sa894@cam.ac.uk. Work done during an internship at DeepMind.
2https://cloud.google.com/tpu/docs/system-architecture-tpu-vm
Preprint. Under review.
arXiv:2210.06352v1 [cs.DC] 7 Oct 2022
We approach automated parallelism with a different philosophy. Our automated SPMD partitioner is
closely integrated with our PartIR compiler stack. PartIR [
30
] provides our partitioner with the ability
to validate partitioning decisions, expose a host of static analyses about the module, automatically
propagate sharding decisions across a module, and produce analytical cost model estimates. The latter
is obtained using a simulator of a lowered and communication-optimized partitioned module at the
compiler level. By contrast, ILP-based encodings [
37
] attempt to address scalability issues in large
graphs by e.g. heuristically grouping operators to prune the search space effectively – furthermore,
there may be communication optimizations post-ILP that cannot be captured when deciding the
parallel plan. FlexFlow [
13
] addresses scalability by making the simulator in their MCMC search
incremental and relies on pre-defined layers/block-operators (e.g. Multi-Head Attention). On the
other hand, our approach can partition any array operation of a low-level tensor IR (XLA) that
high-level NN libraries compile. As we show, our tight compiler integration allows our search to be
data and size efficient and run routinely as part of our user (typically researchers) iterative workflows.
In this work, we extend our automated partitioner, Automap [
27
], to support partitioning a model
across multi-dimensional device topology and discover expert-level composite SPMD partitioning
strategies. Our contribution is a design of a goal-oriented Monte Carlo tree search (MCTS) [
3
] that
decomposes the problem into smaller sub-problems. Furthermore, we incorporate a partitioning-
specific compiler analysis into the MCTS to reduce both the nodes and edges of the tree, improving the
search’s robustness. We show that our partitioner discovers composite expert-level SPMD strategies
for common models, such as Transformers and GraphNets. Moreover, it produces significantly better
than human-written strategies for models as UNet, for which no known strategy is available.
2 Background
2.1 Logical Meshes and Composite Strategies
Figure 1: The composite strategy of batch and model
parallelism over mesh
{batch:N, model:M}
. We also
show communication patterns that may emerge; on the
left, possible communication along the
model
axis (e.g.,
Megatron activation reductions), and on the right, com-
munication along the
batch
axis (e.g., gradient reduc-
tions). The color coding denotes the unique parameter
shards that each device holds; e.g., all devices along the
batch axis holds the same shard of parameters.
Leveraging composite partitioning techniques has enabled training of recent large models [
2
,
22
,
10
].
The main idea is to structure the available accelerator devices into an n-dimensional logical mesh
(which will typically, but not necessarily, correspond to the physical communication topology). For
instance, we may view 32 TPU devices as a 4x8 mesh or a system of 2 GPU servers with 16 GPUs
each as a 2x16 mesh. Once we are given such a logical mesh of available devices, e.g., with a 2D
mesh, a conventional strategy would be to do batch parallelism over one axis and parameter sharding
(model parallelism) over the other. Figure 1 graphically depicts this strategy. ZeRO-style sharding of
the optimizer [
23
] (on top of batch parallelism, and possibly also parameter sharding) is simply a
different stage that shards the optimizer parameters along the axis used for batch parallelism.
Conceptually, each stage of a composite strategy, like the above, optimizes for specific objectives. For
example, batch parallelism and parameter sharding typically improve runtime (while also reducing
memory requirements); whereas ZeRO-style optimizer sharding aims mainly towards improving
memory (but may improve runtime too, as the typically memory-bound vector operations of the
optimizer update are sharded, as was already observed in precursor work [
33
]). ZeRO “stage-2”
optimizer sharding may not increase communication cost since it replaces AllReduce operations with
pairs of parameter AllGather and gradient ReduceScatter. ZeRO “stage-3” aims to further improve
memory by introducing separate parameter AllGather for the forward and the backward pass of a
model; hence may slightly increase the runtime in favor of keeping smaller tensors live across forward
and backward computation. Note that, in our setting, the logical mesh will be given ahead of time by
the user; the partitioner’s task is to discover composite strategies like those described above based on
user-provided objectives.
2
// wo rkl ist = { %a rg0 , % arg 1, %a rg2 }
func f ( %arg 0, % arg1 , %ar g2 ) {
%0 = a dd ( %a rg0 , % ar g1 )
...
}
// wo rkl ist = { %arg 2 }
func f( %a rg0 , % arg 1, %a rg2 ) {
%0 = tile 0"x" ( %r : range < N >) {
%slice0 = slice 0 % ar g0 [ %r ]
%slice1 = slice 0 % ar g1 [ %r ]
%a = a dd ( %s li ce0 , %s li ce1 )
yield %a
}
...
}
Figure 2: Sharding argument
%arg0
along axis
batch
in the left module causes
%arg1
to also
become sharded; leaving us with a worklist of just %arg2 and the rewritten module on the right.
2.2 PartIR rewriting for SPMD partitioning
The PartIR compiler infrastructure accepts a user-defined ML model in the form of an MLIR [
17
]
module, a user-provided mesh, and exposes rewrite actions on that module to express forms of
shardings. The key idea behind PartIR rewriting is to express parallelism of array operations, like
z
= matmul(x, w)
, as parallel tiling or reduction loops, specifically tagged with the axis over which
these parallel loops range. e.g:
// Til in g lo op d eri ved from sh ard i ng x
// alo ng d ime nsi on 0 on " ba tch " a xis .
%z = tile 0" b atch " ( %r : range < N >) {
%xs = s li ce 0 %x [ %r ]
%z s = m atm ul ( % xs , %w )
yield %zs
}
// Redu c tio n loop d eri ved from sh ard i ng
// bo th i npu ts al on g a c o ntr a cti n g
// di men si on on " mo del " a xis.
%z = sum " mo del " ( %r : range < M >) {
%xs = s li ce 1 %x [ %r ]
%ws = s li ce 0 w [ %r ]
%zs = m atm ul ( %x s, %ws )
yield %zs
}
Compiler-exposed sharding actions, of the form
partition(parameter_id, dimension, axis)
,
introduce and propagate such tiling loops across the MLIR module, based on powerful propagation
rules that depend on the semantics of the array operations in hand.
Search with PartIR actions
Automap [
27
] applies such actions to the module, guided by a worklist
consisting of the module arguments (e.g., model parameters, optimizer state and the data tensors
passed to a training step function). If the compiler detects that the sharding of an argument can be
propagated to another argument, then both are removed from the worklist and no longer considered
for further sharding actions (See Figure 2 for an example.)
As a result, a very small number of sharding actions can lead to high-quality partitions (e.g., low
communication overhead) within a reasonable search budget. This can be amplified by further
compressing the worklist with user-provided groupings of parameters that should be equi-sharded
(e.g. all transformer blocks across a large transformer model). The overall search is driven by a cost
model, typically simulated runtime cost and peak per-device memory obtained statically.
3 Multi-axis search
In this work, we extend Automap [
27
] to support multi-axis automatic partitioning, a capability that
our PartIR stack already supports through nesting tiling and reduction loops, as well as multi-axis
loop propagation. However, multi-axis partitioning increases the search complexity exponentially
for every axis. The search must consider the dependency between partitioning decisions and the
argument for partition, axis, and dimension. Compiler static analysis together with Automap’s [
27
]
worklist mechanism described above helps us prune the partitioning decision branching factor. This
section outlines the major additions needed to reduce the complexity of the problem and reach fast
expert-level sharding on several representative models.
3.1 Monte Carlo Tree Search for SPMD compilation
The automated partitioner is implemented as a Monte Carlo Tree Search (MCTS) [
3
] with upper
confidence bound for trees (UCT). The MCTS starts with no prior knowledge about the model’s
3
摘要:

AutomaticDiscoveryofCompositeSPMDPartitioningStrategiesinPartIRSamiAlabedDominikGreweJulianaFrancoBartChrzaszczTomNatanTamaraNormanNormanA.RinkDimitriosVytiniotisMichaelSchaarschmidtDeepMindAbstractLargeneuralnetworkmodelsarecommonlytrainedthroughacombinationofad-vancedparallelismstrategiesinasingl...

展开>> 收起<<
Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR Sami AlabedDominik Grewe Juliana Franco Bart Chrzaszcz.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:481.34KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注