Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR Sami AlabedDominik Grewe Juliana Franco Bart Chrzaszcz

2025-04-27 0 0 481.34KB 11 页 10玖币

侵权投诉

Automatic Discovery of Composite SPMD

Partitioning Strategies in PartIR

Sami Alabed ∗Dominik Grewe Juliana Franco Bart Chrzaszcz

Tom Natan Tamara Norman Norman A. Rink

Dimitrios Vytiniotis Michael Schaarschmidt

DeepMind

Abstract

Large neural network models are commonly trained through a combination of ad-

vanced parallelism strategies in a single program, multiple data (SPMD) paradigm.

For example, training large transformer models requires combining data, model,

and pipeline partitioning; and optimizer sharding techniques. However, identifying

efﬁcient combinations for many model architectures and accelerator systems re-

quires signiﬁcant manual analysis. In this work, we present an automatic partitioner

that identiﬁes these combinations through a goal-oriented search. Our key ﬁndings

are that a Monte Carlo Tree Search-based partitioner leveraging partition-speciﬁc

compiler analysis directly into the search and guided goals matches expert-level

strategies for various models.

1 Introduction

The rapid rise of large neural networks with signiﬁcant memory and compute requirements have

made partitioning strategies critical for enabling their training and inference. For example, large

language models such as GPT-3 [

], Gopher [

], PaLM [

] or Chinchilla [

] have relied on various

combinations of Megatron-style layer sharding [

], batch/data parallelism, pipeline parallelism

[19, 35, 11], or ZeRO sharding/ofﬂoading [23, 25].

Model sizes across applications have been outpacing device memory and FLOPS growth: Google

TPU [

] v2 reports 46 TFLOPS / 16 GB of HBM per chip, while v4 reports 275 TFLOPS / 32

over a period where model parameters – and consequently FLOPS requirements – increased by

approximately 4 orders of magnitude. The use of sophisticated parallelism strategies is, therefore,

unavoidable. Despite progress in the development of partitioning APIs (e.g. GSPMD-style [

]

partitioning annotations exposed in front-end APIs by JAX [

]) or parallelism frameworks such as

DeepSpeed [

], customising new research models to hardware conﬁgurations often still requires

expert analysis to account for model-speciﬁc compute and communication patterns. Recent work has

proposed automated parallelism frameworks speciﬁcally targeting large neural network training. For

example, Alpa [

] deﬁnes a hierarchical space of intra- and inter-operator parallelism and solves

sub-problems through integer linear programming (ILP). Unity [

] as an extension of FlexFlow

[

] and TASO [

] jointly optimises parallelism and algebraic transformations through hierarchical

search. There exists work on applying RL for the partitioning problem [

], recently combining RL

with a constraint solver and achieving good generalization properties [32].

∗Correspondence: sa894@cam.ac.uk. Work done during an internship at DeepMind.

2https://cloud.google.com/tpu/docs/system-architecture-tpu-vm

Preprint. Under review.

arXiv:2210.06352v1 [cs.DC] 7 Oct 2022

We approach automated parallelism with a different philosophy. Our automated SPMD partitioner is

closely integrated with our PartIR compiler stack. PartIR [

] provides our partitioner with the ability

to validate partitioning decisions, expose a host of static analyses about the module, automatically

propagate sharding decisions across a module, and produce analytical cost model estimates. The latter

is obtained using a simulator of a lowered and communication-optimized partitioned module at the

compiler level. By contrast, ILP-based encodings [

] attempt to address scalability issues in large

graphs by e.g. heuristically grouping operators to prune the search space effectively – furthermore,

there may be communication optimizations post-ILP that cannot be captured when deciding the

parallel plan. FlexFlow [

] addresses scalability by making the simulator in their MCMC search

incremental and relies on pre-deﬁned layers/block-operators (e.g. Multi-Head Attention). On the

other hand, our approach can partition any array operation of a low-level tensor IR (XLA) that

high-level NN libraries compile. As we show, our tight compiler integration allows our search to be

data and size efﬁcient and run routinely as part of our user (typically researchers) iterative workﬂows.

In this work, we extend our automated partitioner, Automap [

], to support partitioning a model

across multi-dimensional device topology and discover expert-level composite SPMD partitioning

strategies. Our contribution is a design of a goal-oriented Monte Carlo tree search (MCTS) [

] that

decomposes the problem into smaller sub-problems. Furthermore, we incorporate a partitioning-

speciﬁc compiler analysis into the MCTS to reduce both the nodes and edges of the tree, improving the

search’s robustness. We show that our partitioner discovers composite expert-level SPMD strategies

for common models, such as Transformers and GraphNets. Moreover, it produces signiﬁcantly better

than human-written strategies for models as UNet, for which no known strategy is available.

2 Background

2.1 Logical Meshes and Composite Strategies

Figure 1: The composite strategy of batch and model

parallelism over mesh

{batch:N, model:M}

. We also

show communication patterns that may emerge; on the

left, possible communication along the

model

axis (e.g.,

Megatron activation reductions), and on the right, com-

munication along the

batch

axis (e.g., gradient reduc-

tions). The color coding denotes the unique parameter

shards that each device holds; e.g., all devices along the

batch axis holds the same shard of parameters.

Leveraging composite partitioning techniques has enabled training of recent large models [

The main idea is to structure the available accelerator devices into an n-dimensional logical mesh

(which will typically, but not necessarily, correspond to the physical communication topology). For

instance, we may view 32 TPU devices as a 4x8 mesh or a system of 2 GPU servers with 16 GPUs

each as a 2x16 mesh. Once we are given such a logical mesh of available devices, e.g., with a 2D

mesh, a conventional strategy would be to do batch parallelism over one axis and parameter sharding

(model parallelism) over the other. Figure 1 graphically depicts this strategy. ZeRO-style sharding of

the optimizer [

] (on top of batch parallelism, and possibly also parameter sharding) is simply a

different stage that shards the optimizer parameters along the axis used for batch parallelism.

Conceptually, each stage of a composite strategy, like the above, optimizes for speciﬁc objectives. For

example, batch parallelism and parameter sharding typically improve runtime (while also reducing

memory requirements); whereas ZeRO-style optimizer sharding aims mainly towards improving

memory (but may improve runtime too, as the typically memory-bound vector operations of the

optimizer update are sharded, as was already observed in precursor work [

]). ZeRO “stage-2”

optimizer sharding may not increase communication cost since it replaces AllReduce operations with

pairs of parameter AllGather and gradient ReduceScatter. ZeRO “stage-3” aims to further improve

memory by introducing separate parameter AllGather for the forward and the backward pass of a

model; hence may slightly increase the runtime in favor of keeping smaller tensors live across forward

and backward computation. Note that, in our setting, the logical mesh will be given ahead of time by

the user; the partitioner’s task is to discover composite strategies like those described above based on

user-provided objectives.

// wo rkl ist = { %a rg0 , % arg 1, %a rg2 }

func f ( %arg 0, % arg1 , %ar g2 ) {

%0 = a dd ( %a rg0 , % ar g1 )

...

}

// wo rkl ist = { %arg 2 }

func f( %a rg0 , % arg 1, %a rg2 ) {

%0 = tile 0"x" ( %r : range < N >) {

%slice0 = slice 0 % ar g0 [ %r ]

%slice1 = slice 0 % ar g1 [ %r ]

%a = a dd ( %s li ce0 , %s li ce1 )

yield %a

}

...

}

Figure 2: Sharding argument

%arg0

along axis

batch

in the left module causes

%arg1

to also

become sharded; leaving us with a worklist of just %arg2 and the rewritten module on the right.

2.2 PartIR rewriting for SPMD partitioning

The PartIR compiler infrastructure accepts a user-deﬁned ML model in the form of an MLIR [

]

module, a user-provided mesh, and exposes rewrite actions on that module to express forms of

shardings. The key idea behind PartIR rewriting is to express parallelism of array operations, like

= matmul(x, w)

, as parallel tiling or reduction loops, speciﬁcally tagged with the axis over which

these parallel loops range. e.g:

// Til in g lo op d eri ved from sh ard i ng x

// alo ng d ime nsi on 0 on " ba tch " a xis .

%z = tile 0" b atch " ( %r : range < N >) {

%xs = s li ce 0 %x [ %r ]

%z s = m atm ul ( % xs , %w )

yield %zs

}

// Redu c tio n loop d eri ved from sh ard i ng

// bo th i npu ts al on g a c o ntr a cti n g

// di men si on on " mo del " a xis.

%z = sum " mo del " ( %r : range < M >) {

%xs = s li ce 1 %x [ %r ]

%ws = s li ce 0 w [ %r ]

%zs = m atm ul ( %x s, %ws )

yield %zs

}

Compiler-exposed sharding actions, of the form

partition(parameter_id, dimension, axis)

introduce and propagate such tiling loops across the MLIR module, based on powerful propagation

rules that depend on the semantics of the array operations in hand.

Search with PartIR actions

Automap [

] applies such actions to the module, guided by a worklist

consisting of the module arguments (e.g., model parameters, optimizer state and the data tensors

passed to a training step function). If the compiler detects that the sharding of an argument can be

propagated to another argument, then both are removed from the worklist and no longer considered

for further sharding actions (See Figure 2 for an example.)

As a result, a very small number of sharding actions can lead to high-quality partitions (e.g., low

communication overhead) within a reasonable search budget. This can be ampliﬁed by further

compressing the worklist with user-provided groupings of parameters that should be equi-sharded

(e.g. all transformer blocks across a large transformer model). The overall search is driven by a cost

model, typically simulated runtime cost and peak per-device memory obtained statically.

3 Multi-axis search

In this work, we extend Automap [

] to support multi-axis automatic partitioning, a capability that

our PartIR stack already supports through nesting tiling and reduction loops, as well as multi-axis

loop propagation. However, multi-axis partitioning increases the search complexity exponentially

for every axis. The search must consider the dependency between partitioning decisions and the

argument for partition, axis, and dimension. Compiler static analysis together with Automap’s [

]

worklist mechanism described above helps us prune the partitioning decision branching factor. This

section outlines the major additions needed to reduce the complexity of the problem and reach fast

expert-level sharding on several representative models.

3.1 Monte Carlo Tree Search for SPMD compilation

The automated partitioner is implemented as a Monte Carlo Tree Search (MCTS) [

] with upper

conﬁdence bound for trees (UCT). The MCTS starts with no prior knowledge about the model’s

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AutomaticDiscoveryofCompositeSPMDPartitioningStrategiesinPartIRSamiAlabedDominikGreweJulianaFrancoBartChrzaszczTomNatanTamaraNormanNormanA.RinkDimitriosVytiniotisMichaelSchaarschmidtDeepMindAbstractLargeneuralnetworkmodelsarecommonlytrainedthroughacombinationofad-vancedparallelismstrategiesinasingl...

展开>> 收起<<

Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR Sami AlabedDominik Grewe Juliana Franco Bart Chrzaszcz.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR Sami AlabedDominik Grewe Juliana Franco Bart Chrzaszcz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: