MSRL Distributed Reinforcement Learning with Dataﬂow Fragments Huanzhou Zhu Imperial College LondonBo Zhao

2025-05-02 0 0 769.47KB 16 页 10玖币

侵权投诉

MSRL: Distributed Reinforcement Learning with Dataﬂow Fragments

Huanzhou Zhu*

Imperial College London

Bo Zhao*

Imperial College London

Gang Chen

Huawei Technologies Co., Ltd.

Weifeng Chen

Huawei Technologies Co., Ltd.

Yijie Chen

Huawei Technologies Co., Ltd.

Liang Shi

Huawei Technologies Co., Ltd.

Yaodong Yang

Huawei R&D United Kingdom

Peter Pietzuch

Imperial College London

Lei Chen

Hong Kong University of Science and Technology

Abstract

Reinforcement learning (RL) trains many agents, which is

resource-intensive and must scale to large GPU clusters. Dif-

ferent RL training algorithms offer different opportunities for

distributing and parallelising the computation. Yet, current

distributed RL systems tie the deﬁnition of RL algorithms

to their distributed execution: they hard-code particular dis-

tribution strategies and only accelerate speciﬁc parts of the

computation (e.g. policy network updates) on GPU workers.

Fundamentally, current systems lack abstractions that decou-

ple RL algorithms from their execution.

We describe MindSpore Reinforcement Learning (MSRL),

a distributed RL training system that supports distribution

policies that govern how RL training computation is paral-

lelised and distributed on cluster resources, without requir-

ing changes to the algorithm implementation. MSRL intro-

duces the new abstraction of a fragmented dataﬂow graph,

which maps Python functions from an RL algorithm’s train-

ing loop to parallel computational fragments. Fragments are

executed on different devices by translating them to low-level

dataﬂow representations, e.g. computational graphs as sup-

ported by deep learning engines, CUDA implementations or

multi-threaded CPU processes. We show that MSRL sub-

sumes the distribution strategies of existing systems, while

scaling RL training to 64 GPUs.

1 Introduction

Reinforcement learning (RL) solves decision-making prob-

lems in which agents continuously learn policies, typically

represented as deep neural networks (DNNs), on how to act

in an unknown environment [48]. RL has achieved remark-

able outcomes in complex real-world settings: in game play,

AlphaGo [46] has defeated a world champion in the board

game Go; in biology, AlphaFold [20] has predicted three-

dimensional structures for protein folding; in robotics, RL-

based approaches have allowed robots to perform tasks such

as dexterous manipulation without human intervention [14].

*Equal contribution

The advances in RL come with increasing computational

demands for training: AlphaStar trained 12 agents using

384 TPUs and 1,800 CPUs for more than 44 days to achieve

grandmaster level in StarCraft II game play [50]; OpenAI Five

trained to play Dota 2 games for 10 months with 1,536 GPUs

and 172,800 CPUs, defeating 99.4% of human players [3].

Distributed RL systems must therefore scale in terms of agent

numbers, the amount of training data generated by environ-

ments, and the complexity of the trained policies [3,20,50].

Existing designs for distributed RL systems, e.g. SEED

RL [9], ACME [17], Ray [33], RLlib [24] and Podracer [15]

hardcode a single strategy to parallelise and distribute an

RL training algorithm based on its algorithmic structure. For

example, if an RL algorithm is deﬁned as a set of Python

functions for agents, learners and environments, a system

may directly invoke the implementation of an agent’s

act()

function to generate new actions for the environment.

While integrating an RL algorithm’s deﬁnition with its

execution strategy simpliﬁes the design and implementation,

it means that systems fail to parallelise and distribute RL

algorithms in the most effective way:

(1) When parallelising the training computation, most RL

systems only accelerate the DNN computation on GPUs or

TPUs [9,17,24] using current deep learning (DL) engines,

such as PyTorch [38], TensorFlow [13] or MindSpore [18].

Other parts of the RL algorithm, such as action generation,

environment execution and trajectory sampling, are left to

be executed as sequential Python functions on worker nodes,

potentilly becoming performance bottlenecks.

Recently, researchers have investigated this unrealised ac-

celeration potential: Podracer [15] uses the JAX [11] compi-

lation framework to vectorise the Python implementation of

RL algorithms and parallelise execution on GPUs and TPUs;

WarpDrive [22] can execute the entire RL training loop on a

GPU when implemented in CUDA; and RLlib Flow [25] uses

parallel dataﬂow operators [54] to express RL training. While

these approaches accelerate a larger portion of RL algorithms,

they force users to deﬁne algorithms through custom APIs

e.g. with a ﬁxed set of supported computational operators.

arXiv:2210.00882v2 [cs.LG] 28 Oct 2022

(2) For the distribution of the training computation, current

RL systems are designed with particular strategies in mind, i.e.

they allocate algorithmic components (e.g. actors and learn-

ers) to workers in a ﬁxed way. For example, SEED RL [9]

assumes that learners perform both policy inference and train-

ing on TPU cores, while actors execute on CPUs; ACME [17]

only distributes actors and maintains a single learner; and

TLeague [47] distributes learners but co-locates environments

with actors on CPU workers. Different RL algorithms de-

ployed in different resources will exhibit different compu-

tational bottlenecks, which means that a single strategy for

distribution cannot be optimal.

We observe that the above challenges originate from the lack

of an abstraction that separates the speciﬁcation of an RL

algorithm from its execution. Inspired by how compilation-

based DL engines use intermediate representations (IRs) to

express computation [7,52], we want to design a new dis-

tributed RL system that (i) supports the execution of arbitrary

parts of the RL computation on parallel devices, such as GPUs

and CPUs (parallelism); and (ii) offers ﬂexibility how to de-

ploy parts of an RL algorithm across devices (distribution).

We describe MindSpore Reinforcement Learning (MSRL),

a distributed RL system that decouples the speciﬁcation of

RL algorithms from their execution. This enables MSRL to

change how it parallelises and distributes different RL algo-

rithms to reduce iteration time and increase the scalability of

training. To achieve this, MSRL’s design makes the following

contributions:

(1) Execution-independent algorithm speciﬁcation (§3).

In MSRL, users specify an RL algorithm by implementing

its algorithmic components (e.g. agents, actors, learners) as

Python functions in a traditional way. This implementation

makes no assumptions about how the algorithm will be exe-

cuted: all runtime interactions between components are man-

aged by calls to MSRL APIs. A separate deployment conﬁgu-

ration deﬁnes the devices available for execution.

(2) Fragmented dataﬂow graph abstraction (§4).

From the

RL algorithm implementation, MSRL constructs a fragmented

dataﬂow graph (FDG). An FDG is a dataﬂow representation

of the computation that allows MSRL to map the algorithm

to heterogeneous devices (CPUs and GPUs) at deployment

time. The graph consists of fragments, which are parts of the

RL algorithm that can be parallelised and distributed across

devices.

MSRL uses static analysis on the algorithm implementa-

tion to group its functions into fragments, instances of which

can be deployed on devices. The boundaries between frag-

ments are decided with the help of user-provided partition

annotations in the algorithm implementation. Annotations

specify synchronisation points that require communication

between fragments that are replicated across devices for in-

creased data-parallelism.

(3) Fragment execution using DL engines (§5).

For execu-

tion, MSRL deploys hardware-speciﬁc implementations of

fragments, permitting instances to run on CPUs and/or GPUs.

MSRL supports different fragment implementations: CPU

implementations use regular Python code, and GPU imple-

mentations are generated as compiled computation graphs of

DL engines (e.g. MindSpore or TensorFlow) if a fragment

is implemented using operators, or they are implemented di-

rectly in CUDA. MSRL fuses multiple data-parallel fragments

for more efﬁcient execution.

(4) Distribution policies (§6).

Since the algorithm implemen-

tation is separated from execution through the FDG, MSRL

can apply different distribution policies to govern how frag-

ments are mapped to devices. MSRL supports a ﬂexible set of

distribution policies, which subsume the hard-coded distribu-

tion strategies of current RL systems: e.g. a distribution policy

can distribute multiple actors to scale the interaction with the

environment (like Acme [17]); distribute actors but move

policy inference to learners (like SEED RL [9]); distribute

both actors and learners (like Sebulba [15]); and represent

the full RL training loop on a GPU (like WarpDrive [22]

and Anakin [15]). As the algorithm conﬁguration, its hyper-

parameters or hardware resources change, MSRL can switch

between distribution policy to maintain high training efﬁ-

ciency without requiring changes to the RL algorithm imple-

mentation.

We evaluate MSRL experimentally and show that, by switch-

ing between distribution policies, MSRL improves the train-

ing time of the PPO RL algorithm by up 2.4

as hyper-

parameters, network properties or hardware resources change.

Its FDG abstraction offers more ﬂexible execution with-

out compromising training performance: MSRL scales to

64 GPUs and outperforms the Ray distributed RL system [33]

by up to 3×.

2 Distributed Reinforcement Learning

Next we give background on reinforcement learning (§2.1),

discuss the requirements for distributed RL training (§2.2),

and survey the design space of existing RL systems (§2.3).

2.1 Reinforcement learning

Reinforcement learning (RL) solves a sequential decision-

making problem in which an agent operates in an unknown

environment. The agent’s goal is to learn a policy that max-

imises the cumulative reward based on the feedback from the

environment (see

Fig.

1):

policy inference: an agent per-

forms an inference computation on a policy, which maps the

environment’s state to an agent’s action;

environment execu-

tion: the environment executes the actions, generating trajec-

tories, which are sequences of

hstate,rewardi

pairs produced

by the policy;

policy training: ﬁnally, the agent improves

the policy by training it based on the received reward.

RL algorithms that train a policy fall into three categories:

(1) value-based methods (e.g. DQN [32]) use a deep neural

network (DNN) to approximate the value function, i.e. a map-

Environments

Agent 1 Agent 2 Agent n

Policy 1 Policy 2 Policy n

Action a1Action a2Action an

…

Joint action

<a2,a3,..,an>

State1!

reward1

Message

State2!

reward2

Staten!

rewardn

Step

Environment 1

Environment 2

Environment 3

…

Policy

inference

Step

Environment

execution

Policy

training

Step

…

Fig. 1: RL training loop with multiple agents

ping of the expected return to perform a given action in a state.

Agents learn the values of actions and select actions based on

these estimated values; (2) policy-based methods (e.g. Rein-

force [51]) directly learn a parametrised policy, approximated

by a DNN, for selecting actions without consulting a value

function. Agents use batched trajectories to train the policy

by updating the parameters in the direction of the gradient

that maximises the reward; and (3) actor–critic methods (e.g.

PPO [45], DDPG [26], A2C [31]) combine the two by learn-

ing a policy to select actions (actor) and a value function to

evaluate selected actions (critic).

Multi-agent reinforcement learning (MARL) has multiple

agents, each optimising its own cumulative reward when in-

teracting with the environment or other agents (see

Fig.

1).

A3C [31] executes agents asynchronously on separate environ-

ment instances; MAPPO [53] extends PPO to a multi-agent

setting in which agents share a global parametrised policy.

During training, each agent maintains a central critic that

takes joint information (e.g. state/action pairs) as input, while

learning a policy that only depends on local state.

2.2 Requirements for distributed RL systems

The beneﬁts of RL come at the expense of its computational

cost [37]. Complex settings and environments require the

exploration of large spaces of actions, states and DNN param-

eters, and these spaces grow exponentially with the number

of agents [35]. Therefore, RL systems must be scalable by

exploiting both the parallelism of GPU acceleration and a

large number of distributed worker nodes.

There exists a range of proposals how RL systems can

parallelise and distribute RL training: for single-agent RL,

environment execution (

Fig.

1), policy inference and

training (

) can be distributed across workers [15,17,33],

potentially using GPUs [9,15]; for MARL, agents can be

distributed [24,25,33,43] and exchange training state using

communication libraries [28,36]. In general, environment

instances can execute in parallel [15,31] or be distributed [8]

to speed up execution.

Not a single of the above strategies for parallelising and

distributing RL training is optimal, both in terms of achieving

the lowest iteration time and having the best scalability, for

all possible RL algorithms. For different algorithms, training

workloads and hardware resources, the training bottlenecks

shift: e.g. our measurements show that, for PPO, environ-

ment execution (

) takes up to

98%

of execution time; for

MuZero [44], a large MARL algorithm with many agents,

Python function

def actor(state)

action=actor_net(…)

...

Agent.act( )

Python function

def step(action)

state,reward=…

...

Environment.step( )

Python function

def learn(state,reward)

loss=…

...

Agent.learn( )

Function call

(a) Function-based

Message

Actor 2

Actor 1

Actor 3

Agent.act( )

Environment.step( )

Agent.learn( )

(b) Actor-based

Dataﬂow operators

Agent.act( )

Environment.step( )

Agent.learn( )

Function operator

shared memory

Dataﬂow operators

Fig. 2: Types of RL system designs

environment execution is no longer the bottleneck, and

97%

of time is spent on policy inference and training ( 1+3).

Instead of hardcoding a particular approach for parallelis-

ing and distributing RL computation, a distributed RL system

design should provide the ﬂexibility to change its execution

approach based on the workload. This leads us to the follow-

ing requirements that our design aims to satisfy:

(1) Execution abstraction.

The RL system should have a

ﬂexible execution abstraction that enables it to parallelise and

distribute computation unencumbered by how the algorithm is

deﬁned. While such execution abstractions are commonplace

in compilation-based DL engines [4,7,11,42], they do not

exist in current RL systems (see §2.3).

(2) Distribution strategies.

The RL system should support

different strategies for distributing RL computation across

worker nodes. Users should be permitted to specify multiple

distribution strategies for a single RL algorithm, switching be-

tween them based on the training workload, without having to

change the algorithm implementation. Distribution strategies

should be applicable across classes of RL algorithms.

(3) Acceleration support.

The RL system should exploit

the parallelism of devices, such as multi-core CPUs, GPUs or

other AI accelerators. It should support the full spectrum from

ﬁne-grained vectorised execution to course-grained task paral-

lel tasks on CPUs cores. Acceleration should not be restricted

to policy training and inference only (

) but cover the full

training loop, including environment execution ( 2) [22].

(4) Algorithm abstraction.

Despite decoupling execution

from algorithm speciﬁcation, users expect familiar APIs for

deﬁning algorithms around algorithmic components [10,12,

21], such as agents, actors, learners, policies, environments

etc. An RL system should therefore provide standard APIs,

e.g. deﬁning the main training loop in terms of policy infer-

ence (

), environment execution (

) and policy training (

2.3 Design space of existing RL systems

To understand how existing RL systems support the require-

ments above, we survey the design space of RL systems. Ex-

isting designs can be categorised into three types, function-

based,actor-based and dataﬂow-based (see Fig. 2):

(a) Function-based

RL systems are the most common type.

They express RL algorithms typically as Python functions

Type System Execution Distribution Acceleration Algorithm

Function-

based

SEED RL [9]Python functions environment only DNNs actor/learner/env

Acme [17]Python components

RLGraph [43]delegated to backend agent

Actor-

based

Ray [33]task (stateless)

actor (stateful)

local scheduler

global scheduler

RPC

DNNs

Python functions

with Ray API [33]RLlib [24]

MALib [56]agent/actor/learner/env

Dataﬂow-

based

Podracer [15]JIT-compiled

by JAX [11]

two hard-coded

distribution schemes funcs/DNNs/envs JAX [11] API

RLlib Flow [25]predeﬁned

dataﬂow operators

sharded dataﬂow operators

with Ray tasks [33]DNNs Operator API

WarpDrive [22]GPU thread blocks 7CUDA kernels CUDA

Fragmented

dataﬂow graph MSRL heterogeneous

dataﬂow fragments dataﬂow partitioning funcs/operators/

DNNs/envs agent/actor/learner/env

Tab. 1: Design space of distributed RL systems

executed directly by workers (see

Fig.

2a). The RL training

loop is implemented using direct function calls. For exam-

ple, Acme [17] and SEED RL [9] organise algorithms as ac-

tor/learner functions; RLGraph [43] uses a component abstrac-

tion, and users register Python callbacks to deﬁne function-

ality. Distributed execution is delegated to backend engines,

e.g. TensorFlow [13], PyTorch [38], Ray [33].

(b) Actor-based

RL systems execute algorithms through mes-

sage passing between a set of (programming language) actors

deployed on worker nodes (see

Fig.

2b). Ray [33] deﬁnes

algorithms as parallel tasks in an actor model. Tasks are dis-

tributed among nodes using remote calls. Deﬁning control

ﬂow in a fully distributed model, however, is burdensome.

To overcome this issue, RLlib [24] adds logically centralised

control on top of Ray. Similarly, MALib [56] offers a high-

level agent/evaluator/learner abstraction for population-based

MARL algorithms (e.g. PSRO [34]) using Ray as a backend.

RL systems deﬁne algorithms through

data-parallel operators, which are mapped to GPU kernels or

distributed tasks (see

Fig.

2c). Operators are typically prede-

ﬁned, and users must choose from a ﬁxed set of APIs. For

example, Podracer [15] uses JAX [11] to compile subrou-

tines as dataﬂow operators for distributed execution on TPUs.

WarpDrive [22] deﬁnes dataﬂow operators as CUDA kernels,

and executes the complete RL training loop using GPU thread

blocks. RLlib Flow [25] uses distributed dataﬂow operators,

implemented as iterators with message passing.

Tab.

1considers how well these approaches satisfy our four

requirements from §2.2:

Execution abstraction.

Function- and actor-based systems

execute RL algorithms directly through implemented (Python)

functions and user-deﬁned (programming language) actors,

respectively. This prevents systems from applying optimi-

sations how RL algorithms are parallelised or distributed.

In contrast, dataﬂow-based systems execute computation us-

ing pre-deﬁned/complied operators [15,25] or CUDA ker-

nels [22], which offers optimisation opportunities.

Distribution strategies.

Most function-based systems adopt

a ﬁxed strategy that distributes actors to parallelise the envi-

ronment execution (

Fig.

1) with only a single learner.

Actor-based systems distribute stateful actors and stateless

tasks across multiple workers, often using a greedy scheduler

without domain-speciﬁc planning.

Dataﬂow-based systems typically hardcode how dataﬂow

operators are assigned to workers. Podracer [15] provides two

strategies: Anakin co-hosts an environment and an agent on

each TPU core; Sebulba distributes the environment, learners

and actors on different TPU cores; and RLlib Flow [25] shards

dataﬂow operators across distributed Ray actors.

Acceleration support.

Most RL systems only accelerate

DNN policy inference and training (

). Some dataﬂow-

based systems (e.g. Podracer [15] and WarpDrive [22]) also

accelerate other parts of training, requiring custom dataﬂow

implementations: e.g. Podracer can accelerate the environ-

ment execution (

) on TPU cores; WarpDrive executes the

entire training loop ( 1–3) on a single GPU using CUDA.

Algorithm abstraction.

Function-based RL systems make it

easy to provide intuitive actor/learner/env APIs. Actor-based

RL systems often provide lower-level harder-to-use APIs for

distributed components (e.g. Ray’s

get

wait

remote

[33]);

higher-level libraries (e.g. RLlib’s

PolicyOptimizer

API [24]) try to bridge this gap. Dataﬂow-based systems

come with ﬁxed dataﬂow operators, requiring users to rewrite

algorithms. For example, JAX [11] provides the

vmap

oper-

ator for vectorisation and

pmap

for single-program multiple-

data (SPMD) parallelism.

In summary, there is an opportunity to design a new RL sys-

tem that combines the usability of function-based systems

with the acceleration potential of dataﬂow-based systems.

Such a design requires a new execution abstraction that retains

the ﬂexibility to apply different acceleration and distribution

strategies on top of an RL algorithm speciﬁcation.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MSRL:DistributedReinforcementLearningwithDataowFragmentsHuanzhouZhu*ImperialCollegeLondonBoZhao*ImperialCollegeLondonGangChenHuaweiTechnologiesCo.,Ltd.WeifengChenHuaweiTechnologiesCo.,Ltd.YijieChenHuaweiTechnologiesCo.,Ltd.LiangShiHuaweiTechnologiesCo.,Ltd.YaodongYangHuaweiR&DUnitedKingdomPeterPiet...

收起<<

MSRL Distributed Reinforcement Learning with Dataﬂow Fragments Huanzhou Zhu Imperial College LondonBo Zhao.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MSRL Distributed Reinforcement Learning with Dataﬂow Fragments Huanzhou Zhu Imperial College LondonBo Zhao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: