MSRL Distributed Reinforcement Learning with Dataflow Fragments Huanzhou Zhu Imperial College LondonBo Zhao

2025-05-02 0 0 769.47KB 16 页 10玖币
侵权投诉
MSRL: Distributed Reinforcement Learning with Dataflow Fragments
Huanzhou Zhu*
Imperial College London
Bo Zhao*
Imperial College London
Gang Chen
Huawei Technologies Co., Ltd.
Weifeng Chen
Huawei Technologies Co., Ltd.
Yijie Chen
Huawei Technologies Co., Ltd.
Liang Shi
Huawei Technologies Co., Ltd.
Yaodong Yang
Huawei R&D United Kingdom
Peter Pietzuch
Imperial College London
Lei Chen
Hong Kong University of Science and Technology
Abstract
Reinforcement learning (RL) trains many agents, which is
resource-intensive and must scale to large GPU clusters. Dif-
ferent RL training algorithms offer different opportunities for
distributing and parallelising the computation. Yet, current
distributed RL systems tie the definition of RL algorithms
to their distributed execution: they hard-code particular dis-
tribution strategies and only accelerate specific parts of the
computation (e.g. policy network updates) on GPU workers.
Fundamentally, current systems lack abstractions that decou-
ple RL algorithms from their execution.
We describe MindSpore Reinforcement Learning (MSRL),
a distributed RL training system that supports distribution
policies that govern how RL training computation is paral-
lelised and distributed on cluster resources, without requir-
ing changes to the algorithm implementation. MSRL intro-
duces the new abstraction of a fragmented dataflow graph,
which maps Python functions from an RL algorithm’s train-
ing loop to parallel computational fragments. Fragments are
executed on different devices by translating them to low-level
dataflow representations, e.g. computational graphs as sup-
ported by deep learning engines, CUDA implementations or
multi-threaded CPU processes. We show that MSRL sub-
sumes the distribution strategies of existing systems, while
scaling RL training to 64 GPUs.
1 Introduction
Reinforcement learning (RL) solves decision-making prob-
lems in which agents continuously learn policies, typically
represented as deep neural networks (DNNs), on how to act
in an unknown environment [48]. RL has achieved remark-
able outcomes in complex real-world settings: in game play,
AlphaGo [46] has defeated a world champion in the board
game Go; in biology, AlphaFold [20] has predicted three-
dimensional structures for protein folding; in robotics, RL-
based approaches have allowed robots to perform tasks such
as dexterous manipulation without human intervention [14].
*Equal contribution
The advances in RL come with increasing computational
demands for training: AlphaStar trained 12 agents using
384 TPUs and 1,800 CPUs for more than 44 days to achieve
grandmaster level in StarCraft II game play [50]; OpenAI Five
trained to play Dota 2 games for 10 months with 1,536 GPUs
and 172,800 CPUs, defeating 99.4% of human players [3].
Distributed RL systems must therefore scale in terms of agent
numbers, the amount of training data generated by environ-
ments, and the complexity of the trained policies [3,20,50].
Existing designs for distributed RL systems, e.g. SEED
RL [9], ACME [17], Ray [33], RLlib [24] and Podracer [15]
hardcode a single strategy to parallelise and distribute an
RL training algorithm based on its algorithmic structure. For
example, if an RL algorithm is defined as a set of Python
functions for agents, learners and environments, a system
may directly invoke the implementation of an agent’s
act()
function to generate new actions for the environment.
While integrating an RL algorithm’s definition with its
execution strategy simplifies the design and implementation,
it means that systems fail to parallelise and distribute RL
algorithms in the most effective way:
(1) When parallelising the training computation, most RL
systems only accelerate the DNN computation on GPUs or
TPUs [9,17,24] using current deep learning (DL) engines,
such as PyTorch [38], TensorFlow [13] or MindSpore [18].
Other parts of the RL algorithm, such as action generation,
environment execution and trajectory sampling, are left to
be executed as sequential Python functions on worker nodes,
potentilly becoming performance bottlenecks.
Recently, researchers have investigated this unrealised ac-
celeration potential: Podracer [15] uses the JAX [11] compi-
lation framework to vectorise the Python implementation of
RL algorithms and parallelise execution on GPUs and TPUs;
WarpDrive [22] can execute the entire RL training loop on a
GPU when implemented in CUDA; and RLlib Flow [25] uses
parallel dataflow operators [54] to express RL training. While
these approaches accelerate a larger portion of RL algorithms,
they force users to define algorithms through custom APIs
e.g. with a fixed set of supported computational operators.
1
arXiv:2210.00882v2 [cs.LG] 28 Oct 2022
(2) For the distribution of the training computation, current
RL systems are designed with particular strategies in mind, i.e.
they allocate algorithmic components (e.g. actors and learn-
ers) to workers in a fixed way. For example, SEED RL [9]
assumes that learners perform both policy inference and train-
ing on TPU cores, while actors execute on CPUs; ACME [17]
only distributes actors and maintains a single learner; and
TLeague [47] distributes learners but co-locates environments
with actors on CPU workers. Different RL algorithms de-
ployed in different resources will exhibit different compu-
tational bottlenecks, which means that a single strategy for
distribution cannot be optimal.
We observe that the above challenges originate from the lack
of an abstraction that separates the specification of an RL
algorithm from its execution. Inspired by how compilation-
based DL engines use intermediate representations (IRs) to
express computation [7,52], we want to design a new dis-
tributed RL system that (i) supports the execution of arbitrary
parts of the RL computation on parallel devices, such as GPUs
and CPUs (parallelism); and (ii) offers flexibility how to de-
ploy parts of an RL algorithm across devices (distribution).
We describe MindSpore Reinforcement Learning (MSRL),
a distributed RL system that decouples the specification of
RL algorithms from their execution. This enables MSRL to
change how it parallelises and distributes different RL algo-
rithms to reduce iteration time and increase the scalability of
training. To achieve this, MSRL’s design makes the following
contributions:
(1) Execution-independent algorithm specification (§3).
In MSRL, users specify an RL algorithm by implementing
its algorithmic components (e.g. agents, actors, learners) as
Python functions in a traditional way. This implementation
makes no assumptions about how the algorithm will be exe-
cuted: all runtime interactions between components are man-
aged by calls to MSRL APIs. A separate deployment configu-
ration defines the devices available for execution.
(2) Fragmented dataflow graph abstraction (§4).
From the
RL algorithm implementation, MSRL constructs a fragmented
dataflow graph (FDG). An FDG is a dataflow representation
of the computation that allows MSRL to map the algorithm
to heterogeneous devices (CPUs and GPUs) at deployment
time. The graph consists of fragments, which are parts of the
RL algorithm that can be parallelised and distributed across
devices.
MSRL uses static analysis on the algorithm implementa-
tion to group its functions into fragments, instances of which
can be deployed on devices. The boundaries between frag-
ments are decided with the help of user-provided partition
annotations in the algorithm implementation. Annotations
specify synchronisation points that require communication
between fragments that are replicated across devices for in-
creased data-parallelism.
(3) Fragment execution using DL engines (§5).
For execu-
tion, MSRL deploys hardware-specific implementations of
fragments, permitting instances to run on CPUs and/or GPUs.
MSRL supports different fragment implementations: CPU
implementations use regular Python code, and GPU imple-
mentations are generated as compiled computation graphs of
DL engines (e.g. MindSpore or TensorFlow) if a fragment
is implemented using operators, or they are implemented di-
rectly in CUDA. MSRL fuses multiple data-parallel fragments
for more efficient execution.
(4) Distribution policies (§6).
Since the algorithm implemen-
tation is separated from execution through the FDG, MSRL
can apply different distribution policies to govern how frag-
ments are mapped to devices. MSRL supports a flexible set of
distribution policies, which subsume the hard-coded distribu-
tion strategies of current RL systems: e.g. a distribution policy
can distribute multiple actors to scale the interaction with the
environment (like Acme [17]); distribute actors but move
policy inference to learners (like SEED RL [9]); distribute
both actors and learners (like Sebulba [15]); and represent
the full RL training loop on a GPU (like WarpDrive [22]
and Anakin [15]). As the algorithm configuration, its hyper-
parameters or hardware resources change, MSRL can switch
between distribution policy to maintain high training effi-
ciency without requiring changes to the RL algorithm imple-
mentation.
We evaluate MSRL experimentally and show that, by switch-
ing between distribution policies, MSRL improves the train-
ing time of the PPO RL algorithm by up 2.4
×
as hyper-
parameters, network properties or hardware resources change.
Its FDG abstraction offers more flexible execution with-
out compromising training performance: MSRL scales to
64 GPUs and outperforms the Ray distributed RL system [33]
by up to 3×.
2 Distributed Reinforcement Learning
Next we give background on reinforcement learning (§2.1),
discuss the requirements for distributed RL training (§2.2),
and survey the design space of existing RL systems (§2.3).
2.1 Reinforcement learning
Reinforcement learning (RL) solves a sequential decision-
making problem in which an agent operates in an unknown
environment. The agent’s goal is to learn a policy that max-
imises the cumulative reward based on the feedback from the
environment (see
Fig.
1):
1
policy inference: an agent per-
forms an inference computation on a policy, which maps the
environment’s state to an agent’s action;
2
environment execu-
tion: the environment executes the actions, generating trajec-
tories, which are sequences of
hstate,rewardi
pairs produced
by the policy;
3
policy training: finally, the agent improves
the policy by training it based on the received reward.
RL algorithms that train a policy fall into three categories:
(1) value-based methods (e.g. DQN [32]) use a deep neural
network (DNN) to approximate the value function, i.e. a map-
2
Environments
Agent 1 Agent 2 Agent n
Policy 1 Policy 2 Policy n
Action a1Action a2Action an
Joint action
<a2,a3,..,an>
State1!
reward1
Message
State2!
reward2
Staten!
rewardn
Step
1
Environment 1
Environment 2
Environment 3
Policy
inference
Step
2
Environment
execution
Policy
training
Step
3
Fig. 1: RL training loop with multiple agents
ping of the expected return to perform a given action in a state.
Agents learn the values of actions and select actions based on
these estimated values; (2) policy-based methods (e.g. Rein-
force [51]) directly learn a parametrised policy, approximated
by a DNN, for selecting actions without consulting a value
function. Agents use batched trajectories to train the policy
by updating the parameters in the direction of the gradient
that maximises the reward; and (3) actor–critic methods (e.g.
PPO [45], DDPG [26], A2C [31]) combine the two by learn-
ing a policy to select actions (actor) and a value function to
evaluate selected actions (critic).
Multi-agent reinforcement learning (MARL) has multiple
agents, each optimising its own cumulative reward when in-
teracting with the environment or other agents (see
Fig.
1).
A3C [31] executes agents asynchronously on separate environ-
ment instances; MAPPO [53] extends PPO to a multi-agent
setting in which agents share a global parametrised policy.
During training, each agent maintains a central critic that
takes joint information (e.g. state/action pairs) as input, while
learning a policy that only depends on local state.
2.2 Requirements for distributed RL systems
The benefits of RL come at the expense of its computational
cost [37]. Complex settings and environments require the
exploration of large spaces of actions, states and DNN param-
eters, and these spaces grow exponentially with the number
of agents [35]. Therefore, RL systems must be scalable by
exploiting both the parallelism of GPU acceleration and a
large number of distributed worker nodes.
There exists a range of proposals how RL systems can
parallelise and distribute RL training: for single-agent RL,
environment execution (
2
in
Fig.
1), policy inference and
training (
1
+
3
) can be distributed across workers [15,17,33],
potentially using GPUs [9,15]; for MARL, agents can be
distributed [24,25,33,43] and exchange training state using
communication libraries [28,36]. In general, environment
instances can execute in parallel [15,31] or be distributed [8]
to speed up execution.
Not a single of the above strategies for parallelising and
distributing RL training is optimal, both in terms of achieving
the lowest iteration time and having the best scalability, for
all possible RL algorithms. For different algorithms, training
workloads and hardware resources, the training bottlenecks
shift: e.g. our measurements show that, for PPO, environ-
ment execution (
2
) takes up to
98%
of execution time; for
MuZero [44], a large MARL algorithm with many agents,
Python function
def actor(state)
action=actor_net(…)
...
Agent.act( )
Python function
def step(action)
state,reward=…
...
Environment.step( )
Python function
def learn(state,reward)
loss=…
...
Agent.learn( )
Function call
(a) Function-based
Message
Actor 2
Actor 1
Actor 3
Agent.act( )
Environment.step( )
Agent.learn( )
(b) Actor-based
Dataflow operators
Agent.act( )
Environment.step( )
Agent.learn( )
Function operator
shared memory
Dataflow operators
Dataflow operators
(c) Dataflow-based
Fig. 2: Types of RL system designs
environment execution is no longer the bottleneck, and
97%
of time is spent on policy inference and training ( 1+3).
Instead of hardcoding a particular approach for parallelis-
ing and distributing RL computation, a distributed RL system
design should provide the flexibility to change its execution
approach based on the workload. This leads us to the follow-
ing requirements that our design aims to satisfy:
(1) Execution abstraction.
The RL system should have a
flexible execution abstraction that enables it to parallelise and
distribute computation unencumbered by how the algorithm is
defined. While such execution abstractions are commonplace
in compilation-based DL engines [4,7,11,42], they do not
exist in current RL systems (see §2.3).
(2) Distribution strategies.
The RL system should support
different strategies for distributing RL computation across
worker nodes. Users should be permitted to specify multiple
distribution strategies for a single RL algorithm, switching be-
tween them based on the training workload, without having to
change the algorithm implementation. Distribution strategies
should be applicable across classes of RL algorithms.
(3) Acceleration support.
The RL system should exploit
the parallelism of devices, such as multi-core CPUs, GPUs or
other AI accelerators. It should support the full spectrum from
fine-grained vectorised execution to course-grained task paral-
lel tasks on CPUs cores. Acceleration should not be restricted
to policy training and inference only (
1
+
3
) but cover the full
training loop, including environment execution ( 2) [22].
(4) Algorithm abstraction.
Despite decoupling execution
from algorithm specification, users expect familiar APIs for
defining algorithms around algorithmic components [10,12,
21], such as agents, actors, learners, policies, environments
etc. An RL system should therefore provide standard APIs,
e.g. defining the main training loop in terms of policy infer-
ence (
1
), environment execution (
2
) and policy training (
3
).
2.3 Design space of existing RL systems
To understand how existing RL systems support the require-
ments above, we survey the design space of RL systems. Ex-
isting designs can be categorised into three types, function-
based,actor-based and dataflow-based (see Fig. 2):
(a) Function-based
RL systems are the most common type.
They express RL algorithms typically as Python functions
3
Type System Execution Distribution Acceleration Algorithm
Function-
based
SEED RL [9]Python functions environment only DNNs actor/learner/env
Acme [17]Python components
RLGraph [43]delegated to backend agent
Actor-
based
Ray [33]task (stateless)
actor (stateful)
local scheduler
global scheduler
RPC
DNNs
Python functions
with Ray API [33]RLlib [24]
MALib [56]agent/actor/learner/env
Dataflow-
based
Podracer [15]JIT-compiled
by JAX [11]
two hard-coded
distribution schemes funcs/DNNs/envs JAX [11] API
RLlib Flow [25]predefined
dataflow operators
sharded dataflow operators
with Ray tasks [33]DNNs Operator API
WarpDrive [22]GPU thread blocks 7CUDA kernels CUDA
Fragmented
dataflow graph MSRL heterogeneous
dataflow fragments dataflow partitioning funcs/operators/
DNNs/envs agent/actor/learner/env
Tab. 1: Design space of distributed RL systems
executed directly by workers (see
Fig.
2a). The RL training
loop is implemented using direct function calls. For exam-
ple, Acme [17] and SEED RL [9] organise algorithms as ac-
tor/learner functions; RLGraph [43] uses a component abstrac-
tion, and users register Python callbacks to define function-
ality. Distributed execution is delegated to backend engines,
e.g. TensorFlow [13], PyTorch [38], Ray [33].
(b) Actor-based
RL systems execute algorithms through mes-
sage passing between a set of (programming language) actors
deployed on worker nodes (see
Fig.
2b). Ray [33] defines
algorithms as parallel tasks in an actor model. Tasks are dis-
tributed among nodes using remote calls. Defining control
flow in a fully distributed model, however, is burdensome.
To overcome this issue, RLlib [24] adds logically centralised
control on top of Ray. Similarly, MALib [56] offers a high-
level agent/evaluator/learner abstraction for population-based
MARL algorithms (e.g. PSRO [34]) using Ray as a backend.
(c) Dataflow-based
RL systems define algorithms through
data-parallel operators, which are mapped to GPU kernels or
distributed tasks (see
Fig.
2c). Operators are typically prede-
fined, and users must choose from a fixed set of APIs. For
example, Podracer [15] uses JAX [11] to compile subrou-
tines as dataflow operators for distributed execution on TPUs.
WarpDrive [22] defines dataflow operators as CUDA kernels,
and executes the complete RL training loop using GPU thread
blocks. RLlib Flow [25] uses distributed dataflow operators,
implemented as iterators with message passing.
Tab.
1considers how well these approaches satisfy our four
requirements from §2.2:
Execution abstraction.
Function- and actor-based systems
execute RL algorithms directly through implemented (Python)
functions and user-defined (programming language) actors,
respectively. This prevents systems from applying optimi-
sations how RL algorithms are parallelised or distributed.
In contrast, dataflow-based systems execute computation us-
ing pre-defined/complied operators [15,25] or CUDA ker-
nels [22], which offers optimisation opportunities.
Distribution strategies.
Most function-based systems adopt
a fixed strategy that distributes actors to parallelise the envi-
ronment execution (
1
+
2
in
Fig.
1) with only a single learner.
Actor-based systems distribute stateful actors and stateless
tasks across multiple workers, often using a greedy scheduler
without domain-specific planning.
Dataflow-based systems typically hardcode how dataflow
operators are assigned to workers. Podracer [15] provides two
strategies: Anakin co-hosts an environment and an agent on
each TPU core; Sebulba distributes the environment, learners
and actors on different TPU cores; and RLlib Flow [25] shards
dataflow operators across distributed Ray actors.
Acceleration support.
Most RL systems only accelerate
DNN policy inference and training (
1
+
3
). Some dataflow-
based systems (e.g. Podracer [15] and WarpDrive [22]) also
accelerate other parts of training, requiring custom dataflow
implementations: e.g. Podracer can accelerate the environ-
ment execution (
2
) on TPU cores; WarpDrive executes the
entire training loop ( 13) on a single GPU using CUDA.
Algorithm abstraction.
Function-based RL systems make it
easy to provide intuitive actor/learner/env APIs. Actor-based
RL systems often provide lower-level harder-to-use APIs for
distributed components (e.g. Ray’s
get
/
wait
/
remote
[33]);
higher-level libraries (e.g. RLlib’s
PolicyOptimizer
API [24]) try to bridge this gap. Dataflow-based systems
come with fixed dataflow operators, requiring users to rewrite
algorithms. For example, JAX [11] provides the
vmap
oper-
ator for vectorisation and
pmap
for single-program multiple-
data (SPMD) parallelism.
In summary, there is an opportunity to design a new RL sys-
tem that combines the usability of function-based systems
with the acceleration potential of dataflow-based systems.
Such a design requires a new execution abstraction that retains
the flexibility to apply different acceleration and distribution
strategies on top of an RL algorithm specification.
4
摘要:

MSRL:DistributedReinforcementLearningwithDataowFragmentsHuanzhouZhu*ImperialCollegeLondonBoZhao*ImperialCollegeLondonGangChenHuaweiTechnologiesCo.,Ltd.WeifengChenHuaweiTechnologiesCo.,Ltd.YijieChenHuaweiTechnologiesCo.,Ltd.LiangShiHuaweiTechnologiesCo.,Ltd.YaodongYangHuaweiR&DUnitedKingdomPeterPiet...

收起<<
MSRL Distributed Reinforcement Learning with Dataflow Fragments Huanzhou Zhu Imperial College LondonBo Zhao.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:769.47KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注