
(2) For the distribution of the training computation, current
RL systems are designed with particular strategies in mind, i.e.
they allocate algorithmic components (e.g. actors and learn-
ers) to workers in a fixed way. For example, SEED RL [9]
assumes that learners perform both policy inference and train-
ing on TPU cores, while actors execute on CPUs; ACME [17]
only distributes actors and maintains a single learner; and
TLeague [47] distributes learners but co-locates environments
with actors on CPU workers. Different RL algorithms de-
ployed in different resources will exhibit different compu-
tational bottlenecks, which means that a single strategy for
distribution cannot be optimal.
We observe that the above challenges originate from the lack
of an abstraction that separates the specification of an RL
algorithm from its execution. Inspired by how compilation-
based DL engines use intermediate representations (IRs) to
express computation [7,52], we want to design a new dis-
tributed RL system that (i) supports the execution of arbitrary
parts of the RL computation on parallel devices, such as GPUs
and CPUs (parallelism); and (ii) offers flexibility how to de-
ploy parts of an RL algorithm across devices (distribution).
We describe MindSpore Reinforcement Learning (MSRL),
a distributed RL system that decouples the specification of
RL algorithms from their execution. This enables MSRL to
change how it parallelises and distributes different RL algo-
rithms to reduce iteration time and increase the scalability of
training. To achieve this, MSRL’s design makes the following
contributions:
(1) Execution-independent algorithm specification (§3).
In MSRL, users specify an RL algorithm by implementing
its algorithmic components (e.g. agents, actors, learners) as
Python functions in a traditional way. This implementation
makes no assumptions about how the algorithm will be exe-
cuted: all runtime interactions between components are man-
aged by calls to MSRL APIs. A separate deployment configu-
ration defines the devices available for execution.
(2) Fragmented dataflow graph abstraction (§4).
From the
RL algorithm implementation, MSRL constructs a fragmented
dataflow graph (FDG). An FDG is a dataflow representation
of the computation that allows MSRL to map the algorithm
to heterogeneous devices (CPUs and GPUs) at deployment
time. The graph consists of fragments, which are parts of the
RL algorithm that can be parallelised and distributed across
devices.
MSRL uses static analysis on the algorithm implementa-
tion to group its functions into fragments, instances of which
can be deployed on devices. The boundaries between frag-
ments are decided with the help of user-provided partition
annotations in the algorithm implementation. Annotations
specify synchronisation points that require communication
between fragments that are replicated across devices for in-
creased data-parallelism.
(3) Fragment execution using DL engines (§5).
For execu-
tion, MSRL deploys hardware-specific implementations of
fragments, permitting instances to run on CPUs and/or GPUs.
MSRL supports different fragment implementations: CPU
implementations use regular Python code, and GPU imple-
mentations are generated as compiled computation graphs of
DL engines (e.g. MindSpore or TensorFlow) if a fragment
is implemented using operators, or they are implemented di-
rectly in CUDA. MSRL fuses multiple data-parallel fragments
for more efficient execution.
(4) Distribution policies (§6).
Since the algorithm implemen-
tation is separated from execution through the FDG, MSRL
can apply different distribution policies to govern how frag-
ments are mapped to devices. MSRL supports a flexible set of
distribution policies, which subsume the hard-coded distribu-
tion strategies of current RL systems: e.g. a distribution policy
can distribute multiple actors to scale the interaction with the
environment (like Acme [17]); distribute actors but move
policy inference to learners (like SEED RL [9]); distribute
both actors and learners (like Sebulba [15]); and represent
the full RL training loop on a GPU (like WarpDrive [22]
and Anakin [15]). As the algorithm configuration, its hyper-
parameters or hardware resources change, MSRL can switch
between distribution policy to maintain high training effi-
ciency without requiring changes to the RL algorithm imple-
mentation.
We evaluate MSRL experimentally and show that, by switch-
ing between distribution policies, MSRL improves the train-
ing time of the PPO RL algorithm by up 2.4
×
as hyper-
parameters, network properties or hardware resources change.
Its FDG abstraction offers more flexible execution with-
out compromising training performance: MSRL scales to
64 GPUs and outperforms the Ray distributed RL system [33]
by up to 3×.
2 Distributed Reinforcement Learning
Next we give background on reinforcement learning (§2.1),
discuss the requirements for distributed RL training (§2.2),
and survey the design space of existing RL systems (§2.3).
2.1 Reinforcement learning
Reinforcement learning (RL) solves a sequential decision-
making problem in which an agent operates in an unknown
environment. The agent’s goal is to learn a policy that max-
imises the cumulative reward based on the feedback from the
environment (see
Fig.
1):
1
policy inference: an agent per-
forms an inference computation on a policy, which maps the
environment’s state to an agent’s action;
2
environment execu-
tion: the environment executes the actions, generating trajec-
tories, which are sequences of
hstate,rewardi
pairs produced
by the policy;
3
policy training: finally, the agent improves
the policy by training it based on the received reward.
RL algorithms that train a policy fall into three categories:
(1) value-based methods (e.g. DQN [32]) use a deep neural
network (DNN) to approximate the value function, i.e. a map-
2