the performance of a learning system is impacted due to the
wireless connection or resource limitations. Moreover, prior
works do not systematically study distributions of computa-
tions between local and remote computers nor suggest how
to achieve an effective distribution.
In this paper, we develop two vision-based tasks using a
robotic arm and a mobile robot, and propose a real-time RL
system called the Remote-Local Distributed (ReLoD) system.
Similarly to Yuan and Mahmood’s (2022) work, ReLoD
parallelizes computations of RL algorithms to maintain small
action-cycle times and reduce the computational overhead of
real-time learning. But unlike the prior work, it is designed
to utilize both a local and a remote computer. ReLoD sup-
ports three different modes of distribution: Remote-Only that
allocates all computations on the remote computer, Local-
Only that allocates all computations on the local computer,
and Remote-Local that carefully distributes the computations
between the two computers in a specific way.
Our results show that the performance of SAC on a teth-
ered resource-limited computer drops substantially compared
to its performance on a tethered powerful workstation. Sur-
prisingly, when all computations of SAC are deployed on a
wirelessly connected powerful workstation, the performance
does not improve notably, which contradicts our intuition
since this mode fully utilizes the workstation. On the other
hand, SAC’s Remote-Local mode consistently improves its
performance by a large margin on both tasks, which indicates
that a careful distribution of computations is essential to
utilize a powerful remote workstation. However, the Remote-
Local mode only benefits computationally expensive and
sample efficient methods like SAC since the relatively sim-
pler learning algorithm PPO learns similar policies in all
three modes. We also notice that the highest average return
attained by PPO is about one-third of the highest average
return attained by SAC, which indicates that SAC is more
effective in complex robotic control tasks.
Our system in the Local-Only mode can achieve a perfor-
mance that is on par with a system well-tuned for a single
computer (Yuan & Mahmood 2022), though the latter overall
learns slightly faster. This property makes our system suitable
for conventional RL studies as well.
II. RELATED WORK
A system comparable to ours is SenseAct, which provides
a computational framework for robotic learning experiments
to be reproducible in different locations and under diverse
conditions (Mahmood et al. 2018b). Although SenseAct
enables the systematic design of robotic tasks for RL, it
does not address how to distribute computations of a real-
time learning agent between two computers, and the original
work does not contain vision-based tasks. We use the guiding
principles of SenseAct to design the vision tasks and sys-
tematically study the effectiveness of different distributions
of computations of a learning agent.
Krishnan et al. (2019) introduced an open-source simulator
and a gym environment for quadrotors. Given that these
aerial robots need to accomplish their tasks with limited
onboard energy, it is prohibitive to run current computation-
ally intensive RL methods on the onboard hardware. Since
onboard computing is scarce and updating RL policies with
existing methods is computationally intensive, they carefully
designed policies considering the power and computational
resources available onboard. However, they focused on sim-
to-real techniques for learning, making their approach un-
suited for real-time learning.
Nair et al. (2015) proposed a distributed learning archi-
tecture called the GORILA framework that mainly focuses
on using multiple actors and learners to collect data in
parallel and accelerate training in simulation using clusters
of CPUs and GPUs. GORILA is conceptually akin to the
DistBelief (Dean et al. 2012) architecture. In contrast to
the GORILA framework, our system focuses primarily on
how best to distribute the computations of a learning system
between a resource-limited local computer and a powerful
remote computer to enable effective real-time learning. In
addition, the GORILA framework is customized to Deep Q-
Networks (DQN), while our system supports two different
policy gradient algorithms using a common agent interface.
Lambert et al. (2019) used a model-based reinforcement
learning approach for high-frequency control of a small
quadcopter. Their proposed system is similar to our Remote-
Only mode. A recent paper by Smith et al. (2022) demon-
strated real-time learning of walking gait from scratch on
a Unitree A1 quadrupedal robot on various terrains. Their
real-time synchronous training of SAC on a laptop is similar
to our Local-Only mode. The effectiveness of both these
approaches on vision-based tasks is untested.
Bloesch et al. (2021) used a distributed version of Max-
imum aposteriori Policy Optimization (MPO) (Abdolmaleki
et al. 2018) to learn a vision-based control policy that
can walk with Robotis OP3 bipedal robots. The robot’s
onboard computer samples actions and periodically synchro-
nizes the policy’s neural network weights with a remote
learning process at the start of each episode. Haarnoja et
al. (2019) also proposed a similar asynchronous learning
system tailored to learn a stable gait using SAC and the
minitaur robot (Kenneally et al. 2016). These tasks do not use
images. Although their proposed systems are similar to our
Remote-Local mode, these two papers aim at solving tasks
instead of systematically comparing different distributions of
computations of a learning agent between a resource-limited
computer and a powerful computer. In addition, their systems
are tailored to specific tasks and algorithms and are not
publicly available, while our system is open-source, task-
agnostic, and compatible with multiple algorithms.
III. BACKGROUND
Reinforcement learning is a setting where an agent learns
to control through trial-and-error interactions with its en-
vironment. The agent-environment interaction is modeled
with a Markov Decision Process (MDP), where an agent
interacts with its environment at discrete timesteps. At the
current timestep t, the agent is at state St∈ S, where it
takes an action At∈ A using a probability distribution π