
Learning
SyncOnRL VER (Ours)
Time
A B
C
AsyncOnRL
Learning
Learning
Policy Inference Simulation
D
E
+30%
Figure 1:
(Left) RL Training Systems.
In
SyncOnRL
, actions are computed for all environments,
then all environments are stepped. Experience collection is paused during learning. In
AsyncOnRL
,
computing actions, stepping environments, and learning all occur without synchronization. In
VER
, a
variable amount of experience is collected from each environment, enabling synchronous learning
without the straggler effect.
(Right) skill policies
with navigation are more robust to handoff errors.
this cumulative experience. They are broadly divided into two classes: synchronous (
SyncOnRL
)
and asynchronous (
AsyncOnRL
).
SyncOnRL
contains two synchronization points: first the policy is
executed for the entire batch
(ot→at)B
b=1 1
(Fig. 1A), then actions are executed in all environments,
(st, at→st+1, ot+1)B
b=1
(Fig. 1B), until
T
steps have been collected from all
N
environments.
This
(T, N)
-shaped batch of experience is used to update the policy (Fig. 1C). Synchronization
reduces throughput as the system spends significant (sometimes the most) time waiting for the slowest
environment to finish. This is the straggler effect [Petrini et al.,2003,Dean and Ghemawat,2004].
AsyncOnRL
removes these synchronization points, thereby mitigating the straggler effect and improv-
ing throughput. Actions are taken as soon as they are computed,
at→ot+1
(Fig. 1D), the next
action is computed as soon as the observation is ready,
ot→at
(Fig. 1E), and the policy is updated
as soon as enough experience is collected. However,
AsyncOnRL
systems are not able to ensure that
all experience has been collected by only the current policy and thus must consume near-policy
data. This reduces sample efficiency [Liu et al.,2020]. Thus, status quo leaves us with an unpleasant
tradeoff – high sample-efficiency with low throughput or high throughput with low sample-efficiency.
In this work, we propose Variable Experience Rollout (
VER
).
VER
combines the strengths of and blurs
the line between
SyncOnRL
and
AsyncOnRL
. Like
SyncOnRL
,
VER
collects experience with the current
policy and then updates it. Like
AsyncOnRL
,
VER
does not have synchronization points – it computes
next actions, steps environments, and updates the policy as soon as possible. The inspiration for
VER comes from two key observations:
1) AsyncOnRL
mitigates the straggler effect by implicitly collecting a variable amount of experience
from each environment – more from fast-to-simulate environments and less from slow ones.
2)
Both
SyncOnRL
and
AsyncOnRL
use a fixed rollout length,
T
steps of experience. Our key insight
is that while a fixed rollout length may simplify an implementation, it is not a requirement for RL.
These two key observations naturally lead us to variable experience rollout (
VER
), i.e. collecting
rollouts with a variable number of steps.
VER
adjusts the rollout length for each environment based
on its simulation speed. It explicitly collects more experience from fast-to-simulate environments
and less from slow ones (Fig. 1). The result is an RL system that overcomes the straggler effect
and maintains sample-efficiency by learning from on-policy data.
VER
focuses on efficiently utilizing a single GPU. To enable efficient scaling to multiple GPUs, we
combine VER with the decentralized distributed method proposed in [Wijmans et al.,2020].
First, we evaluate
VER
on well-established embodied navigation tasks using Habitat 1.0 [Savva et al.,
2019] on 8 GPUs.
VER
trains PointGoal navigation [Anderson et al.,2018] 60% faster than Decen-
tralized Distributed PPO (DD-PPO) [Wijmans et al.,2020], the current state-of-the-art for distributed
on-policy RL, with the same sample efficiency. For ObjectGoal navigation [Batra et al.,2020b], an
active area of research, VER is 100% faster than DD-PPO with (slightly) better sample efficiency.
Next, we evaluate
VER
on the recently introduced (and significantly more challenging) GeometricGoal
rearrangement tasks [Batra et al.,2020a] in Habitat 2.0 [Szot et al.,2021]. In GeoRearrange, a virtual
1Following standard notation, stis (PO)MDP state, atis the action taken, and otis the agent observation.
2