VER Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement

2025-05-06 0 0 3.9MB 18 页 10玖币
侵权投诉
VER: Scaling On-Policy RL Leads to the
Emergence of Navigation in Embodied
Rearrangement
Erik Wijmans1,2Irfan Essa1,3Dhruv Batra2,1
1Georgia Institute of Technology 2Meta AI 3Google Atlanta
{etw,irfan,dbatra}@gatech.edu
Abstract
We present Variable Experience Rollout (
VER
), a technique for efficiently scaling
batched on-policy reinforcement learning in heterogenous environments (where
different environments take vastly different times to generate rollouts) to many
GPUs residing on, potentially, many machines.
VER
combines the strengths of
and blurs the line between synchronous and asynchronous on-policy RL methods
(
SyncOnRL
and
AsyncOnRL
, respectively). Specifically, it learns from on-policy
experience (like
SyncOnRL
) and has no synchronization points (like
AsyncOnRL
)
enabling high throughput.
We find that
VER
leads to significant and consistent speed-ups across a broad
range of embodied navigation and mobile manipulation tasks in photorealistic 3D
simulation environments. Specifically, for PointGoal navigation and ObjectGoal
navigation in Habitat 1.0,
VER
is 60-100% faster (1.6-2x speedup) than DD-PPO,
the current state of art for distributed
SyncOnRL
, with similar sample efficiency.
For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat
2.0
VER
is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup)
on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art
AsyncOnRL
),
VER
matches its speed on 1 GPU, and is 70% faster (1.7x speedup)
on 8 GPUs with better sample efficiency.
We leverage these speed-ups to train chained skills for GeometricGoal rear-
rangement tasks in the Home Assistant Benchmark (HAB). We find a surprising
emergence of navigation in skills that do not ostensible require any navigation.
Specifically, the
Pick
skill involves a robot picking an object from a table. During
training the robot was always spawned close to the table and never needed to
navigate. However, we find that if base movement is part of the action space, the
robot learns to navigate then pick an object in new environments with 50% success,
demonstrating surprisingly high out-of-distribution generalization.
Code: github.com/facebookresearch/habitat-lab
1 Introduction
Scaling matters. Progress towards building embodied intelligent agents that are capable of performing
goal driven tasks has been driven, in part, by training large neural networks in photo-realistic 3D envi-
ronments with deep reinforcement learning (RL) for (up to) billions of steps of experience [Wijmans
et al.,2020,Maksymets et al.,2021,Mezghani et al.,2021,Ramakrishnan et al.,2021,Miki et al.,
2022]. To enable this scale, RL systems must be able to efficiently utilize the available resources (e.g.
GPUs), and scale to multiple machines all while maintaining sample-efficient learning.
One promising class of techniques to achieve this scale is batched on-policy RL. These methods
collect experience from many (N) environments simultaneously using the policy and update it with
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05064v1 [cs.LG] 11 Oct 2022
Learning
SyncOnRL VER (Ours)
Time
A B
C
AsyncOnRL
Learning
Learning
Policy Inference Simulation
D
E
+30%
Figure 1:
(Left) RL Training Systems.
In
SyncOnRL
, actions are computed for all environments,
then all environments are stepped. Experience collection is paused during learning. In
AsyncOnRL
,
computing actions, stepping environments, and learning all occur without synchronization. In
VER
, a
variable amount of experience is collected from each environment, enabling synchronous learning
without the straggler effect.
(Right) skill policies
with navigation are more robust to handoff errors.
this cumulative experience. They are broadly divided into two classes: synchronous (
SyncOnRL
)
and asynchronous (
AsyncOnRL
).
SyncOnRL
contains two synchronization points: first the policy is
executed for the entire batch
(otat)B
b=1 1
(Fig. 1A), then actions are executed in all environments,
(st, atst+1, ot+1)B
b=1
(Fig. 1B), until
T
steps have been collected from all
N
environments.
This
(T, N)
-shaped batch of experience is used to update the policy (Fig. 1C). Synchronization
reduces throughput as the system spends significant (sometimes the most) time waiting for the slowest
environment to finish. This is the straggler effect [Petrini et al.,2003,Dean and Ghemawat,2004].
AsyncOnRL
removes these synchronization points, thereby mitigating the straggler effect and improv-
ing throughput. Actions are taken as soon as they are computed,
atot+1
(Fig. 1D), the next
action is computed as soon as the observation is ready,
otat
(Fig. 1E), and the policy is updated
as soon as enough experience is collected. However,
AsyncOnRL
systems are not able to ensure that
all experience has been collected by only the current policy and thus must consume near-policy
data. This reduces sample efficiency [Liu et al.,2020]. Thus, status quo leaves us with an unpleasant
tradeoff – high sample-efficiency with low throughput or high throughput with low sample-efficiency.
In this work, we propose Variable Experience Rollout (
VER
).
VER
combines the strengths of and blurs
the line between
SyncOnRL
and
AsyncOnRL
. Like
SyncOnRL
,
VER
collects experience with the current
policy and then updates it. Like
AsyncOnRL
,
VER
does not have synchronization points – it computes
next actions, steps environments, and updates the policy as soon as possible. The inspiration for
VER comes from two key observations:
1) AsyncOnRL
mitigates the straggler effect by implicitly collecting a variable amount of experience
from each environment – more from fast-to-simulate environments and less from slow ones.
2)
Both
SyncOnRL
and
AsyncOnRL
use a fixed rollout length,
T
steps of experience. Our key insight
is that while a fixed rollout length may simplify an implementation, it is not a requirement for RL.
These two key observations naturally lead us to variable experience rollout (
VER
), i.e. collecting
rollouts with a variable number of steps.
VER
adjusts the rollout length for each environment based
on its simulation speed. It explicitly collects more experience from fast-to-simulate environments
and less from slow ones (Fig. 1). The result is an RL system that overcomes the straggler effect
and maintains sample-efficiency by learning from on-policy data.
VER
focuses on efficiently utilizing a single GPU. To enable efficient scaling to multiple GPUs, we
combine VER with the decentralized distributed method proposed in [Wijmans et al.,2020].
First, we evaluate
VER
on well-established embodied navigation tasks using Habitat 1.0 [Savva et al.,
2019] on 8 GPUs.
VER
trains PointGoal navigation [Anderson et al.,2018] 60% faster than Decen-
tralized Distributed PPO (DD-PPO) [Wijmans et al.,2020], the current state-of-the-art for distributed
on-policy RL, with the same sample efficiency. For ObjectGoal navigation [Batra et al.,2020b], an
active area of research, VER is 100% faster than DD-PPO with (slightly) better sample efficiency.
Next, we evaluate
VER
on the recently introduced (and significantly more challenging) GeometricGoal
rearrangement tasks [Batra et al.,2020a] in Habitat 2.0 [Szot et al.,2021]. In GeoRearrange, a virtual
1Following standard notation, stis (PO)MDP state, atis the action taken, and otis the agent observation.
2
Inference Worker
Shared CPU Memory
Shared GPU Memory
EnvStep
Action
Batch(EnvStep)
Batch(Actions)
Batch(EnvStep)
Policy Weights
N
Policy Weights
Mini-batch
Environment Worker
Learning Worker
Figure 2:
VER system architecture.
Environment workers receive actions to simulate and return
the result of that environment step (EnvStep). Inference workers receive batches experience from
environment workers. They return the new action to take to environment workers and write the
experience into GPU shared memory for learning.
robot is spawned in a new environment and asked to rearrange a set of objects from their initial to
desired coordinates. These environments have highly variable simulation time (physics simulation
time increases if the robot bumps into something) and require GPU-acceleration (for photo-realistic
rendering), limiting the number of environments that can be run in parallel.
On 1 GPU,
VER
is 150% faster (2.5x speedup) than DD-PPO with the same sample efficiency.
VER
is as fast as SampleFactory [Petrenko et al.,2020], the state-of-the-art
AsyncOnRL
, with the same
sample efficiency.
VER
is as fast as
AsyncOnRL
in pure throughput; this is a surprisingly strong
result.
AsyncOnRL
never stops collecting experience and should, in theory, be a strict upper bound
on performance.
VER
is able to match
AsyncOnRL
for environments that heavily utilize the GPU for
rendering, like Habitat. In
AsyncOnRL
, learning, inference, and rendering contend the GPU which
reduces throughput. In VER, inference and rendering contend for the GPU while learning does not.
On 8 GPUs,
VER
achieves better scaling than DD-PPO, achieving a 6.7x speed-up (vs. 6x for DD-PPO)
due to lower variance in experience collection time between GPU-workers. Due to this efficient
multi-GPU scaling,
VER
is 70% faster (1.7x speedup) than SampleFactory on 8 GPUs and has better
sample efficiency as it learns from on-policy data.
Finally, we leverage these SysML contributions to study open research questions posed in prior work.
Specifically, we train RL policies for mobile manipulation skills (
Navigate
,
Pick
,
Place
,etc.) and
chain them via a task planner. Szot et al. [2021] called this approach TP-SRL and identified a critical
‘handoff problem’ – downstream skills are set up for failure by small errors made by upstream skills
(e.g. the
Pick
skill failing because the navigation skill stopped the robot a bit too far from the object ).
We demonstrate a number of surprising findings when TP-SRL is scaled via
VER
. Most importantly,
we find the emergence of navigation when skills that do not ostensibly require navigation (e.g.
pick) are trained with navigation actions enabled. In principle,
Pick
and
Place
policies do not need
to navigate during training since the objects are always in arm’s reach, but in practice they learn
to navigate to recover from their mistakes and this results in strong out-of-distribution test-time
generalization. Specifically, TP-SRL without a navigation skill achieves 50% success on NavPick
and 20% success on a NavPickNavPlace task simply because the
Pick
and
Place
skills have learned
to navigate (sometimes across the room!). TP-SRL with a
Navigate
skill performs even stronger:
90% on NavPickNavPlace and 32% on 5 successive NavPickNavPlaces (called Tidy House in Szot
et al. [2021]), which are +32% and +30% absolute improvements over Szot et al. [2021], respectively.
Prepare Groceries and Set Table, which both require interaction with articulated receptacles (fridge,
drawer), remain as open problems (5% and 0% Success, respectively) and are the next frontiers.
2VER: Variable Experience Rollout
The key challenge that any batched on-policy RL technique needs to address is variability of sim-
ulation time for the environments in a batch. There are two primary sources of this variability:
action-level and episode-level. The amount of time needed to simulate an action within an envi-
ronment varies depending on the the specific action, the state of the robot, and the environment
(e.g. simulating the robot navigating on a clear floor is much faster than simulating the robot’s arm
colliding with objects). The amount of time needed to simulate an entire episode also varies environ-
3
30 1 2
30 1 2
0 2
0 1
54
1
76
2
3 4
64 5
N=0
N=1
N=2
N=3
8
3
(A)
30 1 2
30 1 2
0
0 1
54
1
76
2
2 3 4
4 5
N=0 8
N=1
N=2
N=3
3
N=0 (Navigating)
N=1 (Picking)
(B) Packed Sequence Format
30 1 2
30 1 2
012
0
0000 111 222 33
T=0 T=1 T=2 T=3
N=0
N=1
N=2
N=3 T=0 T=1 T=2 T=3
(C)
Mini BatchingVariable Experience Rollout
Figure 3:
(A) VER
collects a variable amount of experience from each environment. The length
of each step represents the time taken to collect it.
(B) VER mini-batch.
The solid bars denote
episode boundaries. The steps selected for the first mini-batch have a dashed border.
(C) The
PackedSequence data format
represents a set of sequences with variable length in a linear buffer
such that all elements from each timestep area next to one-another in memory.
ment to environment irrespective of action-level variability (e.g. rendering images takes longer for
visually-complex scenes, simulating physics takes longer for scenes with a large number of objects).
2.1 Action-Level Straggler Mitigation
We mitigate the action-level straggler effect by applying the experience collection method of
AsyncOnRL to SyncOnRL. We represent this visually in Fig. 2and describe it in text bellow.
Environment workers
receive the next action and step the environment, EnvStep, e.g.
st, at
st+1, ot+1, rt
. They write the outputs of the environment (observations, reward, etc.) into pre-
allocated CPU shared memory for consumption by inference workers.
Inference workers
receive batches of steps of experience from environment workers. They perform
inference with the current policy to select the next action and send it to the environment worker using
pre-allocated CPU shared memory. After inference they store experience for learning in shared GPU
memory. Inference workers use dynamic batching and perform inference on all outstanding inference
requests instead of waiting for a fixed number of requests to arrive
2
. This allows us to leverage the
benefits of batching without introducing synchronization points between environment workers.
This experience collection technique is similar to that of HTS-RL [Liu et al.,2020] (
SyncOnRL
)
and SampleFactory [Petrenko et al.,2020] (
AsyncOnRL
). Unlike both, we do not overlap experience
collection with learning. This has various system benefits, including reducing GPU memory usage
and reducing GPU driver contention. More details are available in Appendix A.
2.2 Environment-Level Straggler Mitigation
In both
SyncOnRL
and
AsyncOnRL
, the data used for learning consists of
N
rollouts of equal-
T
steps
of experience, an
(T, N)
-shaped batch. In
SyncOnRL
these
N
sets are all collected with the current
policy, this leads to the environment-level straggler effect.
AsyncOnRL
mitigates this by relaxing the
constraint that experience must be strictly on-policy, and thereby implicitly changes the experience
collection rate for each environment.
Variable Experience Rollout (VER). We instead relax the constraint that we must use Nrollouts of
equal-
T
steps. Specifically,
VER
collects
T×N
steps of experience from
N
environments without
a constraint on how many steps of experience are collected from each environment. This explicitly
varies the experience collection rate for each environment – in effect, collecting more experience from
environments that are fast to simulate. Consider the 4 environments shows in Fig. 3A. The length of
the each step representation the wall-clock time taken to collect it, some steps are fast, some are slow.
VER collects more experience from environment 0 as it is fastest to step and less from 1, the slowest.
Learning mini-batch creation. VER
is designed with recurrent policies in mind because memory
is key in long-range and partially observable tasks like HAB. When training recurrent policies, we
must create mini-batches of experience with sequences for back-propagation-through-time. Normally
B
mini-batches are constructed by spitting the
N
environments’ experience into
B
(
T
,
N/B
)-sized
mini-batches. A similar procedure would result in mini-batches of different sizes with
VER
. This
would harm optimization because learning rate and optimization mini-batch size are intertwined and
automatically adjusting the learning rate is an open question [Goyal et al.,2017,You et al.,2020].
2
In practice we introduce both a minimum and maximum number of requests to prevent under-utilization of
compute and over-utilization of memory.
4
摘要:

VER:ScalingOn-PolicyRLLeadstotheEmergenceofNavigationinEmbodiedRearrangementErikWijmans1;2IrfanEssa1;3DhruvBatra2;11GeorgiaInstituteofTechnology2MetaAI3GoogleAtlanta{etw,irfan,dbatra}@gatech.eduAbstractWepresentVariableExperienceRollout(VER),atechniqueforefcientlyscalingbatchedon-policyreinforcemen...

展开>> 收起<<
VER Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:3.9MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注