VER Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement

2025-05-06 1 0 3.9MB 18 页 10玖币

侵权投诉

VER: Scaling On-Policy RL Leads to the

Emergence of Navigation in Embodied

Rearrangement

Erik Wijmans1,2Irfan Essa1,3Dhruv Batra2,1

1Georgia Institute of Technology 2Meta AI 3Google Atlanta

{etw,irfan,dbatra}@gatech.edu

Abstract

We present Variable Experience Rollout (

VER

), a technique for efﬁciently scaling

batched on-policy reinforcement learning in heterogenous environments (where

different environments take vastly different times to generate rollouts) to many

GPUs residing on, potentially, many machines.

VER

combines the strengths of

and blurs the line between synchronous and asynchronous on-policy RL methods

(

SyncOnRL

and

AsyncOnRL

, respectively). Speciﬁcally, it learns from on-policy

experience (like

SyncOnRL

) and has no synchronization points (like

AsyncOnRL

)

enabling high throughput.

We ﬁnd that

VER

leads to signiﬁcant and consistent speed-ups across a broad

range of embodied navigation and mobile manipulation tasks in photorealistic 3D

simulation environments. Speciﬁcally, for PointGoal navigation and ObjectGoal

navigation in Habitat 1.0,

VER

is 60-100% faster (1.6-2x speedup) than DD-PPO,

the current state of art for distributed

SyncOnRL

, with similar sample efﬁciency.

For mobile manipulation tasks (open fridge/cabinet, pick/place objects) in Habitat

2.0

VER

is 150% faster (2.5x speedup) on 1 GPU and 170% faster (2.7x speedup)

on 8 GPUs than DD-PPO. Compared to SampleFactory (the current state-of-the-art

AsyncOnRL

VER

matches its speed on 1 GPU, and is 70% faster (1.7x speedup)

on 8 GPUs with better sample efﬁciency.

We leverage these speed-ups to train chained skills for GeometricGoal rear-

rangement tasks in the Home Assistant Benchmark (HAB). We ﬁnd a surprising

emergence of navigation in skills that do not ostensible require any navigation.

Speciﬁcally, the

Pick

skill involves a robot picking an object from a table. During

training the robot was always spawned close to the table and never needed to

navigate. However, we ﬁnd that if base movement is part of the action space, the

robot learns to navigate then pick an object in new environments with 50% success,

demonstrating surprisingly high out-of-distribution generalization.

Code: github.com/facebookresearch/habitat-lab

1 Introduction

Scaling matters. Progress towards building embodied intelligent agents that are capable of performing

goal driven tasks has been driven, in part, by training large neural networks in photo-realistic 3D envi-

ronments with deep reinforcement learning (RL) for (up to) billions of steps of experience [Wijmans

et al.,2020,Maksymets et al.,2021,Mezghani et al.,2021,Ramakrishnan et al.,2021,Miki et al.,

2022]. To enable this scale, RL systems must be able to efﬁciently utilize the available resources (e.g.

GPUs), and scale to multiple machines all while maintaining sample-efﬁcient learning.

One promising class of techniques to achieve this scale is batched on-policy RL. These methods

collect experience from many (N) environments simultaneously using the policy and update it with

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.05064v1 [cs.LG] 11 Oct 2022

Learning

SyncOnRL VER (Ours)

Time

A B

AsyncOnRL

Learning

Policy Inference Simulation

+30%

Figure 1:

(Left) RL Training Systems.

SyncOnRL

, actions are computed for all environments,

then all environments are stepped. Experience collection is paused during learning. In

AsyncOnRL

computing actions, stepping environments, and learning all occur without synchronization. In

VER

, a

variable amount of experience is collected from each environment, enabling synchronous learning

without the straggler effect.

(Right) skill policies

with navigation are more robust to handoff errors.

this cumulative experience. They are broadly divided into two classes: synchronous (

SyncOnRL

)

and asynchronous (

AsyncOnRL

SyncOnRL

contains two synchronization points: ﬁrst the policy is

executed for the entire batch

(ot→at)B

b=1 1

(Fig. 1A), then actions are executed in all environments,

(st, at→st+1, ot+1)B

b=1

(Fig. 1B), until

steps have been collected from all

environments.

This

(T, N)

-shaped batch of experience is used to update the policy (Fig. 1C). Synchronization

reduces throughput as the system spends signiﬁcant (sometimes the most) time waiting for the slowest

environment to ﬁnish. This is the straggler effect [Petrini et al.,2003,Dean and Ghemawat,2004].

AsyncOnRL

removes these synchronization points, thereby mitigating the straggler effect and improv-

ing throughput. Actions are taken as soon as they are computed,

at→ot+1

(Fig. 1D), the next

action is computed as soon as the observation is ready,

ot→at

(Fig. 1E), and the policy is updated

as soon as enough experience is collected. However,

AsyncOnRL

systems are not able to ensure that

all experience has been collected by only the current policy and thus must consume near-policy

data. This reduces sample efﬁciency [Liu et al.,2020]. Thus, status quo leaves us with an unpleasant

tradeoff – high sample-efﬁciency with low throughput or high throughput with low sample-efﬁciency.

In this work, we propose Variable Experience Rollout (

VER

combines the strengths of and blurs

the line between

SyncOnRL

and

AsyncOnRL

. Like

SyncOnRL

VER

collects experience with the current

policy and then updates it. Like

AsyncOnRL

VER

does not have synchronization points – it computes

next actions, steps environments, and updates the policy as soon as possible. The inspiration for

VER comes from two key observations:

1) AsyncOnRL

mitigates the straggler effect by implicitly collecting a variable amount of experience

from each environment – more from fast-to-simulate environments and less from slow ones.

Both

SyncOnRL

and

AsyncOnRL

use a ﬁxed rollout length,

steps of experience. Our key insight

is that while a ﬁxed rollout length may simplify an implementation, it is not a requirement for RL.

These two key observations naturally lead us to variable experience rollout (

VER

), i.e. collecting

rollouts with a variable number of steps.

VER

adjusts the rollout length for each environment based

on its simulation speed. It explicitly collects more experience from fast-to-simulate environments

and less from slow ones (Fig. 1). The result is an RL system that overcomes the straggler effect

and maintains sample-efﬁciency by learning from on-policy data.

VER

focuses on efﬁciently utilizing a single GPU. To enable efﬁcient scaling to multiple GPUs, we

combine VER with the decentralized distributed method proposed in [Wijmans et al.,2020].

First, we evaluate

VER

on well-established embodied navigation tasks using Habitat 1.0 [Savva et al.,

2019] on 8 GPUs.

VER

trains PointGoal navigation [Anderson et al.,2018] 60% faster than Decen-

tralized Distributed PPO (DD-PPO) [Wijmans et al.,2020], the current state-of-the-art for distributed

on-policy RL, with the same sample efﬁciency. For ObjectGoal navigation [Batra et al.,2020b], an

active area of research, VER is 100% faster than DD-PPO with (slightly) better sample efﬁciency.

Next, we evaluate

VER

on the recently introduced (and signiﬁcantly more challenging) GeometricGoal

rearrangement tasks [Batra et al.,2020a] in Habitat 2.0 [Szot et al.,2021]. In GeoRearrange, a virtual

1Following standard notation, stis (PO)MDP state, atis the action taken, and otis the agent observation.

Inference Worker

Shared CPU Memory

Shared GPU Memory

EnvStep

Action

Batch(EnvStep)

Batch(Actions)

Batch(EnvStep)

Policy Weights

Mini-batch

Environment Worker

Learning Worker

Figure 2:

VER system architecture.

Environment workers receive actions to simulate and return

the result of that environment step (EnvStep). Inference workers receive batches experience from

environment workers. They return the new action to take to environment workers and write the

experience into GPU shared memory for learning.

robot is spawned in a new environment and asked to rearrange a set of objects from their initial to

desired coordinates. These environments have highly variable simulation time (physics simulation

time increases if the robot bumps into something) and require GPU-acceleration (for photo-realistic

rendering), limiting the number of environments that can be run in parallel.

On 1 GPU,

VER

is 150% faster (2.5x speedup) than DD-PPO with the same sample efﬁciency.

VER

is as fast as SampleFactory [Petrenko et al.,2020], the state-of-the-art

AsyncOnRL

, with the same

sample efﬁciency.

VER

is as fast as

AsyncOnRL

in pure throughput; this is a surprisingly strong

result.

AsyncOnRL

never stops collecting experience and should, in theory, be a strict upper bound

on performance.

VER

is able to match

AsyncOnRL

for environments that heavily utilize the GPU for

rendering, like Habitat. In

AsyncOnRL

, learning, inference, and rendering contend the GPU which

reduces throughput. In VER, inference and rendering contend for the GPU while learning does not.

On 8 GPUs,

VER

achieves better scaling than DD-PPO, achieving a 6.7x speed-up (vs. 6x for DD-PPO)

due to lower variance in experience collection time between GPU-workers. Due to this efﬁcient

multi-GPU scaling,

VER

is 70% faster (1.7x speedup) than SampleFactory on 8 GPUs and has better

sample efﬁciency as it learns from on-policy data.

Finally, we leverage these SysML contributions to study open research questions posed in prior work.

Speciﬁcally, we train RL policies for mobile manipulation skills (

Navigate

Pick

Place

,etc.) and

chain them via a task planner. Szot et al. [2021] called this approach TP-SRL and identiﬁed a critical

‘handoff problem’ – downstream skills are set up for failure by small errors made by upstream skills

(e.g. the

Pick

skill failing because the navigation skill stopped the robot a bit too far from the object ).

We demonstrate a number of surprising ﬁndings when TP-SRL is scaled via

VER

. Most importantly,

we ﬁnd the emergence of navigation when skills that do not ostensibly require navigation (e.g.

pick) are trained with navigation actions enabled. In principle,

Pick

and

Place

policies do not need

to navigate during training since the objects are always in arm’s reach, but in practice they learn

to navigate to recover from their mistakes and this results in strong out-of-distribution test-time

generalization. Speciﬁcally, TP-SRL without a navigation skill achieves 50% success on NavPick

and 20% success on a NavPickNavPlace task simply because the

Pick

and

Place

skills have learned

to navigate (sometimes across the room!). TP-SRL with a

Navigate

skill performs even stronger:

90% on NavPickNavPlace and 32% on 5 successive NavPickNavPlaces (called Tidy House in Szot

et al. [2021]), which are +32% and +30% absolute improvements over Szot et al. [2021], respectively.

Prepare Groceries and Set Table, which both require interaction with articulated receptacles (fridge,

drawer), remain as open problems (5% and 0% Success, respectively) and are the next frontiers.

2VER: Variable Experience Rollout

The key challenge that any batched on-policy RL technique needs to address is variability of sim-

ulation time for the environments in a batch. There are two primary sources of this variability:

action-level and episode-level. The amount of time needed to simulate an action within an envi-

ronment varies depending on the the speciﬁc action, the state of the robot, and the environment

(e.g. simulating the robot navigating on a clear ﬂoor is much faster than simulating the robot’s arm

colliding with objects). The amount of time needed to simulate an entire episode also varies environ-

30 1 2

0 2

0 1

3 4

64 5

N=0

N=1

N=2

N=3

(A)

30 1 2

0 1

2 3 4

4 5

N=0 8

N=1

N=2

N=3

N=0 (Navigating)

N=1 (Picking)

(B) Packed Sequence Format

30 1 2

012

0000 111 222 33

T=0 T=1 T=2 T=3

N=0

N=1

N=2

N=3 T=0 T=1 T=2 T=3

(C)

Mini BatchingVariable Experience Rollout

Figure 3:

(A) VER

collects a variable amount of experience from each environment. The length

of each step represents the time taken to collect it.

(B) VER mini-batch.

The solid bars denote

episode boundaries. The steps selected for the ﬁrst mini-batch have a dashed border.

PackedSequence data format

represents a set of sequences with variable length in a linear buffer

such that all elements from each timestep area next to one-another in memory.

ment to environment irrespective of action-level variability (e.g. rendering images takes longer for

visually-complex scenes, simulating physics takes longer for scenes with a large number of objects).

2.1 Action-Level Straggler Mitigation

We mitigate the action-level straggler effect by applying the experience collection method of

AsyncOnRL to SyncOnRL. We represent this visually in Fig. 2and describe it in text bellow.

Environment workers

receive the next action and step the environment, EnvStep, e.g.

st, at→

st+1, ot+1, rt

. They write the outputs of the environment (observations, reward, etc.) into pre-

allocated CPU shared memory for consumption by inference workers.

Inference workers

receive batches of steps of experience from environment workers. They perform

inference with the current policy to select the next action and send it to the environment worker using

pre-allocated CPU shared memory. After inference they store experience for learning in shared GPU

memory. Inference workers use dynamic batching and perform inference on all outstanding inference

requests instead of waiting for a ﬁxed number of requests to arrive

. This allows us to leverage the

beneﬁts of batching without introducing synchronization points between environment workers.

This experience collection technique is similar to that of HTS-RL [Liu et al.,2020] (

SyncOnRL

)

and SampleFactory [Petrenko et al.,2020] (

AsyncOnRL

). Unlike both, we do not overlap experience

collection with learning. This has various system beneﬁts, including reducing GPU memory usage

and reducing GPU driver contention. More details are available in Appendix A.

2.2 Environment-Level Straggler Mitigation

In both

SyncOnRL

and

AsyncOnRL

, the data used for learning consists of

rollouts of equal-

steps

of experience, an

(T, N)

-shaped batch. In

SyncOnRL

these

sets are all collected with the current

policy, this leads to the environment-level straggler effect.

AsyncOnRL

mitigates this by relaxing the

constraint that experience must be strictly on-policy, and thereby implicitly changes the experience

collection rate for each environment.

Variable Experience Rollout (VER). We instead relax the constraint that we must use Nrollouts of

equal-

steps. Speciﬁcally,

VER

collects

T×N

steps of experience from

environments without

a constraint on how many steps of experience are collected from each environment. This explicitly

varies the experience collection rate for each environment – in effect, collecting more experience from

environments that are fast to simulate. Consider the 4 environments shows in Fig. 3A. The length of

the each step representation the wall-clock time taken to collect it, some steps are fast, some are slow.

VER collects more experience from environment 0 as it is fastest to step and less from 1, the slowest.

Learning mini-batch creation. VER

is designed with recurrent policies in mind because memory

is key in long-range and partially observable tasks like HAB. When training recurrent policies, we

must create mini-batches of experience with sequences for back-propagation-through-time. Normally

mini-batches are constructed by spitting the

environments’ experience into

(

N/B

)-sized

mini-batches. A similar procedure would result in mini-batches of different sizes with

VER

. This

would harm optimization because learning rate and optimization mini-batch size are intertwined and

automatically adjusting the learning rate is an open question [Goyal et al.,2017,You et al.,2020].

In practice we introduce both a minimum and maximum number of requests to prevent under-utilization of

compute and over-utilization of memory.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VER:ScalingOn-PolicyRLLeadstotheEmergenceofNavigationinEmbodiedRearrangementErikWijmans1;2IrfanEssa1;3DhruvBatra2;11GeorgiaInstituteofTechnology2MetaAI3GoogleAtlanta{etw,irfan,dbatra}@gatech.eduAbstractWepresentVariableExperienceRollout(VER),atechniqueforefcientlyscalingbatchedon-policyreinforcemen...

展开>> 收起<<

VER Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VER Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: