Iterative Vision-and-Language Navigation Jacob Krantz1 Shurjo Banerjee2Wang Zhu3 Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3

2025-05-03 0 0 6.5MB 14 页 10玖币

侵权投诉

Iterative Vision-and-Language Navigation

Jacob Krantz1*Shurjo Banerjee2∗Wang Zhu3

Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3

1Oregon State University 2University of Michigan 3University of Southern California 4Google Research

Abstract

We present Iterative Vision-and-Language Naviga-

tion (IVLN), a paradigm for evaluating language-guided

agents navigating in a persistent environment over time. Ex-

isting Vision-and-Language Navigation (VLN) benchmarks

erase the agent’s memory at the beginning of every episode,

testing the ability to perform cold-start navigation with

no prior information. However, deployed robots occupy

the same environment for long periods of time. The IVLN

paradigm addresses this disparity by training and evaluating

VLN agents that maintain memory across tours of scenes

that consist of up to 100 ordered instruction-following Room-

to-Room (R2R) episodes, each deﬁned by an individual lan-

guage instruction and a target path. We present discrete

and continuous Iterative Room-to-Room (IR2R) benchmarks

comprising about 400 tours each in 80 indoor scenes. We

ﬁnd that extending the implicit memory of high-performing

transformer VLN agents is not sufﬁcient for IVLN, but agents

that build maps can beneﬁt from environment persistence,

motivating a renewed focus on map-building agents in VLN.

1. Introduction

Robots and virtual agents that persistently operate in hu-

man spaces like homes should improve over time. For ex-

ample, a smart vacuum told to clean the living room, which

is down the hall past the guest bedroom should learn about

both the living room and guest bedroom. Likewise, agents

should be able to associate references in past instructions,

such as guest bedroom, with spatial and visual information

from the environment to understand future instructions.

Most work on language-guided, embodied agents per-

forming navigation [3,25] or household tasks [38] is

episodic in nature—agent memory is erased before issu-

ing each new instruction. In contrast, physical robots build

maps [12,43,49]iteratively from visual observations [32,39]

as an explicit form of long-term memory. Agents trained to

*Equal contributions. Correspondence: krantzja@oregonstate.edu

perform language-guided navigation in simulation that are

deployed on physical robots [2] fail to take advantage of the

mapping-based strategies that facilitate robot navigation.

We propose Iterative Vision-and-Language Navigation

(IVLN), in which an agent follows an ordered sequence of

language instructions that conduct a tour of an indoor space.

Each tour is composed of individual episodes of language

instructions with target paths. Agents can utilize memory

to better understand future tour instructions. After just 10

episodes an agent has seen on average over 50% of the target

path associated with the next language instruction in a tour.

While performing an IVLN tour, agents iteratively explore

the environment, meaning regions irrelevant to task instruc-

tions need not ever be visited. By conditioning exploration

on language, IVLN enables rich semantic representations,

e.g., unusual, novel, and scene-speciﬁc referents grounded

during one episode can be reasoned about later.

We explore both a discrete VLN setting based on Room-

to-Room [3] episodes and navigation graphs (IR2R) and a

continuous simulation VLN-CE [25] setting (IR2R-CE). The

markedly different action and visual observation spaces of

these settings may require different memory mechanisms.

In the discrete setting, agents move on graph edges and

observe clear, well-framed images. For IR2R, we extend a

state-of-the-art transformer agent [11] that learns an implicit

memory based on path history when interpreting instructions.

In the continuous setting, agents take motion actions while

observing noisy images of a 3D environment reconstructed

from discrete panorama images. For IR2R-CE, we propose

an agent that builds and interprets an explicit semantic map.

In short, we deﬁne Iterative Vision-and-Language Navi-

gation (IVLN), a paradigm for persistent VLN, and release

IR2R and IR2R-CE to study discrete and continuous navi-

gation agents in the IVLN setting. We create initial agents

for both benchmarks, including explicit mapping and im-

plicit memory models for continuous navigation. Please see

jacobkrantz.github.io/ivln for code and more details.

2. Related Work

Instruction-guided navigation is a growing area in

grounded language understanding with many task settings

arXiv:2210.03087v3 [cs.CV] 24 Dec 2023

Episode 1/82 Oracle 1/82 Episode 2/82 Episode 6/82 Episode 82/82

……

Tour Map

Observed Environment Map

Travel to the end of the hallway where

there is a vase on the end of the table.

Take a left and go forward until you

reach the open doorway on the left.

Move forward into the open doorway.

[The agent is guided from where it

stopped in episode 1 to the correct

episode 1 goal location, then to the

start location for episode 2. The agent

observes but doesn’t act.]

Go left down the hallway and turn left.

Go down the hall and stop once you

reach the wood ﬂoor.

Exit bathroom and follow hallway

through archway directly in front. Turn

right when hallway ends at pictures and

table. Follow hallway passed piano and

stop in the circle on the hallway ﬂoor.

Facing the toilet, walk through the door

on the left. Make a right and walk

through the doorway across the room.

Make a left and walk down the hallway.

Turn right at the next opening and stop

before the kitchen island on the right.

Instruction

Initial Observation

Figure 1. In IVLN, agents are given language instructions corresponding to a sequence of paths that form a tour around a 3D scene. After

attempting to follow each instruction, the agent is teleoperated by an oracle to the correct goal location, then to the start of the next path

where the next instruction is issued. Unlike conventional episodic paradigms, the agent retains memory between episodes.

developed [3,9,26,33,38,42]. Among these, the Vision-

and-Language Navigation (VLN) task setting based on the

Room-to-Room (R2R) dataset [3] has become a popular

benchmark. An agent in VLN must follow a natural language

instruction by navigating along the described path in a never-

before-seen environment. By design, this paradigm does not

consider how persistent agents operating over time might

leverage prior experiences to better follow future instructions

within the same environment. In contrast, accumulating prior

experience within an environment is a staple of robotic de-

ployment – e.g. building semantic maps for localization and

reasoning [35,41]. Our IVLN paradigm is designed to better

align VLN with a realistic robotic deployment scenario.

Benchmarks for VLN in Discrete Settings VLN tasks fre-

quently involve inferring agent actions in a rendered 2D or

3D scene in response to language commands [8,28]. Agent

control is typically limited to changing position and orienta-

tion by discrete amounts or to predeﬁned possible options.

Advances in camera technology have enabled language-

guided navigation in photorealistic indoor scenes [3,7] and

outdoor city spaces [9]. In “Room-to-Room” (R2R) [3]

VLN, an agent interprets a single English instruction to navi-

gate along a short, indoor path. In a survey of VLN modeling

methods, environment exploration and memorization were

identiﬁed as frequent strategies for aligning a language in-

struction to a desired goal location in a scene [16]. However,

R2R evaluates policies on single instructions, limiting the

incentive to perform efﬁcient, effective memorization or

mapping. To study longer horizon planning, researchers

have extended R2R by concatenating language-aligned paths

and their associated instructions [21,51], tasking agents not

just with arriving to the goal but with following closely the

described path. Others have collected longer paths with in-

structions in three languages [26] or given as a cooperative

conversation [42]. With IR2R tours, we present the longest

such paths with substantial overlap in areas- covered-before

through time, challenging researchers to utilize information

from prior instructions and experience in the scene.

Benchmarks for VLN in Continuous Settings Moving a

physical robot, such as a quad-copter [5] or a toy car [4],

in response to language instructions requires contending

with the real, continuous world. Existing work has trans-

ferred policies for discrete VLN to the physical world by

manually curating a discrete representation of the world

map as a navigation graph [2] with limited success. VLN-

CE [25] re-introduces Room-to-Room [3] with a continuous,

3D reconstruction of indoor MatterPort3D scenes. However,

VLN-CE evaluates agents on single instructions and asso-

ciated paths in an i.i.d. fashion. In contrast, our IR2R-CE

benchmark incentivizes policies that respect environment

persistence found in the real world. Beyond removing the

abstractions of discrete VLN (VLN-CE), IR2R-CE situates

agents in a scene for long time horizons with many language

instructions; a logical next step towards learning useful world

representations through visual and linguistic information.

Pre-Exploration in VLN Some approaches in VLN have

embraced a setting where agents can fully explore the en-

vironment before following an instruction, either explicitly

through pretraining (e.g. [40,44,50]) or through beam-search

at inference time (e.g. [14,30]). Pre-exploration methods

outperform standard VLN approaches and serve as a natural

upper bound to IVLN where an agent has fully explored the

environment. In contrast, IVLN studies how environment in-

formation can be collected while performing the task (rather

than a priori) and how this partial, opportunistic information

can be leveraged to perform better over time.

Persistent Environments in Embodied AI Zooming out,

visual navigation tasks in embodied AI have seen signiﬁcant

progress, fueled by increased scale and quality of 3D scene

datasets (e.g. [7,34]) and high-performance simulation plat-

forms (e.g. [23,31,37,48]). A focus on real-world complexity

has emerged. One recognition is that agents act in, and inter-

act with, persistent environments. Tasks such as multi-object

navigation [45] and visual room rearrangement [46] involve

solving sequences of subtasks that, when approached inde-

pendently, cannot be solved optimally. Instead, reasoning

over persistent semantic and spatial information is required.

The proposed IVLN paradigm enriches this scene perception

problem with natural language and enables the association

of persistent visual semantics with linguistic information.

3. Iterative Vision-and-Language Navigation

We facilitate the study of agents given sequential naviga-

tion instructions in natural language. We extend the Room-

to-Room (R2R) [3] dataset of independent episodes—natural

language instructions and associated target paths in a particu-

lar scene—to tours—sequences of many episodes that cover

large swaths of the scene and include backtracking. The re-

sulting Iterative Room-to-Room tours contain substantially

longer paths and navigation instruction context than prior

discrete (IR2R) or continuous (IR2R-CE) VLN benchmarks.

The Iterative Paradigm We deﬁne a tour to be an ordered

sequence of episodes within a scene. Tours alternate between

two phases. In the agent navigation phase, the agent is

given a language instruction and infers navigation actions,

equivalent to a VLN episode. The phase ends when the

agent emits the STOP signal or takes a maximum number of

actions. The oracle navigation phase immediately follows in

two parts. First, if the agent has not successfully navigated to

within 0.5m of the episode goal, it is guided without language

to that goal by an oracle that forces its actions, analogous

to a human teaching the robot where the path should have

ended. Second, the agent is oracle-guided to the starting

point of the next episode in the tour, analogous to following

a human and waiting to receive the next instruction. The

agent passively observes the environment during this phase.

Generating Tours from VLN Data We generate tours that

Dataset Split Scenes Episodes Tours Tours/

Scene

Tour Length (Episodes)

Mean Min Max SD

IR2R

Train 61 14025 183 3.0 76.6 2 99 28.4

Val-Seen 53 1011 159 3.0 6.4 2 11 2.1

Val-Unseen 11 2349 33 3.0 71.2 6 100 34.0

IR2R-CE

Train 60 10668 222 3.7 48.1 3 93 30.5

Val-Seen 50 747 156 3.1 4.8 2 10 2.1

Val-Unseen 11 1824 36 3.3 50.7 3 100 31.3

Table 1. We construct sequences of episodes—tours—from the

Room-to-Room dataset [3] to create the discrete IR2R and continu-

ous IR2R-CE benchmarks. Here we detail characteristics of these

benchmarks, including the average number of episodes per tour.

minimize the distance between end and start points of se-

quential episodes. We also maximize the number of included

episodes as path ﬁnding between poses can fail in IR2R-CE.

Each R2R split contains a set of scenes, which each con-

tain a set of episodes

. For each

, we seek to derive a set

of disjoint tours

where each tour

T∈ T

is a sequence of

episodes that can be inter-navigated. That is, for episode

and

i+1

, navigation from the end of

to the start of

i+1

is possible. Letting

be the set of unique paths in an episode

set

, we ﬁrst partition

P(X)

such that the paths in each

subset

are inter-navigable; closed doors or obstacles can

create disjoint regions in the scene. To determine

P(X)

, we

compute the navigable geodesic distance between each path

pair where a ﬁnite distance implies connectivity. In IR2R,

this distance is computed on a navigation graph; in IR2R-CE,

it is computed on a 3D navigation mesh and assumes agent

dimensions and actions common to VLN-CE [25]. We then

order the paths in each subset

to deﬁne a tour

. Minimiz-

ing the oracle navigation distance in a tour is equivalent to

an asymmetric traveling salesperson problem (ATSP) which

we approximately solve using the Lin-Kernighan heuristic

(LKH) [17]. Finally, if

contains

instructions per path

and

n > 1

, we duplicate each tour

times, sampling an

instruction for each path without replacement.

Dataset Characteristics We generate tours in the Train,

Validation-Seen, and Validation-Unseen splits of discrete

R2R to form IR2R and continuous R2R to form IR2R-CE

(Tab. 1). Validation-Seen (Val-Seen) contains episodes from

scenes seen during training, while Validation-Unseen (Val-

Unseen) contains episodes from scenes not seen during train-

ing. In total, IR2R contains 375 tours and IR2R-CE contains

414. There are fewer discrete tours, which are longer on av-

erage than continuous tours (Fig. 2a), due to discontinuities

in the navigable area of continuous environments. In discrete

VLN, a path exists from each node to every other node in a

scene, but in continuous environments navigation between

episode endpoints can fail, resulting in disjoint spaces within

a scene that have shorter tours. The distribution of episodes

per tour has a high variance for both benchmarks, a reﬂection

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

IterativeVision-and-LanguageNavigationJacobKrantz1*ShurjoBanerjee2∗WangZhu3JasonCorso2PeterAnderson4StefanLee1JesseThomason31OregonStateUniversity2UniversityofMichigan3UniversityofSouthernCalifornia4GoogleResearchAbstractWepresentIterativeVision-and-LanguageNaviga-tion(IVLN),aparadigmforevaluatingla...

展开>> 收起<<

Iterative Vision-and-Language Navigation Jacob Krantz1 Shurjo Banerjee2Wang Zhu3 Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Iterative Vision-and-Language Navigation Jacob Krantz1 Shurjo Banerjee2Wang Zhu3 Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: