Iterative Vision-and-Language Navigation Jacob Krantz1 Shurjo Banerjee2Wang Zhu3 Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3

2025-05-03 0 0 6.5MB 14 页 10玖币
侵权投诉
Iterative Vision-and-Language Navigation
Jacob Krantz1*Shurjo Banerjee2Wang Zhu3
Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3
1Oregon State University 2University of Michigan 3University of Southern California 4Google Research
Abstract
We present Iterative Vision-and-Language Naviga-
tion (IVLN), a paradigm for evaluating language-guided
agents navigating in a persistent environment over time. Ex-
isting Vision-and-Language Navigation (VLN) benchmarks
erase the agent’s memory at the beginning of every episode,
testing the ability to perform cold-start navigation with
no prior information. However, deployed robots occupy
the same environment for long periods of time. The IVLN
paradigm addresses this disparity by training and evaluating
VLN agents that maintain memory across tours of scenes
that consist of up to 100 ordered instruction-following Room-
to-Room (R2R) episodes, each defined by an individual lan-
guage instruction and a target path. We present discrete
and continuous Iterative Room-to-Room (IR2R) benchmarks
comprising about 400 tours each in 80 indoor scenes. We
find that extending the implicit memory of high-performing
transformer VLN agents is not sufficient for IVLN, but agents
that build maps can benefit from environment persistence,
motivating a renewed focus on map-building agents in VLN.
1. Introduction
Robots and virtual agents that persistently operate in hu-
man spaces like homes should improve over time. For ex-
ample, a smart vacuum told to clean the living room, which
is down the hall past the guest bedroom should learn about
both the living room and guest bedroom. Likewise, agents
should be able to associate references in past instructions,
such as guest bedroom, with spatial and visual information
from the environment to understand future instructions.
Most work on language-guided, embodied agents per-
forming navigation [3,25] or household tasks [38] is
episodic in nature—agent memory is erased before issu-
ing each new instruction. In contrast, physical robots build
maps [12,43,49]iteratively from visual observations [32,39]
as an explicit form of long-term memory. Agents trained to
*Equal contributions. Correspondence: krantzja@oregonstate.edu
perform language-guided navigation in simulation that are
deployed on physical robots [2] fail to take advantage of the
mapping-based strategies that facilitate robot navigation.
We propose Iterative Vision-and-Language Navigation
(IVLN), in which an agent follows an ordered sequence of
language instructions that conduct a tour of an indoor space.
Each tour is composed of individual episodes of language
instructions with target paths. Agents can utilize memory
to better understand future tour instructions. After just 10
episodes an agent has seen on average over 50% of the target
path associated with the next language instruction in a tour.
While performing an IVLN tour, agents iteratively explore
the environment, meaning regions irrelevant to task instruc-
tions need not ever be visited. By conditioning exploration
on language, IVLN enables rich semantic representations,
e.g., unusual, novel, and scene-specific referents grounded
during one episode can be reasoned about later.
We explore both a discrete VLN setting based on Room-
to-Room [3] episodes and navigation graphs (IR2R) and a
continuous simulation VLN-CE [25] setting (IR2R-CE). The
markedly different action and visual observation spaces of
these settings may require different memory mechanisms.
In the discrete setting, agents move on graph edges and
observe clear, well-framed images. For IR2R, we extend a
state-of-the-art transformer agent [11] that learns an implicit
memory based on path history when interpreting instructions.
In the continuous setting, agents take motion actions while
observing noisy images of a 3D environment reconstructed
from discrete panorama images. For IR2R-CE, we propose
an agent that builds and interprets an explicit semantic map.
In short, we define Iterative Vision-and-Language Navi-
gation (IVLN), a paradigm for persistent VLN, and release
IR2R and IR2R-CE to study discrete and continuous navi-
gation agents in the IVLN setting. We create initial agents
for both benchmarks, including explicit mapping and im-
plicit memory models for continuous navigation. Please see
jacobkrantz.github.io/ivln for code and more details.
2. Related Work
Instruction-guided navigation is a growing area in
grounded language understanding with many task settings
arXiv:2210.03087v3 [cs.CV] 24 Dec 2023
Episode 1/82 Oracle 1/82 Episode 2/82 Episode 6/82 Episode 82/82
Tour Map
Observed Environment Map
Travel to the end of the hallway where
there is a vase on the end of the table.
Take a left and go forward until you
reach the open doorway on the left.
Move forward into the open doorway.
[The agent is guided from where it
stopped in episode 1 to the correct
episode 1 goal location, then to the
start location for episode 2. The agent
observes but doesn’t act.]
Go left down the hallway and turn left.
Go down the hall and stop once you
reach the wood floor.
Exit bathroom and follow hallway
through archway directly in front. Turn
right when hallway ends at pictures and
table. Follow hallway passed piano and
stop in the circle on the hallway floor.
Facing the toilet, walk through the door
on the left. Make a right and walk
through the doorway across the room.
Make a left and walk down the hallway.
Turn right at the next opening and stop
before the kitchen island on the right.
Instruction
Initial Observation
Figure 1. In IVLN, agents are given language instructions corresponding to a sequence of paths that form a tour around a 3D scene. After
attempting to follow each instruction, the agent is teleoperated by an oracle to the correct goal location, then to the start of the next path
where the next instruction is issued. Unlike conventional episodic paradigms, the agent retains memory between episodes.
developed [3,9,26,33,38,42]. Among these, the Vision-
and-Language Navigation (VLN) task setting based on the
Room-to-Room (R2R) dataset [3] has become a popular
benchmark. An agent in VLN must follow a natural language
instruction by navigating along the described path in a never-
before-seen environment. By design, this paradigm does not
consider how persistent agents operating over time might
leverage prior experiences to better follow future instructions
within the same environment. In contrast, accumulating prior
experience within an environment is a staple of robotic de-
ployment – e.g. building semantic maps for localization and
reasoning [35,41]. Our IVLN paradigm is designed to better
align VLN with a realistic robotic deployment scenario.
Benchmarks for VLN in Discrete Settings VLN tasks fre-
quently involve inferring agent actions in a rendered 2D or
3D scene in response to language commands [8,28]. Agent
control is typically limited to changing position and orienta-
tion by discrete amounts or to predefined possible options.
Advances in camera technology have enabled language-
guided navigation in photorealistic indoor scenes [3,7] and
outdoor city spaces [9]. In “Room-to-Room” (R2R) [3]
VLN, an agent interprets a single English instruction to navi-
gate along a short, indoor path. In a survey of VLN modeling
methods, environment exploration and memorization were
identified as frequent strategies for aligning a language in-
struction to a desired goal location in a scene [16]. However,
R2R evaluates policies on single instructions, limiting the
incentive to perform efficient, effective memorization or
mapping. To study longer horizon planning, researchers
have extended R2R by concatenating language-aligned paths
and their associated instructions [21,51], tasking agents not
just with arriving to the goal but with following closely the
described path. Others have collected longer paths with in-
structions in three languages [26] or given as a cooperative
conversation [42]. With IR2R tours, we present the longest
such paths with substantial overlap in areas- covered-before
through time, challenging researchers to utilize information
from prior instructions and experience in the scene.
Benchmarks for VLN in Continuous Settings Moving a
physical robot, such as a quad-copter [5] or a toy car [4],
in response to language instructions requires contending
with the real, continuous world. Existing work has trans-
ferred policies for discrete VLN to the physical world by
manually curating a discrete representation of the world
map as a navigation graph [2] with limited success. VLN-
CE [25] re-introduces Room-to-Room [3] with a continuous,
3D reconstruction of indoor MatterPort3D scenes. However,
VLN-CE evaluates agents on single instructions and asso-
ciated paths in an i.i.d. fashion. In contrast, our IR2R-CE
benchmark incentivizes policies that respect environment
persistence found in the real world. Beyond removing the
abstractions of discrete VLN (VLN-CE), IR2R-CE situates
agents in a scene for long time horizons with many language
instructions; a logical next step towards learning useful world
representations through visual and linguistic information.
Pre-Exploration in VLN Some approaches in VLN have
embraced a setting where agents can fully explore the en-
vironment before following an instruction, either explicitly
through pretraining (e.g. [40,44,50]) or through beam-search
at inference time (e.g. [14,30]). Pre-exploration methods
outperform standard VLN approaches and serve as a natural
upper bound to IVLN where an agent has fully explored the
environment. In contrast, IVLN studies how environment in-
formation can be collected while performing the task (rather
than a priori) and how this partial, opportunistic information
can be leveraged to perform better over time.
Persistent Environments in Embodied AI Zooming out,
visual navigation tasks in embodied AI have seen significant
progress, fueled by increased scale and quality of 3D scene
datasets (e.g. [7,34]) and high-performance simulation plat-
forms (e.g. [23,31,37,48]). A focus on real-world complexity
has emerged. One recognition is that agents act in, and inter-
act with, persistent environments. Tasks such as multi-object
navigation [45] and visual room rearrangement [46] involve
solving sequences of subtasks that, when approached inde-
pendently, cannot be solved optimally. Instead, reasoning
over persistent semantic and spatial information is required.
The proposed IVLN paradigm enriches this scene perception
problem with natural language and enables the association
of persistent visual semantics with linguistic information.
3. Iterative Vision-and-Language Navigation
We facilitate the study of agents given sequential naviga-
tion instructions in natural language. We extend the Room-
to-Room (R2R) [3] dataset of independent episodes—natural
language instructions and associated target paths in a particu-
lar scene—to tours—sequences of many episodes that cover
large swaths of the scene and include backtracking. The re-
sulting Iterative Room-to-Room tours contain substantially
longer paths and navigation instruction context than prior
discrete (IR2R) or continuous (IR2R-CE) VLN benchmarks.
The Iterative Paradigm We define a tour to be an ordered
sequence of episodes within a scene. Tours alternate between
two phases. In the agent navigation phase, the agent is
given a language instruction and infers navigation actions,
equivalent to a VLN episode. The phase ends when the
agent emits the STOP signal or takes a maximum number of
actions. The oracle navigation phase immediately follows in
two parts. First, if the agent has not successfully navigated to
within 0.5m of the episode goal, it is guided without language
to that goal by an oracle that forces its actions, analogous
to a human teaching the robot where the path should have
ended. Second, the agent is oracle-guided to the starting
point of the next episode in the tour, analogous to following
a human and waiting to receive the next instruction. The
agent passively observes the environment during this phase.
Generating Tours from VLN Data We generate tours that
Dataset Split Scenes Episodes Tours Tours/
Scene
Tour Length (Episodes)
Mean Min Max SD
IR2R
Train 61 14025 183 3.0 76.6 2 99 28.4
Val-Seen 53 1011 159 3.0 6.4 2 11 2.1
Val-Unseen 11 2349 33 3.0 71.2 6 100 34.0
IR2R-CE
Train 60 10668 222 3.7 48.1 3 93 30.5
Val-Seen 50 747 156 3.1 4.8 2 10 2.1
Val-Unseen 11 1824 36 3.3 50.7 3 100 31.3
Table 1. We construct sequences of episodes—tours—from the
Room-to-Room dataset [3] to create the discrete IR2R and continu-
ous IR2R-CE benchmarks. Here we detail characteristics of these
benchmarks, including the average number of episodes per tour.
minimize the distance between end and start points of se-
quential episodes. We also maximize the number of included
episodes as path finding between poses can fail in IR2R-CE.
Each R2R split contains a set of scenes, which each con-
tain a set of episodes
E
. For each
E
, we seek to derive a set
of disjoint tours
T
where each tour
T∈ T
is a sequence of
episodes that can be inter-navigated. That is, for episode
i
and
i+1
in
T
, navigation from the end of
i
to the start of
i+1
is possible. Letting
X
be the set of unique paths in an episode
set
E
, we first partition
P(X)
such that the paths in each
subset
p
are inter-navigable; closed doors or obstacles can
create disjoint regions in the scene. To determine
P(X)
, we
compute the navigable geodesic distance between each path
pair where a finite distance implies connectivity. In IR2R,
this distance is computed on a navigation graph; in IR2R-CE,
it is computed on a 3D navigation mesh and assumes agent
dimensions and actions common to VLN-CE [25]. We then
order the paths in each subset
p
to define a tour
T
. Minimiz-
ing the oracle navigation distance in a tour is equivalent to
an asymmetric traveling salesperson problem (ATSP) which
we approximately solve using the Lin-Kernighan heuristic
(LKH) [17]. Finally, if
E
contains
n
instructions per path
and
n > 1
, we duplicate each tour
n
times, sampling an
instruction for each path without replacement.
Dataset Characteristics We generate tours in the Train,
Validation-Seen, and Validation-Unseen splits of discrete
R2R to form IR2R and continuous R2R to form IR2R-CE
(Tab. 1). Validation-Seen (Val-Seen) contains episodes from
scenes seen during training, while Validation-Unseen (Val-
Unseen) contains episodes from scenes not seen during train-
ing. In total, IR2R contains 375 tours and IR2R-CE contains
414. There are fewer discrete tours, which are longer on av-
erage than continuous tours (Fig. 2a), due to discontinuities
in the navigable area of continuous environments. In discrete
VLN, a path exists from each node to every other node in a
scene, but in continuous environments navigation between
episode endpoints can fail, resulting in disjoint spaces within
a scene that have shorter tours. The distribution of episodes
per tour has a high variance for both benchmarks, a reflection
摘要:

IterativeVision-and-LanguageNavigationJacobKrantz1*ShurjoBanerjee2∗WangZhu3JasonCorso2PeterAnderson4StefanLee1JesseThomason31OregonStateUniversity2UniversityofMichigan3UniversityofSouthernCalifornia4GoogleResearchAbstractWepresentIterativeVision-and-LanguageNaviga-tion(IVLN),aparadigmforevaluatingla...

展开>> 收起<<
Iterative Vision-and-Language Navigation Jacob Krantz1 Shurjo Banerjee2Wang Zhu3 Jason Corso2Peter Anderson4Stefan Lee1Jesse Thomason3.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:6.5MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注