
representations through visual and linguistic information.
Pre-Exploration in VLN Some approaches in VLN have
embraced a setting where agents can fully explore the en-
vironment before following an instruction, either explicitly
through pretraining (e.g. [40,44,50]) or through beam-search
at inference time (e.g. [14,30]). Pre-exploration methods
outperform standard VLN approaches and serve as a natural
upper bound to IVLN where an agent has fully explored the
environment. In contrast, IVLN studies how environment in-
formation can be collected while performing the task (rather
than a priori) and how this partial, opportunistic information
can be leveraged to perform better over time.
Persistent Environments in Embodied AI Zooming out,
visual navigation tasks in embodied AI have seen significant
progress, fueled by increased scale and quality of 3D scene
datasets (e.g. [7,34]) and high-performance simulation plat-
forms (e.g. [23,31,37,48]). A focus on real-world complexity
has emerged. One recognition is that agents act in, and inter-
act with, persistent environments. Tasks such as multi-object
navigation [45] and visual room rearrangement [46] involve
solving sequences of subtasks that, when approached inde-
pendently, cannot be solved optimally. Instead, reasoning
over persistent semantic and spatial information is required.
The proposed IVLN paradigm enriches this scene perception
problem with natural language and enables the association
of persistent visual semantics with linguistic information.
3. Iterative Vision-and-Language Navigation
We facilitate the study of agents given sequential naviga-
tion instructions in natural language. We extend the Room-
to-Room (R2R) [3] dataset of independent episodes—natural
language instructions and associated target paths in a particu-
lar scene—to tours—sequences of many episodes that cover
large swaths of the scene and include backtracking. The re-
sulting Iterative Room-to-Room tours contain substantially
longer paths and navigation instruction context than prior
discrete (IR2R) or continuous (IR2R-CE) VLN benchmarks.
The Iterative Paradigm We define a tour to be an ordered
sequence of episodes within a scene. Tours alternate between
two phases. In the agent navigation phase, the agent is
given a language instruction and infers navigation actions,
equivalent to a VLN episode. The phase ends when the
agent emits the STOP signal or takes a maximum number of
actions. The oracle navigation phase immediately follows in
two parts. First, if the agent has not successfully navigated to
within 0.5m of the episode goal, it is guided without language
to that goal by an oracle that forces its actions, analogous
to a human teaching the robot where the path should have
ended. Second, the agent is oracle-guided to the starting
point of the next episode in the tour, analogous to following
a human and waiting to receive the next instruction. The
agent passively observes the environment during this phase.
Generating Tours from VLN Data We generate tours that
Dataset Split Scenes Episodes Tours Tours/
Scene
Tour Length (Episodes)
Mean Min Max SD
IR2R
Train 61 14025 183 3.0 76.6 2 99 28.4
Val-Seen 53 1011 159 3.0 6.4 2 11 2.1
Val-Unseen 11 2349 33 3.0 71.2 6 100 34.0
IR2R-CE
Train 60 10668 222 3.7 48.1 3 93 30.5
Val-Seen 50 747 156 3.1 4.8 2 10 2.1
Val-Unseen 11 1824 36 3.3 50.7 3 100 31.3
Table 1. We construct sequences of episodes—tours—from the
Room-to-Room dataset [3] to create the discrete IR2R and continu-
ous IR2R-CE benchmarks. Here we detail characteristics of these
benchmarks, including the average number of episodes per tour.
minimize the distance between end and start points of se-
quential episodes. We also maximize the number of included
episodes as path finding between poses can fail in IR2R-CE.
Each R2R split contains a set of scenes, which each con-
tain a set of episodes
E
. For each
E
, we seek to derive a set
of disjoint tours
T
where each tour
T∈ T
is a sequence of
episodes that can be inter-navigated. That is, for episode
i
and
i+1
in
T
, navigation from the end of
i
to the start of
i+1
is possible. Letting
X
be the set of unique paths in an episode
set
E
, we first partition
P(X)
such that the paths in each
subset
p
are inter-navigable; closed doors or obstacles can
create disjoint regions in the scene. To determine
P(X)
, we
compute the navigable geodesic distance between each path
pair where a finite distance implies connectivity. In IR2R,
this distance is computed on a navigation graph; in IR2R-CE,
it is computed on a 3D navigation mesh and assumes agent
dimensions and actions common to VLN-CE [25]. We then
order the paths in each subset
p
to define a tour
T
. Minimiz-
ing the oracle navigation distance in a tour is equivalent to
an asymmetric traveling salesperson problem (ATSP) which
we approximately solve using the Lin-Kernighan heuristic
(LKH) [17]. Finally, if
E
contains
n
instructions per path
and
n > 1
, we duplicate each tour
n
times, sampling an
instruction for each path without replacement.
Dataset Characteristics We generate tours in the Train,
Validation-Seen, and Validation-Unseen splits of discrete
R2R to form IR2R and continuous R2R to form IR2R-CE
(Tab. 1). Validation-Seen (Val-Seen) contains episodes from
scenes seen during training, while Validation-Unseen (Val-
Unseen) contains episodes from scenes not seen during train-
ing. In total, IR2R contains 375 tours and IR2R-CE contains
414. There are fewer discrete tours, which are longer on av-
erage than continuous tours (Fig. 2a), due to discontinuities
in the navigable area of continuous environments. In discrete
VLN, a path exists from each node to every other node in a
scene, but in continuous environments navigation between
episode endpoints can fail, resulting in disjoint spaces within
a scene that have shorter tours. The distribution of episodes
per tour has a high variance for both benchmarks, a reflection