
Kolve et al.,2017;Jain et al.,2019;Nguyen and
Daumé III,2019;Koh et al.,2021). R2R (Ander-
son et al.,2018a) and RxR (Ku et al.,2020) provide
fine-grained instructions which guide the agent in
a step-wise manner. FG-R2R (Hong et al.,2020a)
and Landmark-RxR (He et al.,2021) segments the
instructions into action units and explicitly ground
sub-instructions on visual observation. In contrast,
REVERIE (Qi et al.,2020), and SOON (Zhu et al.,
2021a) proposes to use referring expression with
no guidance on intermediate steps that lead to the
final destination. Compared to these datasets, ULN
aims to build an agent that can generalize to multi-
level granularity after training once, which is more
practical for real-world applications.
Embodied Navigation Agents
Learning to
ground instructions on visual observations is one
major problem for an agent to generalize to an un-
seen environment (Wang et al.,2019;Deng et al.,
2020;Fu et al.,2020). Previous studies demon-
strate significant improvement by data augmenta-
tions (Fried et al.,2018;Tan et al.,2019;Zhu et al.,
2021b;Fang et al.,2022;Li et al.,2022), designing
pre-training tasks (Hao et al.,2020;Chen et al.,
2021a;Qiao et al.,2022) and decoding algorithms
(Ma et al.,2019a;Ke et al.,2019;Ma et al.,2019b;
Chen et al.,2022). For exploration-based meth-
ods, FAST (Ke et al.,2019) proposes a searching
algorithm that allows the agent to backtrack to the
most promising trajectory. SSM (Wang et al.,2021)
memorizes local and global action spaces and es-
timates multiple scores for candidate nodes in the
frontier of trajectories. Compared to E2E, Active
VLN (Wang et al.,2020) is the most relevant work
where they learn an additional policy for multi-step
exploration. However, they define the reward func-
tion based on distances to target locations, while
our uncertainty estimation is based on step-wise
grounding mismatch. Our E2E module is also more
efficient that has fewer parameters and low training
complexity.
3 Underspecification in VLN
Our dataset construction is three-fold: We first ob-
tain underspecified instructions by asking work-
ers to simplify and rephrase the R2R instructions.
Then, we validate that the goals are still reachable
with underspecified instructions. Finally, we verify
that instructions from R2R-ULN are preferred to
R2R ones from a user’s perspective, which proves
the necessity of the ULN setting. We briefly de-
Level Instructions
L0
Turn around and go down the stairs. At the bottom
turn slightly right and enter the room with the TV
on the wall and a green table. Walk to the right
past the TV. Stop at the door to the right facing
into the bathroom. (from R2R)
L1
Take the stairs to the bottom and enter the room
with the TV on the wall and a green table.
Walk past the TV. Stop at the door to facing into
the bathroom. (Redundancy Removed)
L2
Take the stairs to the bottom and enter the room
a green table. Walk past the TV. Stop at the
bathroom door. (Partial Route Description)
L3
Go to the door of the bathroom next to the room
with a green table. (No Route Description)
Table 1: Instruction examples from the R2R-ULN
validation set. We mark removed words in red and
rephrased words in blue in the next level.
scribe definitions and our ULN dataset in this sec-
tion with more details in Appendix A.
3.1 Instruction Simplification
We formalize the instruction collection as a sen-
tence simplification task and ask human annotators
to remove details from the instructions progres-
sively. Denoting the original R2R instructions as
Level 0
(
L0
), annotators rewrite each
L0
into three
different levels of underspecified instructions. We
discover that some components in
L0
can be redun-
dant or rarely used in indoor environments, such
as “turn 45 degrees”. Therefore, to obtain
Level 1
(
L1
) from each
L0
instruction, annotators rewrite
L0
by removing any redundant part but keep most
of the route description unchanged. Redundant
components include but are not limited to repeti-
tion, excessive details, and directional phrases (See
Table 1). As for
Level 2
(
L2
), annotators remove
one or two sub-instructions from
L1
, creating a
scenario where the users omit some details in com-
monplaces. We collect
Level 3
(
L3
) instructions
by giving destination information such as region
label and floor level and ask annotators to write one
sentence directly referring to the object or location
of the destination point.
3.2 Instruction Verification
To ensure that the underspecified instructions pro-
vide a feasible performance upper bound for VLN
agents, we have another group of annotators navi-
gate in an interactive interface from R2R (Anderson