ULN Towards Underspecified Vision-and-Language Navigation Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang UC Santa Barbara

2025-05-06 0 0 9.04MB 20 页 10玖币
侵权投诉
ULN: Towards Underspecified Vision-and-Language Navigation
Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang
UC Santa Barbara
{weixifeng, tsu-juifu, yujielu, william}@cs.ucsb.edu
Abstract
Vision-and-Language Navigation (VLN) is a
task to guide an embodied agent moving to
a target position using language instructions.
Despite the significant performance improve-
ment, the wide use of fine-grained instructions
fails to characterize more practical linguistic
variations in reality. To fill in this gap, we in-
troduce a new setting, namely Underspecified
vision-and-Language Navigation (ULN), and
associated evaluation datasets. ULN evalu-
ates agents using multi-level underspecified
instructions instead of purely fine-grained or
coarse-grained, which is a more realistic and
general setting. As a primary step toward
ULN, we propose a VLN framework that
consists of a classification module, a naviga-
tion agent, and an Exploitation-to-Exploration
(E2E) module. Specifically, we propose
to learn Granularity Specific Sub-networks
(GSS) for the agent to ground multi-level in-
structions with minimal additional parameters.
Then, our E2E module estimates grounding
uncertainty and conducts multi-step lookahead
exploration to improve the success rate further.
Experimental results show that existing VLN
models are still brittle to multi-level language
underspecification. Our framework is more ro-
bust and outperforms the baselines on ULN by
10% relative success rate across all levels. 1
1 Introduction
Vision-and-Language Navigation (VLN) allows a
human user to command or instruct an embodied
agent to reach target locations using verbal instruc-
tions. For this application to step out of curated
datasets in real-world settings, the agents must gen-
eralize to many linguistic variations of human in-
structions. Despite significant progress in VLN
datasets (Anderson et al.,2018b;Chen et al.,2019;
Ku et al.,2020;Shridhar et al.,2020) and agent
design (Fried et al.,2018;Li et al.,2021;Min et al.,
1
Our code and data are available at
https://github.
com/weixi-feng/ULN.
Starting Position Target Position
(a) VLN BERT (b) Ours
Figure 1: Navigation results of a baseline (left) and
our VLN framework (right) with multi-level underspec-
ified instructions (L0-L3). Trajectories are curved for
demonstration. Note that the baseline stops early and
fails to reach the target position with L1-L3. Our agent
manages to reach the goal across all levels.
2021), it remains a question whether existing mod-
els are generalized and robust enough to deal with
all kinds of language variations.
For the language input in an indoor environment,
some datasets focus on long and detailed instruc-
tions with the route description at every step to
achieve fine-grained language grounding (Ander-
son et al.,2018b;Ku et al.,2020) or long-horizon
navigation (Jain et al.,2019;Zhu et al.,2020a). For
instance, from Room-to-Room (R2R) (Anderson
et al.,2018b), to Room-Across-Room (RxR) (Ku
et al.,2020), the average instruction length in-
creases from 29 to 129 words. Other datasets
have coarse-grained instructions like REVERIE
(Qi et al.,2020) or SOON (Zhu et al.,2021a).
Agents are trained and evaluated on a single granu-
arXiv:2210.10020v1 [cs.CV] 18 Oct 2022
larity or one type of expression.
In contrast, we propose to evaluate VLN agents
on multi-level granularity to better understand the
behavior of embodied agents with respect to lan-
guage variations. Our motivation is that users are
inclined to give shorter instructions instead of de-
tailed route descriptions because 1) users are not
omniscient observers who follow the route and
describe it step by step for the agent; 2) shorter
instructions are more practical, reproducible, and
efficient from a user’s perspective. 3) users tend to
underspecify commands in familiar environments
like personal households. Therefore, we propose
a new setting, namely
U
nderspecified vision-and-
L
anguage
N
avigation (ULN) and associated eval-
uation datasets on top of R2R, namely R2R-ULN
to address these issues. R2R-ULN contains under-
specified instructions where route descriptions are
successively removed from the original instructions.
Each long R2R instruction corresponds to three
shortened and rephrased instructions belonging to
different levels, which preserves partial alignment
but also introduces variances.
As shown in Fig. 1, the goal of ULN is to facili-
tate the development of a generalized VLN design
that achieves balanced performance across all gran-
ularity levels. As a primary step toward ULN, we
propose a modular VLN framework that consists of
an instruction classification module, a navigational
agent, and an Exploitation-to-Exploration (E2E)
module. The classification module first classifies
the input instruction as high-level or low-level in
granularity so that our agent can encode these two
types accordingly. As for the agent, we propose
to learn Granularity Specific Sub-networks (GSS)
to handle both levels with minimally additional pa-
rameters. A sub-network, e.g., the text encoder, is
trained for each level while other parameters are
shared. Finally, the E2E module estimates the step-
wise language grounding uncertainty and conducts
multi-step lookahead exploration to rectify wrong
decisions that originated from underspecified lan-
guage.
Our VLN framework is model-agnostic and can
be applied to many previous agents that follow a
“encode-then-fuse” mechanism for language and vi-
sual inputs. We establish our framework based on
two state-of-the-art (SOTA) VLN agents to demon-
strate its effectiveness. We conduct extensive ex-
periments to analyze the generalization of existing
agents and our framework in ULN and the orig-
inal datasets with fine-grained instructions. Our
contribution is three-fold:
We propose a novel setting named Underspec-
ified vision-and-Language Navigation (ULN)
to account for multi-level language variations
for instructions. We collect a large-scale eval-
uation dataset R2R-ULN which consists of
9
k
validation and 4k testing instructions.
We propose a VLN framework that consists
of Granularity Specific Sub-networks (GSS)
and an E2E module for navigation agents to
handle both low-level and high-level instruc-
tions.
Experiments show that achieving consistent
performance across multi-level underspecifi-
cation can be much more challenging to ex-
isting VLN agents. Furthermore, our VLN
framework can improve the success rate by
10% relatively over the baselines and mitigate
the performance gap across all levels.
2 Related Work
Language Variations for Multimodal Learning
Natural language input has been an essential com-
ponent of modern multimodal learning tasks to
combine with other modalities such as vision (An-
tol et al.,2015;Johnson et al.,2017), speech
(Alayrac et al.,2020) or gestures (Chen et al.,
2021b). The effect of language variations has been
studied in many vision-and-language (V&L) tasks
(Bisk et al.,2016;Agrawal et al.,2018;Cirik et al.,
2018;Zhu et al.,2020b;Lin et al.,2021). For in-
stance, referring expression datasets (Kazemzadeh
et al.,2014;Yu et al.,2016;Mao et al.,2016) con-
tain multiple expressions for the same referring
object. Ref-Adv (Akula et al.,2020) studies the ro-
bustness of referring expression models by switch-
ing word orders. In Visual Question Answering
(VQA), Shah et al. (2019) discovers that VQA
models are brittle to rephrased questions with the
same meaning. As for VLN, we characterize the
linguistic and compositional variations in rephras-
ing and dropping sub-instructions from a full in-
struction with complete route descriptions. We also
define three different levels to formalize underspec-
ification for navigational instructions.
VLN Datasets
VLN has gained much attention
(Gu et al.,2022) with emergence of various simula-
tion environments and datasets (Chang et al.,2017;
Kolve et al.,2017;Jain et al.,2019;Nguyen and
Daumé III,2019;Koh et al.,2021). R2R (Ander-
son et al.,2018a) and RxR (Ku et al.,2020) provide
fine-grained instructions which guide the agent in
a step-wise manner. FG-R2R (Hong et al.,2020a)
and Landmark-RxR (He et al.,2021) segments the
instructions into action units and explicitly ground
sub-instructions on visual observation. In contrast,
REVERIE (Qi et al.,2020), and SOON (Zhu et al.,
2021a) proposes to use referring expression with
no guidance on intermediate steps that lead to the
final destination. Compared to these datasets, ULN
aims to build an agent that can generalize to multi-
level granularity after training once, which is more
practical for real-world applications.
Embodied Navigation Agents
Learning to
ground instructions on visual observations is one
major problem for an agent to generalize to an un-
seen environment (Wang et al.,2019;Deng et al.,
2020;Fu et al.,2020). Previous studies demon-
strate significant improvement by data augmenta-
tions (Fried et al.,2018;Tan et al.,2019;Zhu et al.,
2021b;Fang et al.,2022;Li et al.,2022), designing
pre-training tasks (Hao et al.,2020;Chen et al.,
2021a;Qiao et al.,2022) and decoding algorithms
(Ma et al.,2019a;Ke et al.,2019;Ma et al.,2019b;
Chen et al.,2022). For exploration-based meth-
ods, FAST (Ke et al.,2019) proposes a searching
algorithm that allows the agent to backtrack to the
most promising trajectory. SSM (Wang et al.,2021)
memorizes local and global action spaces and es-
timates multiple scores for candidate nodes in the
frontier of trajectories. Compared to E2E, Active
VLN (Wang et al.,2020) is the most relevant work
where they learn an additional policy for multi-step
exploration. However, they define the reward func-
tion based on distances to target locations, while
our uncertainty estimation is based on step-wise
grounding mismatch. Our E2E module is also more
efficient that has fewer parameters and low training
complexity.
3 Underspecification in VLN
Our dataset construction is three-fold: We first ob-
tain underspecified instructions by asking work-
ers to simplify and rephrase the R2R instructions.
Then, we validate that the goals are still reachable
with underspecified instructions. Finally, we verify
that instructions from R2R-ULN are preferred to
R2R ones from a user’s perspective, which proves
the necessity of the ULN setting. We briefly de-
Level Instructions
L0
Turn around and go down the stairs. At the bottom
turn slightly right and enter the room with the TV
on the wall and a green table. Walk to the right
past the TV. Stop at the door to the right facing
into the bathroom. (from R2R)
L1
Take the stairs to the bottom and enter the room
with the TV on the wall and a green table.
Walk past the TV. Stop at the door to facing into
the bathroom. (Redundancy Removed)
L2
Take the stairs to the bottom and enter the room
a green table. Walk past the TV. Stop at the
bathroom door. (Partial Route Description)
L3
Go to the door of the bathroom next to the room
with a green table. (No Route Description)
Table 1: Instruction examples from the R2R-ULN
validation set. We mark removed words in red and
rephrased words in blue in the next level.
scribe definitions and our ULN dataset in this sec-
tion with more details in Appendix A.
3.1 Instruction Simplification
We formalize the instruction collection as a sen-
tence simplification task and ask human annotators
to remove details from the instructions progres-
sively. Denoting the original R2R instructions as
Level 0
(
L0
), annotators rewrite each
L0
into three
different levels of underspecified instructions. We
discover that some components in
L0
can be redun-
dant or rarely used in indoor environments, such
as “turn 45 degrees”. Therefore, to obtain
Level 1
(
L1
) from each
L0
instruction, annotators rewrite
L0
by removing any redundant part but keep most
of the route description unchanged. Redundant
components include but are not limited to repeti-
tion, excessive details, and directional phrases (See
Table 1). As for
Level 2
(
L2
), annotators remove
one or two sub-instructions from
L1
, creating a
scenario where the users omit some details in com-
monplaces. We collect
Level 3
(
L3
) instructions
by giving destination information such as region
label and floor level and ask annotators to write one
sentence directly referring to the object or location
of the destination point.
3.2 Instruction Verification
To ensure that the underspecified instructions pro-
vide a feasible performance upper bound for VLN
agents, we have another group of annotators navi-
gate in an interactive interface from R2R (Anderson
Level
R2R-ULN Val-Unseen
Instr. Following Instr. Preference
SRÒSPLÒPracticality Efficiency
L086 72 - -
L182 68 55% 57%
L282 65 63% 59%
L375 58 68% 66%
Table 2: Human performance on R2R-ULN validation
unseen in terms of Success Rate (SR) and SR weighted
by Path Length (SPL), and human preference assess-
ment results. The percentage denotes the ratio of par-
ticipants selecting Liover L0.
et al.,2018b). As is shown in Table 2, annotators
achieve a slightly degraded but promising success
rate (SR) with
L3
. SPL is a metric that normalizes
SR over the path length. Therefore, the trade-off
for maintaining high SR is to have more explo-
ration steps, resulting in a much lower SPL value.
We also verify that
Li, i P t1,2,3u
are more prac-
tical and efficient choices than
L0
. Table 2shows
that people prefer underspecified instructions over
full instructions in both aspects, with an increasing
trend as iincreases to 3.
4 Method
4.1 Overview
In this section, we present our VLN framework
for handling multi-level underspecified language
inputs, which mainly consists of three modules (see
Figure 2). Given a natural language instruction in
a sequence of tokens,
W“ pw1,...wnq
,the clas-
sification module first categorize language input as
low-level (
L0, L1, L2
) or high-level (
L3
) instruc-
tions. To handle these two types accordingly, GSS
learns a sub-network, e.g., the text encoder, for
each type while the other parameters are shared.
At each step
t
, we denote the visual observation
Ot“ prv1;a1s,...,rvN, aNsq
with visual feature
vi
and angle feature
ai
of
i
-th view among all
N
views. The history contains a concatenation of
all observations before
t
m
Ht“ pO1,...,Qt´1q
.
Given
Wt,Ht,Ot
, the GSS-based agent predicts
a an action
at
by choosing a navigable viewpoint
from
Ot
. To overcome the reference misalignment
issue, the E2E module predicts a sequence of uncer-
tainty score
S“ ps1, ..., sTq
and conducts multi-
step exploration to collect future visual informa-
tion.
Navigator
Instruction
Classification
Uncertainty
Estimation
Multi-step
Lookahead
Decision
Making


E2E module
Turn around and go
down the stairs. At the
bottom
 


Figure 2: Our VLN framework with classification mod-
ule, navigation agent, and E2E module.
4.2 Instruction Classification
VLN agents can operate in two different modes,
fidelity-oriented or goal-oriented, depending on
reward functions (Jain et al.,2019) or text inputs
(Zhu et al.,2022). Agents trained on low-level gran-
ularity encounter performance degradation when
applied to high-level ones, and vice versa. As is
shown in Figure 2, we propose first to classify the
text inputs into two granularities and then encode
them independently in downstream modules. Our
classification module contains an embedding layer,
average pooling, and a fully-connected layer to
output binary class predictions.
4.3 Navigation Agent
Base Agent
We summarize the high-level frame-
work of many transformer-based agents (Hao et al.,
2020;Guhur et al.,2021;Moudgil et al.,2021)
paramterized as
θ
as shown in Figure 2. Given
the history
Ht
, text
X
, visual observation
Ot
, the
agent first encodes each modality input with en-
coders fhist, ftext, fimg:
XftextpWq, HtfhistpHtq,
OtfimgpOtq(1)
HAMT (Chen et al.,2021a) applies ViT (Doso-
vitskiy et al.,2020) and a Panoramic Trans-
former to hierarchically encode
Ht
as a se-
quence of embeddings
Ht“ ph1, . . . , ht´1q
while
VLN
œ
BERT (Hong et al.,2021) encodes
Ht
as
a state vector
Htht
. The embedding from
each modality is then fed into a
L
-layer cross-
modal transformer
fcm
, and passed through a cross-
attention first in each layer l:
αvÑt
t,l prHt,l;Ot,lsWquery
lqpXt,lWkey
lqT
?dh
(2)
where
αvÑt
t,l
denotes the attention weights of
history-visual concatenation on the language em-
beddings,
dh
is the hidden dimension. We omit
摘要:

ULN:TowardsUnderspeciedVision-and-LanguageNavigationWeixiFengTsu-JuiFuYujieLuWilliamYangWangUCSantaBarbara{weixifeng,tsu-juifu,yujielu,william}@cs.ucsb.eduAbstractVision-and-LanguageNavigation(VLN)isatasktoguideanembodiedagentmovingtoatargetpositionusinglanguageinstructions.Despitethesignicantperf...

展开>> 收起<<
ULN Towards Underspecified Vision-and-Language Navigation Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang UC Santa Barbara.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:9.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注