ULN Towards Underspeciﬁed Vision-and-Language Navigation Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang UC Santa Barbara

2025-05-06 0 0 9.04MB 20 页 10玖币

侵权投诉

ULN: Towards Underspeciﬁed Vision-and-Language Navigation

Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang

UC Santa Barbara

{weixifeng, tsu-juifu, yujielu, william}@cs.ucsb.edu

Abstract

Vision-and-Language Navigation (VLN) is a

task to guide an embodied agent moving to

a target position using language instructions.

Despite the signiﬁcant performance improve-

ment, the wide use of ﬁne-grained instructions

fails to characterize more practical linguistic

variations in reality. To ﬁll in this gap, we in-

troduce a new setting, namely Underspeciﬁed

vision-and-Language Navigation (ULN), and

associated evaluation datasets. ULN evalu-

ates agents using multi-level underspeciﬁed

instructions instead of purely ﬁne-grained or

coarse-grained, which is a more realistic and

general setting. As a primary step toward

ULN, we propose a VLN framework that

consists of a classiﬁcation module, a naviga-

tion agent, and an Exploitation-to-Exploration

(E2E) module. Speciﬁcally, we propose

to learn Granularity Speciﬁc Sub-networks

(GSS) for the agent to ground multi-level in-

structions with minimal additional parameters.

Then, our E2E module estimates grounding

uncertainty and conducts multi-step lookahead

exploration to improve the success rate further.

Experimental results show that existing VLN

models are still brittle to multi-level language

underspeciﬁcation. Our framework is more ro-

bust and outperforms the baselines on ULN by

10% relative success rate across all levels. 1

1 Introduction

Vision-and-Language Navigation (VLN) allows a

human user to command or instruct an embodied

agent to reach target locations using verbal instruc-

tions. For this application to step out of curated

datasets in real-world settings, the agents must gen-

eralize to many linguistic variations of human in-

structions. Despite signiﬁcant progress in VLN

datasets (Anderson et al.,2018b;Chen et al.,2019;

Ku et al.,2020;Shridhar et al.,2020) and agent

design (Fried et al.,2018;Li et al.,2021;Min et al.,

Our code and data are available at

https://github.

com/weixi-feng/ULN.

Starting Position Target Position

(a) VLN BERT (b) Ours

Figure 1: Navigation results of a baseline (left) and

our VLN framework (right) with multi-level underspec-

iﬁed instructions (L0-L3). Trajectories are curved for

demonstration. Note that the baseline stops early and

fails to reach the target position with L1-L3. Our agent

manages to reach the goal across all levels.

2021), it remains a question whether existing mod-

els are generalized and robust enough to deal with

all kinds of language variations.

For the language input in an indoor environment,

some datasets focus on long and detailed instruc-

tions with the route description at every step to

achieve ﬁne-grained language grounding (Ander-

son et al.,2018b;Ku et al.,2020) or long-horizon

navigation (Jain et al.,2019;Zhu et al.,2020a). For

instance, from Room-to-Room (R2R) (Anderson

et al.,2018b), to Room-Across-Room (RxR) (Ku

et al.,2020), the average instruction length in-

creases from 29 to 129 words. Other datasets

have coarse-grained instructions like REVERIE

(Qi et al.,2020) or SOON (Zhu et al.,2021a).

Agents are trained and evaluated on a single granu-

arXiv:2210.10020v1 [cs.CV] 18 Oct 2022

larity or one type of expression.

In contrast, we propose to evaluate VLN agents

on multi-level granularity to better understand the

behavior of embodied agents with respect to lan-

guage variations. Our motivation is that users are

inclined to give shorter instructions instead of de-

tailed route descriptions because 1) users are not

omniscient observers who follow the route and

describe it step by step for the agent; 2) shorter

instructions are more practical, reproducible, and

efﬁcient from a user’s perspective. 3) users tend to

underspecify commands in familiar environments

like personal households. Therefore, we propose

a new setting, namely

nderspeciﬁed vision-and-

anguage

avigation (ULN) and associated eval-

uation datasets on top of R2R, namely R2R-ULN

to address these issues. R2R-ULN contains under-

speciﬁed instructions where route descriptions are

successively removed from the original instructions.

Each long R2R instruction corresponds to three

shortened and rephrased instructions belonging to

different levels, which preserves partial alignment

but also introduces variances.

As shown in Fig. 1, the goal of ULN is to facili-

tate the development of a generalized VLN design

that achieves balanced performance across all gran-

ularity levels. As a primary step toward ULN, we

propose a modular VLN framework that consists of

an instruction classiﬁcation module, a navigational

agent, and an Exploitation-to-Exploration (E2E)

module. The classiﬁcation module ﬁrst classiﬁes

the input instruction as high-level or low-level in

granularity so that our agent can encode these two

types accordingly. As for the agent, we propose

to learn Granularity Speciﬁc Sub-networks (GSS)

to handle both levels with minimally additional pa-

rameters. A sub-network, e.g., the text encoder, is

trained for each level while other parameters are

shared. Finally, the E2E module estimates the step-

wise language grounding uncertainty and conducts

multi-step lookahead exploration to rectify wrong

decisions that originated from underspeciﬁed lan-

guage.

Our VLN framework is model-agnostic and can

be applied to many previous agents that follow a

“encode-then-fuse” mechanism for language and vi-

sual inputs. We establish our framework based on

two state-of-the-art (SOTA) VLN agents to demon-

strate its effectiveness. We conduct extensive ex-

periments to analyze the generalization of existing

agents and our framework in ULN and the orig-

inal datasets with ﬁne-grained instructions. Our

contribution is three-fold:

•

We propose a novel setting named Underspec-

iﬁed vision-and-Language Navigation (ULN)

to account for multi-level language variations

for instructions. We collect a large-scale eval-

uation dataset R2R-ULN which consists of

validation and 4k testing instructions.

•

We propose a VLN framework that consists

of Granularity Speciﬁc Sub-networks (GSS)

and an E2E module for navigation agents to

handle both low-level and high-level instruc-

tions.

•

Experiments show that achieving consistent

performance across multi-level underspeciﬁ-

cation can be much more challenging to ex-

isting VLN agents. Furthermore, our VLN

framework can improve the success rate by

10% relatively over the baselines and mitigate

the performance gap across all levels.

2 Related Work

Language Variations for Multimodal Learning

Natural language input has been an essential com-

ponent of modern multimodal learning tasks to

combine with other modalities such as vision (An-

tol et al.,2015;Johnson et al.,2017), speech

(Alayrac et al.,2020) or gestures (Chen et al.,

2021b). The effect of language variations has been

studied in many vision-and-language (V&L) tasks

(Bisk et al.,2016;Agrawal et al.,2018;Cirik et al.,

2018;Zhu et al.,2020b;Lin et al.,2021). For in-

stance, referring expression datasets (Kazemzadeh

et al.,2014;Yu et al.,2016;Mao et al.,2016) con-

tain multiple expressions for the same referring

object. Ref-Adv (Akula et al.,2020) studies the ro-

bustness of referring expression models by switch-

ing word orders. In Visual Question Answering

(VQA), Shah et al. (2019) discovers that VQA

models are brittle to rephrased questions with the

same meaning. As for VLN, we characterize the

linguistic and compositional variations in rephras-

ing and dropping sub-instructions from a full in-

struction with complete route descriptions. We also

deﬁne three different levels to formalize underspec-

iﬁcation for navigational instructions.

VLN Datasets

VLN has gained much attention

(Gu et al.,2022) with emergence of various simula-

tion environments and datasets (Chang et al.,2017;

Kolve et al.,2017;Jain et al.,2019;Nguyen and

Daumé III,2019;Koh et al.,2021). R2R (Ander-

son et al.,2018a) and RxR (Ku et al.,2020) provide

ﬁne-grained instructions which guide the agent in

a step-wise manner. FG-R2R (Hong et al.,2020a)

and Landmark-RxR (He et al.,2021) segments the

instructions into action units and explicitly ground

sub-instructions on visual observation. In contrast,

REVERIE (Qi et al.,2020), and SOON (Zhu et al.,

2021a) proposes to use referring expression with

no guidance on intermediate steps that lead to the

ﬁnal destination. Compared to these datasets, ULN

aims to build an agent that can generalize to multi-

level granularity after training once, which is more

practical for real-world applications.

Embodied Navigation Agents

Learning to

ground instructions on visual observations is one

major problem for an agent to generalize to an un-

seen environment (Wang et al.,2019;Deng et al.,

2020;Fu et al.,2020). Previous studies demon-

strate signiﬁcant improvement by data augmenta-

tions (Fried et al.,2018;Tan et al.,2019;Zhu et al.,

2021b;Fang et al.,2022;Li et al.,2022), designing

pre-training tasks (Hao et al.,2020;Chen et al.,

2021a;Qiao et al.,2022) and decoding algorithms

(Ma et al.,2019a;Ke et al.,2019;Ma et al.,2019b;

Chen et al.,2022). For exploration-based meth-

ods, FAST (Ke et al.,2019) proposes a searching

algorithm that allows the agent to backtrack to the

most promising trajectory. SSM (Wang et al.,2021)

memorizes local and global action spaces and es-

timates multiple scores for candidate nodes in the

frontier of trajectories. Compared to E2E, Active

VLN (Wang et al.,2020) is the most relevant work

where they learn an additional policy for multi-step

exploration. However, they deﬁne the reward func-

tion based on distances to target locations, while

our uncertainty estimation is based on step-wise

grounding mismatch. Our E2E module is also more

efﬁcient that has fewer parameters and low training

complexity.

3 Underspeciﬁcation in VLN

Our dataset construction is three-fold: We ﬁrst ob-

tain underspeciﬁed instructions by asking work-

ers to simplify and rephrase the R2R instructions.

Then, we validate that the goals are still reachable

with underspeciﬁed instructions. Finally, we verify

that instructions from R2R-ULN are preferred to

R2R ones from a user’s perspective, which proves

the necessity of the ULN setting. We brieﬂy de-

Level Instructions

Turn around and go down the stairs. At the bottom

turn slightly right and enter the room with the TV

on the wall and a green table. Walk to the right

past the TV. Stop at the door to the right facing

into the bathroom. (from R2R)

Take the stairs to the bottom and enter the room

with the TV on the wall and a green table.

Walk past the TV. Stop at the door to facing into

the bathroom. (Redundancy Removed)

Take the stairs to the bottom and enter the room

a green table. Walk past the TV. Stop at the

bathroom door. (Partial Route Description)

Go to the door of the bathroom next to the room

with a green table. (No Route Description)

Table 1: Instruction examples from the R2R-ULN

validation set. We mark removed words in red and

rephrased words in blue in the next level.

scribe deﬁnitions and our ULN dataset in this sec-

tion with more details in Appendix A.

3.1 Instruction Simpliﬁcation

We formalize the instruction collection as a sen-

tence simpliﬁcation task and ask human annotators

to remove details from the instructions progres-

sively. Denoting the original R2R instructions as

Level 0

(

), annotators rewrite each

into three

different levels of underspeciﬁed instructions. We

discover that some components in

can be redun-

dant or rarely used in indoor environments, such

as “turn 45 degrees”. Therefore, to obtain

Level 1

(

) from each

instruction, annotators rewrite

by removing any redundant part but keep most

of the route description unchanged. Redundant

components include but are not limited to repeti-

tion, excessive details, and directional phrases (See

Table 1). As for

Level 2

(

), annotators remove

one or two sub-instructions from

, creating a

scenario where the users omit some details in com-

monplaces. We collect

Level 3

(

) instructions

by giving destination information such as region

label and ﬂoor level and ask annotators to write one

sentence directly referring to the object or location

of the destination point.

3.2 Instruction Veriﬁcation

To ensure that the underspeciﬁed instructions pro-

vide a feasible performance upper bound for VLN

agents, we have another group of annotators navi-

gate in an interactive interface from R2R (Anderson

Level

R2R-ULN Val-Unseen

Instr. Following Instr. Preference

SRÒSPLÒPracticality Efﬁciency

L086 72 - -

L182 68 55% 57%

L282 65 63% 59%

L375 58 68% 66%

Table 2: Human performance on R2R-ULN validation

unseen in terms of Success Rate (SR) and SR weighted

by Path Length (SPL), and human preference assess-

ment results. The percentage denotes the ratio of par-

ticipants selecting Liover L0.

et al.,2018b). As is shown in Table 2, annotators

achieve a slightly degraded but promising success

rate (SR) with

. SPL is a metric that normalizes

SR over the path length. Therefore, the trade-off

for maintaining high SR is to have more explo-

ration steps, resulting in a much lower SPL value.

We also verify that

Li, i P t1,2,3u

are more prac-

tical and efﬁcient choices than

. Table 2shows

that people prefer underspeciﬁed instructions over

full instructions in both aspects, with an increasing

trend as iincreases to 3.

4 Method

4.1 Overview

In this section, we present our VLN framework

for handling multi-level underspeciﬁed language

inputs, which mainly consists of three modules (see

Figure 2). Given a natural language instruction in

a sequence of tokens,

W“ pw1,...wnq

,the clas-

siﬁcation module ﬁrst categorize language input as

low-level (

L0, L1, L2

) or high-level (

) instruc-

tions. To handle these two types accordingly, GSS

learns a sub-network, e.g., the text encoder, for

each type while the other parameters are shared.

At each step

, we denote the visual observation

Ot“ prv1;a1s,...,rvN, aNsq

with visual feature

and angle feature

-th view among all

views. The history contains a concatenation of

all observations before

Ht“ pO1,...,Qt´1q

Given

Wt,Ht,Ot

, the GSS-based agent predicts

a an action

by choosing a navigable viewpoint

from

. To overcome the reference misalignment

issue, the E2E module predicts a sequence of uncer-

tainty score

S“ ps1, ..., sTq

and conducts multi-

step exploration to collect future visual informa-

tion.

Navigator



Instruction

Classification

Uncertainty

Estimation

Multi-step

Lookahead

Decision

Making





E2E module

Turn around and go

down the stairs. At the

bottom…



   













Figure 2: Our VLN framework with classiﬁcation mod-

ule, navigation agent, and E2E module.

4.2 Instruction Classiﬁcation

VLN agents can operate in two different modes,

ﬁdelity-oriented or goal-oriented, depending on

reward functions (Jain et al.,2019) or text inputs

(Zhu et al.,2022). Agents trained on low-level gran-

ularity encounter performance degradation when

applied to high-level ones, and vice versa. As is

shown in Figure 2, we propose ﬁrst to classify the

text inputs into two granularities and then encode

them independently in downstream modules. Our

classiﬁcation module contains an embedding layer,

average pooling, and a fully-connected layer to

output binary class predictions.

4.3 Navigation Agent

Base Agent

We summarize the high-level frame-

work of many transformer-based agents (Hao et al.,

2020;Guhur et al.,2021;Moudgil et al.,2021)

paramterized as

as shown in Figure 2. Given

the history

, text

, visual observation

, the

agent ﬁrst encodes each modality input with en-

coders fhist, ftext, fimg:

X“ftextpWq, Ht“fhistpHtq,

Ot“fimgpOtq(1)

HAMT (Chen et al.,2021a) applies ViT (Doso-

vitskiy et al.,2020) and a Panoramic Trans-

former to hierarchically encode

as a se-

quence of embeddings

Ht“ ph1, . . . , ht´1q

while

VLN

BERT (Hong et al.,2021) encodes

a state vector

Ht“ht

. The embedding from

each modality is then fed into a

-layer cross-

modal transformer

fcm

, and passed through a cross-

attention ﬁrst in each layer l:

αvÑt

t,l “prHt,l;Ot,lsWquery

lqpXt,lWkey

lqT

?dh

(2)

where

αvÑt

t,l

denotes the attention weights of

history-visual concatenation on the language em-

beddings,

is the hidden dimension. We omit

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ULN:TowardsUnderspeciedVision-and-LanguageNavigationWeixiFengTsu-JuiFuYujieLuWilliamYangWangUCSantaBarbara{weixifeng,tsu-juifu,yujielu,william}@cs.ucsb.eduAbstractVision-and-LanguageNavigation(VLN)isatasktoguideanembodiedagentmovingtoatargetpositionusinglanguageinstructions.Despitethesignicantperf...

展开>> 收起<<

ULN Towards Underspeciﬁed Vision-and-Language Navigation Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang UC Santa Barbara.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ULN Towards Underspeciﬁed Vision-and-Language Navigation Weixi Feng Tsu-Jui Fu Yujie Lu William Yang Wang UC Santa Barbara

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: