Transformer-based Localization from Embodied Dialog with Large-scale Pre-training Meera Hahn

2025-05-06 0 0 2.65MB 7 页 10玖币
侵权投诉
Transformer-based Localization from Embodied Dialog with Large-scale
Pre-training
Meera Hahn
Google Research
meerahahn@google.com
James M. Rehg
Georgia Institute of Technology
rehg@gatech.edu
Abstract
We address the challenging task of Localiza-
tion via Embodied Dialog (LED). Given a di-
alog from two agents, an Observer navigat-
ing through an unknown environment and a
Locator who is attempting to identify the Ob-
server’s location, the goal is to predict the Ob-
server’s final location in a map. We develop
a novel LED-Bert architecture and present an
effective pretraining strategy. We show that a
graph-based scene representation is more ef-
fective than the top-down 2D maps used in
prior works. Our approach outperforms pre-
vious baselines.
1 Introduction
A key goal in AI is to develop embodied agents that
can accurately perceive and navigate an environ-
ment as well as communicate about their surround-
ings in natural language. The recently-introduced
Where Are You? (WAY) dataset (Hahn et al.,2020)
provides a setting for developing such a multi-
modal and multi-agent paradigm. This dataset (col-
lected via AMT) contains episodes of a localization
scenario in which two agents communicate via turn-
taking natural language dialog: An Observer agent
moves through an unknown environment, while a
Locator agent attempts to identify the Observer’s
location in a map.
The Observer produces descriptions such as ‘I’m
in a living room with a gray couch and blue arm-
chairs. Behind me there is a door. and can respond
to instructions and questions provided by the Lo-
cator:‘If you walk straight past the seating area,
do you see a bathroom on your right?’ Via this
dialog (and without access to the Observer’s view
of the scene), the Locator attempts to identify the
Observer’s location on a map (which is not avail-
able to the Observer). This is a complex task for
which a successful localization requires accurate
Work done in part at Georgia Institute of Technology.
Localization Error
Predicted Location
True Location
Can you describe
where you are?
I am in a room with
an eating area and
white chairs
Observer
Locator
Figure 1: WAY Dataset Localization Scenario: The Lo-
cator has a map of the building and is trying to localize
the Observer by asking questions and giving instruc-
tions. The Observer has a first person view and may
navigate while responding to the Locator. The turn-
taking dialog ends when the Locator predicts the Ob-
server’s position.
situational grounding and the production of rele-
vant questions and instructions.
One of the benchmark tasks supported by WAY
is ‘Localization via Embodied Dialog (LED)’. In
this task a model takes the dialog and a represen-
tation of the map as inputs, and must output a pre-
diction of the final location of the Observer agent.
The model’s performance is based on error distance
between the predicted location of the Observer and
its true location. LED is a first step towards devel-
oping a Locator agent. One challenge of the task
is to identify an effective map representation. The
LED baseline from (Hahn et al.,2020) uses 2D
images of top down (birds-eye view) floor maps to
represent the environment and an (x,y) location for
the Observer.
This paper provides a new solution to the LED
task with two key components. First, we propose to
model the environment using the first person view
(FPV) panoramic navigation graph from Matter-
port (Anderson et al.,2018a), as an alternative to
top-down maps. Second, we introduce a novel vi-
siolinguistic transformer model, LED-Bert, which
scores the alignment between navigation graph
arXiv:2210.04864v1 [cs.CV] 10 Oct 2022
nodes and dialogs. LED-Bert is an adaption of
ViLBERT (Lu et al.,2019) for the LED task, and
we show that it outperforms all prior baselines. A
key challenge is the small size of the WAY dataset
(approximately 6K episodes), which makes it chal-
lenging to use transformer-based models given their
reliance on large-scale training data. We address
this challenge by developing a pretraining approach
- based on (Majumdar et al.,2020) - that yields an
effective visiolinguistic representation.
Contributions: To summarize:
1.
We demonstrate an LED approach using navi-
gation graphs to represent the environment.
2.
We present LED-Bert, a visiolinguistic trans-
former model which scores alignment be-
tween graph nodes and dialogs. We develop
an effective pretraining strategy that leverages
large-scale disembodied web data and similar
embodied datasets to pretrain LED-Bert.
3.
We show that LED-Bert outperforms all base-
lines, increasing accuracy at 0m by 8.21 abso-
lute percent on the test split.
2 Related Work
BERT
Bidirectional Encoder Representations from
Transformers (BERT) is a transformer based en-
coder used for language modeling. BERT is trained
on massive amounts of unlabeled text data, and
takes as input sentences of tokenized words and
corresponding positional embeddings per tokens.
BERT is trained using the masked language model-
ing and next sentence prediction training objectives.
In the masked language modeling schema, 15% of
the input tokens are replaced with a [MASK] token.
The model is then trained to predict the true value
of the input tokens which are masked using the
other tokens as context. In the next sentence pre-
diction schema, the model is trained to predicted
if the two input sentences follow each other or
not. BERT is specifically trained on Wikipedia and
BooksCorpus (Zhu et al.,2015).
ViLBERT
ViLBERT (Lu et al.,2019) is a multi-
modal transformer that extends the BERT archi-
tecture (Devlin et al.,2018) to learn joint visio-
linguistic representations. Similar multi-modal
transformer models exist (Li et al.,2020,2019;
Su et al.,2020;Tan and Bansal,2019;Zhou et al.,
2020). ViLBERT is constructed of two transformer
encoding streams, one for visual inputs and one
for text inputs. Both of these streams use the stan-
dard BERT-BASE (Devlin et al.,2018) backbone.
The input tokens for the text stream are text tokens,
identical to BERT. The input tokens for the visual
stream are a sequence of image regions which are
generated by an object detector pretrained on Vi-
sual Genome (Krishna et al.,2017). The input to
ViLBERT is then a sequence of visual and textual
tokens which are not concatenated and only en-
ter their respective streams. The two streams then
interact using co-attention layers which are imple-
mented by swapping the key and value matrices
between the visual and textual encoder streams for
certain layers. Co-attention layers are used to at-
tend to one modality via a conditioning on the other
modality, allowing for attention over image regions
given the corresponding text input and vise versa.
Vision-and-Language Pre-training
Prior work
has experimented with utilizing dual-stream trans-
former based models that have been pretrained with
self-supervised objectives and transferring them to
downstream multi-modal tasks with large success.
This has been seen for tasks such as Visual Ques-
tion Answering (Antol et al.,2015), Commonsense
Reasoning (Zellers et al.,2019), Natural Language
Visual Reasoning (Suhr et al.,2018), Image-Text
Retrieval (Lee et al.,2018), Visual-Dialog (Mura-
hari et al.,2020) and Vision Language Navigation
(Majumdar et al.,2020). Specifically VLN-Bert
and VisDial + BERT adapt the ViLBERT architec-
ture and utilize a pretraining scheme which inspired
our approach to train LED-Bert.
3 Approach
3.1 Environment Representation
A key challenge in the LED task is that environ-
ments often have multiple rooms with numerous
similar attributes, i.e. multiple bedrooms with the
same furniture. Therefore a successful model must
be able to visually ground fine-grained attributes.
Strong generalizability is also required in order to
generalize to unseen test environments. The LED
baseline in (Hahn et al.,2020) approaches localiza-
tion as a language-conditioned pixel-to-pixel pre-
diction task – producing a probability distribution
over positions in a top-down view of the environ-
ment, illustrated in Part A, in the Supplementary,
Figure 3. This choice is justified by the fact that
it mirrors the observations that the human Locator
had access to during data collection, allowing for
a straightforward comparison. However, this does
not address the question of what representation is
optimal for localization.
摘要:

Transformer-basedLocalizationfromEmbodiedDialogwithLarge-scalePre-trainingMeeraHahnGoogleResearchmeerahahn@google.comJamesM.RehgGeorgiaInstituteofTechnologyrehg@gatech.eduAbstractWeaddressthechallengingtaskofLocaliza-tionviaEmbodiedDialog(LED).Givenadi-alogfromtwoagents,anObservernavigat-ingthrough...

展开>> 收起<<
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training Meera Hahn.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:2.65MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注