nodes and dialogs. LED-Bert is an adaption of
ViLBERT (Lu et al.,2019) for the LED task, and
we show that it outperforms all prior baselines. A
key challenge is the small size of the WAY dataset
(approximately 6K episodes), which makes it chal-
lenging to use transformer-based models given their
reliance on large-scale training data. We address
this challenge by developing a pretraining approach
- based on (Majumdar et al.,2020) - that yields an
effective visiolinguistic representation.
Contributions: To summarize:
1.
We demonstrate an LED approach using navi-
gation graphs to represent the environment.
2.
We present LED-Bert, a visiolinguistic trans-
former model which scores alignment be-
tween graph nodes and dialogs. We develop
an effective pretraining strategy that leverages
large-scale disembodied web data and similar
embodied datasets to pretrain LED-Bert.
3.
We show that LED-Bert outperforms all base-
lines, increasing accuracy at 0m by 8.21 abso-
lute percent on the test split.
2 Related Work
BERT
Bidirectional Encoder Representations from
Transformers (BERT) is a transformer based en-
coder used for language modeling. BERT is trained
on massive amounts of unlabeled text data, and
takes as input sentences of tokenized words and
corresponding positional embeddings per tokens.
BERT is trained using the masked language model-
ing and next sentence prediction training objectives.
In the masked language modeling schema, 15% of
the input tokens are replaced with a [MASK] token.
The model is then trained to predict the true value
of the input tokens which are masked using the
other tokens as context. In the next sentence pre-
diction schema, the model is trained to predicted
if the two input sentences follow each other or
not. BERT is specifically trained on Wikipedia and
BooksCorpus (Zhu et al.,2015).
ViLBERT
ViLBERT (Lu et al.,2019) is a multi-
modal transformer that extends the BERT archi-
tecture (Devlin et al.,2018) to learn joint visio-
linguistic representations. Similar multi-modal
transformer models exist (Li et al.,2020,2019;
Su et al.,2020;Tan and Bansal,2019;Zhou et al.,
2020). ViLBERT is constructed of two transformer
encoding streams, one for visual inputs and one
for text inputs. Both of these streams use the stan-
dard BERT-BASE (Devlin et al.,2018) backbone.
The input tokens for the text stream are text tokens,
identical to BERT. The input tokens for the visual
stream are a sequence of image regions which are
generated by an object detector pretrained on Vi-
sual Genome (Krishna et al.,2017). The input to
ViLBERT is then a sequence of visual and textual
tokens which are not concatenated and only en-
ter their respective streams. The two streams then
interact using co-attention layers which are imple-
mented by swapping the key and value matrices
between the visual and textual encoder streams for
certain layers. Co-attention layers are used to at-
tend to one modality via a conditioning on the other
modality, allowing for attention over image regions
given the corresponding text input and vise versa.
Vision-and-Language Pre-training
Prior work
has experimented with utilizing dual-stream trans-
former based models that have been pretrained with
self-supervised objectives and transferring them to
downstream multi-modal tasks with large success.
This has been seen for tasks such as Visual Ques-
tion Answering (Antol et al.,2015), Commonsense
Reasoning (Zellers et al.,2019), Natural Language
Visual Reasoning (Suhr et al.,2018), Image-Text
Retrieval (Lee et al.,2018), Visual-Dialog (Mura-
hari et al.,2020) and Vision Language Navigation
(Majumdar et al.,2020). Specifically VLN-Bert
and VisDial + BERT adapt the ViLBERT architec-
ture and utilize a pretraining scheme which inspired
our approach to train LED-Bert.
3 Approach
3.1 Environment Representation
A key challenge in the LED task is that environ-
ments often have multiple rooms with numerous
similar attributes, i.e. multiple bedrooms with the
same furniture. Therefore a successful model must
be able to visually ground fine-grained attributes.
Strong generalizability is also required in order to
generalize to unseen test environments. The LED
baseline in (Hahn et al.,2020) approaches localiza-
tion as a language-conditioned pixel-to-pixel pre-
diction task – producing a probability distribution
over positions in a top-down view of the environ-
ment, illustrated in Part A, in the Supplementary,
Figure 3. This choice is justified by the fact that
it mirrors the observations that the human Locator
had access to during data collection, allowing for
a straightforward comparison. However, this does
not address the question of what representation is
optimal for localization.