
Instruction-Following Agents with Multimodal Transformer
Hao Liu 1 2 Lisa Lee 2Kimin Lee 2Pieter Abbeel 1
1University of California, Berkeley 2Google Research
hao.liu@cs.berkeley.edu
Abstract
Humans are excellent at understanding language
and vision to accomplish a wide range of tasks.
In contrast, creating general instruction-following
embodied agents remains a difficult challenge.
Prior work that uses pure language-only models
lack visual grounding, making it difficult to con-
nect language instructions with visual observa-
tions. On the other hand, methods that use pre-
trained multimodal models typically come with
divided language and visual representations, re-
quiring designing specialized network architec-
ture to fuse them together. We propose a simple
yet effective model for robots to solve instruction-
following tasks in vision-based environments.
Our InstructRL method consists of a multimodal
transformer that encodes visual observations and
language instructions, and a transformer-based
policy that predicts actions based on encoded
representations. The multimodal transformer is
pre-trained on millions of image-text pairs and
natural language text, thereby producing generic
cross-modal representations of observations and
instructions. The transformer-based policy keeps
track of the full history of observations and ac-
tions, and predicts actions autoregressively. De-
spite its simplicity, we show that this unified trans-
former model outperforms all state-of-the-art pre-
trained or trained-from-scratch methods in both
single-task and multi-task settings. Our model
also shows better model scalability and general-
ization ability than prior work.
1. Introduction
Humans are able to understand language and vision to ac-
complish a wide range of tasks. Many tasks require lan-
guage understanding and vision perception, from driving to
whiteboard discussion and cooking. Humans can also gen-
eralize to new tasks by building upon knowledge acquired
from previously-seen tasks. Meanwhile, creating generic
Preprint. Under review.
instruction-following agents that can generalize to multiple
tasks and environments is one of the central challenges of
reinforcement learning (RL) and robotics.
Driven by significant advances in learning generic pre-
trained models for language understanding (Devlin et al.,
2018;Brown et al.,2020;Chowdhery et al.,2022), recent
work has made great progress towards building instruction-
following agents (Lynch & Sermanet,2020;Mandlekar
et al.,2021;Ahn et al.,2022;Jang et al.,2022;Guhur et al.,
2022;Shridhar et al.,2022b). For example, SayCan (Ahn
et al.,2022) exploits PaLM models (Chowdhery et al.,2022)
to generate language descriptions of step-by-step plans from
language instructions, then executes the plans by mapping
the steps to predefined macro actions. HiveFormer (Guhur
et al.,2022) uses a pre-trained language encoder to general-
ize to multiple manipulation tasks. However, a remaining
challenge is that pure language-only pre-trained models are
disconnected from visual representations, making it diffi-
cult to differentiate vision-related semantics such as colors.
Therefore, visual semantics have to be further learned to
connect language instructions and visual inputs.
Another category of methods use pre-trained multimodal
models, which have shown great success in joint visual and
language understanding (Radford et al.,2021). This has
made tremendous progress towards creating a general RL
agent (Zeng et al.,2022;Khandelwal et al.,2022;Nair et al.,
2022b;Khandelwal et al.,2022;Shridhar et al.,2022a). For
example, CLIPort (Shridhar et al.,2022a) uses CLIP (Rad-
ford et al.,2021) vision encoder and language encoder to
solve manipulation tasks. However, a drawback is that they
come with limited language understanding compared to
pure language-only pre-trained models like BERT (Devlin
et al.,2018), lacking the ability to follow long and detailed
instructions. In addition, the representations of visual in-
put and textual input are often disjointly learned, so such
methods typically require designing specialized network
architectures on top of the pre-trained models to fuse them
together.
To address the above challenges, we introduce InstructRL,
a simple yet effective method based on the multimodal
transformer (Vaswani et al.,2017;Tsai et al.,2019). It
arXiv:2210.13431v4 [cs.CV] 25 Mar 2023