Instruction-Following Agents with Multimodal Transformer Hao Liu1 2Lisa Lee2Kimin Lee2Pieter Abbeel1 1University of California Berkeley2Google Research

2025-05-05 0 0 1.59MB 16 页 10玖币
侵权投诉
Instruction-Following Agents with Multimodal Transformer
Hao Liu 1 2 Lisa Lee 2Kimin Lee 2Pieter Abbeel 1
1University of California, Berkeley 2Google Research
hao.liu@cs.berkeley.edu
Abstract
Humans are excellent at understanding language
and vision to accomplish a wide range of tasks.
In contrast, creating general instruction-following
embodied agents remains a difficult challenge.
Prior work that uses pure language-only models
lack visual grounding, making it difficult to con-
nect language instructions with visual observa-
tions. On the other hand, methods that use pre-
trained multimodal models typically come with
divided language and visual representations, re-
quiring designing specialized network architec-
ture to fuse them together. We propose a simple
yet effective model for robots to solve instruction-
following tasks in vision-based environments.
Our InstructRL method consists of a multimodal
transformer that encodes visual observations and
language instructions, and a transformer-based
policy that predicts actions based on encoded
representations. The multimodal transformer is
pre-trained on millions of image-text pairs and
natural language text, thereby producing generic
cross-modal representations of observations and
instructions. The transformer-based policy keeps
track of the full history of observations and ac-
tions, and predicts actions autoregressively. De-
spite its simplicity, we show that this unified trans-
former model outperforms all state-of-the-art pre-
trained or trained-from-scratch methods in both
single-task and multi-task settings. Our model
also shows better model scalability and general-
ization ability than prior work.
1. Introduction
Humans are able to understand language and vision to ac-
complish a wide range of tasks. Many tasks require lan-
guage understanding and vision perception, from driving to
whiteboard discussion and cooking. Humans can also gen-
eralize to new tasks by building upon knowledge acquired
from previously-seen tasks. Meanwhile, creating generic
Preprint. Under review.
instruction-following agents that can generalize to multiple
tasks and environments is one of the central challenges of
reinforcement learning (RL) and robotics.
Driven by significant advances in learning generic pre-
trained models for language understanding (Devlin et al.,
2018;Brown et al.,2020;Chowdhery et al.,2022), recent
work has made great progress towards building instruction-
following agents (Lynch & Sermanet,2020;Mandlekar
et al.,2021;Ahn et al.,2022;Jang et al.,2022;Guhur et al.,
2022;Shridhar et al.,2022b). For example, SayCan (Ahn
et al.,2022) exploits PaLM models (Chowdhery et al.,2022)
to generate language descriptions of step-by-step plans from
language instructions, then executes the plans by mapping
the steps to predefined macro actions. HiveFormer (Guhur
et al.,2022) uses a pre-trained language encoder to general-
ize to multiple manipulation tasks. However, a remaining
challenge is that pure language-only pre-trained models are
disconnected from visual representations, making it diffi-
cult to differentiate vision-related semantics such as colors.
Therefore, visual semantics have to be further learned to
connect language instructions and visual inputs.
Another category of methods use pre-trained multimodal
models, which have shown great success in joint visual and
language understanding (Radford et al.,2021). This has
made tremendous progress towards creating a general RL
agent (Zeng et al.,2022;Khandelwal et al.,2022;Nair et al.,
2022b;Khandelwal et al.,2022;Shridhar et al.,2022a). For
example, CLIPort (Shridhar et al.,2022a) uses CLIP (Rad-
ford et al.,2021) vision encoder and language encoder to
solve manipulation tasks. However, a drawback is that they
come with limited language understanding compared to
pure language-only pre-trained models like BERT (Devlin
et al.,2018), lacking the ability to follow long and detailed
instructions. In addition, the representations of visual in-
put and textual input are often disjointly learned, so such
methods typically require designing specialized network
architectures on top of the pre-trained models to fuse them
together.
To address the above challenges, we introduce InstructRL,
a simple yet effective method based on the multimodal
transformer (Vaswani et al.,2017;Tsai et al.,2019). It
arXiv:2210.13431v4 [cs.CV] 25 Mar 2023
Instruction-Following Agents with Multimodal Transformer
Figure 1.
Examples of RLBench tasks considered in this work.
Left
:InstructRL can perform multiple tasks from RLBench given language
instructions, by leveraging the representations of a pre-trained multimodal transformer model, and learning a transformer policy.
Right
:
Each task can be composed of multiple variations that share the same skills but differ in objects. For example, in the block stacking task,
InstructRL can generalize to varying colors and ordering of the blocks.
first encodes fine-grained cross-modal alignment between
vision and language using a pre-trained multimodal trans-
former (Geng et al.,2022), which is a large masked au-
toencoding transformer (Vaswani et al.,2017;He et al.,
2022) jointly trained on image-text (Changpinyo et al.,2021;
Thomee et al.,2016) and text-only data (Devlin et al.,2018).
The generic representations of each camera and instructions
form a sequence, and are concatenated with the embeddings
of proprioception data and actions. These tokens are fed into
a multimodal transformer-based policy, which jointly mod-
els dependencies between the current and past observations,
and cross-modal alignment between instruction and views
from multiple cameras. Based on the output representations
from our multimodal transformer, we predict 7-DoF actions,
i.e., position, rotation, and state of the gripper.
We evaluate InstructRL on RLBench (James et al.,2020),
measuring capabilities for single-task learning, multi-task
learning, multi-variation generalization, long instructions
following, and model scalability. On all 74 tasks which
belong to 9 categories (see Figure 1for example tasks),
our InstructRL significantly outperforms state-of-the-art
models (Shridhar et al.,2022a;Guhur et al.,2022;Liu
et al.,2022), demonstrating the effectiveness of combining
pretrained language and vision representations. Moreover,
InstructRL not only excels in following basic language in-
structions, but is also able to benefit from human-written
long and detailed language instructions. We also demon-
strate that InstructRL generalizes to new instructions that
represent different variations of the task that are unseen
during training, and shows excellent model scalability with
performance continuing to increase with larger model size.
2. Problem Definition
We consider the problem of robotic manipulation from vi-
sual observations and natural language instructions (see
Figure 1and Table 6in Appendix for examples). We
assume the agent receives a natural language instruction
x:= {x1, . . . , xn}
consisting of
n
text tokens. At each
timestep
t
, the agent receives a visual observation
ot∈ O
and takes an action
at∈ A
in order to solve the task speci-
fied by the instruction.
We parameterize the policy
πat|x,{oi}t
i=1,{ai}t1
i=1
as
a transformer model, which is conditioned on the instruction
x
, observations
{oi}t
i=1
, and previous actions
{ai}t1
i=1
. For
robotic control, we use macro steps (James & Davison,
2022), which are key turning points in the action trajectory
where the gripper changes its state (open/close) or the joint
velocities are set to near zero. Following James & Davison
(2022), we employ an inverse-kinematics based controller
to find a trajectory between macro-steps. In this way, the
sequence length of an episode is significantly reduced from
hundreds of small steps to typically less than 10 macro steps.
Observation space
: Each observation
ot
consists of im-
ages
{ck
t}K
k=1
taken from
K
different camera viewpoints, as
well as proprioception data
oP
tR4
. Each image
ck
t
is an
RGB image of size
256 ×256 ×3
. We use
K= 3
camera
viewpoints located on the robot’s wrist, left shoulder, and
right shoulder. The proprioception data
oP
t
consists of 4
scalar values: gripper open, left finger joint position, right
finger joint position, and timestep of the action sequence.
Note that we do not use point cloud data in order for our
method to be more flexibly applied to other domains. Since
RLBench consists of sparse-reward and challenging tasks,
using point cloud data can benefit performance (James &
Davison,2022;Guhur et al.,2022), but we leave this as
Instruction-Following Agents with Multimodal Transformer
Place 3 of the
red cubes on ..
Place 3 of the
red cubes on ..
Place 3 of the
red cubes on ..
Place 3 of the
red cubes on ..
Figure 2.
Different frameworks of leveraging pre-trained representations for instruction-following agents. In prior work, additional
training-from-scratch is needed to combine the representations of text and image from (I) a pre-trained vision model, (II) a pre-trained
language model, or (III) both pre-trained language model and vision model. In contrast, InstructRL extracts generic representations from
(IV) a simple and unified multimodal model pretrained on aligned image-text and unaligned text data.
future work.
Action space
: Following the standard setup in RL-
Bench (James & Davison,2022), each action
at:=
(pt, qt, gt)
consists of the desired gripper position
pt=
(xt, yt, zt)
in Cartesian coordinates and quaternion
qt=
(q0
t, q1
t, q2
t, q3
t)
relative to the base frame, and the gripper
state
gt
indicating whether the gripper is open or closed.
An object is grasped when it is located in between the grip-
per’s two fingers and the gripper is closing its grasp. The
execution of an action is achieved by a motion planner in
RLBench.
3. InstructRL
We propose a unified architecture for robotic tasks called
InstructRL, which is shown in Figure 3. It consists of two
modules: a pre-trained multimodal masked autoencoder (He
et al.,2022;Geng et al.,2022) to encode instructions and
visual observations, and a transformer-based (Vaswani et al.,
2017) policy that predicts actions. First, the feature en-
coding module (Sec. 3.1) generates token embeddings for
language instructions
{xj}n
j=1
, observations
{oi}t
i=1
, and
previous actions
{ai}t1
i=1
. Then, given the token embed-
dings, the multimodal transformer policy (Sec. 3.2) learns
relationships between the instruction, image observations,
and the history of past observations and actions, in order to
predict the next action at.
3.1. Multimodal Representation
We encode the instruction and visual observations using
a pre-trained multimodal transformer encoder, as shown
in Figure 3. Specifically, we use a pre-trained multi-
modal masked autoencoder (M3AE) (Geng et al.,2022)
encoder, which is a large transformer-based architecture
based on ViT (Dosovitskiy et al.,2020) and BERT (De-
vlin et al.,2018). Specifically M3AE (Geng et al.,2022)
is a transformer-based architecture that learns a unified en-
coder for both vision and language data via masked to-
ken prediction. It is trained on a large-scale image-text
dataset(CC12M (Changpinyo et al.,2021)) and text-only
corpus (Devlin et al.,2018) and is able to learn generalizable
representations that transfer well to downstream tasks.
Encoding Instructions and Observations. Following the
practice of M3AE, we first tokenize the language instruc-
tions
{xj}n
j=1
into embedding vectors and then apply 1D
positional encodings. We denote the language instructions
as
ExRn×d
, where
n
is the length of language tokens
and dis the embedding dimension.
We divide each image observation in
{ck
t}K
k=1
into image
patches, and use a linear projection to convert them to image
embeddings that have the same dimension as the language
embeddings. Then, we apply 2D positional encodings. Each
image is represented as
EcRlc×de
where
lc
is the length
of image patches tokens and
de
is the embedding dimension.
The image embeddings and text embeddings are then
concatenated along the sequence dimension:
E=
concat(Ec, Ex)R(lc+n)×de
. The combined language
and image embeddings are then processed by a series
of transformer blocks to obtain the final representation
ˆok
tR(lc+n)×de
. Following the practice of VIT and M3AE,
we also apply average pooling on the sequence length di-
mension of
ˆok
t
to get
ok
tRde
as the final representation
of the
k
-th camera image
ck
t
and the instruction. We use
multi-scale features
hk
tRd
which are a concatenation
Instruction-Following Agents with Multimodal Transformer
Place 3 of the red cubes on ..
Place 3 of the red cubes on ..
Figure 3.
InstructRL is composed of a multimodal transformer and a transformer-based policy. First, the instruction (text) and multi-view
image observations are jointly encoded using the pre-trained multimodal transformer. Next, the sequence of representations and a history
of actions are encoded by the transformer-based policy to predict the next action.
Figure 4.
The architecture of the transformer policy. The model is conditioned on a history of language-vision representations and actions
to predict the next action.
of all intermediate layer representations, where the feature
dimension
d=Lde
equals the number of intermedi-
ate layers
L
times embedding dimension
de
. Finally, we
can get the representations over all
K
camera viewpoints
ht={h1
t,· · · , hK
t} ∈ RK×d
as the representation of the
vision-language input.
Encoding Proprioceptions and Actions. The proprioception
data
oP
tR4
is encoded with a linear layer to upsample the
input dimension to
d
(i.e., each scalar in
oP
t
is mapped to
Rd
) to get a representation
zt={z1
t,· · · , z4
t} ∈ R4×d
all
each state in
oP
t
. Similarly, the action is projected to feature
space ftRd.
3.2. Transformer-based Policy
We consider a context-conditional policy, which takes all
encoded instructions, observations and actions as input,
i.e.,
{(hi, zi)}t
i=1
and
{fi}t1
i=1
. By default, we use context
length 4 throughout the paper (i.e.,
4(K+ 5)
embeddings
are processed by the transformer policy). This enables learn-
ing relationships among views from multiple cameras, the
current observations and instructions, and between the cur-
rent observations and history for action prediction. The
architecture of transformer policy is illustrated in Figure 4.
We pass the output embeddings of the transformer into a
feature map to predict the next action
at= [pt;qt;gt]
. We
use behavioral cloning to train the models. In RLBench,
we generate
D
, a collection of
N
successful demonstrations
for each task. Each demonstration
δD
is composed of a
sequence of (maximum)
T
macro-steps with observations
{oi}T
i=1
, actions
{a
i}T
i=1
and instructions
{xl}n
l=1
. We min-
imize a loss function
L
over a batch of demonstrations
B={δj}|B|
j=1 D
. The loss function is the mean-square
error (MSE) on the gripper’s action:
L=1
|B|X
δB
X
tT
MSE (at, a
t)
.(1)
4. Experimental Setup
To evaluate the effectiveness of our method, we run exper-
iments on RLBench (James et al.,2020), a benchmark of
robotic manipulation task (see Figure 1). We use the same
setup as in (Guhur et al.,2022), including the same set of 74
tasks with 100 demonstrations per task for training, and the
摘要:

Instruction-FollowingAgentswithMultimodalTransformerHaoLiu12LisaLee2KiminLee2PieterAbbeel11UniversityofCalifornia,Berkeley2GoogleResearchhao.liu@cs.berkeley.eduAbstractHumansareexcellentatunderstandinglanguageandvisiontoaccomplishawiderangeoftasks.Incontrast,creatinggeneralinstruction-followingembod...

展开>> 收起<<
Instruction-Following Agents with Multimodal Transformer Hao Liu1 2Lisa Lee2Kimin Lee2Pieter Abbeel1 1University of California Berkeley2Google Research.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.59MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注