Instruction-Following Agents with Multimodal Transformer Hao Liu1 2Lisa Lee2Kimin Lee2Pieter Abbeel1 1University of California Berkeley2Google Research

2025-05-05 0 0 1.59MB 16 页 10玖币

侵权投诉

Instruction-Following Agents with Multimodal Transformer

Hao Liu 1 2 Lisa Lee 2Kimin Lee 2Pieter Abbeel 1

1University of California, Berkeley 2Google Research

hao.liu@cs.berkeley.edu

Abstract

Humans are excellent at understanding language

and vision to accomplish a wide range of tasks.

In contrast, creating general instruction-following

embodied agents remains a difﬁcult challenge.

Prior work that uses pure language-only models

lack visual grounding, making it difﬁcult to con-

nect language instructions with visual observa-

tions. On the other hand, methods that use pre-

trained multimodal models typically come with

divided language and visual representations, re-

quiring designing specialized network architec-

ture to fuse them together. We propose a simple

yet effective model for robots to solve instruction-

following tasks in vision-based environments.

Our InstructRL method consists of a multimodal

transformer that encodes visual observations and

language instructions, and a transformer-based

policy that predicts actions based on encoded

representations. The multimodal transformer is

pre-trained on millions of image-text pairs and

natural language text, thereby producing generic

cross-modal representations of observations and

instructions. The transformer-based policy keeps

track of the full history of observations and ac-

tions, and predicts actions autoregressively. De-

spite its simplicity, we show that this uniﬁed trans-

former model outperforms all state-of-the-art pre-

trained or trained-from-scratch methods in both

single-task and multi-task settings. Our model

also shows better model scalability and general-

ization ability than prior work.

1. Introduction

Humans are able to understand language and vision to ac-

complish a wide range of tasks. Many tasks require lan-

guage understanding and vision perception, from driving to

whiteboard discussion and cooking. Humans can also gen-

eralize to new tasks by building upon knowledge acquired

from previously-seen tasks. Meanwhile, creating generic

Preprint. Under review.

instruction-following agents that can generalize to multiple

tasks and environments is one of the central challenges of

reinforcement learning (RL) and robotics.

Driven by signiﬁcant advances in learning generic pre-

trained models for language understanding (Devlin et al.,

2018;Brown et al.,2020;Chowdhery et al.,2022), recent

work has made great progress towards building instruction-

following agents (Lynch & Sermanet,2020;Mandlekar

et al.,2021;Ahn et al.,2022;Jang et al.,2022;Guhur et al.,

2022;Shridhar et al.,2022b). For example, SayCan (Ahn

et al.,2022) exploits PaLM models (Chowdhery et al.,2022)

to generate language descriptions of step-by-step plans from

language instructions, then executes the plans by mapping

the steps to predeﬁned macro actions. HiveFormer (Guhur

et al.,2022) uses a pre-trained language encoder to general-

ize to multiple manipulation tasks. However, a remaining

challenge is that pure language-only pre-trained models are

disconnected from visual representations, making it difﬁ-

cult to differentiate vision-related semantics such as colors.

Therefore, visual semantics have to be further learned to

connect language instructions and visual inputs.

Another category of methods use pre-trained multimodal

models, which have shown great success in joint visual and

language understanding (Radford et al.,2021). This has

made tremendous progress towards creating a general RL

agent (Zeng et al.,2022;Khandelwal et al.,2022;Nair et al.,

2022b;Khandelwal et al.,2022;Shridhar et al.,2022a). For

example, CLIPort (Shridhar et al.,2022a) uses CLIP (Rad-

ford et al.,2021) vision encoder and language encoder to

solve manipulation tasks. However, a drawback is that they

come with limited language understanding compared to

pure language-only pre-trained models like BERT (Devlin

et al.,2018), lacking the ability to follow long and detailed

instructions. In addition, the representations of visual in-

put and textual input are often disjointly learned, so such

methods typically require designing specialized network

architectures on top of the pre-trained models to fuse them

together.

To address the above challenges, we introduce InstructRL,

a simple yet effective method based on the multimodal

transformer (Vaswani et al.,2017;Tsai et al.,2019). It

arXiv:2210.13431v4 [cs.CV] 25 Mar 2023

Instruction-Following Agents with Multimodal Transformer

Figure 1.

Examples of RLBench tasks considered in this work.

Left

:InstructRL can perform multiple tasks from RLBench given language

instructions, by leveraging the representations of a pre-trained multimodal transformer model, and learning a transformer policy.

Right

Each task can be composed of multiple variations that share the same skills but differ in objects. For example, in the block stacking task,

InstructRL can generalize to varying colors and ordering of the blocks.

ﬁrst encodes ﬁne-grained cross-modal alignment between

vision and language using a pre-trained multimodal trans-

former (Geng et al.,2022), which is a large masked au-

toencoding transformer (Vaswani et al.,2017;He et al.,

2022) jointly trained on image-text (Changpinyo et al.,2021;

Thomee et al.,2016) and text-only data (Devlin et al.,2018).

The generic representations of each camera and instructions

form a sequence, and are concatenated with the embeddings

of proprioception data and actions. These tokens are fed into

a multimodal transformer-based policy, which jointly mod-

els dependencies between the current and past observations,

and cross-modal alignment between instruction and views

from multiple cameras. Based on the output representations

from our multimodal transformer, we predict 7-DoF actions,

i.e., position, rotation, and state of the gripper.

We evaluate InstructRL on RLBench (James et al.,2020),

measuring capabilities for single-task learning, multi-task

learning, multi-variation generalization, long instructions

following, and model scalability. On all 74 tasks which

belong to 9 categories (see Figure 1for example tasks),

our InstructRL signiﬁcantly outperforms state-of-the-art

models (Shridhar et al.,2022a;Guhur et al.,2022;Liu

et al.,2022), demonstrating the effectiveness of combining

pretrained language and vision representations. Moreover,

InstructRL not only excels in following basic language in-

structions, but is also able to beneﬁt from human-written

long and detailed language instructions. We also demon-

strate that InstructRL generalizes to new instructions that

represent different variations of the task that are unseen

during training, and shows excellent model scalability with

performance continuing to increase with larger model size.

2. Problem Deﬁnition

We consider the problem of robotic manipulation from vi-

sual observations and natural language instructions (see

Figure 1and Table 6in Appendix for examples). We

assume the agent receives a natural language instruction

x:= {x1, . . . , xn}

consisting of

text tokens. At each

timestep

, the agent receives a visual observation

ot∈ O

and takes an action

at∈ A

in order to solve the task speci-

ﬁed by the instruction.

We parameterize the policy

πat|x,{oi}t

i=1,{ai}t−1

i=1

a transformer model, which is conditioned on the instruction

, observations

{oi}t

i=1

, and previous actions

{ai}t−1

i=1

. For

robotic control, we use macro steps (James & Davison,

2022), which are key turning points in the action trajectory

where the gripper changes its state (open/close) or the joint

velocities are set to near zero. Following James & Davison

(2022), we employ an inverse-kinematics based controller

to ﬁnd a trajectory between macro-steps. In this way, the

sequence length of an episode is signiﬁcantly reduced from

hundreds of small steps to typically less than 10 macro steps.

Observation space

: Each observation

consists of im-

ages

{ck

t}K

k=1

taken from

different camera viewpoints, as

well as proprioception data

t∈R4

. Each image

is an

RGB image of size

256 ×256 ×3

. We use

K= 3

camera

viewpoints located on the robot’s wrist, left shoulder, and

right shoulder. The proprioception data

consists of 4

scalar values: gripper open, left ﬁnger joint position, right

ﬁnger joint position, and timestep of the action sequence.

Note that we do not use point cloud data in order for our

method to be more ﬂexibly applied to other domains. Since

RLBench consists of sparse-reward and challenging tasks,

using point cloud data can beneﬁt performance (James &

Davison,2022;Guhur et al.,2022), but we leave this as

Instruction-Following Agents with Multimodal Transformer

Place 3 of the

red cubes on ..

Place 3 of the

red cubes on ..

Place 3 of the

red cubes on ..

Place 3 of the

red cubes on ..

Figure 2.

Different frameworks of leveraging pre-trained representations for instruction-following agents. In prior work, additional

training-from-scratch is needed to combine the representations of text and image from (I) a pre-trained vision model, (II) a pre-trained

language model, or (III) both pre-trained language model and vision model. In contrast, InstructRL extracts generic representations from

(IV) a simple and uniﬁed multimodal model pretrained on aligned image-text and unaligned text data.

future work.

Action space

: Following the standard setup in RL-

Bench (James & Davison,2022), each action

at:=

(pt, qt, gt)

consists of the desired gripper position

pt=

(xt, yt, zt)

in Cartesian coordinates and quaternion

qt=

(q0

t, q1

t, q2

t, q3

relative to the base frame, and the gripper

state

indicating whether the gripper is open or closed.

An object is grasped when it is located in between the grip-

per’s two ﬁngers and the gripper is closing its grasp. The

execution of an action is achieved by a motion planner in

RLBench.

3. InstructRL

We propose a uniﬁed architecture for robotic tasks called

InstructRL, which is shown in Figure 3. It consists of two

modules: a pre-trained multimodal masked autoencoder (He

et al.,2022;Geng et al.,2022) to encode instructions and

visual observations, and a transformer-based (Vaswani et al.,

2017) policy that predicts actions. First, the feature en-

coding module (Sec. 3.1) generates token embeddings for

language instructions

{xj}n

j=1

, observations

{oi}t

i=1

, and

previous actions

{ai}t−1

i=1

. Then, given the token embed-

dings, the multimodal transformer policy (Sec. 3.2) learns

relationships between the instruction, image observations,

and the history of past observations and actions, in order to

predict the next action at.

3.1. Multimodal Representation

We encode the instruction and visual observations using

a pre-trained multimodal transformer encoder, as shown

in Figure 3. Speciﬁcally, we use a pre-trained multi-

modal masked autoencoder (M3AE) (Geng et al.,2022)

encoder, which is a large transformer-based architecture

based on ViT (Dosovitskiy et al.,2020) and BERT (De-

vlin et al.,2018). Speciﬁcally M3AE (Geng et al.,2022)

is a transformer-based architecture that learns a uniﬁed en-

coder for both vision and language data via masked to-

ken prediction. It is trained on a large-scale image-text

dataset(CC12M (Changpinyo et al.,2021)) and text-only

corpus (Devlin et al.,2018) and is able to learn generalizable

representations that transfer well to downstream tasks.

Encoding Instructions and Observations. Following the

practice of M3AE, we ﬁrst tokenize the language instruc-

tions

{xj}n

j=1

into embedding vectors and then apply 1D

positional encodings. We denote the language instructions

Ex∈Rn×d

, where

is the length of language tokens

and dis the embedding dimension.

We divide each image observation in

{ck

t}K

k=1

into image

patches, and use a linear projection to convert them to image

embeddings that have the same dimension as the language

embeddings. Then, we apply 2D positional encodings. Each

image is represented as

Ec∈Rlc×de

where

is the length

of image patches tokens and

is the embedding dimension.

The image embeddings and text embeddings are then

concatenated along the sequence dimension:

concat(Ec, Ex)∈R(lc+n)×de

. The combined language

and image embeddings are then processed by a series

of transformer blocks to obtain the ﬁnal representation

ˆok

t∈R(lc+n)×de

. Following the practice of VIT and M3AE,

we also apply average pooling on the sequence length di-

mension of

ˆok

to get

t∈Rde

as the ﬁnal representation

of the

-th camera image

and the instruction. We use

multi-scale features

t∈Rd

which are a concatenation

Instruction-Following Agents with Multimodal Transformer

Place 3 of the red cubes on ..

…

Place 3 of the red cubes on ..

Figure 3.

InstructRL is composed of a multimodal transformer and a transformer-based policy. First, the instruction (text) and multi-view

image observations are jointly encoded using the pre-trained multimodal transformer. Next, the sequence of representations and a history

of actions are encoded by the transformer-based policy to predict the next action.

Figure 4.

The architecture of the transformer policy. The model is conditioned on a history of language-vision representations and actions

to predict the next action.

of all intermediate layer representations, where the feature

dimension

d=L∗de

equals the number of intermedi-

ate layers

times embedding dimension

. Finally, we

can get the representations over all

camera viewpoints

ht={h1

t,· · · , hK

t} ∈ RK×d

as the representation of the

vision-language input.

Encoding Proprioceptions and Actions. The proprioception

data

t∈R4

is encoded with a linear layer to upsample the

input dimension to

(i.e., each scalar in

is mapped to

) to get a representation

zt={z1

t,· · · , z4

t} ∈ R4×d

all

each state in

. Similarly, the action is projected to feature

space ft∈Rd.

3.2. Transformer-based Policy

We consider a context-conditional policy, which takes all

encoded instructions, observations and actions as input,

i.e.,

{(hi, zi)}t

i=1

and

{fi}t−1

i=1

. By default, we use context

length 4 throughout the paper (i.e.,

4(K+ 5)

embeddings

are processed by the transformer policy). This enables learn-

ing relationships among views from multiple cameras, the

current observations and instructions, and between the cur-

rent observations and history for action prediction. The

architecture of transformer policy is illustrated in Figure 4.

We pass the output embeddings of the transformer into a

feature map to predict the next action

at= [pt;qt;gt]

. We

use behavioral cloning to train the models. In RLBench,

we generate

, a collection of

successful demonstrations

for each task. Each demonstration

δ∈D

is composed of a

sequence of (maximum)

macro-steps with observations

{oi}T

i=1

, actions

{a∗

i}T

i=1

and instructions

{xl}n

l=1

. We min-

imize a loss function

over a batch of demonstrations

B={δj}|B|

j=1 ⊂D

. The loss function is the mean-square

error (MSE) on the gripper’s action:

L=1

|B|X

δ∈B



X

t≤T

MSE (at, a∗

t)

.(1)

4. Experimental Setup

To evaluate the effectiveness of our method, we run exper-

iments on RLBench (James et al.,2020), a benchmark of

robotic manipulation task (see Figure 1). We use the same

setup as in (Guhur et al.,2022), including the same set of 74

tasks with 100 demonstrations per task for training, and the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Instruction-FollowingAgentswithMultimodalTransformerHaoLiu12LisaLee2KiminLee2PieterAbbeel11UniversityofCalifornia,Berkeley2GoogleResearchhao.liu@cs.berkeley.eduAbstractHumansareexcellentatunderstandinglanguageandvisiontoaccomplishawiderangeoftasks.Incontrast,creatinggeneralinstruction-followingembod...

展开>> 收起<<

Instruction-Following Agents with Multimodal Transformer Hao Liu1 2Lisa Lee2Kimin Lee2Pieter Abbeel1 1University of California Berkeley2Google Research.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Instruction-Following Agents with Multimodal Transformer Hao Liu1 2Lisa Lee2Kimin Lee2Pieter Abbeel1 1University of California Berkeley2Google Research

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: