Interactive Language Talking to Robots in Real Time Corey Lynch Ayzaan Wahid Jonathan Tompson Tianli Ding James Betker Robert Baruch Travis Armstrong Pete Florence

2025-04-27 0 0 9.69MB 11 页 10玖币

侵权投诉

Interactive Language: Talking to Robots in Real Time

Corey Lynch, Ayzaan Wahid, Jonathan Tompson

Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, Pete Florence

Robotics at Google

Abstract— We present a framework for building interactive, real-

time, natural language-instructable robots in the real world, and we

open source related assets (dataset, environment, benchmark, and

policies). Trained with behavioral cloning on a dataset of hundreds

of thousands of language-annotated trajectories, a produced policy

can proficiently execute an order of magnitude more commands

than previous works: specifically we estimate a 93.5% success rate

on a set of 87,000 unique natural language strings specifying raw

end-to-end visuo-linguo-motor skills in the real world. We find that

the same policy is capable of being guided by a human via real-time

language to address a wide range of precise long-horizon rearrange-

ment goals, e.g. “make a smiley face out of blocks”. The dataset

we release comprises nearly 600,000 language-labeled trajectories,

an order of magnitude larger than prior available datasets. We

hope the demonstrated results and associated assets enable further

advancement of helpful, capable, natural-language-interactable

robots. See videos at https://interactive-language.github.io.

I. INTRODUCTION

The goal of building a robot that can follow a diverse array of

natural language instructions has been a longstanding goal of AI

research, since at least the SHRDLU [1] experiments starting in the

late 1960s. While recent research on this topic has been abundant

[2]–[9], few efforts have actually produced a robot that (i) exists in

the real world, and (ii) can capably respond to a large number of

rich, diverse language commands. We expect that future research

will continue to produce larger and more diverse sets of behaviors,

either by sequencing raw skills together [10] or growing the num-

ber of raw skills themselves [11]. However, we are also interested

in (iii), the capacity to follow interactive language commands, by

which we mean that the robot reacts capably and in-the-moment

to new natural language instructions provided during ongoing task

execution. Although we might expect such a robot to be possible

given current methods, natural language-interactable robots are

frequently slow in practice, and often use blocking parameterized

skills [7], [10] or simplifying self-resetting behaviors [9], [12] that

prohibit this kind of live, real-time interaction.

In this paper, we demonstrate a framework for producing real-

world, real-time-interactable, natural-language-instructable robots

(Fig. 1, a) that by certain metrics operate at an order of magnitude

larger scale than prior works. To accelerate further research in

this setting, we accordingly provide our associated recipe, dataset,

models, hardware environment description, simulated analogue

environment, and a research benchmark for language conditioned

manipulation (Fig. 1, c). In terms of scale, the produced robot

policies can address 87,000 unique commands at an estimated

93.5% success rate (Fig. 1, b), with continuous 5Hz visuolinguo-

motor control, and are capable of chaining raw skills to reach

hundreds of thousands of long horizon goals in its environment.

Fig. 1: Real-time language, diverse robot behaviors.

a) Over the

course of 5 minutes, a human guides a robot to precisely rearrange

objects a table into a desired shape, with real-time natural language

as the only mechanism for specifying behaviors. b) We demonstrate

a single robot that can capably address 87,000 behaviors specified

entirely in natural language. c) We release Language-Table, a suite of

human-collected datasets and a multi-task continuous control benchmark

for open vocabulary visuolinguomotor learning.

This robot exists in an environment which we designed to

provide a tractable yet difficult level of challenge (perception from

pixels, feedback-rich control, multiple objects, ambiguous natural

language instructions). We cast real time language guidance as a

large scale imitation learning problem [11], [13], [14] (Figure 2).

The learning algorithm recipe itself is intentionally simple, and

instead the complexity of this effort was primarily in the data

effort itself, for which we detail insights and techniques. We hope

the dataset and benchmark may catalyze further work which may

improve on our demonstrated sample complexity and performance.

Beyond demonstrating diverse short-horizon skills, we also use

these capabilities to study the nonobvious benefits of a real-time

language robot. For one, we show that through occasional human

natural-language feedback, the robot can accomplish complex

long-horizon rearrangements such as “put the blocks into a smiley

face with green eyes” that require multiple minutes of precise

coordinated control (Figure 5, left). We also find that real-time

language competency unlocks new capabilities like simultaneous,

multi-robot instruction – in which a single human can guide mul-

tiple real-time robots through long-horizon tasks (Figure 5, right).

Contributions

. Our primary contributions include (i) Interac-

tive Language, a framework for producing real world robots that

can capably receive interactive open vocabulary language condi-

arXiv:2210.06407v1 [cs.RO] 12 Oct 2022

teleoperated collect

(video, actions, language)

robot policy

(video, actions, language)

"push the red

star to the

top center of

the board"

(video, actions, language)

"push the

red star to

the top

center of

the board"

1) High-throughput teleoperation + hindsight

language relabeling

video language

action

event-selectable hindsight language relabeling

find all

behaviors, give

hindsight

instructions

robot

(video,

action)

human + robot goal:

veical line (>100k goals)

"move the

triangle to the

top right side

of the heart..."

"...nudge the

green star

left a bit..."

mse loss

2.7k

hours

diverse

relabeled

demonstrations

5hz

control

2) Language conditioned behavioral

cloning (LCBC)

3) Human + robot solve goals with

real time language

sta end

Fig. 2: Interactive Language: a large scale robot imitation learning framework for real-time language.

Stage 1: First, high throughput robot

data collection with multiple operators. Post-collection, relabel robot video and actions into language conditioned demonstrations using event-selectable

hindsight relabeling. Stage 2: do simple language conditioned behavioral cloning. Stage 3: Human guides a single learned policy in real-time using

natural language to accomplish hundreds of thousands of goals.

tioning in real-time

while performing continuous-control visuo-

motor manipulation. Interactive Language combines existing tech-

niques, together with novel components like event-selectable hind-

sight relabeling, to define a simple and scalable recipe for learning

large repertoires of natural-language-conditionable skills. (ii) We

use this system to present and study the setting of interactive

language guidance, showing that the combination of real-time lan-

guage feedback and a low-level language-conditionable policy can

address long-horizon manipulation goal states in a tabletop rear-

rangement setting. (iii) To facilitate future research in this domain,

we release Language-Table, a dataset and simulated multitask imi-

tation learning benchmark. With nearly 600,000 diverse demonstra-

tions across simulation and the real world, Language-Table is, to

our knowledge, the largest natural language conditioned imitation

learning dataset of its kind by an order of magnitude (Table III).

II. RELATED WORK

From single-task imitation to multi-task and language con-

ditioning

. Imitation learning (see review [14]), the perspective

we adopt in this work, provides a simple and stable way for

robots to acquire behaviors from human expert demonstrations.

While historically imitation learning has been applied to individual

tasks from instrumented state [16]–[19], the desire for more

general purpose robots has motivated study into policies capable

of learning multiple skills at once from more generic on-board

sensory observations like RGB pixels [20]–[22]. To condition

multiple learned behaviors, prior setups have relied on discrete

one-hot task identifiers [23], which can be difficult to scale to

many tasks, or goal images [24]–[26], which can be impractical

to provide in real world scenarios. Alternatively, a long history of

For the scope of this paper, by real-time we mean new language conditioning

can occur in the “blink of an eye”, i.e. approximately 3 Hz [15] or greater.

prior work in broader AI research [1]–[6], [11] has sought a more

convenient form of specification in the form of natural language

conditioning (survey [27]), with some results on physical robots

[7]–[9], [12]. This focus has yielded many varied and impressive

approaches to tackling the grounding problem [1], [28]—learning

to relate language to one’s embodied observations and actions.

However, in both simulation and the real world, instruction-

following robots rarely leverage the full capabilities of contin-

uous control, instead employing simplified, parameterized action

spaces [6], [7], [29], [30]. Furthermore, once provided, language

conditioning is typically presumed fixed over robot execution [8]–

[10], [12], with little opportunity for subsequent interaction by

the instructor. Our work, in contrast, studies the first combination,

to our knowledge, of real-time natural language guidance of a

physical robot engaged in continuous visuomotor manipulation.

Interactively guiding robot behavior with language

. Our

work exists in a larger setting of humans modifying or correcting

the behavior of autonomous agents [31], historically addressed

in forms like teleoperation [32]–[34], kinesthetic teaching [35],

or sparse human preference feedback [36]. Certain works have

studied language as a means of correction, but typically do so

under simplifying assumptions that we relax in the current work.

For example, [37], [38], [39], and [40] study language corrections,

but under the respective simplifying assumptions of hand-defined

optimization for grounding, undivided operator attention, paired

iterative corrections at training time, and presumed access to

motion planners and task cost functions. Additionally, to the

best of our knowledge, none of these works support multiple-Hz

iterative specification over the course of execution. Closest to our

approach is [11] and [30], which study language-interactive agents

learned via imitation, but entirely in simulation and under varying

degrees of actuation realism. In contrast to these prior studies,

our work learns real-time natural language policies end-to-end

from RGB pixels to continuous control outputs with a simple

behavioral cloning objective [13], and applies them to contact-rich

real-world manipulation tasks.

Scaling real world imitation learning.

One of the largest

bottlenecks in robot imitation is often simply the amount of diverse

robot data made available to learning [9], [22], [23]. Many multi-

task imitation learning frameworks determine the set of tasks to be

learned upfront [7], [9], [10], [12], [14]. While this may simplify

collection conceptually, it also often requires that reset protocols

and success criteria be designed manually for each behavior. An-

other challenge particular to large scale multi-operator collections

is that typically not all data can be considered optimal [41], [42],

often requiring manual post-hoc success filtering [9], [10]. These

per-task manual efforts have historically been difficult to scale to a

large and diverse task setting, like the one studied in this work. We

sidestep both these scaling concerns by instead having operators

continuously teleoperate long-horizon behaviors, with no require-

ments on low level task segmentation or resets [11], [25], [43] and

then leverage after-the-fact crowdsourced language annotation [8],

[11]. In contrast to the “random window” relabeling explored in

[11], we give annotators precise control over the start and end of be-

haviors they are annotating, which we find in practice better aligns

relabeled training data to the actual commands given at test time.

III. PROBLEM SETUP

Our goal is to train a conditional policy,

πθ(a|s, l)

parameterized by

, which maps from observations

s∈ S

and human-provided language

l∈L

to actions

a∈A

on a physical

robot. In particular we are interested in open-vocabulary language-

conditioned visuomotor policies, in which the observation space

contains high-dimensional RGB images, e.g.

S=RH×W×C

and where language conditioning

has no predefined template,

grammar, or vocabulary. We are also particularly interested in

allowing humans to interject new language

at any time, at the

natural rate of the visuo-linguo-motor policy. Each commanded

encodes a distribution of achievable goals

gshort ∈ Gshort

the environment. Note that humans may generate a new language

instruction

based on their own perception of the environment,

sH∈SH

, which may differ substantially from the robot’s

s∈S

(e.g. due to viewpoint, self-occlusion, limited observational

memory, etc.). As in prior works [11], we treat natural-language-

conditioned visuomotor skill learning as a contextual imitation

learning problem [14]. As such, we acquire an offline dataset

containing pairs of valid demonstrations and the conditions

they resolve

{(τ,l)i}D

i=0

. Each

τi

is a variable-length trajectory of

robot observations and actions

τi=[(s0,a0),(s1,a1),...,(sT)]

, and

each lidescribes the full trajectory as a second-person command.

IV. INTERACTIVE LANGUAGE: METHODS AND ANALYSIS

First we introduce Interactive Language, summarized in Fig-

ure 2, a simple and generically applicable imitation learning frame-

work for training real-time natural-language-interactable robots. In-

teractive Language combines a scalable method for collecting var-

ied, real world language-conditioned demonstration datasets, with

straightforward language conditioned behavioral cloning (LCBC).

Has contact

Object/location

-directed

instructions

Compound

instructions

Random window [8], [11] 86% 47% 16%

Event-selectable (ours) 91% 83% <1%

Real test instructions 89% 84% <1%

TABLE I: Which relabeling strategy aligns best with test-time

language?

Real-World Data Collection

Total robots 4

Total teleoperators 10

Total episodes 16.4k

Average episode length (minutes) 9.9

Total hours of collect time 2.7k

Hindsight Relabeling

Total crowdsourced annotators 64

Total relabeled demonstrations obtained 299k

Total unique relabeled instructions 87k

Average relabeled demonstration length (seconds) 5.8

Total number of hours of relabeled demonstrations obtained 488

Total instruction hours / Collect hours 18.06%

TABLE II: Statistics: real-world collection and relabeling

. This data

snapshot went into training and is a subset of the full Language-Table data.

A. Data Collection

High throughput raw data collection

. Interactive Language

adopts purposefully minimal collection assumptions to maximize

the flow of human demonstrated behavior to learning. Operators

teleoperate a variety of long-horizon behaviors constantly,

without low-level task definition, segmentation, or episodic resets.

This strategy shares assumptions with “play” collection [25],

but additionally guides collect towards temporally extended

low-entropy states like lines, shapes, and complex arrangements.

Each collect episode lasts

∼

10 minutes before a break, and is

guided by multiple randomly chosen long-horizon prompts

p∈P

(e.g. “make a square shape out of the blocks”), drawn from the

set of target long-horizon goals, which teleoperators are free to

follow or ignore. We do not assume all of the data collected for

each prompt

is optimal (each

is discarded after collecting).

In practice, our collection includes many inevitable edge cases

that might otherwise require data cleaning, e.g. solving for the

wrong

or knocking blocks off table. We log all of these cases

and incorporate them later on as training data. Concretely, this

collect procedure yields a semi-structured, optimality-agnostic

collection

Dcollect ={τi}Dcollect

i=0

. The purpose of

Dcollect

is to

provide a sufficiently diverse basis for crowdsourced hindsight

language relabeling [8], [11], described next.

Event-selectable hindsight relabeling

. We convert

Dcollect

into natural language conditioned demonstrations

Dtraining =

{(τ,l)i}Dtraining

i=0

, using a new variant of hindsight language relabel-

ing [11] we call “Event-Selectable Hindsight Relabeling” (Fig.2,

left). Previous “random window” relabeling systems [8], [11] have

at least two drawbacks: each random window is not guaranteed

to contain “usefully describable” actions, and random window

lengths must be determined upfront as a sensitive hyperparameter.

We instead ask annotators to watch the full collect video, then

find

coherent behaviors (

K=24

in our case). Annotators have

the ability to mark the start and end frame of each behavior, and

are asked to phrase their text descriptions as natural language

commands. In Table I, we compare event-selectable relabeling to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InteractiveLanguage:TalkingtoRobotsinRealTimeCoreyLynch,AyzaanWahid,JonathanTompsonTianliDing,JamesBetker,RobertBaruch,TravisArmstrong,PeteFlorenceRoboticsatGoogleAbstractWepresentaframeworkforbuildinginteractive,real-time,naturallanguage-instructablerobotsintherealworld,andweopensourcerelatedasset...

展开>> 收起<<

Interactive Language Talking to Robots in Real Time Corey Lynch Ayzaan Wahid Jonathan Tompson Tianli Ding James Betker Robert Baruch Travis Armstrong Pete Florence.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Interactive Language Talking to Robots in Real Time Corey Lynch Ayzaan Wahid Jonathan Tompson Tianli Ding James Betker Robert Baruch Travis Armstrong Pete Florence

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: