Interactive Language Talking to Robots in Real Time Corey Lynch Ayzaan Wahid Jonathan Tompson Tianli Ding James Betker Robert Baruch Travis Armstrong Pete Florence

2025-04-27 0 0 9.69MB 11 页 10玖币
侵权投诉
Interactive Language: Talking to Robots in Real Time
Corey Lynch, Ayzaan Wahid, Jonathan Tompson
Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, Pete Florence
Robotics at Google
Abstract We present a framework for building interactive, real-
time, natural language-instructable robots in the real world, and we
open source related assets (dataset, environment, benchmark, and
policies). Trained with behavioral cloning on a dataset of hundreds
of thousands of language-annotated trajectories, a produced policy
can proficiently execute an order of magnitude more commands
than previous works: specifically we estimate a 93.5% success rate
on a set of 87,000 unique natural language strings specifying raw
end-to-end visuo-linguo-motor skills in the real world. We find that
the same policy is capable of being guided by a human via real-time
language to address a wide range of precise long-horizon rearrange-
ment goals, e.g. “make a smiley face out of blocks”. The dataset
we release comprises nearly 600,000 language-labeled trajectories,
an order of magnitude larger than prior available datasets. We
hope the demonstrated results and associated assets enable further
advancement of helpful, capable, natural-language-interactable
robots. See videos at https://interactive-language.github.io.
I. INTRODUCTION
The goal of building a robot that can follow a diverse array of
natural language instructions has been a longstanding goal of AI
research, since at least the SHRDLU [1] experiments starting in the
late 1960s. While recent research on this topic has been abundant
[2]–[9], few efforts have actually produced a robot that (i) exists in
the real world, and (ii) can capably respond to a large number of
rich, diverse language commands. We expect that future research
will continue to produce larger and more diverse sets of behaviors,
either by sequencing raw skills together [10] or growing the num-
ber of raw skills themselves [11]. However, we are also interested
in (iii), the capacity to follow interactive language commands, by
which we mean that the robot reacts capably and in-the-moment
to new natural language instructions provided during ongoing task
execution. Although we might expect such a robot to be possible
given current methods, natural language-interactable robots are
frequently slow in practice, and often use blocking parameterized
skills [7], [10] or simplifying self-resetting behaviors [9], [12] that
prohibit this kind of live, real-time interaction.
In this paper, we demonstrate a framework for producing real-
world, real-time-interactable, natural-language-instructable robots
(Fig. 1, a) that by certain metrics operate at an order of magnitude
larger scale than prior works. To accelerate further research in
this setting, we accordingly provide our associated recipe, dataset,
models, hardware environment description, simulated analogue
environment, and a research benchmark for language conditioned
manipulation (Fig. 1, c). In terms of scale, the produced robot
policies can address 87,000 unique commands at an estimated
93.5% success rate (Fig. 1, b), with continuous 5Hz visuolinguo-
motor control, and are capable of chaining raw skills to reach
hundreds of thousands of long horizon goals in its environment.
`
Fig. 1: Real-time language, diverse robot behaviors.
a) Over the
course of 5 minutes, a human guides a robot to precisely rearrange
objects a table into a desired shape, with real-time natural language
as the only mechanism for specifying behaviors. b) We demonstrate
a single robot that can capably address 87,000 behaviors specified
entirely in natural language. c) We release Language-Table, a suite of
human-collected datasets and a multi-task continuous control benchmark
for open vocabulary visuolinguomotor learning.
This robot exists in an environment which we designed to
provide a tractable yet difficult level of challenge (perception from
pixels, feedback-rich control, multiple objects, ambiguous natural
language instructions). We cast real time language guidance as a
large scale imitation learning problem [11], [13], [14] (Figure 2).
The learning algorithm recipe itself is intentionally simple, and
instead the complexity of this effort was primarily in the data
effort itself, for which we detail insights and techniques. We hope
the dataset and benchmark may catalyze further work which may
improve on our demonstrated sample complexity and performance.
Beyond demonstrating diverse short-horizon skills, we also use
these capabilities to study the nonobvious benefits of a real-time
language robot. For one, we show that through occasional human
natural-language feedback, the robot can accomplish complex
long-horizon rearrangements such as “put the blocks into a smiley
face with green eyesthat require multiple minutes of precise
coordinated control (Figure 5, left). We also find that real-time
language competency unlocks new capabilities like simultaneous,
multi-robot instruction – in which a single human can guide mul-
tiple real-time robots through long-horizon tasks (Figure 5, right).
Contributions
. Our primary contributions include (i) Interac-
tive Language, a framework for producing real world robots that
can capably receive interactive open vocabulary language condi-
arXiv:2210.06407v1 [cs.RO] 12 Oct 2022
teleoperated collect
(video, actions, language)
robot policy
(video, actions, language)
"push the red
star to the
top center of
the board"
(video, actions, language)
"push the
red star to
the top
center of
the board"
1) High-throughput teleoperation + hindsight
language relabeling
video language
action
event-selectable hindsight language relabeling
find all
behaviors, give
hindsight
instructions
robot
(video,
action)
human + robot goal:
veical line (>100k goals)
"move the
triangle to the
top right side
of the heart..."
"...nudge the
green star
left a bit..."
mse loss
2.7k
hours
diverse
relabeled
demonstrations
5hz
control
2) Language conditioned behavioral
cloning (LCBC)
3) Human + robot solve goals with
real time language
sta end
Fig. 2: Interactive Language: a large scale robot imitation learning framework for real-time language.
Stage 1: First, high throughput robot
data collection with multiple operators. Post-collection, relabel robot video and actions into language conditioned demonstrations using event-selectable
hindsight relabeling. Stage 2: do simple language conditioned behavioral cloning. Stage 3: Human guides a single learned policy in real-time using
natural language to accomplish hundreds of thousands of goals.
tioning in real-time
1
while performing continuous-control visuo-
motor manipulation. Interactive Language combines existing tech-
niques, together with novel components like event-selectable hind-
sight relabeling, to define a simple and scalable recipe for learning
large repertoires of natural-language-conditionable skills. (ii) We
use this system to present and study the setting of interactive
language guidance, showing that the combination of real-time lan-
guage feedback and a low-level language-conditionable policy can
address long-horizon manipulation goal states in a tabletop rear-
rangement setting. (iii) To facilitate future research in this domain,
we release Language-Table, a dataset and simulated multitask imi-
tation learning benchmark. With nearly 600,000 diverse demonstra-
tions across simulation and the real world, Language-Table is, to
our knowledge, the largest natural language conditioned imitation
learning dataset of its kind by an order of magnitude (Table III).
II. RELATED WORK
From single-task imitation to multi-task and language con-
ditioning
. Imitation learning (see review [14]), the perspective
we adopt in this work, provides a simple and stable way for
robots to acquire behaviors from human expert demonstrations.
While historically imitation learning has been applied to individual
tasks from instrumented state [16]–[19], the desire for more
general purpose robots has motivated study into policies capable
of learning multiple skills at once from more generic on-board
sensory observations like RGB pixels [20]–[22]. To condition
multiple learned behaviors, prior setups have relied on discrete
one-hot task identifiers [23], which can be difficult to scale to
many tasks, or goal images [24]–[26], which can be impractical
to provide in real world scenarios. Alternatively, a long history of
1
For the scope of this paper, by real-time we mean new language conditioning
can occur in the “blink of an eye”, i.e. approximately 3 Hz [15] or greater.
prior work in broader AI research [1]–[6], [11] has sought a more
convenient form of specification in the form of natural language
conditioning (survey [27]), with some results on physical robots
[7]–[9], [12]. This focus has yielded many varied and impressive
approaches to tackling the grounding problem [1], [28]learning
to relate language to ones embodied observations and actions.
However, in both simulation and the real world, instruction-
following robots rarely leverage the full capabilities of contin-
uous control, instead employing simplified, parameterized action
spaces [6], [7], [29], [30]. Furthermore, once provided, language
conditioning is typically presumed fixed over robot execution [8]
[10], [12], with little opportunity for subsequent interaction by
the instructor. Our work, in contrast, studies the first combination,
to our knowledge, of real-time natural language guidance of a
physical robot engaged in continuous visuomotor manipulation.
Interactively guiding robot behavior with language
. Our
work exists in a larger setting of humans modifying or correcting
the behavior of autonomous agents [31], historically addressed
in forms like teleoperation [32]–[34], kinesthetic teaching [35],
or sparse human preference feedback [36]. Certain works have
studied language as a means of correction, but typically do so
under simplifying assumptions that we relax in the current work.
For example, [37], [38], [39], and [40] study language corrections,
but under the respective simplifying assumptions of hand-defined
optimization for grounding, undivided operator attention, paired
iterative corrections at training time, and presumed access to
motion planners and task cost functions. Additionally, to the
best of our knowledge, none of these works support multiple-Hz
iterative specification over the course of execution. Closest to our
approach is [11] and [30], which study language-interactive agents
learned via imitation, but entirely in simulation and under varying
degrees of actuation realism. In contrast to these prior studies,
our work learns real-time natural language policies end-to-end
from RGB pixels to continuous control outputs with a simple
behavioral cloning objective [13], and applies them to contact-rich
real-world manipulation tasks.
Scaling real world imitation learning.
One of the largest
bottlenecks in robot imitation is often simply the amount of diverse
robot data made available to learning [9], [22], [23]. Many multi-
task imitation learning frameworks determine the set of tasks to be
learned upfront [7], [9], [10], [12], [14]. While this may simplify
collection conceptually, it also often requires that reset protocols
and success criteria be designed manually for each behavior. An-
other challenge particular to large scale multi-operator collections
is that typically not all data can be considered optimal [41], [42],
often requiring manual post-hoc success filtering [9], [10]. These
per-task manual efforts have historically been difficult to scale to a
large and diverse task setting, like the one studied in this work. We
sidestep both these scaling concerns by instead having operators
continuously teleoperate long-horizon behaviors, with no require-
ments on low level task segmentation or resets [11], [25], [43] and
then leverage after-the-fact crowdsourced language annotation [8],
[11]. In contrast to the “random window” relabeling explored in
[11], we give annotators precise control over the start and end of be-
haviors they are annotating, which we find in practice better aligns
relabeled training data to the actual commands given at test time.
III. PROBLEM SETUP
Our goal is to train a conditional policy,
πθ(a|s, l)
,
parameterized by
θ
, which maps from observations
s∈ S
and human-provided language
lL
to actions
aA
on a physical
robot. In particular we are interested in open-vocabulary language-
conditioned visuomotor policies, in which the observation space
contains high-dimensional RGB images, e.g.
S=RH×W×C
,
and where language conditioning
L
has no predefined template,
grammar, or vocabulary. We are also particularly interested in
allowing humans to interject new language
L
at any time, at the
natural rate of the visuo-linguo-motor policy. Each commanded
l
encodes a distribution of achievable goals
gshort ∈ Gshort
in
the environment. Note that humans may generate a new language
instruction
l
based on their own perception of the environment,
sHSH
, which may differ substantially from the robot’s
sS
(e.g. due to viewpoint, self-occlusion, limited observational
memory, etc.). As in prior works [11], we treat natural-language-
conditioned visuomotor skill learning as a contextual imitation
learning problem [14]. As such, we acquire an offline dataset
D
containing pairs of valid demonstrations and the conditions
they resolve
{(τ,l)i}D
i=0
. Each
τi
is a variable-length trajectory of
robot observations and actions
τi=[(s0,a0),(s1,a1),...,(sT)]
, and
each lidescribes the full trajectory as a second-person command.
IV. INTERACTIVE LANGUAGE: METHODS AND ANALYSIS
First we introduce Interactive Language, summarized in Fig-
ure 2, a simple and generically applicable imitation learning frame-
work for training real-time natural-language-interactable robots. In-
teractive Language combines a scalable method for collecting var-
ied, real world language-conditioned demonstration datasets, with
straightforward language conditioned behavioral cloning (LCBC).
Has contact
Object/location
-directed
instructions
Compound
instructions
Random window [8], [11] 86% 47% 16%
Event-selectable (ours) 91% 83% <1%
Real test instructions 89% 84% <1%
TABLE I: Which relabeling strategy aligns best with test-time
language?
Real-World Data Collection
Total robots 4
Total teleoperators 10
Total episodes 16.4k
Average episode length (minutes) 9.9
Total hours of collect time 2.7k
Hindsight Relabeling
Total crowdsourced annotators 64
Total relabeled demonstrations obtained 299k
Total unique relabeled instructions 87k
Average relabeled demonstration length (seconds) 5.8
Total number of hours of relabeled demonstrations obtained 488
Total instruction hours / Collect hours 18.06%
TABLE II: Statistics: real-world collection and relabeling
. This data
snapshot went into training and is a subset of the full Language-Table data.
A. Data Collection
High throughput raw data collection
. Interactive Language
adopts purposefully minimal collection assumptions to maximize
the flow of human demonstrated behavior to learning. Operators
teleoperate a variety of long-horizon behaviors constantly,
without low-level task definition, segmentation, or episodic resets.
This strategy shares assumptions with “play” collection [25],
but additionally guides collect towards temporally extended
low-entropy states like lines, shapes, and complex arrangements.
Each collect episode lasts
10 minutes before a break, and is
guided by multiple randomly chosen long-horizon prompts
pP
(e.g. “make a square shape out of the blocks”), drawn from the
set of target long-horizon goals, which teleoperators are free to
follow or ignore. We do not assume all of the data collected for
each prompt
p
is optimal (each
p
is discarded after collecting).
In practice, our collection includes many inevitable edge cases
that might otherwise require data cleaning, e.g. solving for the
wrong
p
or knocking blocks off table. We log all of these cases
and incorporate them later on as training data. Concretely, this
collect procedure yields a semi-structured, optimality-agnostic
collection
Dcollect ={τi}Dcollect
i=0
. The purpose of
Dcollect
is to
provide a sufficiently diverse basis for crowdsourced hindsight
language relabeling [8], [11], described next.
Event-selectable hindsight relabeling
. We convert
Dcollect
into natural language conditioned demonstrations
Dtraining =
{(τ,l)i}Dtraining
i=0
, using a new variant of hindsight language relabel-
ing [11] we call “Event-Selectable Hindsight Relabeling” (Fig.2,
left). Previous “random window” relabeling systems [8], [11] have
at least two drawbacks: each random window is not guaranteed
to contain “usefully describable” actions, and random window
lengths must be determined upfront as a sensitive hyperparameter.
We instead ask annotators to watch the full collect video, then
find
K
coherent behaviors (
K=24
in our case). Annotators have
the ability to mark the start and end frame of each behavior, and
are asked to phrase their text descriptions as natural language
commands. In Table I, we compare event-selectable relabeling to
摘要:

InteractiveLanguage:TalkingtoRobotsinRealTimeCoreyLynch,AyzaanWahid,JonathanTompsonTianliDing,JamesBetker,RobertBaruch,TravisArmstrong,PeteFlorenceRoboticsatGoogleAbstract—Wepresentaframeworkforbuildinginteractive,real-time,naturallanguage-instructablerobotsintherealworld,andweopensourcerelatedasset...

展开>> 收起<<
Interactive Language Talking to Robots in Real Time Corey Lynch Ayzaan Wahid Jonathan Tompson Tianli Ding James Betker Robert Baruch Travis Armstrong Pete Florence.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:9.69MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注