
our work learns real-time natural language policies end-to-end
from RGB pixels to continuous control outputs with a simple
behavioral cloning objective [13], and applies them to contact-rich
real-world manipulation tasks.
Scaling real world imitation learning.
One of the largest
bottlenecks in robot imitation is often simply the amount of diverse
robot data made available to learning [9], [22], [23]. Many multi-
task imitation learning frameworks determine the set of tasks to be
learned upfront [7], [9], [10], [12], [14]. While this may simplify
collection conceptually, it also often requires that reset protocols
and success criteria be designed manually for each behavior. An-
other challenge particular to large scale multi-operator collections
is that typically not all data can be considered optimal [41], [42],
often requiring manual post-hoc success filtering [9], [10]. These
per-task manual efforts have historically been difficult to scale to a
large and diverse task setting, like the one studied in this work. We
sidestep both these scaling concerns by instead having operators
continuously teleoperate long-horizon behaviors, with no require-
ments on low level task segmentation or resets [11], [25], [43] and
then leverage after-the-fact crowdsourced language annotation [8],
[11]. In contrast to the “random window” relabeling explored in
[11], we give annotators precise control over the start and end of be-
haviors they are annotating, which we find in practice better aligns
relabeled training data to the actual commands given at test time.
III. PROBLEM SETUP
Our goal is to train a conditional policy,
πθ(a|s, l)
,
parameterized by
θ
, which maps from observations
s∈ S
and human-provided language
l∈L
to actions
a∈A
on a physical
robot. In particular we are interested in open-vocabulary language-
conditioned visuomotor policies, in which the observation space
contains high-dimensional RGB images, e.g.
S=RH×W×C
,
and where language conditioning
L
has no predefined template,
grammar, or vocabulary. We are also particularly interested in
allowing humans to interject new language
L
at any time, at the
natural rate of the visuo-linguo-motor policy. Each commanded
l
encodes a distribution of achievable goals
gshort ∈ Gshort
in
the environment. Note that humans may generate a new language
instruction
l
based on their own perception of the environment,
sH∈SH
, which may differ substantially from the robot’s
s∈S
(e.g. due to viewpoint, self-occlusion, limited observational
memory, etc.). As in prior works [11], we treat natural-language-
conditioned visuomotor skill learning as a contextual imitation
learning problem [14]. As such, we acquire an offline dataset
D
containing pairs of valid demonstrations and the conditions
they resolve
{(τ,l)i}D
i=0
. Each
τi
is a variable-length trajectory of
robot observations and actions
τi=[(s0,a0),(s1,a1),...,(sT)]
, and
each lidescribes the full trajectory as a second-person command.
IV. INTERACTIVE LANGUAGE: METHODS AND ANALYSIS
First we introduce Interactive Language, summarized in Fig-
ure 2, a simple and generically applicable imitation learning frame-
work for training real-time natural-language-interactable robots. In-
teractive Language combines a scalable method for collecting var-
ied, real world language-conditioned demonstration datasets, with
straightforward language conditioned behavioral cloning (LCBC).
Has contact
Object/location
-directed
instructions
Compound
instructions
Random window [8], [11] 86% 47% 16%
Event-selectable (ours) 91% 83% <1%
Real test instructions 89% 84% <1%
TABLE I: Which relabeling strategy aligns best with test-time
language?
Real-World Data Collection
Total robots 4
Total teleoperators 10
Total episodes 16.4k
Average episode length (minutes) 9.9
Total hours of collect time 2.7k
Hindsight Relabeling
Total crowdsourced annotators 64
Total relabeled demonstrations obtained 299k
Total unique relabeled instructions 87k
Average relabeled demonstration length (seconds) 5.8
Total number of hours of relabeled demonstrations obtained 488
Total instruction hours / Collect hours 18.06%
TABLE II: Statistics: real-world collection and relabeling
. This data
snapshot went into training and is a subset of the full Language-Table data.
A. Data Collection
High throughput raw data collection
. Interactive Language
adopts purposefully minimal collection assumptions to maximize
the flow of human demonstrated behavior to learning. Operators
teleoperate a variety of long-horizon behaviors constantly,
without low-level task definition, segmentation, or episodic resets.
This strategy shares assumptions with “play” collection [25],
but additionally guides collect towards temporally extended
low-entropy states like lines, shapes, and complex arrangements.
Each collect episode lasts
∼
10 minutes before a break, and is
guided by multiple randomly chosen long-horizon prompts
p∈P
(e.g. “make a square shape out of the blocks”), drawn from the
set of target long-horizon goals, which teleoperators are free to
follow or ignore. We do not assume all of the data collected for
each prompt
p
is optimal (each
p
is discarded after collecting).
In practice, our collection includes many inevitable edge cases
that might otherwise require data cleaning, e.g. solving for the
wrong
p
or knocking blocks off table. We log all of these cases
and incorporate them later on as training data. Concretely, this
collect procedure yields a semi-structured, optimality-agnostic
collection
Dcollect ={τi}Dcollect
i=0
. The purpose of
Dcollect
is to
provide a sufficiently diverse basis for crowdsourced hindsight
language relabeling [8], [11], described next.
Event-selectable hindsight relabeling
. We convert
Dcollect
into natural language conditioned demonstrations
Dtraining =
{(τ,l)i}Dtraining
i=0
, using a new variant of hindsight language relabel-
ing [11] we call “Event-Selectable Hindsight Relabeling” (Fig.2,
left). Previous “random window” relabeling systems [8], [11] have
at least two drawbacks: each random window is not guaranteed
to contain “usefully describable” actions, and random window
lengths must be determined upfront as a sensitive hyperparameter.
We instead ask annotators to watch the full collect video, then
find
K
coherent behaviors (
K=24
in our case). Annotators have
the ability to mark the start and end frame of each behavior, and
are asked to phrase their text descriptions as natural language
commands. In Table I, we compare event-selectable relabeling to