
than previous approaches. Unlike prior work, which relies
on costly expert demonstrations and fully annotated datasets
to learn language-conditioned agents in the real world, our
approach leverages a more scalable data collection scheme:
unstructured, reset-free and possibly suboptimal, teleoperated
play data [16]. Moreover, our approach requires annotating
as little as 1% of the total data with language. Extensive
experiments show that when paired with LLMs that translate
abstract natural language instructions into a sequence of
subgoals, HULC++ enables completing long-horizon, multi-
stage natural language instructions in the real world. Finally,
we show that our model sets a new state of the art on the
challenging CALVIN benchmark [8], on following multiple
long-horizon manipulation tasks in a row with 7-DoF control,
from high-dimensional perceptual observations, and specified
via natural language. To our knowledge, our method is the
first explicitly aiming to solve language-conditioned long-
horizon, multi-tier tasks from purely offline, reset-free and
unstructured data in the real world, while requiring as little
as 1% of language annotations.
II. RELATED WORK
There has been a growing interest in the robotics com-
munity to build language-driven robot systems [17], spurred
by the advancements in grounding language and vision [18],
[19]. Earlier works focused on localizing objects mentioned
in referring expressions [20], [21], [22], [23], [24] and
following pick-and-place instructions with predefined motion
primitives [25], [6], [26]. More recently, end-to-end learning
has been used to study the challenging problem of fusing
perception, language and control [4], [27], [28], [1], [10],
[9], [15], [5]. End-to-end learning from pixels is an attrac-
tive choice for modeling general-purpose agents due to its
flexibility, as it makes the least assumptions about objects
and tasks. However, such pixel-to-action models often have
a poor sample efficiency. In the area of robot manipulation,
the two extremes of the spectrum are CLIPort [6] on the
one hand, and agents like GATO [5] and BC-Z [4] on
the other, which range from needing a few hundred expert
demonstrations for pick-and-placing objects with motion
planning, to several months of data collection of expert
demonstrations to learn visuomotor manipulation skills for
continuous control. In contrast, we lift the requirement of
collecting expert demonstrations and the corresponding need
for manually resetting the scene, to learn from unstructured,
reset-free, teleoperated play data [16]. Another orthogonal
line of work tackles data inefficiency by using pre-trained
image representations [29], [6], [30] to bootstrap downstream
task learning, which we also leverage in this work.
We propose a novel hierarchical approach that com-
bines the strengths of both paradigms to learn language-
conditioned, task-agnostic, long-horizon policies from high-
dimensional camera observations. Inspired by the line of
work that decomposes robot manipulation into semantic and
spatial pathways [12], [13], [6], we propose leveraging a
self-supervised affordance model from unstructured data that
guides the robot to the vicinity of actionable regions referred
"Move the sliding door to the right"
Projected end-effector
position
......
Fig. 2: Visualization of the procedure to extract language-
conditioned visual affordances from human teleoperated unstruc-
tured, free-form interaction data. We leverage the gripper open/close
signal during teleoperation to project the end-effector into the
camera images to detect affordances in undirected data.
in language instructions. Once inside this area, we switch to
a single multi-task 7-DoF language-conditioned visuomotor
policy, trained also from offline, unstructured data.
III. METHOD
We decompose our approach into three main steps. First
we train a language-conditioned affordance model from
unstructured, teleoperated data to predict 3D locations of
an object that affords an input language instruction (Section
III-A). Second, we leverage model-based planning to move
towards the predicted location and switch to a local language-
conditioned, learning-based policy πfree to interact with the
scene (Section III-C). Third, we show how HULC++ can
be used together with large language models (LLMs) for
decomposing abstract language instructions into a sequence
of feasible, executable subtasks (Section III-D).
Formally, our final robot policy is defined as a mixture:
π(a|s, l) = (1 −α(s, l)) ·πmod (a|s)
+α(s, l)·πfree (a|s, l)(1)
Specifically, we use the pixel distance between the pro-
jected end-effector position Itcp and the predicted pixel from
the affordance model Iaff to select which policy to use.
If the distance is larger than a threshold , the predicted
region is far from the robots current position and we use the
model-based policy πmod to move to the predicted location.
Otherwise, the end-effector is already near the predicted
position and we keep using the learning-based policy πfree.
Thus, we define αas:
α(s, l) = (0,if |Iaff −Itcp |>
1,otherwise (2)
As the affordance prediction is conditioned on language,
each time the agent receives a new instruction, our agent
decides which policy to use based on α(s, l). Restricting the
area where the model-free policy is active to the vicinity
of regions that afford human-object interactions has the
advantage that it makes it more sample efficient, as it only
needs to learn local behaviors.