Grounding Language with Visual Affordances over Unstructured Data Oier Mees1 Jessica Borja-Diaz1 Wolfram Burgard2 Abstract Recent works have shown that Large Language

2025-05-06 0 0 4.68MB 9 页 10玖币
侵权投诉
Grounding Language with Visual Affordances over Unstructured Data
Oier Mees1, Jessica Borja-Diaz1, Wolfram Burgard2
Abstract Recent works have shown that Large Language
Models (LLMs) can be applied to ground natural language to a
wide variety of robot skills. However, in practice, learning multi-
task, language-conditioned robotic skills typically requires
large-scale data collection and frequent human intervention to
reset the environment or help correcting the current policies.
In this work, we propose a novel approach to efficiently
learn general-purpose language-conditioned robot skills from
unstructured, offline and reset-free data in the real world
by exploiting a self-supervised visuo-lingual affordance model,
which requires annotating as little as 1% of the total data with
language. We evaluate our method in extensive experiments
both in simulated and real-world robotic tasks, achieving state-
of-the-art performance on the challenging CALVIN benchmark
and learning over 25 distinct visuomotor manipulation tasks
with a single policy in the real world. We find that when
paired with LLMs to break down abstract natural language
instructions into subgoals via few-shot prompting, our method
is capable of completing long-horizon, multi-tier tasks in the
real world, while requiring an order of magnitude less data
than previous approaches. Code and videos are available at
http://hulc2.cs.uni-freiburg.de.
I. INTRODUCTION
Recent advances in large-scale language modeling have
produced promising results in bridging their semantic knowl-
edge of the world to robot instruction following and plan-
ning [1], [2], [3]. In reality, planning with Large Language
Models (LLMs) requires having a large set of diverse low-
level behaviors that can be seamlessly combined together to
intelligently act in the world. Learning such sensorimotor
skills and grounding them in language typically requires
either a massive large-scale data collection effort [1], [2],
[4], [5] with frequent human interventions, limiting the skills
to templated pick-and-place operations [6], [7] or deploying
the policies in simpler simulated environments [8], [9], [10].
The phenomenon that the apparently easy tasks for humans,
such as pouring water into a cup, are difficult to teach a robot
to do, is also known as Moravec’s paradox [11]. This raises
the question: how can we learn a diverse repertoire of visuo-
motor skills in the real world in a scalable and data-efficient
manner for instruction following?
Prior studies show that decomposing robot manipulation
into semantic and spatial pathways [12], [13], [6], improves
generalization, data-efficiency, and understanding of multi-
modal information. Inspired by these pathway architectures,
we propose a novel, sample-efficient method for learning
Equal contribution.
1University of Freiburg, Germany.2University of Technology Nuremberg,
Germany.
Can you tidy up the
workspace?
3
2
1
5
“Open the drawer” “Place the pink block
inside the drawer”
“Close the drawer”
“Place the purple block
inside the drawer”
“Place the yellow block
inside the drawer”
4
I will do:
Fig. 1: When paired with Large Language Models, HULC++
enables completing long-horizon, multi-tier tasks from abstract
natural language instructions in the real world, such as “tidy up
the workspace” with no additional training. We leverage a visual
affordance model to guide the robot to the vicinity of actionable
regions referred by language. Once inside this area, we switch to a
single 7-DoF language-conditioned visuomotor policy, trained from
offline, unstructured data.
general-purpose language-conditioned robot skills from un-
structured, offline and reset-free data in the real world by
exploiting a self-supervised visuo-lingual affordance model.
Our key observation is that instead of scaling the data
collection to learn how to reach any reachable goal state from
any current state [14] with a single end-to-end model, we can
decompose the goal-reaching problem hierarchically with a
high-level stream that grounds semantic concepts and a low-
level stream that grounds 3D spatial interaction knowledge,
as seen in Figure 1.
Specifically, we present Hierarchical Universal Lan-
guage Conditioned Policies 2.0 (HULC++), a hierarchical
language-conditioned agent that integrates the task-agnostic
control of HULC [10] with the object-centric semantic
understanding of VAPO [13]. HULC is a state-of-the-art
language-conditioned imitation learning agent that learns 7-
DoF goal-reaching policies end-to-end. However, in order to
jointly learn language, vision, and control, it needs a large
amount of robot interaction data, similar to other end-to-
end agents [4], [9], [15]. VAPO extracts a self-supervised
visual affordance model of unstructured data and not only
accelerates learning, but was also shown to boost general-
ization of downstream control policies. We show that by
extending VAPO to learn language-conditioned affordances
and combining it with a 7-DoF low-level policy that builds
upon HULC, our method is capable of following multiple
long-horizon manipulation tasks in a row, directly from
images, while requiring an order of magnitude less data
arXiv:2210.01911v3 [cs.RO] 8 Mar 2023
than previous approaches. Unlike prior work, which relies
on costly expert demonstrations and fully annotated datasets
to learn language-conditioned agents in the real world, our
approach leverages a more scalable data collection scheme:
unstructured, reset-free and possibly suboptimal, teleoperated
play data [16]. Moreover, our approach requires annotating
as little as 1% of the total data with language. Extensive
experiments show that when paired with LLMs that translate
abstract natural language instructions into a sequence of
subgoals, HULC++ enables completing long-horizon, multi-
stage natural language instructions in the real world. Finally,
we show that our model sets a new state of the art on the
challenging CALVIN benchmark [8], on following multiple
long-horizon manipulation tasks in a row with 7-DoF control,
from high-dimensional perceptual observations, and specified
via natural language. To our knowledge, our method is the
first explicitly aiming to solve language-conditioned long-
horizon, multi-tier tasks from purely offline, reset-free and
unstructured data in the real world, while requiring as little
as 1% of language annotations.
II. RELATED WORK
There has been a growing interest in the robotics com-
munity to build language-driven robot systems [17], spurred
by the advancements in grounding language and vision [18],
[19]. Earlier works focused on localizing objects mentioned
in referring expressions [20], [21], [22], [23], [24] and
following pick-and-place instructions with predefined motion
primitives [25], [6], [26]. More recently, end-to-end learning
has been used to study the challenging problem of fusing
perception, language and control [4], [27], [28], [1], [10],
[9], [15], [5]. End-to-end learning from pixels is an attrac-
tive choice for modeling general-purpose agents due to its
flexibility, as it makes the least assumptions about objects
and tasks. However, such pixel-to-action models often have
a poor sample efficiency. In the area of robot manipulation,
the two extremes of the spectrum are CLIPort [6] on the
one hand, and agents like GATO [5] and BC-Z [4] on
the other, which range from needing a few hundred expert
demonstrations for pick-and-placing objects with motion
planning, to several months of data collection of expert
demonstrations to learn visuomotor manipulation skills for
continuous control. In contrast, we lift the requirement of
collecting expert demonstrations and the corresponding need
for manually resetting the scene, to learn from unstructured,
reset-free, teleoperated play data [16]. Another orthogonal
line of work tackles data inefficiency by using pre-trained
image representations [29], [6], [30] to bootstrap downstream
task learning, which we also leverage in this work.
We propose a novel hierarchical approach that com-
bines the strengths of both paradigms to learn language-
conditioned, task-agnostic, long-horizon policies from high-
dimensional camera observations. Inspired by the line of
work that decomposes robot manipulation into semantic and
spatial pathways [12], [13], [6], we propose leveraging a
self-supervised affordance model from unstructured data that
guides the robot to the vicinity of actionable regions referred
"Move the sliding door to the right"
Projected end-effector
position
......
Fig. 2: Visualization of the procedure to extract language-
conditioned visual affordances from human teleoperated unstruc-
tured, free-form interaction data. We leverage the gripper open/close
signal during teleoperation to project the end-effector into the
camera images to detect affordances in undirected data.
in language instructions. Once inside this area, we switch to
a single multi-task 7-DoF language-conditioned visuomotor
policy, trained also from offline, unstructured data.
III. METHOD
We decompose our approach into three main steps. First
we train a language-conditioned affordance model from
unstructured, teleoperated data to predict 3D locations of
an object that affords an input language instruction (Section
III-A). Second, we leverage model-based planning to move
towards the predicted location and switch to a local language-
conditioned, learning-based policy πfree to interact with the
scene (Section III-C). Third, we show how HULC++ can
be used together with large language models (LLMs) for
decomposing abstract language instructions into a sequence
of feasible, executable subtasks (Section III-D).
Formally, our final robot policy is defined as a mixture:
π(a|s, l) = (1 α(s, l)) ·πmod (a|s)
+α(s, l)·πfree (a|s, l)(1)
Specifically, we use the pixel distance between the pro-
jected end-effector position Itcp and the predicted pixel from
the affordance model Iaff to select which policy to use.
If the distance is larger than a threshold , the predicted
region is far from the robots current position and we use the
model-based policy πmod to move to the predicted location.
Otherwise, the end-effector is already near the predicted
position and we keep using the learning-based policy πfree.
Thus, we define αas:
α(s, l) = (0,if |Iaff Itcp |> 
1,otherwise (2)
As the affordance prediction is conditioned on language,
each time the agent receives a new instruction, our agent
decides which policy to use based on α(s, l). Restricting the
area where the model-free policy is active to the vicinity
of regions that afford human-object interactions has the
advantage that it makes it more sample efficient, as it only
needs to learn local behaviors.
摘要:

GroundingLanguagewithVisualAffordancesoverUnstructuredDataOierMees1,JessicaBorja-Diaz1,WolframBurgard2Abstract—RecentworkshaveshownthatLargeLanguageModels(LLMs)canbeappliedtogroundnaturallanguagetoawidevarietyofrobotskills.However,inpractice,learningmulti-task,language-conditionedroboticskillstypi...

展开>> 收起<<
Grounding Language with Visual Affordances over Unstructured Data Oier Mees1 Jessica Borja-Diaz1 Wolfram Burgard2 Abstract Recent works have shown that Large Language.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:4.68MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注