Grounding Language with Visual Affordances over Unstructured Data Oier Mees1 Jessica Borja-Diaz1 Wolfram Burgard2 Abstract Recent works have shown that Large Language

2025-05-06 0 0 4.68MB 9 页 10玖币

侵权投诉

Grounding Language with Visual Affordances over Unstructured Data

Oier Mees∗1, Jessica Borja-Diaz∗1, Wolfram Burgard2

Abstract— Recent works have shown that Large Language

Models (LLMs) can be applied to ground natural language to a

wide variety of robot skills. However, in practice, learning multi-

task, language-conditioned robotic skills typically requires

large-scale data collection and frequent human intervention to

reset the environment or help correcting the current policies.

In this work, we propose a novel approach to efﬁciently

learn general-purpose language-conditioned robot skills from

unstructured, ofﬂine and reset-free data in the real world

by exploiting a self-supervised visuo-lingual affordance model,

which requires annotating as little as 1% of the total data with

language. We evaluate our method in extensive experiments

both in simulated and real-world robotic tasks, achieving state-

of-the-art performance on the challenging CALVIN benchmark

and learning over 25 distinct visuomotor manipulation tasks

with a single policy in the real world. We ﬁnd that when

paired with LLMs to break down abstract natural language

instructions into subgoals via few-shot prompting, our method

is capable of completing long-horizon, multi-tier tasks in the

real world, while requiring an order of magnitude less data

than previous approaches. Code and videos are available at

http://hulc2.cs.uni-freiburg.de.

I. INTRODUCTION

Recent advances in large-scale language modeling have

produced promising results in bridging their semantic knowl-

edge of the world to robot instruction following and plan-

ning [1], [2], [3]. In reality, planning with Large Language

Models (LLMs) requires having a large set of diverse low-

level behaviors that can be seamlessly combined together to

intelligently act in the world. Learning such sensorimotor

skills and grounding them in language typically requires

either a massive large-scale data collection effort [1], [2],

[4], [5] with frequent human interventions, limiting the skills

to templated pick-and-place operations [6], [7] or deploying

the policies in simpler simulated environments [8], [9], [10].

The phenomenon that the apparently easy tasks for humans,

such as pouring water into a cup, are difﬁcult to teach a robot

to do, is also known as Moravec’s paradox [11]. This raises

the question: how can we learn a diverse repertoire of visuo-

motor skills in the real world in a scalable and data-efﬁcient

manner for instruction following?

Prior studies show that decomposing robot manipulation

into semantic and spatial pathways [12], [13], [6], improves

generalization, data-efﬁciency, and understanding of multi-

modal information. Inspired by these pathway architectures,

we propose a novel, sample-efﬁcient method for learning

∗Equal contribution.

1University of Freiburg, Germany.2University of Technology Nuremberg,

Germany.

Can you tidy up the

workspace?

“Open the drawer” “Place the pink block

inside the drawer”

“Close the drawer”

“Place the purple block

inside the drawer”

“Place the yellow block

inside the drawer”

I will do:

Fig. 1: When paired with Large Language Models, HULC++

enables completing long-horizon, multi-tier tasks from abstract

natural language instructions in the real world, such as “tidy up

the workspace” with no additional training. We leverage a visual

affordance model to guide the robot to the vicinity of actionable

regions referred by language. Once inside this area, we switch to a

single 7-DoF language-conditioned visuomotor policy, trained from

ofﬂine, unstructured data.

general-purpose language-conditioned robot skills from un-

structured, ofﬂine and reset-free data in the real world by

exploiting a self-supervised visuo-lingual affordance model.

Our key observation is that instead of scaling the data

collection to learn how to reach any reachable goal state from

any current state [14] with a single end-to-end model, we can

decompose the goal-reaching problem hierarchically with a

high-level stream that grounds semantic concepts and a low-

level stream that grounds 3D spatial interaction knowledge,

as seen in Figure 1.

Speciﬁcally, we present Hierarchical Universal Lan-

guage Conditioned Policies 2.0 (HULC++), a hierarchical

language-conditioned agent that integrates the task-agnostic

control of HULC [10] with the object-centric semantic

understanding of VAPO [13]. HULC is a state-of-the-art

language-conditioned imitation learning agent that learns 7-

DoF goal-reaching policies end-to-end. However, in order to

jointly learn language, vision, and control, it needs a large

amount of robot interaction data, similar to other end-to-

end agents [4], [9], [15]. VAPO extracts a self-supervised

visual affordance model of unstructured data and not only

accelerates learning, but was also shown to boost general-

ization of downstream control policies. We show that by

extending VAPO to learn language-conditioned affordances

and combining it with a 7-DoF low-level policy that builds

upon HULC, our method is capable of following multiple

long-horizon manipulation tasks in a row, directly from

images, while requiring an order of magnitude less data

arXiv:2210.01911v3 [cs.RO] 8 Mar 2023

than previous approaches. Unlike prior work, which relies

on costly expert demonstrations and fully annotated datasets

to learn language-conditioned agents in the real world, our

approach leverages a more scalable data collection scheme:

unstructured, reset-free and possibly suboptimal, teleoperated

play data [16]. Moreover, our approach requires annotating

as little as 1% of the total data with language. Extensive

experiments show that when paired with LLMs that translate

abstract natural language instructions into a sequence of

subgoals, HULC++ enables completing long-horizon, multi-

stage natural language instructions in the real world. Finally,

we show that our model sets a new state of the art on the

challenging CALVIN benchmark [8], on following multiple

long-horizon manipulation tasks in a row with 7-DoF control,

from high-dimensional perceptual observations, and speciﬁed

via natural language. To our knowledge, our method is the

ﬁrst explicitly aiming to solve language-conditioned long-

horizon, multi-tier tasks from purely ofﬂine, reset-free and

unstructured data in the real world, while requiring as little

as 1% of language annotations.

II. RELATED WORK

There has been a growing interest in the robotics com-

munity to build language-driven robot systems [17], spurred

by the advancements in grounding language and vision [18],

[19]. Earlier works focused on localizing objects mentioned

in referring expressions [20], [21], [22], [23], [24] and

following pick-and-place instructions with predeﬁned motion

primitives [25], [6], [26]. More recently, end-to-end learning

has been used to study the challenging problem of fusing

perception, language and control [4], [27], [28], [1], [10],

[9], [15], [5]. End-to-end learning from pixels is an attrac-

tive choice for modeling general-purpose agents due to its

ﬂexibility, as it makes the least assumptions about objects

and tasks. However, such pixel-to-action models often have

a poor sample efﬁciency. In the area of robot manipulation,

the two extremes of the spectrum are CLIPort [6] on the

one hand, and agents like GATO [5] and BC-Z [4] on

the other, which range from needing a few hundred expert

demonstrations for pick-and-placing objects with motion

planning, to several months of data collection of expert

demonstrations to learn visuomotor manipulation skills for

continuous control. In contrast, we lift the requirement of

collecting expert demonstrations and the corresponding need

for manually resetting the scene, to learn from unstructured,

reset-free, teleoperated play data [16]. Another orthogonal

line of work tackles data inefﬁciency by using pre-trained

image representations [29], [6], [30] to bootstrap downstream

task learning, which we also leverage in this work.

We propose a novel hierarchical approach that com-

bines the strengths of both paradigms to learn language-

conditioned, task-agnostic, long-horizon policies from high-

dimensional camera observations. Inspired by the line of

work that decomposes robot manipulation into semantic and

spatial pathways [12], [13], [6], we propose leveraging a

self-supervised affordance model from unstructured data that

guides the robot to the vicinity of actionable regions referred

"Move the sliding door to the right"

Projected end-effector

position

......

Fig. 2: Visualization of the procedure to extract language-

conditioned visual affordances from human teleoperated unstruc-

tured, free-form interaction data. We leverage the gripper open/close

signal during teleoperation to project the end-effector into the

camera images to detect affordances in undirected data.

in language instructions. Once inside this area, we switch to

a single multi-task 7-DoF language-conditioned visuomotor

policy, trained also from ofﬂine, unstructured data.

III. METHOD

We decompose our approach into three main steps. First

we train a language-conditioned affordance model from

unstructured, teleoperated data to predict 3D locations of

an object that affords an input language instruction (Section

III-A). Second, we leverage model-based planning to move

towards the predicted location and switch to a local language-

conditioned, learning-based policy πfree to interact with the

scene (Section III-C). Third, we show how HULC++ can

be used together with large language models (LLMs) for

decomposing abstract language instructions into a sequence

of feasible, executable subtasks (Section III-D).

Formally, our ﬁnal robot policy is deﬁned as a mixture:

π(a|s, l) = (1 −α(s, l)) ·πmod (a|s)

+α(s, l)·πfree (a|s, l)(1)

Speciﬁcally, we use the pixel distance between the pro-

jected end-effector position Itcp and the predicted pixel from

the affordance model Iaff to select which policy to use.

If the distance is larger than a threshold , the predicted

region is far from the robots current position and we use the

model-based policy πmod to move to the predicted location.

Otherwise, the end-effector is already near the predicted

position and we keep using the learning-based policy πfree.

Thus, we deﬁne αas:

α(s, l) = (0,if |Iaﬀ −Itcp |> 

1,otherwise (2)

As the affordance prediction is conditioned on language,

each time the agent receives a new instruction, our agent

decides which policy to use based on α(s, l). Restricting the

area where the model-free policy is active to the vicinity

of regions that afford human-object interactions has the

advantage that it makes it more sample efﬁcient, as it only

needs to learn local behaviors.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GroundingLanguagewithVisualAffordancesoverUnstructuredDataOierMees1,JessicaBorja-Diaz1,WolframBurgard2AbstractRecentworkshaveshownthatLargeLanguageModels(LLMs)canbeappliedtogroundnaturallanguagetoawidevarietyofrobotskills.However,inpractice,learningmulti-task,language-conditionedroboticskillstypi...

展开>> 收起<<

Grounding Language with Visual Affordances over Unstructured Data Oier Mees1 Jessica Borja-Diaz1 Wolfram Burgard2 Abstract Recent works have shown that Large Language.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Grounding Language with Visual Affordances over Unstructured Data Oier Mees1 Jessica Borja-Diaz1 Wolfram Burgard2 Abstract Recent works have shown that Large Language

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: