Eliciting Compatible Demonstrations for Multi-Human Imitation Learning Kanishk Gandhi Siddharth Karamcheti Madeline Liao Dorsa Sadigh

2025-05-03 0 0 1.97MB 16 页 10玖币

侵权投诉

Eliciting Compatible Demonstrations for

Multi-Human Imitation Learning

Kanishk Gandhi, Siddharth Karamcheti, Madeline Liao, Dorsa Sadigh

Department of Computer Science, Stanford University

{kanishk.gandhi, skaramcheti, madelineliao, dorsa}@stanford.edu

Abstract:

Imitation learning from human-provided demonstrations is a strong

approach for learning policies for robot manipulation. While the ideal dataset for

imitation learning is homogenous and low-variance – reﬂecting a single, optimal

method for performing a task – natural human behavior has a great deal of het-

erogeneity, with several optimal ways to demonstrate a task. This multimodality

is inconsequential to human users, with task variations manifesting as subcon-

scious choices; for example, reaching down, then across to grasp an object, versus

reaching across, then down. Yet, this mismatch presents a problem for interactive

imitation learning, where sequences of users improve on a policy by iteratively

collecting new, possibly conﬂicting demonstrations. To combat this problem of

demonstrator incompatibility, this work designs an approach for 1) measuring the

compatibility of a new demonstration given a base policy, and 2) actively eliciting

more compatible demonstrations from new users. Across two simulation tasks

requiring long-horizon, dexterous manipulation and a real-world “food plating”

task with a Franka Emika Panda arm, we show that we can both identify incompat-

ible demonstrations via post-hoc ﬁltering, and apply our compatibility measure to

actively elicit compatible demonstrations from new users, leading to improved task

success rates across simulated and real environments.1

Keywords:

Interactive Imitation Learning, Active Demonstration Elicitation,

Human-Robot Interaction

1 Introduction

Interactive imitation learning [

] from a pool of human demonstrators is a scalable approach for

learning multi-task policies for robotic manipulation [

]. Yet, such approaches have a critical

problem, especially in the low-to-moderate data regime: data from multiple human demonstrators

often have conﬂicting modes, where two users provide opposing behaviors for a single task – behaviors

that manifest as subconscious, random choices. For example, consider the nut-on-peg task in Fig. 1:

one user (in orange) approaches the nut by moving across the table, then down, while the other user

(blue) reaches down, then across.

Training on aggregated batches of data in series – starting with a base policy, adding small amounts

of data from new users, and retraining the policy after each batch – is common in interactive imitation

formulations [

]; unfortunately, when we add a small number of conﬂicting demonstrations during

the interaction phase, the retrained policy attempts to cover both the base demonstrations and the new

set. This leads to incongruent overﬁtting, where a policy – even one equipped to learn multimodal

behaviors [

] – tries to ﬁt the base set for most of a trajectory, but overﬁts to the new set for a

small subset of the state space, often with catastrophic failure modes.

To mitigate this problem, this work tackles two questions: 1) how can we measure the compatibility

between a new demonstrator and an existing policy, and 2) how can we use this measure to actively

elicit better demonstrations from a new user?

While our approach for measuring and eliciting compatible demonstrations during online collection

is novel, prior work has studied the impact of suboptimal demonstrations on learning. Most relevant,

Mandlekar et al.

[10]

introduce RoboMimic, a suite of simulated manipulation tasks that consist

1Additional videos & results: https://sites.google.com/view/eliciting-demos-corl22/home

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.08073v1 [cs.RO] 14 Oct 2022

of demonstrations collected from 6 humans of varying qualities (Worse, Okay, and Better). This

work and recent followups [

] show how policies degrade in the presence of heterogenous data

with multimodal behaviors, ﬁnding that training imitation learning policies on data from a single

demonstrator can often exceed the performance of training on equivalent amounts of data from

multiple demonstrators. A related body of work on ofﬂine policy learning propose approaches for

overcoming suboptimality by using extra annotations, usually in the form of rewards or returns for

each demonstration. These methods seek to reweight [

], or directly ﬁlter out demonstrations

[

]. Yet, these are all post-hoc approaches that operate after demonstrations have been

collected. None of these prior approaches target the root cause: informing users of the compatibility

of their demonstrations, as they collect data in the ﬁrst place.

Figure 1: Visualization of two users (orange and

blue) inserting a nut onto a round peg. Both demon-

strators approach the round nut, pick it up and

insert it into the peg in very different ways, in-

troducing heterogenous data that is incompatible

with the base policy, hurting performance when

retrained.

In contrast, our proposed approach targets this

root cause by ﬁrst learning a ﬁne-grained mea-

sure of demonstrator compatibility subject to

an initial base policy, and then operationalizing

this measure to inform users while they collect

demonstrations. Our compatibility measure is

derived by sorting individual state-action pairs

in the data based on their likelihood and nov-

elty under the base policy. By plotting each

state-action along a 2D “map” with these two

properties as the axes, we are able to learn visual

boundaries denoting “aligned and compatible”

demonstrations (those with high likelihood and

low novelty) to those that are grossly incom-

patible (low likelihood and low novelty) – an

approach inspired by prior work in dataset inter-

pretability [

]. From these maps, we iden-

tify a scoring function and set of thresholds to

deﬁne our demonstrator compatibility measure,

which we then cheaply and efﬁciently deploy

during active data collection. Crucially, with

this ﬁne-grained measure, users with “incom-

patible” demonstrations are given rich feedback

as to where in their demonstration the incom-

patibility arose coupled with visualizations of

compatible demonstrations from the base pol-

icy’s training set, guiding their behavior moving

forward.

Our results across two simulated tasks from prior work [

] and a real-world “food-plating” task

with a Franka Emika Panda arm show that our compatibility measure enables the elicitation of more

compatible demonstrations – demonstrations that, when added to the base dataset, lead to a boost of

up to 25% in success rate relative to a 20% decrease in performance when collecting data naively,

without feedback.

2 Related Work

Identifying & Learning from Diverse Demonstrations.

Our deﬁnition of diversity subsumes both

heterogeneity – where demonstrators capture multiple distinct, but equally optimal behaviors – and

suboptimality – where users provide demonstrations that are poor relative to some objective reward

function. On examining the effect of learning from these diverse demonstrations, Mandlekar et al.

[10]

found that imitation learning models trained on demonstrations from a single “proﬁcient” human

achieved higher success rates than models trained on signiﬁcantly larger dataset sourced from a

mixture of human demonstrators with varying optimality. Gopalan et al.

[21]

further show that

without prompting users with videos or speciﬁc subtasks, only a third of demonstrators provide com-

patible demonstrations. These works motivate our approach by showing that in naive demonstration

collection, heterogeneity and suboptimality are prevalent; instead, we need a direct way to intervene

during collection, actively inform users, and guide them to provide compatible demonstrations.

Other work studies learning from diverse demonstrations. Several approaches [

] model

the multimodality of diverse demonstrations directly, usually by explicitly handling each mode

of the data. Unfortunately, these approaches fail in the sequential, interactive imitation learning

setting that we study, where we gradually introduce new modes and retrain after each collection. A

separate line of work tackles learning with suboptimal data through the use of ofﬂine reinforcement

learning [

], ﬁltered imitation [

], or inverse reinforcement learning [

These methods use extra reward annotations for each demonstration, and generally try to reweight or

selectively remove suboptimal data. Unlike these approaches, this work tackles the root cause of the

problem – how to collect compatible demonstrations from humans in the ﬁrst place, removing the

need to ﬁlter out 10s – 100s of painstakingly collected demonstrations.

Interactive Imitation Learning.

Most of the work in interactive imitation learning attempts to

address the problem of covariate shift endemic to policies trained via behavioral cloning. Ross

et al.

[1]

introduce DAgger, a method for iteratively collecting demonstrations by relabeling the

states visited by a policy interactively, teaching the policy to recover. Later work builds on DAgger,

primarily focusing on reducing the number of expert labels requested during training. These variants

identify different measures such as safety and uncertainty to use to query the expert limiting the

amount of supervision a user needs to provide [

]. The focus of these works is on

learning from a single user rather than from multiple users demonstrating heterogenous behaviors.

Our proposed approach presents a novel interactive imitation learning method that learns to actively

elicit demonstrations from a pool of multiple users during the course of data collection, with the

express goal of guiding users towards compatible, optimal demonstrations.

3 Problem Setting

We consider sequential decision making tasks, modeled as a Markov Decision Process (MDP) with

the following components: the state space

, action space

, transition function

, reward function

, and discount factor

. In this work, we assume sparse rewards only provided on task completion.

A state

s= (ogrounded, oproprio)∈ S

is comprised of a grounded observation (either coordinates/poses

of objects in the environment, or an RGB visual observation) and the robot’s proprioceptive state.

An action

a∈ A

is a continuous vector of robot joint actions. We assume access to a small dataset

of trajectories

Dbase ={τ1, τ2. . . τN}

where each trajectory

τi

is a sequence of states and actions,

τi={(s1, a1). . . (sT, aT)}. We train an initial policy πbase on this dataset via behavioral cloning.

Our approach has two components: 1) developing a measure of the compatibility of a new demonstra-

tion with an existing base policy, and 2) building a method for actively eliciting compatible demon-

strations from users. For the ﬁrst component, we learn a measure

M(Dbase,(snew, anew)) ∈[0,1]

that

deﬁnes a compatibility at the granuality of a state-action pair. For the second component, we use our

ﬁne-grained measure

to provide rich feedback to users about their demonstrations, in addition to

deciding whether to accept/reject a new demonstration

τnew

. The set of new demonstrations the user

collects comprises Dnew – after each collection step, we train a new policy πnew on Dbase ∪ Dnew.

Under our deﬁnition of compatibility (and the measure

we derive), we hope to guide users to

provide new datasets

Dnew

such that

πnew

will have as high of an expected return as possible, and

minimally, a higher expected return than

πbase

. In other words, our deﬁnition of compatibility asks

that new data should only help, not hurt performance relative to the initial set.

4 Learning to Measure Compatibility in Multi-Human Demonstrations

We ﬁrst derive a general compatibility measure

given a base set of demonstrations, then evaluate

our measure through a series of case studies grounded in real, user-provided demonstration data.

4.1 Estimating Compatibility and Identifying Good Demonstrations

An idealized compatibility measure

has one role – estimating the performance of a policy

πbase

that

is retrained on the union of a known base dataset

Dbase

and a new dataset

Dnew

. Crucially,

needs to

operate at a granular level (ideally at the level of individual state-action pairs), without incurring the

cost of retraining and evaluating

πbase

on the new dataset. Phrased this way, there is a clear connection

to pool-based active learning [

], informing a choice of a plausible metrics that could help predict

downstream success. While many metrics could work, we choose two easy-to-compute metrics that

lend themselves well to interpretability: the likelihood of actions

anew

Dnew

under

πbase

, measured

Low

NOVELTY

LIKELIHOOD LIKELIHOOD

High

Low

High Low

High

(a) Compatible Operator

High

61 Rejected 128 Rejected

(b) Incompatible Operator

Figure 2: Plots, or “maps” of the likelihood and novelty values for demonstrations recorded by the

blue compatible operator (a), and the orange incompatible operator with respect to the base policy

πbase

. The gray border indicates a ﬁlter with our chosen thresholds (

). The orange operator

performs far more low-likelihood actions in familiar states, earning a lower compatibility score.

by the negative mean squared error between actions predicted by

πbase(snew)

and

anew

, as well as the

novelty of a given state, measured by the standard deviation in the predicted actions under the base

policy. To turn these two metrics into a single measure, we borrow from interpretability and active

learning literature [

]; we plot, or “map” on a 2D plane the likelihood and novelty values for

each trajectory in our dataset, with the goal of identifying two thresholds

(λ, η)

on likelihood and

novelty – thresholds that allow us to identify a given trajectory based on its compatibility.

The ﬁnal piece is setting the thresholds for novelty (

) and likelihood (

). While this can be done in

many ways – often just by looking at the “mapped” dataset on the 2D plane [

] – in this work, we

assume the ability to obtain or collect a handful of a priori incompatible demonstrations that we can

use to build contrast sets to regress a threshold. We deﬁne compatibility

M(Dbase,(snew, anew)) ∈

[0,1] as parameterized by thresholds (λ, η):

M=(1−min (πbase(snew)−anew )2

λ,1if novelty(snew)< η

1otherwise.

A state-action pair is compatible (

M(Dbase,(snew, anew)) = 1

) if the likelihood of

anew

under the

base policy is high, or the novelty of

snew

> η

; incompatibility is deﬁned smoothly on the inverse

interval with a state-action pair being completely incompatible if (

M(Dbase,(snew, anew)) = 0

) when

(πbase(snew)−anew)2≥λ

. Intuitively, this measure mirrors what one expects when collecting new

data – in states where the policy is conﬁdent about what actions to take, reject other actions, and

in new states where the policy is uncertain, more, diverse data is good. The following subsection

validates this measure against demonstration data collected by real users on a series of tasks, showing

how we can use

as a ﬁlter to boost policy performance in the presence of a large dataset of

heterogenous demonstrations.

4.2 Case Studies and Experiments

To validate our measure

, we consider three tasks in simulation, each with a set of demonstrations

collected by multiple operators. For each task, we choose one operator to initialize the base set of

demos

Dbase

and use trajectories from another operator to form

Dnew

. For a more detailed summary

of the tasks datasets, and policy training procedure, please consult the Supplementary Material.

Square Nut

[

]: The goal of the task is to place a square nut onto a square peg. We use the

environment and demonstrations collected by Mandlekar et al.

[10]

, consisting of 200 demonstrations

from a proﬁcient operator, and 50 demonstrations each from 6 operators of varying qualities. We

initialize

Dbase

with 50 demonstrations from the proﬁcient operator, and use demonstrations from

each of the 4 more efﬁcient operators as the different Dnew.

Round Nut

[

]: Similar to the prior task, the goal is to place a round nut onto a round peg

(as in Fig. 1). For this task, we collected 30 demonstrations from a proﬁcient operator and 30

demonstrations each from 3 other operators – where the data provided for one operator is taken

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ElicitingCompatibleDemonstrationsforMulti-HumanImitationLearningKanishkGandhi,SiddharthKaramcheti,MadelineLiao,DorsaSadighDepartmentofComputerScience,StanfordUniversity{kanishk.gandhi,skaramcheti,madelineliao,dorsa}@stanford.eduAbstract:Imitationlearningfromhuman-provideddemonstrationsisastrongappro...

展开>> 收起<<

Eliciting Compatible Demonstrations for Multi-Human Imitation Learning Kanishk Gandhi Siddharth Karamcheti Madeline Liao Dorsa Sadigh.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Eliciting Compatible Demonstrations for Multi-Human Imitation Learning Kanishk Gandhi Siddharth Karamcheti Madeline Liao Dorsa Sadigh

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: