Eliciting Compatible Demonstrations for Multi-Human Imitation Learning Kanishk Gandhi Siddharth Karamcheti Madeline Liao Dorsa Sadigh

2025-05-03 0 0 1.97MB 16 页 10玖币
侵权投诉
Eliciting Compatible Demonstrations for
Multi-Human Imitation Learning
Kanishk Gandhi, Siddharth Karamcheti, Madeline Liao, Dorsa Sadigh
Department of Computer Science, Stanford University
{kanishk.gandhi, skaramcheti, madelineliao, dorsa}@stanford.edu
Abstract:
Imitation learning from human-provided demonstrations is a strong
approach for learning policies for robot manipulation. While the ideal dataset for
imitation learning is homogenous and low-variance – reflecting a single, optimal
method for performing a task – natural human behavior has a great deal of het-
erogeneity, with several optimal ways to demonstrate a task. This multimodality
is inconsequential to human users, with task variations manifesting as subcon-
scious choices; for example, reaching down, then across to grasp an object, versus
reaching across, then down. Yet, this mismatch presents a problem for interactive
imitation learning, where sequences of users improve on a policy by iteratively
collecting new, possibly conflicting demonstrations. To combat this problem of
demonstrator incompatibility, this work designs an approach for 1) measuring the
compatibility of a new demonstration given a base policy, and 2) actively eliciting
more compatible demonstrations from new users. Across two simulation tasks
requiring long-horizon, dexterous manipulation and a real-world “food plating”
task with a Franka Emika Panda arm, we show that we can both identify incompat-
ible demonstrations via post-hoc filtering, and apply our compatibility measure to
actively elicit compatible demonstrations from new users, leading to improved task
success rates across simulated and real environments.1
Keywords:
Interactive Imitation Learning, Active Demonstration Elicitation,
Human-Robot Interaction
1 Introduction
Interactive imitation learning [
1
,
2
,
3
] from a pool of human demonstrators is a scalable approach for
learning multi-task policies for robotic manipulation [
4
,
5
,
6
]. Yet, such approaches have a critical
problem, especially in the low-to-moderate data regime: data from multiple human demonstrators
often have conflicting modes, where two users provide opposing behaviors for a single task – behaviors
that manifest as subconscious, random choices. For example, consider the nut-on-peg task in Fig. 1:
one user (in orange) approaches the nut by moving across the table, then down, while the other user
(blue) reaches down, then across.
Training on aggregated batches of data in series – starting with a base policy, adding small amounts
of data from new users, and retraining the policy after each batch – is common in interactive imitation
formulations [
2
,
3
]; unfortunately, when we add a small number of conflicting demonstrations during
the interaction phase, the retrained policy attempts to cover both the base demonstrations and the new
set. This leads to incongruent overfitting, where a policy – even one equipped to learn multimodal
behaviors [
7
,
8
,
9
] – tries to fit the base set for most of a trajectory, but overfits to the new set for a
small subset of the state space, often with catastrophic failure modes.
To mitigate this problem, this work tackles two questions: 1) how can we measure the compatibility
between a new demonstrator and an existing policy, and 2) how can we use this measure to actively
elicit better demonstrations from a new user?
While our approach for measuring and eliciting compatible demonstrations during online collection
is novel, prior work has studied the impact of suboptimal demonstrations on learning. Most relevant,
Mandlekar et al.
[10]
introduce RoboMimic, a suite of simulated manipulation tasks that consist
1Additional videos & results: https://sites.google.com/view/eliciting-demos-corl22/home
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.08073v1 [cs.RO] 14 Oct 2022
of demonstrations collected from 6 humans of varying qualities (Worse, Okay, and Better). This
work and recent followups [
11
] show how policies degrade in the presence of heterogenous data
with multimodal behaviors, finding that training imitation learning policies on data from a single
demonstrator can often exceed the performance of training on equivalent amounts of data from
multiple demonstrators. A related body of work on offline policy learning propose approaches for
overcoming suboptimality by using extra annotations, usually in the form of rewards or returns for
each demonstration. These methods seek to reweight [
12
,
13
,
14
], or directly filter out demonstrations
[
15
,
16
,
17
]. Yet, these are all post-hoc approaches that operate after demonstrations have been
collected. None of these prior approaches target the root cause: informing users of the compatibility
of their demonstrations, as they collect data in the first place.
Figure 1: Visualization of two users (orange and
blue) inserting a nut onto a round peg. Both demon-
strators approach the round nut, pick it up and
insert it into the peg in very different ways, in-
troducing heterogenous data that is incompatible
with the base policy, hurting performance when
retrained.
In contrast, our proposed approach targets this
root cause by first learning a fine-grained mea-
sure of demonstrator compatibility subject to
an initial base policy, and then operationalizing
this measure to inform users while they collect
demonstrations. Our compatibility measure is
derived by sorting individual state-action pairs
in the data based on their likelihood and nov-
elty under the base policy. By plotting each
state-action along a 2D “map” with these two
properties as the axes, we are able to learn visual
boundaries denoting “aligned and compatible”
demonstrations (those with high likelihood and
low novelty) to those that are grossly incom-
patible (low likelihood and low novelty) – an
approach inspired by prior work in dataset inter-
pretability [
18
,
19
]. From these maps, we iden-
tify a scoring function and set of thresholds to
define our demonstrator compatibility measure,
which we then cheaply and efficiently deploy
during active data collection. Crucially, with
this fine-grained measure, users with “incom-
patible” demonstrations are given rich feedback
as to where in their demonstration the incom-
patibility arose coupled with visualizations of
compatible demonstrations from the base pol-
icy’s training set, guiding their behavior moving
forward.
Our results across two simulated tasks from prior work [
10
,
20
] and a real-world “food-plating” task
with a Franka Emika Panda arm show that our compatibility measure enables the elicitation of more
compatible demonstrations – demonstrations that, when added to the base dataset, lead to a boost of
up to 25% in success rate relative to a 20% decrease in performance when collecting data naively,
without feedback.
2 Related Work
Identifying & Learning from Diverse Demonstrations.
Our definition of diversity subsumes both
heterogeneity – where demonstrators capture multiple distinct, but equally optimal behaviors – and
suboptimality – where users provide demonstrations that are poor relative to some objective reward
function. On examining the effect of learning from these diverse demonstrations, Mandlekar et al.
[10]
found that imitation learning models trained on demonstrations from a single “proficient” human
achieved higher success rates than models trained on significantly larger dataset sourced from a
mixture of human demonstrators with varying optimality. Gopalan et al.
[21]
further show that
without prompting users with videos or specific subtasks, only a third of demonstrators provide com-
patible demonstrations. These works motivate our approach by showing that in naive demonstration
collection, heterogeneity and suboptimality are prevalent; instead, we need a direct way to intervene
during collection, actively inform users, and guide them to provide compatible demonstrations.
2
Other work studies learning from diverse demonstrations. Several approaches [
7
,
8
,
9
,
22
,
23
] model
the multimodality of diverse demonstrations directly, usually by explicitly handling each mode
of the data. Unfortunately, these approaches fail in the sequential, interactive imitation learning
setting that we study, where we gradually introduce new modes and retrain after each collection. A
separate line of work tackles learning with suboptimal data through the use of offline reinforcement
learning [
12
,
13
,
14
], filtered imitation [
11
,
15
,
16
,
17
], or inverse reinforcement learning [
24
,
25
,
26
].
These methods use extra reward annotations for each demonstration, and generally try to reweight or
selectively remove suboptimal data. Unlike these approaches, this work tackles the root cause of the
problem – how to collect compatible demonstrations from humans in the first place, removing the
need to filter out 10s – 100s of painstakingly collected demonstrations.
Interactive Imitation Learning.
Most of the work in interactive imitation learning attempts to
address the problem of covariate shift endemic to policies trained via behavioral cloning. Ross
et al.
[1]
introduce DAgger, a method for iteratively collecting demonstrations by relabeling the
states visited by a policy interactively, teaching the policy to recover. Later work builds on DAgger,
primarily focusing on reducing the number of expert labels requested during training. These variants
identify different measures such as safety and uncertainty to use to query the expert limiting the
amount of supervision a user needs to provide [
2
,
20
,
27
,
28
,
29
,
30
]. The focus of these works is on
learning from a single user rather than from multiple users demonstrating heterogenous behaviors.
Our proposed approach presents a novel interactive imitation learning method that learns to actively
elicit demonstrations from a pool of multiple users during the course of data collection, with the
express goal of guiding users towards compatible, optimal demonstrations.
3 Problem Setting
We consider sequential decision making tasks, modeled as a Markov Decision Process (MDP) with
the following components: the state space
S
, action space
A
, transition function
T
, reward function
R
, and discount factor
γ
. In this work, we assume sparse rewards only provided on task completion.
A state
s= (ogrounded, oproprio)∈ S
is comprised of a grounded observation (either coordinates/poses
of objects in the environment, or an RGB visual observation) and the robot’s proprioceptive state.
An action
a∈ A
is a continuous vector of robot joint actions. We assume access to a small dataset
of trajectories
Dbase ={τ1, τ2. . . τN}
where each trajectory
τi
is a sequence of states and actions,
τi={(s1, a1). . . (sT, aT)}. We train an initial policy πbase on this dataset via behavioral cloning.
Our approach has two components: 1) developing a measure of the compatibility of a new demonstra-
tion with an existing base policy, and 2) building a method for actively eliciting compatible demon-
strations from users. For the first component, we learn a measure
M(Dbase,(snew, anew)) [0,1]
that
defines a compatibility at the granuality of a state-action pair. For the second component, we use our
fine-grained measure
M
to provide rich feedback to users about their demonstrations, in addition to
deciding whether to accept/reject a new demonstration
τnew
. The set of new demonstrations the user
collects comprises Dnew – after each collection step, we train a new policy πnew on Dbase ∪ Dnew.
Under our definition of compatibility (and the measure
M
we derive), we hope to guide users to
provide new datasets
Dnew
such that
πnew
will have as high of an expected return as possible, and
minimally, a higher expected return than
πbase
. In other words, our definition of compatibility asks
that new data should only help, not hurt performance relative to the initial set.
4 Learning to Measure Compatibility in Multi-Human Demonstrations
We first derive a general compatibility measure
M
given a base set of demonstrations, then evaluate
our measure through a series of case studies grounded in real, user-provided demonstration data.
4.1 Estimating Compatibility and Identifying Good Demonstrations
An idealized compatibility measure
M
has one role – estimating the performance of a policy
πbase
that
is retrained on the union of a known base dataset
Dbase
and a new dataset
Dnew
. Crucially,
M
needs to
operate at a granular level (ideally at the level of individual state-action pairs), without incurring the
cost of retraining and evaluating
πbase
on the new dataset. Phrased this way, there is a clear connection
to pool-based active learning [
31
], informing a choice of a plausible metrics that could help predict
downstream success. While many metrics could work, we choose two easy-to-compute metrics that
lend themselves well to interpretability: the likelihood of actions
anew
in
Dnew
under
πbase
, measured
3
Low
NOVELTY
NOVELTY
LIKELIHOOD LIKELIHOOD
High
Low
Low
High Low
High
(a) Compatible Operator
High
61 Rejected 128 Rejected
(b) Incompatible Operator
Figure 2: Plots, or “maps” of the likelihood and novelty values for demonstrations recorded by the
blue compatible operator (a), and the orange incompatible operator with respect to the base policy
πbase
. The gray border indicates a filter with our chosen thresholds (
λ
,
η
). The orange operator
performs far more low-likelihood actions in familiar states, earning a lower compatibility score.
by the negative mean squared error between actions predicted by
πbase(snew)
and
anew
, as well as the
novelty of a given state, measured by the standard deviation in the predicted actions under the base
policy. To turn these two metrics into a single measure, we borrow from interpretability and active
learning literature [
18
,
19
]; we plot, or “map” on a 2D plane the likelihood and novelty values for
each trajectory in our dataset, with the goal of identifying two thresholds
(λ, η)
on likelihood and
novelty – thresholds that allow us to identify a given trajectory based on its compatibility.
The final piece is setting the thresholds for novelty (
η
) and likelihood (
λ
). While this can be done in
many ways – often just by looking at the “mapped” dataset on the 2D plane [
18
] – in this work, we
assume the ability to obtain or collect a handful of a priori incompatible demonstrations that we can
use to build contrast sets to regress a threshold. We define compatibility
M(Dbase,(snew, anew))
[0,1] as parameterized by thresholds (λ, η):
M=(1min (πbase(snew)anew )2
λ,1if novelty(snew)< η
1otherwise.
A state-action pair is compatible (
M(Dbase,(snew, anew)) = 1
) if the likelihood of
anew
under the
base policy is high, or the novelty of
snew
is
> η
; incompatibility is defined smoothly on the inverse
interval with a state-action pair being completely incompatible if (
M(Dbase,(snew, anew)) = 0
) when
(πbase(snew)anew)2λ
. Intuitively, this measure mirrors what one expects when collecting new
data – in states where the policy is confident about what actions to take, reject other actions, and
in new states where the policy is uncertain, more, diverse data is good. The following subsection
validates this measure against demonstration data collected by real users on a series of tasks, showing
how we can use
M
as a filter to boost policy performance in the presence of a large dataset of
heterogenous demonstrations.
4.2 Case Studies and Experiments
To validate our measure
M
, we consider three tasks in simulation, each with a set of demonstrations
collected by multiple operators. For each task, we choose one operator to initialize the base set of
demos
Dbase
and use trajectories from another operator to form
Dnew
. For a more detailed summary
of the tasks datasets, and policy training procedure, please consult the Supplementary Material.
Square Nut
[
10
]: The goal of the task is to place a square nut onto a square peg. We use the
environment and demonstrations collected by Mandlekar et al.
[10]
, consisting of 200 demonstrations
from a proficient operator, and 50 demonstrations each from 6 operators of varying qualities. We
initialize
Dbase
with 50 demonstrations from the proficient operator, and use demonstrations from
each of the 4 more efficient operators as the different Dnew.
Round Nut
[
10
,
20
]: Similar to the prior task, the goal is to place a round nut onto a round peg
(as in Fig. 1). For this task, we collected 30 demonstrations from a proficient operator and 30
demonstrations each from 3 other operators – where the data provided for one operator is taken
4
摘要:

ElicitingCompatibleDemonstrationsforMulti-HumanImitationLearningKanishkGandhi,SiddharthKaramcheti,MadelineLiao,DorsaSadighDepartmentofComputerScience,StanfordUniversity{kanishk.gandhi,skaramcheti,madelineliao,dorsa}@stanford.eduAbstract:Imitationlearningfromhuman-provideddemonstrationsisastrongappro...

展开>> 收起<<
Eliciting Compatible Demonstrations for Multi-Human Imitation Learning Kanishk Gandhi Siddharth Karamcheti Madeline Liao Dorsa Sadigh.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.97MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注