Extraneousness-Aware Imitation Learning Ray Chen Zheng14 Kaizhe Hu14 Zhecheng Yuan14 Boyuan Chen3 Huazhe Xu124 Abstract Visual imitation learning provides an effective

2025-04-27 0 0 2.71MB 7 页 10玖币
侵权投诉
Extraneousness-Aware Imitation Learning
Ray Chen Zheng*1,4, Kaizhe Hu*1,4, Zhecheng Yuan1,4, Boyuan Chen3, Huazhe Xu1,2,4
Abstract Visual imitation learning provides an effective
framework to learn skills from demonstrations. However, the
quality of the provided demonstrations usually significantly
affects the ability of an agent to acquire desired skills. There-
fore, the standard visual imitation learning assumes near-
optimal demonstrations, which are expensive or sometimes
prohibitive to collect. Previous works propose to learn from
noisy demonstrations; however, the noise is usually assumed to
follow a context-independent distribution such as a uniform or
gaussian distribution. In this paper, we consider another crucial
yet underexplored setting — imitation learning with task-
irrelevant yet locally consistent segments in the demonstrations
(e.g., wiping sweat while cutting potatoes in a cooking tutorial).
We argue that such noise is common in real world data and
term them as “extraneous” segments. To tackle this problem,
we introduce Extraneousness-Aware Imitation Learning (EIL),
a self-supervised approach that learns visuomotor policies from
third-person demonstrations with extraneous subsequences.
EIL learns action-conditioned observation embeddings in a
self-supervised manner and retrieves task-relevant observations
across visual demonstrations while excluding the extraneous
ones. Experimental results show that EIL outperforms strong
baselines and achieves comparable policies to those trained
with perfect demonstration on both simulated and real-world
robot control tasks. The project page can be found here:
https://sites.google.com/view/eil-website.
I. INTRODUCTION
Imitation learning (IL) enables intelligent agents to ac-
quire various skills from demonstrations [1], [2]; recent
advances also extend IL to the visual domain [3], [4],
[5], [6]. However, in contrast to how humans learn from
demonstrations, artificial agents usually require “clean” data
sampled from expert policies. Some recent literatures [7],
[8], [9], [10] propose methods to perform imitation learning
from noisy demonstrations. However, many of these methods
are state-based, and are limited by their requirements such
as additional labels or assumptions about the noise. Despite
the effort, real-world data may contain extraneous segments
that can hardly be defined or labelled. For example, when
learning to cut potatoes from videos, humans can naturally
ignore some of the demonstrators’ extraneous actions like
wiping sweat in the halfway. This distinction raises the
question naturally: how can we leverage the rich range
of unannotated visual demonstrations for imitation learning
without being hindered by their noise?
In this paper, we propose Extraneousness-Aware Imita-
tion Learning (EIL) that enables agents to imitate from
noisy video demonstrations with extraneous segments. Our
*Denotes equal contribution.
1Tsinghua University, 2Shanghai AI Lab, 3Massachusetts Institute of
Technology, 4Shanghai Qi Zhi Institute
Contact: zhengrc19@mails.tsinghua.edu.cn.
method allows agents to identify extraneous subsequences
via self-supervised learning and selectively perform imi-
tation from the task-relevant parts. Specifically, we train
an action-conditioned encoder through temporal cycle-
consistency (TCC) loss [11] to obtain the embeddings of
each observation. In this way, the observations of similar
progress in the demonstrations will gain similar embeddings.
Then, we propose an Unsupervised Voting-based Alignment
algorithm (UVA) to filter task-irrelevant frames across video
clips. Finally, we introduce a few tasks to benchmark the
performance of imitation learning from noisy data with
extraneous sequences.
We evaluate our method on multiple visual-input robot
control tasks in both simulator and the real-world. The exper-
iment results suggest that the proposed encoder can produce
embeddings useful for extraneousness detection. As a result,
EIL outperforms various baselines and achieves comparable
performance to those trained with perfect demonstrations.
Our contributions can be summarized as follows: 1)
We propose a meaningful yet underexplored setting of vi-
sual imitation learning from demonstrations with extrane-
ous segments 2) We introduce Extraneousness-Aware Imi-
tation Learning (EIL) that learns selectively from the task-
relevant parts by leveraging action-conditioned embeddings
and alignment algorithms. 3) We introduce datasets with
extraneous segments over several simulated or real-world
tasks and demonstrate our method’s empirical effectiveness.
II. RELATED WORKS
A. Learning from Noisy Demonstration
Imitation learning [1], [2], [12], [13] includes behavior
cloning [14], [15] which aims to copy the behaviors from the
demonstration, and inverse reinforcement learning [16] that
infers the reward function for learning policies. However,
these methods usually assume access to expert demonstra-
tions, which are hard to obtain in practice.
Recent works try to tackle the imitation learning problem
when the demonstrations are noisy. However, through this
line of research [17], [18], [19], [20], [8], [9], [10], [21], the
vast majority of the works are done in the low-dimensional
state space rather than the high-dimensional image space.
Furthermore, it is common in previous works [21] to assume
the noise is sampled from an a priori distribution. Methods
designed specifically for such noise might fail completely
when the noise violates the assumption. Recently, more
attentions are drawn to learning from realistic visual demon-
strations, e.g., Chen et al. propose to learn policies from “in-
the-wild” videos [22]. While the method achieves impressive
arXiv:2210.01379v2 [cs.RO] 1 Mar 2023
Action
Encoder
Image
Encoder Frame Embeddings
···
Virtual
Reference
Nearest Neighbor
Matching
Filtered
State-Action Pairs
Learned
Actions
State-Action Pairs
with Extraneousness
(a) (b) (c)
ResNet
Backbone
···
···
···
Matched NN
Ref index
UVA
Fig. 1: Extraneousness-Aware Imitation Learning (EIL). The overall framework contains 3 components: (a) It encodes
the state-action pairs into representation. (b) It takes in the embeddings and process them with unsupervised voting-based
alignment (UVA) algorithms. (c) It performs visual imitation learning with the aligned state action pairs. We note that (b)
can be a simple filtering algorithm when reference trajectories are available.
results, they focus on dealing with diverse demonstrations
without considering the “extraneousness” explicitly.
B. Self-Supervised Learning from Videos and its Application
to Control and Robotics Tasks
Self-supervised learning (SSL) from videos can learn
visual representations with temporal information for different
downstream tasks from unlabeled data [23], [24], [25], [26],
[27], [28], [29]. A recent line of research utilizes SSL for
learning correspondences [30], [31], [32], [11], [33], [34].
Specifically, Dwibedi et al. propose to find correspondences
across time in multiple videos with the help of cycle-
consistency, where frames with similar progress will be en-
coded to similar embeddings [11], [35], [36]. These methods
offer a welcoming way to leverage unlabeled and noisy real-
world data. In recent years, SSL also promises to help with
visuomotor tasks in control and robotics [37], [38]. For ex-
ample, TCN [39] learns a self-supervised temporal-consistent
embedding for imitation learning and reinforcement learning.
XIRL [40] learns a self-supervised embedding that estimates
task progress for inverse reinforcement learning. [41], [42],
[43] directly map the observations such as images to the
target domain. The distinction between EIL and previous
works is that we tackle the problem where demonstrations
have extraneous subsequences, rather than different visual
appearances, view points, or embodiments.
III. METHOD
In this section, we first describe the problem setup and then
introduce Extraneousness-Aware Imitation Learning (EIL),
a simple yet effective approach for learning visuomotor
policies from videos that have extraneous subsequences.
A. Problem Statement
We consider the setting where an agent aims at learning
visuomotor policies from Kvideo demonstrations {Di}K
i=1.
In the ith video, the jth observation oi
jis paired up with its
corresponding action ai
j. For each sequence in the demon-
stration set, there are Lextraneous subsequences {En}L
n=1
that are task-irrelevant yet locally consistent. In contrast to
existing works that have various assumptions about the noise,
our setting only assumes each video to contain more than
50% of task-relevant content [44], [45].
The imitation agent takes a high-dimensional observation
otas input and outputs an action ˆatat timestep t. To suc-
cessfully imitate from the aforementioned demonstrations,
the agent needs to reason about what the task-relevant parts
are and rule out the extraneous subsequences.
B. Extraneousness-Aware Imitation Learning (EIL)
1) Overview: EIL is a general framework for imitating
from videos with extraneous subsequences. The intuition of
EIL is that the task-relevant parts among different demon-
strations will share similar semantics meaning in the latent
space, thus they can be aligned with each other. Following
such intuition, when more than one demonstration sequence
are given, we can match their embeddings to retrieve the
task-relevant parts. In the case where a perfect reference
demonstration is available, we can match frames in other
sequences with that of the reference trajectory. However,
in most cases, such a reference is hard to obtain. Hence,
we propose an unsupervised alignment algorithm to retrieve
task-relevant parts from a set of noisy demonstrations.
Figure 1 demonstrates the overview of EIL. In Figure 1(a),
we learn a temporal representation of each frame conditioned
on both its visual observation and action through temporal
cycle consistency loss. After obtaining the representation, as
shown in Figure 1(b), we propose an unsupervised voting
method to perform video filtering when no perfect demon-
stration is available. Finally, as described in Figure 1(c),
we perform standard visual imitation learning on top of the
denoised data from the alignment procedure.
2) Action-conditioned Temporal Cycle Consistency Rep-
resentation Learning: We first learn representations that
encode temporal information for frame alignment across
different video demonstrations. We train an image encoder
ψIand an action encoder ψAthat embed the observations
and actions into corresponding features ψI(o)and ψA(a).
Then, we concatenate ψI(o)and ψA(a)to a multi-layer
perceptron (MLP) ψEto obtain the embeddings that have
摘要:

Extraneousness-AwareImitationLearningRayChenZheng*1;4,KaizheHu*1;4,ZhechengYuan1;4,BoyuanChen3,HuazheXu1;2;4Abstract—Visualimitationlearningprovidesaneffectiveframeworktolearnskillsfromdemonstrations.However,thequalityoftheprovideddemonstrationsusuallysignicantlyaffectstheabilityofanagenttoacquired...

展开>> 收起<<
Extraneousness-Aware Imitation Learning Ray Chen Zheng14 Kaizhe Hu14 Zhecheng Yuan14 Boyuan Chen3 Huazhe Xu124 Abstract Visual imitation learning provides an effective.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:2.71MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注