Extraneousness-Aware Imitation Learning
Ray Chen Zheng*1,4, Kaizhe Hu*1,4, Zhecheng Yuan1,4, Boyuan Chen3, Huazhe Xu1,2,4
Abstract— Visual imitation learning provides an effective
framework to learn skills from demonstrations. However, the
quality of the provided demonstrations usually significantly
affects the ability of an agent to acquire desired skills. There-
fore, the standard visual imitation learning assumes near-
optimal demonstrations, which are expensive or sometimes
prohibitive to collect. Previous works propose to learn from
noisy demonstrations; however, the noise is usually assumed to
follow a context-independent distribution such as a uniform or
gaussian distribution. In this paper, we consider another crucial
yet underexplored setting — imitation learning with task-
irrelevant yet locally consistent segments in the demonstrations
(e.g., wiping sweat while cutting potatoes in a cooking tutorial).
We argue that such noise is common in real world data and
term them as “extraneous” segments. To tackle this problem,
we introduce Extraneousness-Aware Imitation Learning (EIL),
a self-supervised approach that learns visuomotor policies from
third-person demonstrations with extraneous subsequences.
EIL learns action-conditioned observation embeddings in a
self-supervised manner and retrieves task-relevant observations
across visual demonstrations while excluding the extraneous
ones. Experimental results show that EIL outperforms strong
baselines and achieves comparable policies to those trained
with perfect demonstration on both simulated and real-world
robot control tasks. The project page can be found here:
https://sites.google.com/view/eil-website.
I. INTRODUCTION
Imitation learning (IL) enables intelligent agents to ac-
quire various skills from demonstrations [1], [2]; recent
advances also extend IL to the visual domain [3], [4],
[5], [6]. However, in contrast to how humans learn from
demonstrations, artificial agents usually require “clean” data
sampled from expert policies. Some recent literatures [7],
[8], [9], [10] propose methods to perform imitation learning
from noisy demonstrations. However, many of these methods
are state-based, and are limited by their requirements such
as additional labels or assumptions about the noise. Despite
the effort, real-world data may contain extraneous segments
that can hardly be defined or labelled. For example, when
learning to cut potatoes from videos, humans can naturally
ignore some of the demonstrators’ extraneous actions like
wiping sweat in the halfway. This distinction raises the
question naturally: how can we leverage the rich range
of unannotated visual demonstrations for imitation learning
without being hindered by their noise?
In this paper, we propose Extraneousness-Aware Imita-
tion Learning (EIL) that enables agents to imitate from
noisy video demonstrations with extraneous segments. Our
*Denotes equal contribution.
1Tsinghua University, 2Shanghai AI Lab, 3Massachusetts Institute of
Technology, 4Shanghai Qi Zhi Institute
Contact: zhengrc19@mails.tsinghua.edu.cn.
method allows agents to identify extraneous subsequences
via self-supervised learning and selectively perform imi-
tation from the task-relevant parts. Specifically, we train
an action-conditioned encoder through temporal cycle-
consistency (TCC) loss [11] to obtain the embeddings of
each observation. In this way, the observations of similar
progress in the demonstrations will gain similar embeddings.
Then, we propose an Unsupervised Voting-based Alignment
algorithm (UVA) to filter task-irrelevant frames across video
clips. Finally, we introduce a few tasks to benchmark the
performance of imitation learning from noisy data with
extraneous sequences.
We evaluate our method on multiple visual-input robot
control tasks in both simulator and the real-world. The exper-
iment results suggest that the proposed encoder can produce
embeddings useful for extraneousness detection. As a result,
EIL outperforms various baselines and achieves comparable
performance to those trained with perfect demonstrations.
Our contributions can be summarized as follows: 1)
We propose a meaningful yet underexplored setting of vi-
sual imitation learning from demonstrations with extrane-
ous segments 2) We introduce Extraneousness-Aware Imi-
tation Learning (EIL) that learns selectively from the task-
relevant parts by leveraging action-conditioned embeddings
and alignment algorithms. 3) We introduce datasets with
extraneous segments over several simulated or real-world
tasks and demonstrate our method’s empirical effectiveness.
II. RELATED WORKS
A. Learning from Noisy Demonstration
Imitation learning [1], [2], [12], [13] includes behavior
cloning [14], [15] which aims to copy the behaviors from the
demonstration, and inverse reinforcement learning [16] that
infers the reward function for learning policies. However,
these methods usually assume access to expert demonstra-
tions, which are hard to obtain in practice.
Recent works try to tackle the imitation learning problem
when the demonstrations are noisy. However, through this
line of research [17], [18], [19], [20], [8], [9], [10], [21], the
vast majority of the works are done in the low-dimensional
state space rather than the high-dimensional image space.
Furthermore, it is common in previous works [21] to assume
the noise is sampled from an a priori distribution. Methods
designed specifically for such noise might fail completely
when the noise violates the assumption. Recently, more
attentions are drawn to learning from realistic visual demon-
strations, e.g., Chen et al. propose to learn policies from “in-
the-wild” videos [22]. While the method achieves impressive
arXiv:2210.01379v2 [cs.RO] 1 Mar 2023