Extraneousness-Aware Imitation Learning Ray Chen Zheng14 Kaizhe Hu14 Zhecheng Yuan14 Boyuan Chen3 Huazhe Xu124 Abstract Visual imitation learning provides an effective

2025-04-27 0 0 2.71MB 7 页 10玖币

侵权投诉

Extraneousness-Aware Imitation Learning

Ray Chen Zheng*1,4, Kaizhe Hu*1,4, Zhecheng Yuan1,4, Boyuan Chen3, Huazhe Xu1,2,4

Abstract— Visual imitation learning provides an effective

framework to learn skills from demonstrations. However, the

quality of the provided demonstrations usually signiﬁcantly

affects the ability of an agent to acquire desired skills. There-

fore, the standard visual imitation learning assumes near-

optimal demonstrations, which are expensive or sometimes

prohibitive to collect. Previous works propose to learn from

noisy demonstrations; however, the noise is usually assumed to

follow a context-independent distribution such as a uniform or

gaussian distribution. In this paper, we consider another crucial

yet underexplored setting — imitation learning with task-

irrelevant yet locally consistent segments in the demonstrations

(e.g., wiping sweat while cutting potatoes in a cooking tutorial).

We argue that such noise is common in real world data and

term them as “extraneous” segments. To tackle this problem,

we introduce Extraneousness-Aware Imitation Learning (EIL),

a self-supervised approach that learns visuomotor policies from

third-person demonstrations with extraneous subsequences.

EIL learns action-conditioned observation embeddings in a

self-supervised manner and retrieves task-relevant observations

across visual demonstrations while excluding the extraneous

ones. Experimental results show that EIL outperforms strong

baselines and achieves comparable policies to those trained

with perfect demonstration on both simulated and real-world

robot control tasks. The project page can be found here:

https://sites.google.com/view/eil-website.

I. INTRODUCTION

Imitation learning (IL) enables intelligent agents to ac-

quire various skills from demonstrations [1], [2]; recent

advances also extend IL to the visual domain [3], [4],

[5], [6]. However, in contrast to how humans learn from

demonstrations, artiﬁcial agents usually require “clean” data

sampled from expert policies. Some recent literatures [7],

[8], [9], [10] propose methods to perform imitation learning

from noisy demonstrations. However, many of these methods

are state-based, and are limited by their requirements such

as additional labels or assumptions about the noise. Despite

the effort, real-world data may contain extraneous segments

that can hardly be deﬁned or labelled. For example, when

learning to cut potatoes from videos, humans can naturally

ignore some of the demonstrators’ extraneous actions like

wiping sweat in the halfway. This distinction raises the

question naturally: how can we leverage the rich range

of unannotated visual demonstrations for imitation learning

without being hindered by their noise?

In this paper, we propose Extraneousness-Aware Imita-

tion Learning (EIL) that enables agents to imitate from

noisy video demonstrations with extraneous segments. Our

*Denotes equal contribution.

1Tsinghua University, 2Shanghai AI Lab, 3Massachusetts Institute of

Technology, 4Shanghai Qi Zhi Institute

Contact: zhengrc19@mails.tsinghua.edu.cn.

method allows agents to identify extraneous subsequences

via self-supervised learning and selectively perform imi-

tation from the task-relevant parts. Speciﬁcally, we train

an action-conditioned encoder through temporal cycle-

consistency (TCC) loss [11] to obtain the embeddings of

each observation. In this way, the observations of similar

progress in the demonstrations will gain similar embeddings.

Then, we propose an Unsupervised Voting-based Alignment

algorithm (UVA) to ﬁlter task-irrelevant frames across video

clips. Finally, we introduce a few tasks to benchmark the

performance of imitation learning from noisy data with

extraneous sequences.

We evaluate our method on multiple visual-input robot

control tasks in both simulator and the real-world. The exper-

iment results suggest that the proposed encoder can produce

embeddings useful for extraneousness detection. As a result,

EIL outperforms various baselines and achieves comparable

performance to those trained with perfect demonstrations.

Our contributions can be summarized as follows: 1)

We propose a meaningful yet underexplored setting of vi-

sual imitation learning from demonstrations with extrane-

ous segments 2) We introduce Extraneousness-Aware Imi-

tation Learning (EIL) that learns selectively from the task-

relevant parts by leveraging action-conditioned embeddings

and alignment algorithms. 3) We introduce datasets with

extraneous segments over several simulated or real-world

tasks and demonstrate our method’s empirical effectiveness.

II. RELATED WORKS

A. Learning from Noisy Demonstration

Imitation learning [1], [2], [12], [13] includes behavior

cloning [14], [15] which aims to copy the behaviors from the

demonstration, and inverse reinforcement learning [16] that

infers the reward function for learning policies. However,

these methods usually assume access to expert demonstra-

tions, which are hard to obtain in practice.

Recent works try to tackle the imitation learning problem

when the demonstrations are noisy. However, through this

line of research [17], [18], [19], [20], [8], [9], [10], [21], the

vast majority of the works are done in the low-dimensional

state space rather than the high-dimensional image space.

Furthermore, it is common in previous works [21] to assume

the noise is sampled from an a priori distribution. Methods

designed speciﬁcally for such noise might fail completely

when the noise violates the assumption. Recently, more

attentions are drawn to learning from realistic visual demon-

strations, e.g., Chen et al. propose to learn policies from “in-

the-wild” videos [22]. While the method achieves impressive

arXiv:2210.01379v2 [cs.RO] 1 Mar 2023

Action

Encoder

Image

Encoder Frame Embeddings

···

Virtual

Reference

Nearest Neighbor

Matching

Filtered

State-Action Pairs

Learned

Actions

State-Action Pairs

with Extraneousness

(a) (b) (c)

ResNet

Backbone

···

Matched NN

Ref index

UVA

Fig. 1: Extraneousness-Aware Imitation Learning (EIL). The overall framework contains 3 components: (a) It encodes

the state-action pairs into representation. (b) It takes in the embeddings and process them with unsupervised voting-based

alignment (UVA) algorithms. (c) It performs visual imitation learning with the aligned state action pairs. We note that (b)

can be a simple ﬁltering algorithm when reference trajectories are available.

results, they focus on dealing with diverse demonstrations

without considering the “extraneousness” explicitly.

B. Self-Supervised Learning from Videos and its Application

to Control and Robotics Tasks

Self-supervised learning (SSL) from videos can learn

visual representations with temporal information for different

downstream tasks from unlabeled data [23], [24], [25], [26],

[27], [28], [29]. A recent line of research utilizes SSL for

learning correspondences [30], [31], [32], [11], [33], [34].

Speciﬁcally, Dwibedi et al. propose to ﬁnd correspondences

across time in multiple videos with the help of cycle-

consistency, where frames with similar progress will be en-

coded to similar embeddings [11], [35], [36]. These methods

offer a welcoming way to leverage unlabeled and noisy real-

world data. In recent years, SSL also promises to help with

visuomotor tasks in control and robotics [37], [38]. For ex-

ample, TCN [39] learns a self-supervised temporal-consistent

embedding for imitation learning and reinforcement learning.

XIRL [40] learns a self-supervised embedding that estimates

task progress for inverse reinforcement learning. [41], [42],

[43] directly map the observations such as images to the

target domain. The distinction between EIL and previous

works is that we tackle the problem where demonstrations

have extraneous subsequences, rather than different visual

appearances, view points, or embodiments.

III. METHOD

In this section, we ﬁrst describe the problem setup and then

introduce Extraneousness-Aware Imitation Learning (EIL),

a simple yet effective approach for learning visuomotor

policies from videos that have extraneous subsequences.

A. Problem Statement

We consider the setting where an agent aims at learning

visuomotor policies from Kvideo demonstrations {Di}K

i=1.

In the ith video, the jth observation oi

jis paired up with its

corresponding action ai

j. For each sequence in the demon-

stration set, there are Lextraneous subsequences {En}L

n=1

that are task-irrelevant yet locally consistent. In contrast to

existing works that have various assumptions about the noise,

our setting only assumes each video to contain more than

50% of task-relevant content [44], [45].

The imitation agent takes a high-dimensional observation

otas input and outputs an action ˆatat timestep t. To suc-

cessfully imitate from the aforementioned demonstrations,

the agent needs to reason about what the task-relevant parts

are and rule out the extraneous subsequences.

B. Extraneousness-Aware Imitation Learning (EIL)

1) Overview: EIL is a general framework for imitating

from videos with extraneous subsequences. The intuition of

EIL is that the task-relevant parts among different demon-

strations will share similar semantics meaning in the latent

space, thus they can be aligned with each other. Following

such intuition, when more than one demonstration sequence

are given, we can match their embeddings to retrieve the

task-relevant parts. In the case where a perfect reference

demonstration is available, we can match frames in other

sequences with that of the reference trajectory. However,

in most cases, such a reference is hard to obtain. Hence,

we propose an unsupervised alignment algorithm to retrieve

task-relevant parts from a set of noisy demonstrations.

Figure 1 demonstrates the overview of EIL. In Figure 1(a),

we learn a temporal representation of each frame conditioned

on both its visual observation and action through temporal

cycle consistency loss. After obtaining the representation, as

shown in Figure 1(b), we propose an unsupervised voting

method to perform video ﬁltering when no perfect demon-

stration is available. Finally, as described in Figure 1(c),

we perform standard visual imitation learning on top of the

denoised data from the alignment procedure.

2) Action-conditioned Temporal Cycle Consistency Rep-

resentation Learning: We ﬁrst learn representations that

encode temporal information for frame alignment across

different video demonstrations. We train an image encoder

ψIand an action encoder ψAthat embed the observations

and actions into corresponding features ψI(o)and ψA(a).

Then, we concatenate ψI(o)and ψA(a)to a multi-layer

perceptron (MLP) ψEto obtain the embeddings that have

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Extraneousness-AwareImitationLearningRayChenZheng*1;4,KaizheHu*1;4,ZhechengYuan1;4,BoyuanChen3,HuazheXu1;2;4AbstractVisualimitationlearningprovidesaneffectiveframeworktolearnskillsfromdemonstrations.However,thequalityoftheprovideddemonstrationsusuallysignicantlyaffectstheabilityofanagenttoacquired...

展开>> 收起<<

Extraneousness-Aware Imitation Learning Ray Chen Zheng14 Kaizhe Hu14 Zhecheng Yuan14 Boyuan Chen3 Huazhe Xu124 Abstract Visual imitation learning provides an effective.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Extraneousness-Aware Imitation Learning Ray Chen Zheng14 Kaizhe Hu14 Zhecheng Yuan14 Boyuan Chen3 Huazhe Xu124 Abstract Visual imitation learning provides an effective

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: