VIOLA Imitation Learning for Vision-Based Manipulation with Object Proposal Priors Yifeng Zhu1 Abhishek Joshi1 Peter Stone12 Yuke Zhu1

2025-05-06 0 0 4.48MB 15 页 10玖币

侵权投诉

VIOLA: Imitation Learning for Vision-Based

Manipulation with Object Proposal Priors

Yifeng Zhu1, Abhishek Joshi1, Peter Stone1,2, Yuke Zhu1

1The University of Texas at Austin 2Sony AI

Abstract: We introduce VIOLA, an object-centric imitation learning approach

to learning closed-loop visuomotor policies for robot manipulation. Our approach

constructs object-centric representations based on general object proposals from

a pre-trained vision model. VIOLA uses a transformer-based policy to reason

over these representations and attend to the task-relevant visual factors for action

prediction. Such object-based structural priors improve deep imitation learning

algorithm’s robustness against object variations and environmental perturbations.

We quantitatively evaluate VIOLA in simulation and on real robots. VIOLA out-

performs the state-of-the-art imitation learning methods by 45.8% in success rate.

It has also been deployed successfully on a physical robot to solve challenging

long-horizon tasks, such as dining table arrangement and coffee making. More

videos and model details can be found in supplementary material and the project

website: https://ut-austin-rpl.github.io/VIOLA.

Keywords: Imitation Learning, Manipulation, Object-Centric Representations

1 Introduction

Vision-based manipulation is a critical ability for autonomous robots to interact with everyday en-

vironments. It requires the robots to understand the unstructured world through visual perception

to determine intelligent behaviors. In recent years, deep imitation learning (IL) [1–4] has emerged

as a powerful approach to training visuomotor policies on diverse ofﬂine data, particularly human

demonstrations. Its success stems from the effectiveness of training over-parameterized neural net-

works end-to-end with supervised learning objectives. These models excel at mapping raw visual

observations to motor actions without manual engineering. While deep IL methods often distin-

guish themselves from reinforcement learning counterparts in their scalability to long-horizon tasks,

a burgeoning body of recent work pointed out that IL methods lack robustness to covariate shifts and

environmental perturbations [5–11]. End-to-end visuomotor policies are likely to falsely associate

actions with task-irrelevant visual factors, leading to poor generalization in new situations.

In this work, we endow imitation learning algorithms with awareness about objects and their inter-

actions to improve their efﬁcacy and robustness in vision-based manipulation tasks. As cognitive

science studies suggest, explaining a visual scene as objects and their interactions facilitates humans

to learn fast and make accurate predictions [12–14]. Inspired by these ﬁndings, we hypothesize that

decomposing a visual scene into factorized representations of objects in the scenes would enable

robots to reason about the manipulation workspace in a modular fashion and improve their general-

ization ability. To this end, we develop an object-centric imitation learning approach, which infuses

structural object-based priors into the model architecture of visuomotor policies. Training policies

with these priors would make it easier for the model to focus on task-relevant visual cues while

discarding spurious dependencies.

The ﬁrst and foremost challenge of such an object-centric approach is to determine what consti-

tutes an object and how objects are represented. The deﬁnitions of objects are often ﬂuid and

task-dependent for manipulation tasks. This work studies the notions of objects operationally and

considers them as disentangled visual concepts that inform the robot’s decision-making. Previous

works have explored learning visuomotor policies with awareness of objects, but they are limited to

simple control domain [6], single object manipulation [15], or require costly annotations for object

detection [16]. We are motivated by the recent advances in visual recognition, in particular, image

models for generating object proposals [17,18], i.e., localized bounding boxes on 2D images. These

object proposals capture general priors of “objectness” across appearance variations and object cate-

6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.

arXiv:2210.11339v2 [cs.RO] 8 Mar 2023

Action

generation

General

object proposals

Robot

Commands

Object-centric

representation

Task-relevant Task-irrelevant

Multi-head

self-attention

Raw visual

observation

Figure 1: VIOLA ﬁrst obtains a set of general object proposals from raw visual observations. It extracts object

features from the proposals to build the object-centric representation. The transformer-based policy uses multi-

head self-attention to reason over the representation and identify task-relevant regions for action generation.

gories. They have served as intermediate representations for downstream vision tasks, such as object

detection and instance segmentation [19–21]. In this work, we investigate using object proposals

from a pre-trained vision models as object-centric priors for visuomotor policy in manipulation and

use the object proposals as a starting point to build our object-centric representations.

We introduce VIOLA (Visuomotor Imitation via Object-centric LeArning), an object-centric imita-

tion learning model to train closed-loop visuomotor policies for robot manipulation. The high-level

overview of the method is illustrated in Figure 1. VIOLA ﬁrst uses a pre-trained Region Proposal

Network (RPN) [18] to get a set of general object proposals from raw visual observations. We ex-

tract features from each proposal region to build the factorized object-centric representations of the

visual scene. These object-centric representations are converted into a set of discrete tokens and

processed by a Transformer encoder [22]. The transformer encoder learns to focus on task-relevant

regions while ignoring the irrelevant visual factors for decision-making through the multi-head self-

attention mechanism when trained on supervised imitation learning objectives.

We compare VIOLA against state-of-the-art deep imitation learning methods for vision-based ma-

nipulation in simulation and on a real robot. We use simulation to systematically evaluate the poli-

cies’ performances and generalization abilities in the canonical setting (i.e., testing in the training

distribution) and under three challenging variations, including initial object placements, presence of

distracting objects, and camera pose perturbations. Our quantitative evaluations show that VIOLA

outperforms the most competitive baseline by 45.8% in success rate. When visual variations such

as jittered camera views are introduced, VIOLA maintains its robust behaviors of precise grasping

and manipulation, while end-to-end learning methods would fail to reach the target objects. VIOLA

also produces visuomotor policies to solve three challenging real-world tasks with a small set of 50

demonstrations, including a multi-stage coffee-making task where VIOLA achieves 60% success

rate while baseline methods fail entirely.

Our contributions with VIOLA are three-fold: 1) We learn object-centric representations based

on general object proposals and design a transformer-based policy that determines task-relevant

proposals to generate the robot’s actions; 2) We show that VIOLA outperforms state-of-the-art

baselines in simulation and validate the effectiveness of our model designs through ablative studies;

and 3) We show that VIOLA learns policies on a real robot to complete challenging tasks.

2 Related Work

Imitation Learning (IL) for Manipulation. IL has been an established paradigm for acquiring

manipulation policies for decades. It can be roughly categorized into non-parametric and parametric

approaches. Non-parametric approaches, such as DMP and PrMP, can effectively acquire manipula-

tion behaviors through a small number of demonstrations [23–26]. However, they typically focus on

open-loop trajectory generation and fall short in handling high-dimensional observations. Paramet-

ric approaches, especially neural networks, have shown promise in vision-based manipulation. Nev-

ertheless, these approaches are susceptible to distributional shifts and observation noises [1–4,27].

Object-centric priors have been previously explored in imitation learning policies to overcome the

issues above [15,16]. However, these previous works either focus on the manipulation of single

object instances, or requires costly annotations for pre-training object detections. Based on the same

conceptual idea as previous object-centric imitation learning, VIOLA uses a pre-trained RPN to

introduce object proposals as object-centric priors into the end-to-end IL policies, thus improving

their robustness towards visual variations and solving tasks that involve complicated interaction with

multiple objects.

Visual Representations for Visuomotor Policies. Various types of intermediate visual rep-

resentations have been explored for visuomotor policy learning. Object bounding boxes have

been commonly used as intermediate visual representations [28,16,29]. However, they require

ﬁne-tuned or category-speciﬁc detectors and cannot easily generalize to tasks with previously

unknown objects. Recently, deep learning methods have enabled IL to train policies end-to-end

on raw observations [2,3]. These methods are prone to covariate shift and causal confusion [6],

resulting in poor generalization performances. Similar to our work, a large body of literature has

looked into incorporating additional inductive biases into end-to-end policies. Notable ones include

spatial attention [30–32] and affordances [33–38]. However, these representations are purposefully

designed for speciﬁc motion primitives, such as pick-and-place, limiting their abilities to generate

diverse manipulation behaviors.

Object-Centric Representation. Object-centric representations have been widely used in visual

understanding and robotics tasks, where researchers seek to reason about visual scenes in a modu-

lar way based on the objects presented. In robotics, poses [39–41] or bounding boxes [28,16,29]

are commonly used as object-level abstractions. These representations often require prior knowl-

edge about object instance/category and do not capture ﬁne-grained details, falling short in applying

to new tasks without previous unknown objects. Unsupervised object discovery methods [42–44]

learn object representation without manual supervision. However, they fall short in handling com-

plex scenes [42,43], hindering the applicability to realistic manipulation tasks. Recent work from

the vision community has made signiﬁcant progress in generating object proposals for various down-

stream tasks, such as object detection [19–21] and visual-language reasoning [45,46]. Motivated

by the effectiveness of region proposal networks (RPNs) on out-of-distribution images [21], we use

object proposals to scaffold our object-centric representations for robot manipulation tasks.

3 Approach

We introduce VIOLA, an object-centric imitation learning approach to vision-based manipulation.

The core idea is to decompose raw visual observations into object-centric representations, on top of

which the policy generates closed-loop control actions. Figure 2illustrates the pipeline. We ﬁrst

formulate the problem of visuomotor policy learning and describe two key components of VIOLA:

1) how we build the object-centric representations based on general object proposals, and 2) how we

use a transformer-based architecture to learn policy over the object-centric representations.

3.1 Problem Formulation

We formulate a robot manipulation task as a discrete-time Markov Decision Process, which is de-

ﬁned as a 5-tuple: M= (S,A,P, R, γ), where Sis the state space, Ais the action space, P(·|s, a)

is the stochastic transition probability, R(s, a, s0)is the reward function, and γ∈[0,1) is the dis-

count factor. In our context, Sis the space of the robot’s raw sensory data, including RGB images

and proprioception, Ais the space of the robot’s motor commands, and π:S → A is a closed-

loop sensorimotor policy that we deploy on the robot to perform a task. The goal of learning a

visuomotor policy for robot manipulation is to learn a policy πthat maximizes the expected return

E[P∞

t=0 γtR(st, at, st+1)].

In our work, we use behavioral cloning as our imitation learning algorithm. We assume access to a

set of N demonstrations D={τi}N

i=1, where each trajectory τiis demonstrated through teleopera-

tion. The goal of our behavioral cloning approach is to learn a policy that clones the actions from

demonstrations D.

We aim to design an object-centric representation that factorizes the visual observations of an un-

structured scene into features of individual entities. For vision-based manipulation, we assume no

access to the ground-truth states of objects. Instead, we use the top Kobject proposals from a

pre-trained Region Proposal Network (RPN) [18] to represent the set of object-related regions of an

image. These proposals are grounded on image regions and optimized for covering the bounding

boxes of potential objects. We treat these Kproposals as the K(approximate) objects and extract

visual and positional features to build region features from each proposal. For manipulation tasks,

reasoning about object interactions is also essential to decide actions, To provide contextual informa-

tion for such relational reasoning, we design three context features: a global context feature from the

workspace image to encode the current stage of the task; an eye-in-hand visual feature from the eye-

in-hand camera to mitigate occlusion and partial observability of objects; and a proprioceptive fea-

ture from the robot’s states. We call the set of region features and context features at time step tas the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VIOLA:ImitationLearningforVision-BasedManipulationwithObjectProposalPriorsYifengZhu1,AbhishekJoshi1,PeterStone1;2,YukeZhu11TheUniversityofTexasatAustin2SonyAIAbstract:WeintroduceVIOLA,anobject-centricimitationlearningapproachtolearningclosed-loopvisuomotorpoliciesforrobotmanipulation.Ourapproachcons...

展开>> 收起<<

VIOLA Imitation Learning for Vision-Based Manipulation with Object Proposal Priors Yifeng Zhu1 Abhishek Joshi1 Peter Stone12 Yuke Zhu1.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VIOLA Imitation Learning for Vision-Based Manipulation with Object Proposal Priors Yifeng Zhu1 Abhishek Joshi1 Peter Stone12 Yuke Zhu1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: