VIOLA Imitation Learning for Vision-Based Manipulation with Object Proposal Priors Yifeng Zhu1 Abhishek Joshi1 Peter Stone12 Yuke Zhu1

2025-05-06 0 0 4.48MB 15 页 10玖币
侵权投诉
VIOLA: Imitation Learning for Vision-Based
Manipulation with Object Proposal Priors
Yifeng Zhu1, Abhishek Joshi1, Peter Stone1,2, Yuke Zhu1
1The University of Texas at Austin 2Sony AI
Abstract: We introduce VIOLA, an object-centric imitation learning approach
to learning closed-loop visuomotor policies for robot manipulation. Our approach
constructs object-centric representations based on general object proposals from
a pre-trained vision model. VIOLA uses a transformer-based policy to reason
over these representations and attend to the task-relevant visual factors for action
prediction. Such object-based structural priors improve deep imitation learning
algorithm’s robustness against object variations and environmental perturbations.
We quantitatively evaluate VIOLA in simulation and on real robots. VIOLA out-
performs the state-of-the-art imitation learning methods by 45.8% in success rate.
It has also been deployed successfully on a physical robot to solve challenging
long-horizon tasks, such as dining table arrangement and coffee making. More
videos and model details can be found in supplementary material and the project
website: https://ut-austin-rpl.github.io/VIOLA.
Keywords: Imitation Learning, Manipulation, Object-Centric Representations
1 Introduction
Vision-based manipulation is a critical ability for autonomous robots to interact with everyday en-
vironments. It requires the robots to understand the unstructured world through visual perception
to determine intelligent behaviors. In recent years, deep imitation learning (IL) [14] has emerged
as a powerful approach to training visuomotor policies on diverse offline data, particularly human
demonstrations. Its success stems from the effectiveness of training over-parameterized neural net-
works end-to-end with supervised learning objectives. These models excel at mapping raw visual
observations to motor actions without manual engineering. While deep IL methods often distin-
guish themselves from reinforcement learning counterparts in their scalability to long-horizon tasks,
a burgeoning body of recent work pointed out that IL methods lack robustness to covariate shifts and
environmental perturbations [511]. End-to-end visuomotor policies are likely to falsely associate
actions with task-irrelevant visual factors, leading to poor generalization in new situations.
In this work, we endow imitation learning algorithms with awareness about objects and their inter-
actions to improve their efficacy and robustness in vision-based manipulation tasks. As cognitive
science studies suggest, explaining a visual scene as objects and their interactions facilitates humans
to learn fast and make accurate predictions [1214]. Inspired by these findings, we hypothesize that
decomposing a visual scene into factorized representations of objects in the scenes would enable
robots to reason about the manipulation workspace in a modular fashion and improve their general-
ization ability. To this end, we develop an object-centric imitation learning approach, which infuses
structural object-based priors into the model architecture of visuomotor policies. Training policies
with these priors would make it easier for the model to focus on task-relevant visual cues while
discarding spurious dependencies.
The first and foremost challenge of such an object-centric approach is to determine what consti-
tutes an object and how objects are represented. The definitions of objects are often fluid and
task-dependent for manipulation tasks. This work studies the notions of objects operationally and
considers them as disentangled visual concepts that inform the robot’s decision-making. Previous
works have explored learning visuomotor policies with awareness of objects, but they are limited to
simple control domain [6], single object manipulation [15], or require costly annotations for object
detection [16]. We are motivated by the recent advances in visual recognition, in particular, image
models for generating object proposals [17,18], i.e., localized bounding boxes on 2D images. These
object proposals capture general priors of “objectness” across appearance variations and object cate-
6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand.
arXiv:2210.11339v2 [cs.RO] 8 Mar 2023
Action
generation
General
object proposals
Robot
Commands
Object-centric
representation
Task-relevant Task-irrelevant
Multi-head
self-attention
Raw visual
observation
Figure 1: VIOLA first obtains a set of general object proposals from raw visual observations. It extracts object
features from the proposals to build the object-centric representation. The transformer-based policy uses multi-
head self-attention to reason over the representation and identify task-relevant regions for action generation.
gories. They have served as intermediate representations for downstream vision tasks, such as object
detection and instance segmentation [1921]. In this work, we investigate using object proposals
from a pre-trained vision models as object-centric priors for visuomotor policy in manipulation and
use the object proposals as a starting point to build our object-centric representations.
We introduce VIOLA (Visuomotor Imitation via Object-centric LeArning), an object-centric imita-
tion learning model to train closed-loop visuomotor policies for robot manipulation. The high-level
overview of the method is illustrated in Figure 1. VIOLA first uses a pre-trained Region Proposal
Network (RPN) [18] to get a set of general object proposals from raw visual observations. We ex-
tract features from each proposal region to build the factorized object-centric representations of the
visual scene. These object-centric representations are converted into a set of discrete tokens and
processed by a Transformer encoder [22]. The transformer encoder learns to focus on task-relevant
regions while ignoring the irrelevant visual factors for decision-making through the multi-head self-
attention mechanism when trained on supervised imitation learning objectives.
We compare VIOLA against state-of-the-art deep imitation learning methods for vision-based ma-
nipulation in simulation and on a real robot. We use simulation to systematically evaluate the poli-
cies’ performances and generalization abilities in the canonical setting (i.e., testing in the training
distribution) and under three challenging variations, including initial object placements, presence of
distracting objects, and camera pose perturbations. Our quantitative evaluations show that VIOLA
outperforms the most competitive baseline by 45.8% in success rate. When visual variations such
as jittered camera views are introduced, VIOLA maintains its robust behaviors of precise grasping
and manipulation, while end-to-end learning methods would fail to reach the target objects. VIOLA
also produces visuomotor policies to solve three challenging real-world tasks with a small set of 50
demonstrations, including a multi-stage coffee-making task where VIOLA achieves 60% success
rate while baseline methods fail entirely.
Our contributions with VIOLA are three-fold: 1) We learn object-centric representations based
on general object proposals and design a transformer-based policy that determines task-relevant
proposals to generate the robot’s actions; 2) We show that VIOLA outperforms state-of-the-art
baselines in simulation and validate the effectiveness of our model designs through ablative studies;
and 3) We show that VIOLA learns policies on a real robot to complete challenging tasks.
2 Related Work
Imitation Learning (IL) for Manipulation. IL has been an established paradigm for acquiring
manipulation policies for decades. It can be roughly categorized into non-parametric and parametric
approaches. Non-parametric approaches, such as DMP and PrMP, can effectively acquire manipula-
tion behaviors through a small number of demonstrations [2326]. However, they typically focus on
open-loop trajectory generation and fall short in handling high-dimensional observations. Paramet-
ric approaches, especially neural networks, have shown promise in vision-based manipulation. Nev-
ertheless, these approaches are susceptible to distributional shifts and observation noises [14,27].
Object-centric priors have been previously explored in imitation learning policies to overcome the
issues above [15,16]. However, these previous works either focus on the manipulation of single
object instances, or requires costly annotations for pre-training object detections. Based on the same
conceptual idea as previous object-centric imitation learning, VIOLA uses a pre-trained RPN to
introduce object proposals as object-centric priors into the end-to-end IL policies, thus improving
their robustness towards visual variations and solving tasks that involve complicated interaction with
multiple objects.
2
Visual Representations for Visuomotor Policies. Various types of intermediate visual rep-
resentations have been explored for visuomotor policy learning. Object bounding boxes have
been commonly used as intermediate visual representations [28,16,29]. However, they require
fine-tuned or category-specific detectors and cannot easily generalize to tasks with previously
unknown objects. Recently, deep learning methods have enabled IL to train policies end-to-end
on raw observations [2,3]. These methods are prone to covariate shift and causal confusion [6],
resulting in poor generalization performances. Similar to our work, a large body of literature has
looked into incorporating additional inductive biases into end-to-end policies. Notable ones include
spatial attention [3032] and affordances [3338]. However, these representations are purposefully
designed for specific motion primitives, such as pick-and-place, limiting their abilities to generate
diverse manipulation behaviors.
Object-Centric Representation. Object-centric representations have been widely used in visual
understanding and robotics tasks, where researchers seek to reason about visual scenes in a modu-
lar way based on the objects presented. In robotics, poses [3941] or bounding boxes [28,16,29]
are commonly used as object-level abstractions. These representations often require prior knowl-
edge about object instance/category and do not capture fine-grained details, falling short in applying
to new tasks without previous unknown objects. Unsupervised object discovery methods [4244]
learn object representation without manual supervision. However, they fall short in handling com-
plex scenes [42,43], hindering the applicability to realistic manipulation tasks. Recent work from
the vision community has made significant progress in generating object proposals for various down-
stream tasks, such as object detection [1921] and visual-language reasoning [45,46]. Motivated
by the effectiveness of region proposal networks (RPNs) on out-of-distribution images [21], we use
object proposals to scaffold our object-centric representations for robot manipulation tasks.
3 Approach
We introduce VIOLA, an object-centric imitation learning approach to vision-based manipulation.
The core idea is to decompose raw visual observations into object-centric representations, on top of
which the policy generates closed-loop control actions. Figure 2illustrates the pipeline. We first
formulate the problem of visuomotor policy learning and describe two key components of VIOLA:
1) how we build the object-centric representations based on general object proposals, and 2) how we
use a transformer-based architecture to learn policy over the object-centric representations.
3.1 Problem Formulation
We formulate a robot manipulation task as a discrete-time Markov Decision Process, which is de-
fined as a 5-tuple: M= (S,A,P, R, γ), where Sis the state space, Ais the action space, P(·|s, a)
is the stochastic transition probability, R(s, a, s0)is the reward function, and γ[0,1) is the dis-
count factor. In our context, Sis the space of the robot’s raw sensory data, including RGB images
and proprioception, Ais the space of the robot’s motor commands, and π:S A is a closed-
loop sensorimotor policy that we deploy on the robot to perform a task. The goal of learning a
visuomotor policy for robot manipulation is to learn a policy πthat maximizes the expected return
E[P
t=0 γtR(st, at, st+1)].
In our work, we use behavioral cloning as our imitation learning algorithm. We assume access to a
set of N demonstrations D={τi}N
i=1, where each trajectory τiis demonstrated through teleopera-
tion. The goal of our behavioral cloning approach is to learn a policy that clones the actions from
demonstrations D.
We aim to design an object-centric representation that factorizes the visual observations of an un-
structured scene into features of individual entities. For vision-based manipulation, we assume no
access to the ground-truth states of objects. Instead, we use the top Kobject proposals from a
pre-trained Region Proposal Network (RPN) [18] to represent the set of object-related regions of an
image. These proposals are grounded on image regions and optimized for covering the bounding
boxes of potential objects. We treat these Kproposals as the K(approximate) objects and extract
visual and positional features to build region features from each proposal. For manipulation tasks,
reasoning about object interactions is also essential to decide actions, To provide contextual informa-
tion for such relational reasoning, we design three context features: a global context feature from the
workspace image to encode the current stage of the task; an eye-in-hand visual feature from the eye-
in-hand camera to mitigate occlusion and partial observability of objects; and a proprioceptive fea-
ture from the robot’s states. We call the set of region features and context features at time step tas the
3
摘要:

VIOLA:ImitationLearningforVision-BasedManipulationwithObjectProposalPriorsYifengZhu1,AbhishekJoshi1,PeterStone1;2,YukeZhu11TheUniversityofTexasatAustin2SonyAIAbstract:WeintroduceVIOLA,anobject-centricimitationlearningapproachtolearningclosed-loopvisuomotorpoliciesforrobotmanipulation.Ourapproachcons...

展开>> 收起<<
VIOLA Imitation Learning for Vision-Based Manipulation with Object Proposal Priors Yifeng Zhu1 Abhishek Joshi1 Peter Stone12 Yuke Zhu1.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:4.48MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注