Visual Representations for Visuomotor Policies. Various types of intermediate visual rep-
resentations have been explored for visuomotor policy learning. Object bounding boxes have
been commonly used as intermediate visual representations [28,16,29]. However, they require
fine-tuned or category-specific detectors and cannot easily generalize to tasks with previously
unknown objects. Recently, deep learning methods have enabled IL to train policies end-to-end
on raw observations [2,3]. These methods are prone to covariate shift and causal confusion [6],
resulting in poor generalization performances. Similar to our work, a large body of literature has
looked into incorporating additional inductive biases into end-to-end policies. Notable ones include
spatial attention [30–32] and affordances [33–38]. However, these representations are purposefully
designed for specific motion primitives, such as pick-and-place, limiting their abilities to generate
diverse manipulation behaviors.
Object-Centric Representation. Object-centric representations have been widely used in visual
understanding and robotics tasks, where researchers seek to reason about visual scenes in a modu-
lar way based on the objects presented. In robotics, poses [39–41] or bounding boxes [28,16,29]
are commonly used as object-level abstractions. These representations often require prior knowl-
edge about object instance/category and do not capture fine-grained details, falling short in applying
to new tasks without previous unknown objects. Unsupervised object discovery methods [42–44]
learn object representation without manual supervision. However, they fall short in handling com-
plex scenes [42,43], hindering the applicability to realistic manipulation tasks. Recent work from
the vision community has made significant progress in generating object proposals for various down-
stream tasks, such as object detection [19–21] and visual-language reasoning [45,46]. Motivated
by the effectiveness of region proposal networks (RPNs) on out-of-distribution images [21], we use
object proposals to scaffold our object-centric representations for robot manipulation tasks.
3 Approach
We introduce VIOLA, an object-centric imitation learning approach to vision-based manipulation.
The core idea is to decompose raw visual observations into object-centric representations, on top of
which the policy generates closed-loop control actions. Figure 2illustrates the pipeline. We first
formulate the problem of visuomotor policy learning and describe two key components of VIOLA:
1) how we build the object-centric representations based on general object proposals, and 2) how we
use a transformer-based architecture to learn policy over the object-centric representations.
3.1 Problem Formulation
We formulate a robot manipulation task as a discrete-time Markov Decision Process, which is de-
fined as a 5-tuple: M= (S,A,P, R, γ), where Sis the state space, Ais the action space, P(·|s, a)
is the stochastic transition probability, R(s, a, s0)is the reward function, and γ∈[0,1) is the dis-
count factor. In our context, Sis the space of the robot’s raw sensory data, including RGB images
and proprioception, Ais the space of the robot’s motor commands, and π:S → A is a closed-
loop sensorimotor policy that we deploy on the robot to perform a task. The goal of learning a
visuomotor policy for robot manipulation is to learn a policy πthat maximizes the expected return
E[P∞
t=0 γtR(st, at, st+1)].
In our work, we use behavioral cloning as our imitation learning algorithm. We assume access to a
set of N demonstrations D={τi}N
i=1, where each trajectory τiis demonstrated through teleopera-
tion. The goal of our behavioral cloning approach is to learn a policy that clones the actions from
demonstrations D.
We aim to design an object-centric representation that factorizes the visual observations of an un-
structured scene into features of individual entities. For vision-based manipulation, we assume no
access to the ground-truth states of objects. Instead, we use the top Kobject proposals from a
pre-trained Region Proposal Network (RPN) [18] to represent the set of object-related regions of an
image. These proposals are grounded on image regions and optimized for covering the bounding
boxes of potential objects. We treat these Kproposals as the K(approximate) objects and extract
visual and positional features to build region features from each proposal. For manipulation tasks,
reasoning about object interactions is also essential to decide actions, To provide contextual informa-
tion for such relational reasoning, we design three context features: a global context feature from the
workspace image to encode the current stage of the task; an eye-in-hand visual feature from the eye-
in-hand camera to mitigate occlusion and partial observability of objects; and a proprioceptive fea-
ture from the robot’s states. We call the set of region features and context features at time step tas the
3