an intelligent way, but learning when actions lead to failures
and updating mental representations of the world to reflect
this information [
10
]. We argue that robotic systems that
can learn how to manipulate articulated objects should be
designed using similar principles.
In this work, we propose “Hypothesize, Simulate, Act,
Update, and Repeat” (H-SAUR), an exploration strategy that
allows an agent to figure out the underlying articulation
mechanism of man-made objects from a handful of actions.
At the core of our model is a probabilistic generative model
that generates hypotheses of how articulated objects might
deform given an action. Given a kinematic object, our model
first generates several hypothetical articulation configurations
of the object from 3D point clouds segmented by object parts.
Our model then evaluates the likelihood of each hypothesis
through analysis-by-synthesis – the proposed model simulates
objects representative of each hypothetical configuration, us-
ing a physics engine to predict likely outcomes given an action.
The virtual simulation helps resolve three critical components
in this interactive perception setup: (1) deciding real-world
exploratory actions that might produce meaningful outcomes,
(2) reducing uncertainty over beliefs after observing the action-
outcome pairs from real-world interactions, (3) generating
actions that will lead to successful execution of a given
task after fully figuring out the articulation mechanism. The
contributions of this paper can be summarized as follows:
leftmargin=*,itemsep=0mm
1)
We propose a novel exploration algorithm for efficient
exploration and manipulation of puzzle boxes and artic-
ulated objects, by integrating the power of probabilistic
generative models and forward simulation. Our model
explicitly captures the uncertainty over articulation
hypotheses.
2)
We compare H-SAUR against existing state-of-the-art
methods, and show it outperforms them in operating
unknown articulated object, despite requiring many
fewer interactions with the object of interest.
3)
We propose a new manipulation benchmark – Puzzle-
Boxes – which consists of locked boxes that require
multi-step sequential actions to unlock and open, in
order to test the ability to explore and manipulate
complex articulated objects.
II. RELATED WORK
Kinematic Structure Estimation.
A natural first step to
manipulate an object is to predict the articulation mechanism
of the object. Li et al. [
11
] and Wang et al. [
12
] proposed
models to segment object point clouds into independently
moving parts and articulated joints. However, this requires
part and articulation annotations, and thus does not generalize
to unexpected articulation mechanisms. Previous work address
this by proposing to visually parse articulated objects under
motion [
13
], [
14
], [
15
], [
16
], [
17
], [
18
]. Yet, most work
assumes the objects are manually articulated by humans or
scripted actions from the robot. In this paper, we study how an
agent can jointly infer articulation mechanism and exploratory
actions that helps to reveal the articulation of an object, i.e.,
in an interactive perception setup [
19
]. Niekum et al. [
20
]
addresses a similar setup, but only handles articulated objects
with a single joint and assumes the robot knows where to
apply forces. Kulick et al. [
21
] and Baum et al. [
22
] handle
dependency joints but assume each joint is either locked or
unlocked, which is ambiguous for general kinematic objects.
H-SAUR takes raw point clouds and part segmentations
as inputs, and infers both the joint structure of the object
and how to act. This model can handle articulated objects
with an arbitrary number of joints and joint dependencies
by leveraging off-the-shelf physics simulation for general
physical constraint reasoning.
Model-free approaches for manipulating articulated ob-
jects.
Instead of explicitly inferring the articulation mecha-
nism, recent works in deep RL learn to generate plausible
object manipulation actions from pointclouds [
23
], [
3
], [
5
],
RGB-(D) images [
4
], [
1
], [
2
], or the full 3D state of the
objects and their segments [
24
], [
25
], [
26
]. While most of
these RL approaches learn through explicit rewards, recent
approaches have learned to manipulate objects in a self-
supervised manner, through self-driven goals or imitation
learning [
27
], [
28
]. However, all of these systems require
a large number of interactions during training and cannot
discover hidden mechanisms that are only revealed through
test-time exploratory behaviors. Furthermore, while they focus
on training-time exploration, our work focuses on testing-
time exploration where only a small number of interactions
is permitted.
III. METHOD
We consider a task of estimating kinematic structure
of an unknown articulated object and use the estimation
for efficient manipulation. We are particularly interested in
manipulating a visually ambiguous object, e.g., a closed
door that can be opened by pulling, pushing, sliding, etc. In
such a situation, the agent needs to estimate its underlying
kinematic configuration, and update its beliefs over different
configurations based on the outcome of past failed actions.
We propose “Hypothesize, Simulate, Act, Update, and
Repeat” (H-SAUR), a physics-aware generative model that
represents an articulated object manipulation scene in terms of
3D shapes of object parts, articulation joint types and positions
of each part, actions to apply on the object, and the change to
the object after applying the actions. In this work, we assume
to have access to a physics engine that can take as input 3D
meshes (estimated from a point cloud) of a target unknown
object with an estimated kinematic configuration, and produce
hypothetical simulated articulations of this object when
kinematically acted upon. The method consists of three parts.
First, we initiate a number of hypothetical configurations that
imitate a target object by sampling articulation structures from
a prior distribution. The prior distribution can be uniform or
from learned vision models. Second, we sample one of the
hypotheses to generate an action that is expected to provide
evidence for or against that hypothesis. Finally, we apply the
optimal action to the target object and update beliefs about
object joints based on the outcome.