H-SAUR Hypothesize Simulate Act Update and Repeat for Understanding Object Articulations from Interactions Kei Ota12 Hsiao-Yu Tung3 Kevin A. Smith3 Anoop Cherian4

2025-05-06 0 0 5.2MB 12 页 10玖币
侵权投诉
H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat
for Understanding Object Articulations from Interactions
Kei Ota1,2, Hsiao-Yu Tung3, Kevin A. Smith3, Anoop Cherian4,
Tim K. Marks4, Alan Sullivan4, Asako Kanezaki2, and Joshua B. Tenenbaum3
Observation
Mask
Segmented
Pointcloud
Hypothetical configurations (sec.3.1)
Generative Model (sec.3.1, 3.2)
Action selection (sec.3.2)
Posterior Inferences (sec.3.3)
𝑎!
Sample a configuration 𝑠!
Apply action 𝑎!
to hypotheses
Apply action 𝑎!
to real world object Update posterior distribution
𝑤!,# = 𝑓( ) = 10
) = 1
,
,
Compare with real world outcome
𝑤$,# = 𝑓(
Prior
𝑂!#$
𝑂%
Fig. 1:
Overview of our “Hypothesize,Simulate,Act,Update, and Repeat” (H-SAUR) framework.
We consider the
task of estimating the kinematic structure of an unknown articulated object and use that structure for efficiently manipulating
the object.
Left:
A generative model produces several hypothetical configurations given point cloud segments and simulates
possible actions that maximally deform a sampled configuration.
Right:
By applying an action and observing the outcome,
the posterior inference is performed using the same generative model by simulating and updating the posterior distribution.
We repeat the process until the convergence.
Abstract The world is filled with articulated objects that are
difficult to determine how to use from vision alone, e.g., a door
might open inwards or outwards. Humans handle these objects
with strategic trial-and-error: first pushing a door then pulling
if that doesn’t work. We enable these capabilities in autonomous
agents by proposing “Hypothesize, Simulate, Act, Update, and
Repeat” (H-SAUR), a probabilistic generative framework that
simultaneously generates a distribution of hypotheses about how
objects articulate given input observations, captures certainty
over hypotheses over time, and infer plausible actions for
exploration and goal-conditioned manipulation. We compare
our model with existing work in manipulating objects after a
handful of exploration actions, on the PartNet-Mobility dataset.
We further propose a novel PuzzleBoxes benchmark that
contains locked boxes that require multiple steps to solve. We
show that the proposed model significantly outperforms the
current state-of-the-art articulated object manipulation frame-
work, despite using zero training data. We further improve the
test-time efficiency of H-SAUR by integrating a learned prior
from learning-based vision models.
1
Kei Ota is with Information Technology R&D
Center, Mitsubishi Electric Corporation, Japan.
Ota.Kei@ds.MitsubishiElectric.co.jp
2
Kei Ota and Asako Kanezaki are with Tokyo Institute of Technology,
Japan.
3
Hsiao-Yu Tung, Kevin A. Smith, and Joshua B. Tenenbaum are with
Department of Brain and Cognitive Sciences, Massachusetts Institute of
Technology, Cambridge, MA, USA.
4
Anoop Cherian, Tim K. Marks, and Alan Sullivan are with Mitsubishi
Electric Research Labs, Cambridge, MA, USA.
I. INTRODUCTION
Every day we are surrounded by a number of articulated
objects that require specific interactions to use: our laptops
can be opened or shut, windows can be raised or lowered,
and drawers can be pulled out or pushed back in. A robot
designed to function in real-world contexts should thus be
able to understand and interact with these articulated objects.
Recent advances in deep reinforcement learning (RL) have
focused on this problem and enabled robots to manipulate
articulated objects such as drawers and doors [
1
], [
2
], [
3
], [
4
].
However, these systems typically produce fixed actions based
on observations of a scene, and thus, when the articulated
joint is ambiguous (e.g., a door that slides or swings), they
cannot adapt their policies in response to failed actions. While
some systems attempt to adjust policies during test-time
exploration to recover from failure modes [
5
], [
6
], they only
propose local action adjustments (pull harder or run faster)
and so are insufficient in cases where dramatically different
strategies need to be applied, e.g., from “sliding the window”
to “pushing the window outward from the bottom.
In contrast, humans and many other animals can quickly
figure out how to manipulate complex articulated man-made
objects, e.g., puzzle boxes, with very little training [
7
], [
8
],
[
9
]. These capabilities are thought to be supported by rapid,
strategic trial-and-error learning – interacting with objects in
arXiv:2210.12521v1 [cs.RO] 22 Oct 2022
an intelligent way, but learning when actions lead to failures
and updating mental representations of the world to reflect
this information [
10
]. We argue that robotic systems that
can learn how to manipulate articulated objects should be
designed using similar principles.
In this work, we propose “Hypothesize, Simulate, Act,
Update, and Repeat” (H-SAUR), an exploration strategy that
allows an agent to figure out the underlying articulation
mechanism of man-made objects from a handful of actions.
At the core of our model is a probabilistic generative model
that generates hypotheses of how articulated objects might
deform given an action. Given a kinematic object, our model
first generates several hypothetical articulation configurations
of the object from 3D point clouds segmented by object parts.
Our model then evaluates the likelihood of each hypothesis
through analysis-by-synthesis – the proposed model simulates
objects representative of each hypothetical configuration, us-
ing a physics engine to predict likely outcomes given an action.
The virtual simulation helps resolve three critical components
in this interactive perception setup: (1) deciding real-world
exploratory actions that might produce meaningful outcomes,
(2) reducing uncertainty over beliefs after observing the action-
outcome pairs from real-world interactions, (3) generating
actions that will lead to successful execution of a given
task after fully figuring out the articulation mechanism. The
contributions of this paper can be summarized as follows:
leftmargin=*,itemsep=0mm
1)
We propose a novel exploration algorithm for efficient
exploration and manipulation of puzzle boxes and artic-
ulated objects, by integrating the power of probabilistic
generative models and forward simulation. Our model
explicitly captures the uncertainty over articulation
hypotheses.
2)
We compare H-SAUR against existing state-of-the-art
methods, and show it outperforms them in operating
unknown articulated object, despite requiring many
fewer interactions with the object of interest.
3)
We propose a new manipulation benchmark – Puzzle-
Boxes – which consists of locked boxes that require
multi-step sequential actions to unlock and open, in
order to test the ability to explore and manipulate
complex articulated objects.
II. RELATED WORK
Kinematic Structure Estimation.
A natural first step to
manipulate an object is to predict the articulation mechanism
of the object. Li et al. [
11
] and Wang et al. [
12
] proposed
models to segment object point clouds into independently
moving parts and articulated joints. However, this requires
part and articulation annotations, and thus does not generalize
to unexpected articulation mechanisms. Previous work address
this by proposing to visually parse articulated objects under
motion [
13
], [
14
], [
15
], [
16
], [
17
], [
18
]. Yet, most work
assumes the objects are manually articulated by humans or
scripted actions from the robot. In this paper, we study how an
agent can jointly infer articulation mechanism and exploratory
actions that helps to reveal the articulation of an object, i.e.,
in an interactive perception setup [
19
]. Niekum et al. [
20
]
addresses a similar setup, but only handles articulated objects
with a single joint and assumes the robot knows where to
apply forces. Kulick et al. [
21
] and Baum et al. [
22
] handle
dependency joints but assume each joint is either locked or
unlocked, which is ambiguous for general kinematic objects.
H-SAUR takes raw point clouds and part segmentations
as inputs, and infers both the joint structure of the object
and how to act. This model can handle articulated objects
with an arbitrary number of joints and joint dependencies
by leveraging off-the-shelf physics simulation for general
physical constraint reasoning.
Model-free approaches for manipulating articulated ob-
jects.
Instead of explicitly inferring the articulation mecha-
nism, recent works in deep RL learn to generate plausible
object manipulation actions from pointclouds [
23
], [
3
], [
5
],
RGB-(D) images [
4
], [
1
], [
2
], or the full 3D state of the
objects and their segments [
24
], [
25
], [
26
]. While most of
these RL approaches learn through explicit rewards, recent
approaches have learned to manipulate objects in a self-
supervised manner, through self-driven goals or imitation
learning [
27
], [
28
]. However, all of these systems require
a large number of interactions during training and cannot
discover hidden mechanisms that are only revealed through
test-time exploratory behaviors. Furthermore, while they focus
on training-time exploration, our work focuses on testing-
time exploration where only a small number of interactions
is permitted.
III. METHOD
We consider a task of estimating kinematic structure
of an unknown articulated object and use the estimation
for efficient manipulation. We are particularly interested in
manipulating a visually ambiguous object, e.g., a closed
door that can be opened by pulling, pushing, sliding, etc. In
such a situation, the agent needs to estimate its underlying
kinematic configuration, and update its beliefs over different
configurations based on the outcome of past failed actions.
We propose “Hypothesize, Simulate, Act, Update, and
Repeat” (H-SAUR), a physics-aware generative model that
represents an articulated object manipulation scene in terms of
3D shapes of object parts, articulation joint types and positions
of each part, actions to apply on the object, and the change to
the object after applying the actions. In this work, we assume
to have access to a physics engine that can take as input 3D
meshes (estimated from a point cloud) of a target unknown
object with an estimated kinematic configuration, and produce
hypothetical simulated articulations of this object when
kinematically acted upon. The method consists of three parts.
First, we initiate a number of hypothetical configurations that
imitate a target object by sampling articulation structures from
a prior distribution. The prior distribution can be uniform or
from learned vision models. Second, we sample one of the
hypotheses to generate an action that is expected to provide
evidence for or against that hypothesis. Finally, we apply the
optimal action to the target object and update beliefs about
object joints based on the outcome.
A. Generating Hypothetical Articulated Objects
Given the observed pointcloud
O
of a target object along
with its part segmentation,
m,
we generate a number of
kinematic replicas of the object. Since the true articulation
mechanism is initially unknown, we generate these replicas
by sampling different kinematic structures from uniform prior
distributions over joint types and parameters.
Object Parts.
From the observed pointcloud
O
and segmen-
tation masks,
m1, m2,· · · , mNv,
where
Nv
is number of
available views, we can break the pointcloud into part-centric
pointcloud
O1, O2,· · · , ONp
where
Np
is the total number
of object parts.
Articulation Joints.
Each object part is attached to a base of
the object with a joint. We consider three most common types
of articulation joints: revolute (r), prismatic (p), and fixed
(f). For revolute and prismatic joints, we further generate
possible joint axes and positions, using the tight bounding
boxes fitted to the part-centric pointcloud to obtain a total
of
J
possible joints. The
jth
joint is denoted as
θ(j)= (c, d)
where
c∈ {r, p, f}
is the joint type and
dR6
is the 6-DoF
pose of the joint axis. The prior distribution
p(θ(j))
for the
joint type is assumed to be uniform at
t= 0.
One can also
use learned prior from vision models that predict joint types.
In addition, most articulated joints have lower and upper
limits of how much the joint can be deformed. We denote
the limits as
θlow
and
θhigh.
The prior distribution is sam-
pled uniformly from
[θMAX,0]
and
[0, θMAX],
respectively.
The full state of the joint for object part
Oi
is
si=
(θ(σ(i)), θlowi, θhighi, θcuri)
, where
σ(i)∈ {1,2,· · · , J}
is the
joint configuration for the
ith
object part, and
θcuri
is the joint
position at the current time step. The prior over all the latent
variables is:
p(s1:Np) =
Np
Y
i=1
p(θ(σ(i)))punif[θMAX ,0](θlowi)punif[0MAX](θhighi).
(1)
We approximate the distribution by maintaining a particle
pool,
S,
where each particle in the pool represents a particular
setup for the articulation configurations.
B. Simulating and Selecting Informative Action
We utilize virtual simulations to generate an optimal
action that reduces the uncertainty of joint configuration
hypotheses. Yet, computing the optimal action that maximizes
the information gain involves integral over all latent variables,
which is intractable. One can approximate this by a sampling-
based method [
19
]. However, the high computational require-
ments still prohibit the agent from solving the task within
a reasonable time. We address this by using only a single
particle to make a noisy approximation of the optimal action.
We sample a joint configuration from the set of particles
s(k)∼ S
and obtain the optimal action by simulating different
actions on the object with the physics simulation. The action
at= (p, r)R6
is represented as a 3D point
ptR3
on the
object and the direction
rtR3
to apply force. The optimal
action is defined as the action that can maximally deform the
object or a target object part over a single step. For multi-part
objects, we maintain a list of parts-of-interest, which we will
introduce shortly, and we sample a target part from the list
to act on. We measure how much an object part
i
deforms
by
di=kθcuri
t+1 θcuri
tk.
Although one can naively sample
a huge number of actions and pick the best action through
simulation, we found this can be extremely inefficient with
large object parts. To improve inference speed, we instead
treat the action inference as a particle filtering problem: we
initialize a number of action proposals by randomly sampling
3D locations on the target point cloud and assign random
directions to apply force, then we use the measured distance
dj
as the likelihood to update the posterior distribution of the
particles. We add noise to the action while reproducing the
particles from previous iterations. We continue this process
three times and finally sample a particle from the pool to
obtain the action
a
.
1
We found the inferred action
a
is
often close to the oracle optimal action that maximizes di.
The probabilistic formulation of an articulation mechanism
given past observation and action is
p(st|O1:t1, a1:t1)
=Zp(st|st1, at1)
| {z }
forward dynamics
p(st1|O1:t1, a1:t1)
| {z }
obtain through recursion
,
(2)
where the first term is handled by the physics engine by
forward simulation, and the second is initialized with the prior
defined in Eq. (1) and can be obtained through recursion.
C. Updating hypotheses through analysis-by-synthesis
We apply the inferred action
a
on the target object
Ot
to observe outcome
Ot+1
. We then update the probability of
each hypothesis through analysis-by-synthesis: we first apply
the same action
a
on all the "imagined" objects,
s∈ S
in the physics engine. After applying the action, we obtain
ˆ
O(k)
t+1
for each particle
s(k).
We define the likelihood of the
particle
s(k)
as
wk=1
dist(Ot+1,ˆ
O(k)
t+1)+
, where
dist(o1, o2) =
1
|o1|Pxo1minyo2kxyk2
2
is the chamfer distance between
two point cloud
o1
and
o2.
The overall updated posterior is:
p(st|O1:t, a1:t1)p(Ot|st)p(st|O1:t1, a1:t1)
=
K
X
k=1
wkp(st|O1:t1, a1:t1),(3)
where the second term can be computed from Eq.
(2)
, and
the whole inference is implemented through particle filtering
with weighted sampling.
D. Handling Joints with Dependency in Goal-Conditioned
Manipulation
A real puzzle box often consists of joints with dependencies,
e.g., a lock needs to be open first in order to operate on another
lock. Randomly selecting a part to act on is ineffective and
1
We found the particle filter (PF) generates nearly optimal action
1,500
times faster compared to an oracle optimal action generated by exhaustive
search (ES). We compare deformations caused by them and found PF with
100
particles almost always generates the same action (
dPF
i/dES
i= 0.995
).
摘要:

H-SAUR:Hypothesize,Simulate,Act,Update,andRepeatforUnderstandingObjectArticulationsfromInteractionsKeiOta1;2,Hsiao-YuTung3,KevinA.Smith3,AnoopCherian4,TimK.Marks4,AlanSullivan4,AsakoKanezaki2,andJoshuaB.Tenenbaum3Fig.1:Overviewofour“Hypothesize,Simulate,Act,Update,andRepeat”(H-SAUR)framework.Weconsi...

展开>> 收起<<
H-SAUR Hypothesize Simulate Act Update and Repeat for Understanding Object Articulations from Interactions Kei Ota12 Hsiao-Yu Tung3 Kevin A. Smith3 Anoop Cherian4.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.2MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注