H-SAUR Hypothesize Simulate Act Update and Repeat for Understanding Object Articulations from Interactions Kei Ota12 Hsiao-Yu Tung3 Kevin A. Smith3 Anoop Cherian4

2025-05-06 0 0 5.2MB 12 页 10玖币

侵权投诉

H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat

for Understanding Object Articulations from Interactions

Kei Ota1,2, Hsiao-Yu Tung3, Kevin A. Smith3, Anoop Cherian4,

Tim K. Marks4, Alan Sullivan4, Asako Kanezaki2, and Joshua B. Tenenbaum3

Observation

Mask

Segmented

Pointcloud

Hypothetical configurations (sec.3.1)

Generative Model (sec.3.1, 3.2)

Action selection (sec.3.2)

Posterior Inferences (sec.3.3)

𝑎!

∗

Sample a configuration 𝑠!

Apply action 𝑎!

∗to hypotheses

Apply action 𝑎!

∗to real world object Update posterior distribution

𝑤!,# = 𝑓( ) = 10

) = 1

…

Compare with real world outcome

𝑤$,# = 𝑓(

…

Prior

…

𝑂!#$

𝑂%

Fig. 1:

Overview of our “Hypothesize,Simulate,Act,Update, and Repeat” (H-SAUR) framework.

We consider the

task of estimating the kinematic structure of an unknown articulated object and use that structure for efﬁciently manipulating

the object.

Left:

A generative model produces several hypothetical conﬁgurations given point cloud segments and simulates

possible actions that maximally deform a sampled conﬁguration.

Right:

By applying an action and observing the outcome,

the posterior inference is performed using the same generative model by simulating and updating the posterior distribution.

We repeat the process until the convergence.

Abstract— The world is ﬁlled with articulated objects that are

difﬁcult to determine how to use from vision alone, e.g., a door

might open inwards or outwards. Humans handle these objects

with strategic trial-and-error: ﬁrst pushing a door then pulling

if that doesn’t work. We enable these capabilities in autonomous

agents by proposing “Hypothesize, Simulate, Act, Update, and

Repeat” (H-SAUR), a probabilistic generative framework that

simultaneously generates a distribution of hypotheses about how

objects articulate given input observations, captures certainty

over hypotheses over time, and infer plausible actions for

exploration and goal-conditioned manipulation. We compare

our model with existing work in manipulating objects after a

handful of exploration actions, on the PartNet-Mobility dataset.

We further propose a novel PuzzleBoxes benchmark that

contains locked boxes that require multiple steps to solve. We

show that the proposed model signiﬁcantly outperforms the

current state-of-the-art articulated object manipulation frame-

work, despite using zero training data. We further improve the

test-time efﬁciency of H-SAUR by integrating a learned prior

from learning-based vision models.

Kei Ota is with Information Technology R&D

Center, Mitsubishi Electric Corporation, Japan.

Ota.Kei@ds.MitsubishiElectric.co.jp

Kei Ota and Asako Kanezaki are with Tokyo Institute of Technology,

Japan.

Hsiao-Yu Tung, Kevin A. Smith, and Joshua B. Tenenbaum are with

Department of Brain and Cognitive Sciences, Massachusetts Institute of

Technology, Cambridge, MA, USA.

Anoop Cherian, Tim K. Marks, and Alan Sullivan are with Mitsubishi

Electric Research Labs, Cambridge, MA, USA.

I. INTRODUCTION

Every day we are surrounded by a number of articulated

objects that require speciﬁc interactions to use: our laptops

can be opened or shut, windows can be raised or lowered,

and drawers can be pulled out or pushed back in. A robot

designed to function in real-world contexts should thus be

able to understand and interact with these articulated objects.

Recent advances in deep reinforcement learning (RL) have

focused on this problem and enabled robots to manipulate

articulated objects such as drawers and doors [

], [

However, these systems typically produce ﬁxed actions based

on observations of a scene, and thus, when the articulated

joint is ambiguous (e.g., a door that slides or swings), they

cannot adapt their policies in response to failed actions. While

some systems attempt to adjust policies during test-time

exploration to recover from failure modes [

], [

], they only

propose local action adjustments (pull harder or run faster)

and so are insufﬁcient in cases where dramatically different

strategies need to be applied, e.g., from “sliding the window”

to “pushing the window outward from the bottom.”

In contrast, humans and many other animals can quickly

ﬁgure out how to manipulate complex articulated man-made

objects, e.g., puzzle boxes, with very little training [

], [

[

]. These capabilities are thought to be supported by rapid,

strategic trial-and-error learning – interacting with objects in

arXiv:2210.12521v1 [cs.RO] 22 Oct 2022

an intelligent way, but learning when actions lead to failures

and updating mental representations of the world to reﬂect

this information [

]. We argue that robotic systems that

can learn how to manipulate articulated objects should be

designed using similar principles.

In this work, we propose “Hypothesize, Simulate, Act,

Update, and Repeat” (H-SAUR), an exploration strategy that

allows an agent to ﬁgure out the underlying articulation

mechanism of man-made objects from a handful of actions.

At the core of our model is a probabilistic generative model

that generates hypotheses of how articulated objects might

deform given an action. Given a kinematic object, our model

ﬁrst generates several hypothetical articulation conﬁgurations

of the object from 3D point clouds segmented by object parts.

Our model then evaluates the likelihood of each hypothesis

through analysis-by-synthesis – the proposed model simulates

objects representative of each hypothetical conﬁguration, us-

ing a physics engine to predict likely outcomes given an action.

The virtual simulation helps resolve three critical components

in this interactive perception setup: (1) deciding real-world

exploratory actions that might produce meaningful outcomes,

(2) reducing uncertainty over beliefs after observing the action-

outcome pairs from real-world interactions, (3) generating

actions that will lead to successful execution of a given

task after fully ﬁguring out the articulation mechanism. The

contributions of this paper can be summarized as follows:

leftmargin=*,itemsep=0mm

We propose a novel exploration algorithm for efﬁcient

exploration and manipulation of puzzle boxes and artic-

ulated objects, by integrating the power of probabilistic

generative models and forward simulation. Our model

explicitly captures the uncertainty over articulation

hypotheses.

We compare H-SAUR against existing state-of-the-art

methods, and show it outperforms them in operating

unknown articulated object, despite requiring many

fewer interactions with the object of interest.

We propose a new manipulation benchmark – Puzzle-

Boxes – which consists of locked boxes that require

multi-step sequential actions to unlock and open, in

order to test the ability to explore and manipulate

complex articulated objects.

II. RELATED WORK

Kinematic Structure Estimation.

A natural ﬁrst step to

manipulate an object is to predict the articulation mechanism

of the object. Li et al. [

] and Wang et al. [

] proposed

models to segment object point clouds into independently

moving parts and articulated joints. However, this requires

part and articulation annotations, and thus does not generalize

to unexpected articulation mechanisms. Previous work address

this by proposing to visually parse articulated objects under

motion [

], [

]. Yet, most work

assumes the objects are manually articulated by humans or

scripted actions from the robot. In this paper, we study how an

agent can jointly infer articulation mechanism and exploratory

actions that helps to reveal the articulation of an object, i.e.,

in an interactive perception setup [

]. Niekum et al. [

]

addresses a similar setup, but only handles articulated objects

with a single joint and assumes the robot knows where to

apply forces. Kulick et al. [

] and Baum et al. [

] handle

dependency joints but assume each joint is either locked or

unlocked, which is ambiguous for general kinematic objects.

H-SAUR takes raw point clouds and part segmentations

as inputs, and infers both the joint structure of the object

and how to act. This model can handle articulated objects

with an arbitrary number of joints and joint dependencies

by leveraging off-the-shelf physics simulation for general

physical constraint reasoning.

Model-free approaches for manipulating articulated ob-

jects.

Instead of explicitly inferring the articulation mecha-

nism, recent works in deep RL learn to generate plausible

object manipulation actions from pointclouds [

], [

RGB-(D) images [

], [

], or the full 3D state of the

objects and their segments [

], [

]. While most of

these RL approaches learn through explicit rewards, recent

approaches have learned to manipulate objects in a self-

supervised manner, through self-driven goals or imitation

learning [

], [

]. However, all of these systems require

a large number of interactions during training and cannot

discover hidden mechanisms that are only revealed through

test-time exploratory behaviors. Furthermore, while they focus

on training-time exploration, our work focuses on testing-

time exploration where only a small number of interactions

is permitted.

III. METHOD

We consider a task of estimating kinematic structure

of an unknown articulated object and use the estimation

for efﬁcient manipulation. We are particularly interested in

manipulating a visually ambiguous object, e.g., a closed

door that can be opened by pulling, pushing, sliding, etc. In

such a situation, the agent needs to estimate its underlying

kinematic conﬁguration, and update its beliefs over different

conﬁgurations based on the outcome of past failed actions.

We propose “Hypothesize, Simulate, Act, Update, and

Repeat” (H-SAUR), a physics-aware generative model that

represents an articulated object manipulation scene in terms of

3D shapes of object parts, articulation joint types and positions

of each part, actions to apply on the object, and the change to

the object after applying the actions. In this work, we assume

to have access to a physics engine that can take as input 3D

meshes (estimated from a point cloud) of a target unknown

object with an estimated kinematic conﬁguration, and produce

hypothetical simulated articulations of this object when

kinematically acted upon. The method consists of three parts.

First, we initiate a number of hypothetical conﬁgurations that

imitate a target object by sampling articulation structures from

a prior distribution. The prior distribution can be uniform or

from learned vision models. Second, we sample one of the

hypotheses to generate an action that is expected to provide

evidence for or against that hypothesis. Finally, we apply the

optimal action to the target object and update beliefs about

object joints based on the outcome.

A. Generating Hypothetical Articulated Objects

Given the observed pointcloud

of a target object along

with its part segmentation,

we generate a number of

kinematic replicas of the object. Since the true articulation

mechanism is initially unknown, we generate these replicas

by sampling different kinematic structures from uniform prior

distributions over joint types and parameters.

Object Parts.

From the observed pointcloud

and segmen-

tation masks,

m1, m2,· · · , mNv,

where

is number of

available views, we can break the pointcloud into part-centric

pointcloud

O1, O2,· · · , ONp

where

is the total number

of object parts.

Articulation Joints.

Each object part is attached to a base of

the object with a joint. We consider three most common types

of articulation joints: revolute (r), prismatic (p), and ﬁxed

(f). For revolute and prismatic joints, we further generate

possible joint axes and positions, using the tight bounding

boxes ﬁtted to the part-centric pointcloud to obtain a total

possible joints. The

jth

joint is denoted as

θ(j)= (c, d)

where

c∈ {r, p, f}

is the joint type and

d∈R6

is the 6-DoF

pose of the joint axis. The prior distribution

p(θ(j))

for the

joint type is assumed to be uniform at

t= 0.

One can also

use learned prior from vision models that predict joint types.

In addition, most articulated joints have lower and upper

limits of how much the joint can be deformed. We denote

the limits as

θlow

and

θhigh.

The prior distribution is sam-

pled uniformly from

[−θMAX,0]

and

[0, θMAX],

respectively.

The full state of the joint for object part

si=

(θ(σ(i)), θlowi, θhighi, θcuri)

, where

σ(i)∈ {1,2,· · · , J}

is the

joint conﬁguration for the

ith

object part, and

θcuri

is the joint

position at the current time step. The prior over all the latent

variables is:

p(s1:Np) =

i=1

p(θ(σ(i)))punif[−θMAX ,0](θlowi)punif[0,θMAX](θhighi).

(1)

We approximate the distribution by maintaining a particle

pool,

where each particle in the pool represents a particular

setup for the articulation conﬁgurations.

B. Simulating and Selecting Informative Action

We utilize virtual simulations to generate an optimal

action that reduces the uncertainty of joint conﬁguration

hypotheses. Yet, computing the optimal action that maximizes

the information gain involves integral over all latent variables,

which is intractable. One can approximate this by a sampling-

based method [

]. However, the high computational require-

ments still prohibit the agent from solving the task within

a reasonable time. We address this by using only a single

particle to make a noisy approximation of the optimal action.

We sample a joint conﬁguration from the set of particles

s(k)∼ S

and obtain the optimal action by simulating different

actions on the object with the physics simulation. The action

at= (p, r)∈R6

is represented as a 3D point

pt∈R3

on the

object and the direction

rt∈R3

to apply force. The optimal

action is deﬁned as the action that can maximally deform the

object or a target object part over a single step. For multi-part

objects, we maintain a list of parts-of-interest, which we will

introduce shortly, and we sample a target part from the list

to act on. We measure how much an object part

deforms

di=kθcuri

t+1 −θcuri

tk.

Although one can naively sample

a huge number of actions and pick the best action through

simulation, we found this can be extremely inefﬁcient with

large object parts. To improve inference speed, we instead

treat the action inference as a particle ﬁltering problem: we

initialize a number of action proposals by randomly sampling

3D locations on the target point cloud and assign random

directions to apply force, then we use the measured distance

as the likelihood to update the posterior distribution of the

particles. We add noise to the action while reproducing the

particles from previous iterations. We continue this process

three times and ﬁnally sample a particle from the pool to

obtain the action

a∗

We found the inferred action

a∗

often close to the oracle optimal action that maximizes di.

The probabilistic formulation of an articulation mechanism

given past observation and action is

p(st|O1:t−1, a1:t−1)

=Zp(st|st−1, at−1)

| {z }

forward dynamics

p(st−1|O1:t−1, a1:t−1)

| {z }

obtain through recursion

(2)

where the ﬁrst term is handled by the physics engine by

forward simulation, and the second is initialized with the prior

deﬁned in Eq. (1) and can be obtained through recursion.

C. Updating hypotheses through analysis-by-synthesis

We apply the inferred action

a∗

on the target object

to observe outcome

Ot+1

. We then update the probability of

each hypothesis through analysis-by-synthesis: we ﬁrst apply

the same action

a∗

on all the "imagined" objects,

s∈ S

in the physics engine. After applying the action, we obtain

O(k)

t+1

for each particle

s(k).

We deﬁne the likelihood of the

particle

s(k)

wk=1

dist(Ot+1,ˆ

O(k)

t+1)+

, where

dist(o1, o2) =

|o1|Px∈o1miny∈o2kx−yk2

is the chamfer distance between

two point cloud

and

o2.

The overall updated posterior is:

p(st|O1:t, a1:t−1)∝p(Ot|st)p(st|O1:t−1, a1:t−1)

k=1

wkp(st|O1:t−1, a1:t−1),(3)

where the second term can be computed from Eq.

(2)

, and

the whole inference is implemented through particle ﬁltering

with weighted sampling.

D. Handling Joints with Dependency in Goal-Conditioned

Manipulation

A real puzzle box often consists of joints with dependencies,

e.g., a lock needs to be open ﬁrst in order to operate on another

lock. Randomly selecting a part to act on is ineffective and

We found the particle ﬁlter (PF) generates nearly optimal action

1,500

times faster compared to an oracle optimal action generated by exhaustive

search (ES). We compare deformations caused by them and found PF with

100

particles almost always generates the same action (

dPF

i/dES

i= 0.995

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

H-SAUR:Hypothesize,Simulate,Act,Update,andRepeatforUnderstandingObjectArticulationsfromInteractionsKeiOta1;2,Hsiao-YuTung3,KevinA.Smith3,AnoopCherian4,TimK.Marks4,AlanSullivan4,AsakoKanezaki2,andJoshuaB.Tenenbaum3Fig.1:OverviewofourHypothesize,Simulate,Act,Update,andRepeat(H-SAUR)framework.Weconsi...

展开>> 收起<<

H-SAUR Hypothesize Simulate Act Update and Repeat for Understanding Object Articulations from Interactions Kei Ota12 Hsiao-Yu Tung3 Kevin A. Smith3 Anoop Cherian4.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

H-SAUR Hypothesize Simulate Act Update and Repeat for Understanding Object Articulations from Interactions Kei Ota12 Hsiao-Yu Tung3 Kevin A. Smith3 Anoop Cherian4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: