A Real2Sim2Real Method for Robust Object Grasping with Neural Surface Reconstruction Luobin Wang Runlin Guo Quan Vuong Yuzhe Qin Hao Su Henrik Christensen

2025-04-27 0 0 3.55MB 8 页 10玖币

A Real2Sim2Real Method for Robust Object Grasping

with Neural Surface Reconstruction

Luobin Wang, Runlin Guo, Quan Vuong, Yuzhe Qin, Hao Su, Henrik Christensen

Fig. 1: Left (our pipeline): given a real scene (I), we fuse and segment the camera observations to obtain object level point

clouds (II), which we use to construct a digital replica of the real scene (III). The replica is used to generate grasp labels

(IV) to obtain trained grasping networks (V). The grasp poses predicted by the trained networks are evaluated in the real

scene (VI). Right (the Real2Sim step) illustrates how we can automatically place the reconstructed meshes in the digital

replica without having to explicitly perform pose estimation. Given an object-level point cloud, we use a trained ConvONet

to reconstruct the mesh (I), and then apply the inverse normalization operation (II) to obtain the mesh represented in the

world frame (III).

Abstract— We explore an emerging technique, geometric

Real2Sim2Real, in the context of object manipulation. We

hypothesize that recent 3D modeling methods provides a path

towards building digital replicas of real-world scenes that

afford physical simulation and support robust manipulation

algorithm learning. Since 6 DOF grasping is one the most

important primitives for all manipulation tasks, we study

whether geometric Real2Sim2Real can help us train a robust

grasping network with high sample efﬁciency. We propose to

reconstruct high-quality meshes from real-world point clouds

using state-of-the-art neural surface reconstruction method (the

Real2Sim step). Because most simulators take meshes for fast

simulation, the reconstructed meshes enable grasp pose labels

generation without human efforts. The generated labels can

train grasp network that performs robustly in real evaluation

scenes (the Sim2Real step). In synthetic and real experiments,

we show that the Real2Sim2Real pipeline performs better than

baseline grasp networks trained with a 104×larger dataset

by mimicking geometric shapes of target objects in simulation.

We also show that our method has better sample efﬁciency

than training the grasping network with a retrieval-based

scene reconstruction method. The beneﬁt of the Real2Sim2Real

pipeline comes from 1) decoupling scene modeling and grasp

sampling into sub-problems, and 2) both sub-problems can be

solved with sufﬁciently high quality using recent 3D learning

algorithms and mesh-based physical simulation techniques.

Video presentation available at this link.

All authors afﬁliated with Contextual Robotics Institute, UC San Diego.

I. INTRODUCTION

Learning robotic manipulation skills in simulation and

executing the skills in the real world, often termed Sim2Real,

has fueled many recent advances in robot manipulation [1]–

[5]. The paradigm is effective because many recent learning

approaches require high volume of interaction samples (e.g.,

reinforcement learning [6], [7] and simulation-based grasp

pose auto-labeling [8]), and obtaining these samples in sim-

ulators is much cheaper than in the real world. The Sim2Real

approach usually requires practitioners to build a digital

replica of the real physical scene, where the robots perform

the task of interest. To construct the digital replica in sim-

ulation, the practitioners have to manually curate the object

meshes, calibrate their dynamics parameters, and place them

at realistic poses. Even though there are existing approaches

that lower the realism requirement of the digital replica, such

as domain randomization and domain adaptation [4], [9]–

[17] , these approaches still require the manual creation of

3D assets before they can be applied. Manual model creation,

physical property calibration and scene construction require

domain expertise and can be prohibitively costly to scale to

multiple large-scale scenes with many objects [18]–[21].

Recognizing scene creation and calibration as a Sim2Real

bottleneck, recent researches have attempted to automate this

process and dub the problem Real2Sim2Real [22]–[24]. In

fact, these recent researches tackle dynamics Real2Sim2Real,

since they focus on estimating the simulation dynamic pa-

arXiv:2210.02685v3 [cs.RO] 2 Oct 2023

Fig. 2: Given objects placed on a tabletop surface, the robot

should pick up the objects and move them above the tabletop

surface without knocking the neighboring objects over.

rameters. We instead focus on the geometric Real2Sim2Real

challenge – automatically constructing object geometry from

camera observations, placing them at physically-plausible

poses that allow for forward simulation, and demonstrating

that the reconstructed scenes can train performant neural net-

works. In manipulation research, constructing object meshes

and placing them into simulated scenes can be a highly

manual process [18], [25]. Very recently, we have witnessed

breakthrough in learning-based 3D modeling. State-of-the-

art neural surface reconstruction methods can convert input

from 3D sensors to meshes with geometric details and have

demonstrated strong cross-scene generalizability [26]–[29].

Therefore, it is time to ask the hypothesis whether the

quality of the generated 3D meshes from state-of-the-art

neural reconstruction algorithms can be used for physical

simulation and learning of manipulation algorithms. In fact,

autonomous driving researches which require coarser envi-

ronment geometry have already beneﬁted tremendously from

efforts to automatically reconstruct digital clones [30]–[34].

However, manipulation research requires ﬁner details for

simulation, including precise geometric meshes and accurate

physical properties. Using 6-DoF object clutter grasping task

with point cloud input as the case study, we demonstrate

experimental evidences to support our hypothesis that mod-

ern neural reconstruction algorithms can provide sufﬁcient

details of objects to support simulation and learning. We

choose grasping as the case study application because grasp-

ing is an essential primitives for all manipulation tasks and

enabling grasping with our compact framework provides

strong promises for more complex manipulation tasks.

Our pipeline uses a performant and recent neural surface

reconstruction network that learns to convert point clouds

to detailed meshes. To create a digital replica of the scene,

we propose a simple yet effective method to place the

object meshes into the scene automatically without having

to explicitly perform pose estimation. We demonstrate our

results in the tabletop grasping-in-clutter task and across two

recent state-of-the-art grasp pose prediction networks [35],

[36]. The grasping networks have different input modality

and network architecture, providing additional evidence for

the generality of our results.

We compare our method with a recent scene reconstruction

baseline [37] which uses object pose estimation followed

by retrieval-based method for building the digital scene

replica. The results show that we can achieve higher sample

Algorithm 1 Step-by-step description of our reconstruction

framework to create a digital replica of the test scenes from

camera observations

Input: depth and segmentation maps, camera poses

Output: A digital replica of the test scene

1: Convert the depth maps to point clouds and fuse the

point clouds

2: Extract object-level point clouds using the semantic

segmentation maps

3: for Each object-level point cloud do

4: Perform point cloud outlier removal

5: Use Algorithm 2 to reconstruct an object mesh

6: Place the object mesh into the simulation scene

7: end for

efﬁciency if the shapes of digital replicas are closer to the

real world scenario.

To put the performance of our Real2Sim2Real pipeline

into perspective, we also compare the pre-trained variant and

the train-with-reconstruction variant of the best-performing

grasping network we used. Although the pre-trained model

used a 104×larger training dataset, we observe similar grasp-

ing performance in simulation. We can also achieve the same

robust grasping performance when adapting our compact

framework to a new gripper in the real world without having

to relabel the entire large dataset. In summary, our method

decouples 3D modeling and grasp pose sampling, and both

sub-problems can be solved with quality and generalizability

using state-of-the-art methods.

II. RELATED WORK

3D Reconstruction methods are generally explicit [38]–

[40] or implicit. Implicit representation encodes the shape as

continuous function [26], [41] and can, in principle, handle

more complex geometries. For Real2Sim, we utilize Con-

vONet [26], which provides a ﬂexible implicit representation

to reconstruct 3D scenes with ﬁne-grained details.

Sim2Real gap is a common problem when transferring a

simulator policy to the real-world. While existing techniques,

such as domain randomization and system identiﬁcation, can

reduce the Sim2Real gap [9], [42], [43], they focus on gen-

eralization with respect to the low-dimensional dynamics pa-

rameters. We instead study geometric Real2Sim2Real, align-

ing the high-dimensional geometry to improve Sim2Real

transfer.

Geometric Real2Sim2Real is a generative problem: given

observations of a real scene, an algorithm should generate the

corresponding digital replica. Generative algorithms typically

fall into two classes. Firstly, the algorithm retrieves from an

external database during the generation process, an approach

taken by [37]. The second class of approaches relies solely

on the trained generative model at test time. A representative

algorithm is [44]. While [37], [44] can reconstruct objects

and scenes, they do not demonstrate that the reconstruction

can be used to train networks to perform robotic manipu-

lation tasks. The ability to use the reconstructed scenes to

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AReal2Sim2RealMethodforRobustObjectGraspingwithNeuralSurfaceReconstructionLuobinWang,RunlinGuo,QuanVuong,YuzheQin,HaoSu,HenrikChristensenFig.1:Left(ourpipeline):givenarealscene(I),wefuseandsegmentthecameraobservationstoobtainobjectlevelpointclouds(II),whichweusetoconstructadigitalreplicaoftherealsce...

展开>> 收起<<

A Real2Sim2Real Method for Robust Object Grasping with Neural Surface Reconstruction Luobin Wang Runlin Guo Quan Vuong Yuzhe Qin Hao Su Henrik Christensen.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Real2Sim2Real Method for Robust Object Grasping with Neural Surface Reconstruction Luobin Wang Runlin Guo Quan Vuong Yuzhe Qin Hao Su Henrik Christensen

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: