A Real2Sim2Real Method for Robust Object Grasping with Neural Surface Reconstruction Luobin Wang Runlin Guo Quan Vuong Yuzhe Qin Hao Su Henrik Christensen

2025-04-27 0 0 3.55MB 8 页 10玖币
侵权投诉
A Real2Sim2Real Method for Robust Object Grasping
with Neural Surface Reconstruction
Luobin Wang, Runlin Guo, Quan Vuong, Yuzhe Qin, Hao Su, Henrik Christensen
Fig. 1: Left (our pipeline): given a real scene (I), we fuse and segment the camera observations to obtain object level point
clouds (II), which we use to construct a digital replica of the real scene (III). The replica is used to generate grasp labels
(IV) to obtain trained grasping networks (V). The grasp poses predicted by the trained networks are evaluated in the real
scene (VI). Right (the Real2Sim step) illustrates how we can automatically place the reconstructed meshes in the digital
replica without having to explicitly perform pose estimation. Given an object-level point cloud, we use a trained ConvONet
to reconstruct the mesh (I), and then apply the inverse normalization operation (II) to obtain the mesh represented in the
world frame (III).
Abstract We explore an emerging technique, geometric
Real2Sim2Real, in the context of object manipulation. We
hypothesize that recent 3D modeling methods provides a path
towards building digital replicas of real-world scenes that
afford physical simulation and support robust manipulation
algorithm learning. Since 6 DOF grasping is one the most
important primitives for all manipulation tasks, we study
whether geometric Real2Sim2Real can help us train a robust
grasping network with high sample efficiency. We propose to
reconstruct high-quality meshes from real-world point clouds
using state-of-the-art neural surface reconstruction method (the
Real2Sim step). Because most simulators take meshes for fast
simulation, the reconstructed meshes enable grasp pose labels
generation without human efforts. The generated labels can
train grasp network that performs robustly in real evaluation
scenes (the Sim2Real step). In synthetic and real experiments,
we show that the Real2Sim2Real pipeline performs better than
baseline grasp networks trained with a 104×larger dataset
by mimicking geometric shapes of target objects in simulation.
We also show that our method has better sample efficiency
than training the grasping network with a retrieval-based
scene reconstruction method. The benefit of the Real2Sim2Real
pipeline comes from 1) decoupling scene modeling and grasp
sampling into sub-problems, and 2) both sub-problems can be
solved with sufficiently high quality using recent 3D learning
algorithms and mesh-based physical simulation techniques.
Video presentation available at this link.
All authors affiliated with Contextual Robotics Institute, UC San Diego.
I. INTRODUCTION
Learning robotic manipulation skills in simulation and
executing the skills in the real world, often termed Sim2Real,
has fueled many recent advances in robot manipulation [1]–
[5]. The paradigm is effective because many recent learning
approaches require high volume of interaction samples (e.g.,
reinforcement learning [6], [7] and simulation-based grasp
pose auto-labeling [8]), and obtaining these samples in sim-
ulators is much cheaper than in the real world. The Sim2Real
approach usually requires practitioners to build a digital
replica of the real physical scene, where the robots perform
the task of interest. To construct the digital replica in sim-
ulation, the practitioners have to manually curate the object
meshes, calibrate their dynamics parameters, and place them
at realistic poses. Even though there are existing approaches
that lower the realism requirement of the digital replica, such
as domain randomization and domain adaptation [4], [9]–
[17] , these approaches still require the manual creation of
3D assets before they can be applied. Manual model creation,
physical property calibration and scene construction require
domain expertise and can be prohibitively costly to scale to
multiple large-scale scenes with many objects [18]–[21].
Recognizing scene creation and calibration as a Sim2Real
bottleneck, recent researches have attempted to automate this
process and dub the problem Real2Sim2Real [22]–[24]. In
fact, these recent researches tackle dynamics Real2Sim2Real,
since they focus on estimating the simulation dynamic pa-
arXiv:2210.02685v3 [cs.RO] 2 Oct 2023
Fig. 2: Given objects placed on a tabletop surface, the robot
should pick up the objects and move them above the tabletop
surface without knocking the neighboring objects over.
rameters. We instead focus on the geometric Real2Sim2Real
challenge – automatically constructing object geometry from
camera observations, placing them at physically-plausible
poses that allow for forward simulation, and demonstrating
that the reconstructed scenes can train performant neural net-
works. In manipulation research, constructing object meshes
and placing them into simulated scenes can be a highly
manual process [18], [25]. Very recently, we have witnessed
breakthrough in learning-based 3D modeling. State-of-the-
art neural surface reconstruction methods can convert input
from 3D sensors to meshes with geometric details and have
demonstrated strong cross-scene generalizability [26]–[29].
Therefore, it is time to ask the hypothesis whether the
quality of the generated 3D meshes from state-of-the-art
neural reconstruction algorithms can be used for physical
simulation and learning of manipulation algorithms. In fact,
autonomous driving researches which require coarser envi-
ronment geometry have already benefited tremendously from
efforts to automatically reconstruct digital clones [30]–[34].
However, manipulation research requires finer details for
simulation, including precise geometric meshes and accurate
physical properties. Using 6-DoF object clutter grasping task
with point cloud input as the case study, we demonstrate
experimental evidences to support our hypothesis that mod-
ern neural reconstruction algorithms can provide sufficient
details of objects to support simulation and learning. We
choose grasping as the case study application because grasp-
ing is an essential primitives for all manipulation tasks and
enabling grasping with our compact framework provides
strong promises for more complex manipulation tasks.
Our pipeline uses a performant and recent neural surface
reconstruction network that learns to convert point clouds
to detailed meshes. To create a digital replica of the scene,
we propose a simple yet effective method to place the
object meshes into the scene automatically without having
to explicitly perform pose estimation. We demonstrate our
results in the tabletop grasping-in-clutter task and across two
recent state-of-the-art grasp pose prediction networks [35],
[36]. The grasping networks have different input modality
and network architecture, providing additional evidence for
the generality of our results.
We compare our method with a recent scene reconstruction
baseline [37] which uses object pose estimation followed
by retrieval-based method for building the digital scene
replica. The results show that we can achieve higher sample
Algorithm 1 Step-by-step description of our reconstruction
framework to create a digital replica of the test scenes from
camera observations
Input: depth and segmentation maps, camera poses
Output: A digital replica of the test scene
1: Convert the depth maps to point clouds and fuse the
point clouds
2: Extract object-level point clouds using the semantic
segmentation maps
3: for Each object-level point cloud do
4: Perform point cloud outlier removal
5: Use Algorithm 2 to reconstruct an object mesh
6: Place the object mesh into the simulation scene
7: end for
efficiency if the shapes of digital replicas are closer to the
real world scenario.
To put the performance of our Real2Sim2Real pipeline
into perspective, we also compare the pre-trained variant and
the train-with-reconstruction variant of the best-performing
grasping network we used. Although the pre-trained model
used a 104×larger training dataset, we observe similar grasp-
ing performance in simulation. We can also achieve the same
robust grasping performance when adapting our compact
framework to a new gripper in the real world without having
to relabel the entire large dataset. In summary, our method
decouples 3D modeling and grasp pose sampling, and both
sub-problems can be solved with quality and generalizability
using state-of-the-art methods.
II. RELATED WORK
3D Reconstruction methods are generally explicit [38]–
[40] or implicit. Implicit representation encodes the shape as
continuous function [26], [41] and can, in principle, handle
more complex geometries. For Real2Sim, we utilize Con-
vONet [26], which provides a flexible implicit representation
to reconstruct 3D scenes with fine-grained details.
Sim2Real gap is a common problem when transferring a
simulator policy to the real-world. While existing techniques,
such as domain randomization and system identification, can
reduce the Sim2Real gap [9], [42], [43], they focus on gen-
eralization with respect to the low-dimensional dynamics pa-
rameters. We instead study geometric Real2Sim2Real, align-
ing the high-dimensional geometry to improve Sim2Real
transfer.
Geometric Real2Sim2Real is a generative problem: given
observations of a real scene, an algorithm should generate the
corresponding digital replica. Generative algorithms typically
fall into two classes. Firstly, the algorithm retrieves from an
external database during the generation process, an approach
taken by [37]. The second class of approaches relies solely
on the trained generative model at test time. A representative
algorithm is [44]. While [37], [44] can reconstruct objects
and scenes, they do not demonstrate that the reconstruction
can be used to train networks to perform robotic manipu-
lation tasks. The ability to use the reconstructed scenes to
摘要:

AReal2Sim2RealMethodforRobustObjectGraspingwithNeuralSurfaceReconstructionLuobinWang,RunlinGuo,QuanVuong,YuzheQin,HaoSu,HenrikChristensenFig.1:Left(ourpipeline):givenarealscene(I),wefuseandsegmentthecameraobservationstoobtainobjectlevelpointclouds(II),whichweusetoconstructadigitalreplicaoftherealsce...

展开>> 收起<<
A Real2Sim2Real Method for Robust Object Grasping with Neural Surface Reconstruction Luobin Wang Runlin Guo Quan Vuong Yuzhe Qin Hao Su Henrik Christensen.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:3.55MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注