
Fig. 2: Given objects placed on a tabletop surface, the robot
should pick up the objects and move them above the tabletop
surface without knocking the neighboring objects over.
rameters. We instead focus on the geometric Real2Sim2Real
challenge – automatically constructing object geometry from
camera observations, placing them at physically-plausible
poses that allow for forward simulation, and demonstrating
that the reconstructed scenes can train performant neural net-
works. In manipulation research, constructing object meshes
and placing them into simulated scenes can be a highly
manual process [18], [25]. Very recently, we have witnessed
breakthrough in learning-based 3D modeling. State-of-the-
art neural surface reconstruction methods can convert input
from 3D sensors to meshes with geometric details and have
demonstrated strong cross-scene generalizability [26]–[29].
Therefore, it is time to ask the hypothesis whether the
quality of the generated 3D meshes from state-of-the-art
neural reconstruction algorithms can be used for physical
simulation and learning of manipulation algorithms. In fact,
autonomous driving researches which require coarser envi-
ronment geometry have already benefited tremendously from
efforts to automatically reconstruct digital clones [30]–[34].
However, manipulation research requires finer details for
simulation, including precise geometric meshes and accurate
physical properties. Using 6-DoF object clutter grasping task
with point cloud input as the case study, we demonstrate
experimental evidences to support our hypothesis that mod-
ern neural reconstruction algorithms can provide sufficient
details of objects to support simulation and learning. We
choose grasping as the case study application because grasp-
ing is an essential primitives for all manipulation tasks and
enabling grasping with our compact framework provides
strong promises for more complex manipulation tasks.
Our pipeline uses a performant and recent neural surface
reconstruction network that learns to convert point clouds
to detailed meshes. To create a digital replica of the scene,
we propose a simple yet effective method to place the
object meshes into the scene automatically without having
to explicitly perform pose estimation. We demonstrate our
results in the tabletop grasping-in-clutter task and across two
recent state-of-the-art grasp pose prediction networks [35],
[36]. The grasping networks have different input modality
and network architecture, providing additional evidence for
the generality of our results.
We compare our method with a recent scene reconstruction
baseline [37] which uses object pose estimation followed
by retrieval-based method for building the digital scene
replica. The results show that we can achieve higher sample
Algorithm 1 Step-by-step description of our reconstruction
framework to create a digital replica of the test scenes from
camera observations
Input: depth and segmentation maps, camera poses
Output: A digital replica of the test scene
1: Convert the depth maps to point clouds and fuse the
point clouds
2: Extract object-level point clouds using the semantic
segmentation maps
3: for Each object-level point cloud do
4: Perform point cloud outlier removal
5: Use Algorithm 2 to reconstruct an object mesh
6: Place the object mesh into the simulation scene
7: end for
efficiency if the shapes of digital replicas are closer to the
real world scenario.
To put the performance of our Real2Sim2Real pipeline
into perspective, we also compare the pre-trained variant and
the train-with-reconstruction variant of the best-performing
grasping network we used. Although the pre-trained model
used a 104×larger training dataset, we observe similar grasp-
ing performance in simulation. We can also achieve the same
robust grasping performance when adapting our compact
framework to a new gripper in the real world without having
to relabel the entire large dataset. In summary, our method
decouples 3D modeling and grasp pose sampling, and both
sub-problems can be solved with quality and generalizability
using state-of-the-art methods.
II. RELATED WORK
3D Reconstruction methods are generally explicit [38]–
[40] or implicit. Implicit representation encodes the shape as
continuous function [26], [41] and can, in principle, handle
more complex geometries. For Real2Sim, we utilize Con-
vONet [26], which provides a flexible implicit representation
to reconstruct 3D scenes with fine-grained details.
Sim2Real gap is a common problem when transferring a
simulator policy to the real-world. While existing techniques,
such as domain randomization and system identification, can
reduce the Sim2Real gap [9], [42], [43], they focus on gen-
eralization with respect to the low-dimensional dynamics pa-
rameters. We instead study geometric Real2Sim2Real, align-
ing the high-dimensional geometry to improve Sim2Real
transfer.
Geometric Real2Sim2Real is a generative problem: given
observations of a real scene, an algorithm should generate the
corresponding digital replica. Generative algorithms typically
fall into two classes. Firstly, the algorithm retrieves from an
external database during the generation process, an approach
taken by [37]. The second class of approaches relies solely
on the trained generative model at test time. A representative
algorithm is [44]. While [37], [44] can reconstruct objects
and scenes, they do not demonstrate that the reconstruction
can be used to train networks to perform robotic manipu-
lation tasks. The ability to use the reconstructed scenes to