
Figure 2: The framework of our method. It consists of three stages: ContactCVAE, GraspNet and Penetration-aware Partial Optimization.
ContactCVAE takes an object point cloud Oas input and generates a contact map C0. GraspNet estimates a grasp parameterized by θfrom
the contact map C0. Finally, penetration-aware partial (PAP) optimization refines θto get the final grasp.
ample, [Mo et al., 2021]and [Wu et al., 2021]first estimate
the contact points for parallel-jaw grippers and plan paths to
grasp the target objects. The common assumption in the lit-
erature is that the contact area is a point and the contact point
generation is treated as a per-point (or pixel voxel) detection
problem, i.e. classifying each 3D object point to be a con-
tact or not, which cannot be applied to dexterous hand grasps
demonstrating much more complex contact. For dexterous
robotic hand grasping, recent work [Mandikal and Grauman,
2021]finds that leveraging contact areas from human grasp
can improve the grasping success rate in a reinforcement
learning framework. However, it assumes an object only af-
fords one grasp, which contradicts the real case and limits its
application.
To tackle the limitations, we propose to leverage contact
maps to constrain the grasp synthesis. Specifically, we fac-
torize the learning task into two sequential stages, rather than
taking a black-box hand pose generative network that directly
maps an object to the possible grasping poses in previous
work. In the first stage, we generate multiple hypotheses of
the grasping contact areas, represented by binary 3D segmen-
tation maps. In the second stage, we learn a mapping from the
contact to the grasping pose by assuming the grasping pose is
fully constrained given a contact map.
The intermediate segmentation contact maps align with the
smooth manifold of the object surface: for example, a small
change in a valid contact map would likely produce another
valid solution (as illustrated in Figure 1), then the correspond-
ing pose can be deterministically established by the follow-
ing GraspNet and PAP optimization. This manner reduces
the challenging pose generation to an easier map generation
problem in a low-dimension and smooth manifold, benefiting
generation efficiency and generality.
The other benefit of the intermediate contact representation
is enabling the optimization from the contacts. Different from
the optimization for the full grasps from scratch [Brahmbhatt
et al., 2019b; Xing et al., 2022], we propose a penetration-
aware partial (PAP) optimization with the intermediate con-
tacts. It detects partial poses causing penetration and lever-
ages the generated contact maps as a consistency constraint
for the refinement of the partial poses. The PAP optimization
constrains gradients from wrong partial poses to affect these
poses requiring adjustment only, which results in better grasp
quality than a global optimization method.
In summary, our key contributions are: 1) we tackle the
high non-linearity problem of the 3D generation problem by
introducing the contact map constraint and factorizing the
generation in two stages: contact map generation and map-
ping from contact maps to grasps; 2) we propose a PAP op-
timization with the intermediate contacts for the grasp re-
finement; 3) benefiting from the two decomposed learning
stages and partial optimization, our method outperforms ex-
isting methods both quantitatively and qualitatively.
2 Related Works
Human grasp generation is a challenging task due to the
higher degrees of freedom of human hands and the require-
ment of the generated hands to interact with objects in a phys-
ically reasonable manner. Most methods use models such as
MANO [Romero et al., 2017]to parameterize hand poses,
aiming to directly learn a latent conditional distribution of the
hand parameters given objects via large datasets. The distri-
bution is usually learned by generative network models such
as Conditional Variational Auto-Encoder [Sohn et al., 2015],
or Adversarial Generative Networks [Arjovsky et al., 2017].
To get finer poses, many existing works adopt a coarse-to-fine
strategy by learning the residuals of the grasping poses in the
refinement stage. [Corona et al., 2020]uses a generative ad-
versarial network to obtain an initial grasp, and then an extra
network to refine it. [Taheri et al., 2020]follows a similar
strategy but uses a CVAE model to output an initial grasp.
In recent works, contact maps are exploited to improve
robotic grasping, hand object reconstruction, and 3D grasp
synthesis. [Brahmbhatt et al., 2019b]introduces a loss for
robotic grasping optimization using contact maps captured
from thermal cameras [Brahmbhatt et al., 2019a; Brahmb-
hatt et al., 2020]to filter and rank random grasps sampled
by GraspIt! [Miller and Allen, 2004]. It concludes that syn-
thesized grasping poses optimized directly from the contact
demonstrate superior quality to other approaches which kine-
matically re-target observed human grasps to the target hand
model. In the reconstruction of the hand-object interaction,
[Grady et al., 2021]propose a differentiable contact optimiza-
tion to refine the hand pose reconstructed from an image. In
the 3D grasp synthesis, [Jiang et al., 2021]also exploits con-
tact maps but they only use them to refine generated grasps
during inference. Our work differs from these works using
contact maps in three aspects: 1) these works use contact
maps as a loss for the grasp optimization or post-processing
for further grasp refinement while our work exploits the con-
tact maps as an intermediate constraint for the learning of the
grasp distribution; 2) in contrast to the learning-based works
with contact maps which treat objects-to-grasps as a black
box, our work factorizes the grasp synthesis into objects-to-
contact maps and contact maps-to-grasps; 3) moreover, these
works refine the whole grasps with global optimization meth-
ods using contact maps while our penetration-aware partial
optimization detects the partial poses causing the penetration
and leverages the contact map constraint to optimize the par-