
TABLE I: Dexterous Grasp Dataset Comparison
Dataset Hand Observations Sim./Real Grasps Obj.(Cat.) Grasps per Obj. Method
ObMan [14] MANO - Sim. 27k 2772(8) 10 GraspIt!
HO3D [15] MANO RGBD Real 77k 10 >7k Estimation
DexYCB [16] MANO RGBD Real 582K 20 >29k Human annotation
ContactDB [17] MANO RGBD+thermal Real 3750 50 75 Capture
ContactPose [18] MANO RGBD Real 2306 25 92 Capture
DDGdata [9] ShadowHand - Sim. 6.9k 565 >100 GraspIt!
DexGraspNet (Ours) ShadowHand - Sim. 1.32M 5355(133) >200 Optimization
force closure and then uses it to synthesize diverse and
stable grasps via optimization. However, [19] suffers from
low yield, slow convergence, and strict constraints on object
meshes, making it infeasible for us to use for synthesizing a
large-scale dataset.
To achieve our desired diversity, quality, and scale, we pro-
pose several critical improvements to [19], making it much
more efficient and robust. First, we design a better hand pose
initialization strategy and carefully select contact candidates
to boost yield. For synthesizing 10000 valid grasps, we speed
up from 400 GPU hours to 7 GPU hours. Second, we propose
an alternative way to compute penetration energy and signed
distances, which enables us to handle object meshes of much
lower quality, and also highly simplifies their preprocessing
procedures. Third, we introduce energy terms that punish
self-penetration and out-of-limit joint angles to further im-
prove grasp quality. Additionally, with simple modifications,
the entire pipeline can be applied to other dexterous hands,
such as MANO [20] and Allegro.
To verify the advantage of our dataset over the one from
DDG, we train two dexterous grasping algorithms on our
dataset and DDG. The cross-dataset experiments confirm that
training on our dataset yields better grasping quality and
higher diversity. Also, the great diversity of the hand grasps
from our dataset leaves huge improvement space for future
dexterous grasping algorithms.
II. RELATED WORK
Researches in grasping can be broadly categorized by the
types of end effectors involved. The most thoroughly studied
ones are the suction cup and parallel jaw grippers, whose
grasp pose can be defined by a 7D vector at most, including
3D for translation, 3D for rotation, and 1D for the width
between the two fingers. Dexterous robotic hands with three
or more fingers such as ShadowHand [8] and humanoid
hands such as MANO [20] require more complex descriptors,
sometimes up to 24DoF as in ShadowHand [8]. In this
paper, we are dedicated to researches on the latter type. To
bridge the gap between humanoid hands and robotic hands,
numerous researches have shown the efficacy of retargeting
humanoid hand poses to dexterous robotic hands [21–24].
A. Analytical Grasping
Early researches in dexterous grasping focus on optimizing
grasping poses to form force closure that can resist external
forces and torques [25–28].
Due to the complexity of computing hand kinematics and
testing force closure, many works were devoted to simpli-
fying the search space [29–31]. As a result, these methods
were applicable to restricted settings and can only produce
limited types of grasping poses. Another stream of work [32–
34] looks for simplifying the optimization process with an
auxiliary function. [19] proposed to use a differentiable
estimator of the force closure metric to synthesize diverse
grasping poses for arbitrary hands.
B. Data-Driven Grasping
Recent works shift their focus to data-driven methods.
Given an object, the most straightforward approach is to
directly generate the pose vectors of the grasping hand [35–
39]. A refinement step is usually implemented in these
methods to remove inconsistencies such as penetration.
Other methods take an indirect approach that involves gen-
erating an intermediate representation first. Existing methods
use contact points [40–42], contact maps [21, 22, 43–45], and
occupancy fields [46] as the intermediate representations.
The methods then obtain the grasping poses via optimiza-
tion [40, 41, 44, 46], planning [43], RL policies [22, 42], or
another generative model [45].
Compared to most analytical methods, data-driven meth-
ods show improved inference speed and diversity of gener-
ated grasping poses. However, the diversity is still limited
by the training data.
C. Dexterous Grasp Datasets
Dexterous grasping is impossibly difficult to annotate
for its overwhelming degrees of freedom. Most existing
works are trained on programmatically synthesized grasping
poses [9, 14, 38, 47] using the GraspIt! [10] planner. The
planner first searches the eigengrasp space for pregrasp
poses that cross a threshold. Then, the planner squeezes
all fingers in the selected pregrasp poses to construct a
firm grasp. Since the initial search is performed in the low
dimensional eigengrasp space, the resulting data follows a
narrow distribution and cannot cover the full dexterity of
multi-finger hands.
More recent works leverage the increasing capacity of
computer vision to collect human hand poses when inter-
acting with the object. HO3D [15, 48] computes the ground
truth 3D hand pose for images from 2D hand keypoint
annotations. The method resolves ambiguities by considering
physics constraints in hand-object interactions and hand-hand
interactions. DexYCB [16] and ContactPose [18] solve the
3D hand shape from multi-view RGBD camera recordings.
Latest datasets [49–51] use optical motion capture systems to
track hand and object shapes during interactions. While these