•An algorithm that computes grasp quality scores us-
ing the geometric information (approach direction and
spatial representation) from interactions between the
objects, human hand, and robot gripper.
•A neural model for fast and parallel evaluation of
various robot grasp candidates for human friendliness
and stability.
•A new set of metrics that evaluate the quality of grasps
based on their safety, human friendliness, and efficiency
for human-robot collaboration tasks.
•A validation of our CoGrasp approach through real-
robot experiments, demonstrating our method achieves
a 88% success rate in generating stable robot grasps
while leaving socially compliant space on the object
for humans to co-grasp concurrently.
•A validation of our approach through a user study with
10 participants, indicating our method achieves 22%
higher scores on various metrics of CoGrasp’s social
compliance and safety than a traditional robot-centric
method [6].
II. RELATED WORK
This section discusses various techniques that generate
collision-free, stable robot grasps for object manipulation.
We divide these methods into three categories, i.e., classical,
data-driven, and contextual, as described in the following.
A. Classical Method
The study of robot grasp generation goes very back,
starting from attempting to handle objects using a robot
hand with elastic fingers [11]. It gave rise to geometric-based
approaches [12]–[14] for producing grasps using the contact
points’ classification as frictionless, friction, or soft contact
to identify parts on the object for a successful grasping.
Another line of work, like [15], studied the number of contact
points needed for stable grasping. The contact points are
essential for grasp stability, but the current techniques only
look into these for identifying suitable regions from the
robot’s perspective. In a similar vein, [16], [17] demonstrated
that grasping an object results in a pull force and overcoming
that wrench is an essential aspect of stability. [18]–[20]
includes studying complex kinematics of the object and the
hand motion involved during an interaction, displaying the
movement of an approaching hand or the gripper to be
critical for grasping. Following the formulations of stable
grasping, geometry-based techniques [5], [21]–[24] were
proposed that rely directly on the object shape to generate a
suitable grasp. However, such methods do not generalize to
real-world scenarios where object models are often unknown.
Modern methods [3] tend to account for surface normals
to evaluate the quality of grasps and use them to compute
a safe distance for a stable grasp. [25] models the mean
axis of an object by running PCA and empirically choosing
a safe space from a normal plane to that axis. Despite
progress in stable robot grasping, these geometric approaches
do not consider human-in-the-loop and solely rely on having
one manipulator; thus do not apply to collaboration tasks
requiring the co-grasping of the objects.
B. Data-Driven Method
With the advancement in computational resources and
deep neural models, data-driven methods have emerged
significantly for generating grasps. In emerging scenarios
where input is only available from visual sensors, the ex-
isting methods lean on computing the object models and
their 6DoF poses [26] before deploying traditional grasping
methods. Furthermore, when the complete 3D object models
are unavailable, learning-based shape reconstruction [27]
from inference or multiple views [28] are proposed to fill
in the gap. Many reinforcement learning techniques have
also come up, which learn gripper poses using exploration
[29] and learn policies for manipulation [30]. They help
identify dynamic responses to disturbances while grasping.
Still, such perturbations are only based on gripper-object and
environment interaction. Nevertheless, 3D reconstruction and
exploring the large 6-DoF state space suffer from storing a
huge amount of data that consists exclusively of the gripper
and object relative poses.
Since contact points are crucial in grasping, methods
like PointNet++ [31] are used to learn patterns from point
clouds. For instance, [6], [32], [33] utilizes PointNet++ to
learn geometric forms between grippers and objects from
the contact points data available in grasping datasets like
ACRONYM [34]. To extend the point space that is not
limited to the contact points, learning-based methods like
Dexnet [35]–[37] directly learn the orientations and the
approach direction of the gripper. There are also methods
[38], [39] that tend to produce a grasp score for each point in
the space when contact information is not present. However,
the contact information and orientations that are looked upon
come only from the gripper and the object. Some other neural
sampling-based methods [40], [41] use different grasp quality
metrics as objective functions. These metrics depending on
the gripper orientations and surface areas, only relate to the
overall stability rather than human awareness.
C. Contextual Grasping
Contextual grasping refers to grasp generation with some
context about the objects and their underlying tasks. This
problem often involves encoding contextual information like
the semantic representation or the object properties into
the network inputs. For example, [9] encodes the target
candidate’s visual, tactile, and texture information while
performing tasks like picking, lifting, or pouring. [10] also
considers encoding the relationships of objects in the scene,
which allows reasoning about invisible points, enabling
collision-free grasp. However, the enhanced reliabilities of
grips come from the extensive cost when acquiring hand-
labeled training data. In addition, these works focus more on
producing human-like grasps or moving the object of interest
to the target position, which involves no human actions.
In summary, none of the abovementioned methods con-
siders a simultaneous human grasp while producing a stable