networks are designed to be geometrically invariant, which
means they are robust to changes in scale, orientation, and
position. Geometric invariance is a crucial property in real-
world robotic scenarios where objects are placed in diverse
poses. Moreover, capsule networks employ dynamic rout-
ing mechanisms, allowing for more flexible information
flow between layers. This dynamic routing enables capsules
to detect spatial relationships and contributes to a richer un-
derstanding of the input data [25]. In the context of object
grasping, capsule networks enable object-aware grasping
by generating grasp configurations based on object features
and geometry. Building upon this concept, we have devel-
oped a novel architecture, GraspCaps, that takes as input
a point cloud representation of an object and generates as
outputs a semantic category label and per-point grasp con-
figurations. Our approach utilizes the activation of a single
capsule in the capsule network and processes this activa-
tion to produce per-point grasp vectors and corresponding
fitness values. To the best of our knowledge, GraspCaps
represents the first instance of a grasp network architecture
that employs a capsule network for object-aware grasping.
The contributions of this paper can be summarized as:
• This paper presents a novel architecture for object-aware
grasping that utilizes a capsule network to process a point
cloud representation of an object and generate a corre-
sponding semantic category label along with point-wise
grasp synthesis. This marks the first instance of a grasp-
ing model using the capsule network.
• We propose an algorithm for generating 6D grasp vectors
from point clouds and creating a synthetic grasp dataset
consisting of 4,576 samples with corresponding object
labels and target grasp vectors.
• To rigorously evaluate the effectiveness of the proposed
approach, we conducted a comprehensive series of exper-
iments in both simulation and real-robot setups.
2. Related Work
Deep learning-based object grasping methods provide
enhanced accuracy and adaptability, reduced dependency
on manual engineering, and improved robustness to vari-
ability in real-world scenarios [20]. Current approaches that
process point cloud data can be split up into two categories:
(i) approaches that first transform the point cloud into a dif-
ferent data structure [1,5,26], (ii) approaches that directly
process the point cloud [14,21,22,28,33]. Our method
falls into the second category.
Processing the point set directly has several advantages,
since no overhead is added by transforming the point set,
and there is no chance of any information loss in the con-
version. However, point sets are by definition unordered,
which makes extracting local structures and identifying
similar regions non-trivial. PointNet [21] was one of the
first architectures to effectively use point set data for train-
ing a neural network in an object recognition task. By de-
sign, the PointNet architecture is mostly invariant to point
order, which benefits point sets since extracting a natural or-
der from these sets is non-trivial. However, this does limit
the performance of PointNet as it cannot recognize local
structures in point sets. In prior research the importance of
order in data for the performance of neural networks has
been illustrated [31], hence order should not be fully disre-
garded. PointNet++ [22] improves upon PointNet by recog-
nizing local structures in the data. Our network architecture
is based in part on the architecture used by [3], which makes
the insight to split up the PointNet architecture into several
distinct modules.
Later research showed successful results working with
point sets by transforming the point set to be processed by
a convolutional neural network. PointCNN [14] processed
the input data by applying a χ-transform on the point set.
DGCNN [33] and Point-GNN [28] employ layer architec-
tures that transform the point set into a graph representation
and apply convolution to the resulting graph edges. Sev-
eral approaches have been successful in processing point
sets using a CNN by first transforming the point set into
a more regular data structure, such as a 3D voxel grid [1],
top-down view [13,17,26], or multi-view 2D images [5].
The resulting data structures can be processed with existing
deep neural network architectures. These conversions come
with significant limitations however, as there is a consider-
able loss in information when converting the point cloud to
a different structure, whether that be in the form of losing
natural point densities when converting to a voxel grid, or
the loss of spatial relations between points when convert-
ing to a top-down image. Additionally, the generated voxel
grids might be more voluminous than the original point set,
as it is likely that many of the voxels remain empty [21].
Due to these considerations, we decided to base the Grasp-
Caps architecture in a way that processes the point cloud di-
rectly. Moreover, our method utilizes capsule activations to
generate per-point grasp configuration. The intricate under-
standing of spatial hierarchies afforded by capsule networks
distinguishes GraspCaps as a pioneering solution for object
grasping. Unlike the reviewed approaches, our approach al-
leviates the need for excessive pooling layers employed in
CNN architectures. Such pooling layers can result in a loss
of detailed spatial information.
In the field of grasp generation, S4G [23] extended the
PointNet architecture to generate 6D grasps based on the
input point set. Grasp pose detection (GPD) [29] was de-
veloped to generate and evaluate the fitness of grasps. It
takes a point cloud as its input and generates grasps which
are then filtered on fitness. The network then classifies the
grasp candidate as either successful or unsuccessful. Point-
NetGPD [16] builds upon the idea of GPD and expands on
it by employing the PointNet architecture to evaluate the