
2 Convolutional Neural Networks (CNNs)
Convolutional neural networks (CNN) are feed-forward neural networks that can extract features from
data in a hierarchical structure. The architecture of CNN is inspired by visual perception [
16
]. CNN
architectures date back decades [
21
], consisting of convolutional layers that can extract high-level
features from images. CNNs detect these features in images and learn how to recognize objects with
this information. Layers near the start detect simpler features like edges, and as the layers get deeper,
they detect more complex features like eyes, noses, or an entire face in case of face recognition. It then
uses all of these features, which it has learned to make a final prediction. Deep CNNs have provided
a significant contribution in computer vision tasks such as image classification [
9
,
18
]. They have
also been successfully applied to other computer vision fields, such as object detection [
4
,
25
,
32
],
face recognition [29], etc.
Limitations:
CNNs perform exceptionally great when they are inferenced over images that
resemble the dataset [
7
]. Despite being successful, CNN performs poorly when it receives the same
image with a different viewpoint [
1
,
26
]. Convolving kernel across an image ensures invariance,
but it doesn’t certify equivariance. Max-pooling [
30
] was introduced to further aid in creating the
positional invariance. On close scrutiny, the pooling operation stacked with a convolutional layer will
only detect the features but not preserve any spatial relationships between the detected features. The
difference could be explained with a CNN detecting a human face. An average human face will have
a pair of eyes, nose, and mouth; however, an image in which these parts are present doesn’t qualify
it to be an image of a human face. Therefore, a system just detecting certain features in the image
without having information about their spatial arrangement could fail in many scenarios. In short, The
pooling operation, along with the convolution, was supposed to introduce positional, orientational, and
proportional invariances but rather became a cause for them [
26
]. Including more data or using methods
like data, augmentation is used to tackle this problem, which ensured that the model was trained on
as many viewpoints/ orientations as possible. However, this is a very crude way of handling this.
3 Capsule Networks (CapsNet)
Hinton et al. [
12
] proposed the first capsule networks, which, unlike conventional CNNs, were designed
to encode and preserve the underlying spatial information between its learned features. The basic
idea behind the introduction of capsules by Hinton et al. [
12
] was to create a neural network capable of
performing inverse graphics [
11
]. Computer graphics deals with generating a visual image from some
internal hierarchical representation of geometric data. This internal representation consists of matrices
that represent the relative positions and orientation of the geometrical objects. This representation
is then converted to an image that is finally rendered on the screen. Hinton and his colleagues in their
paper [
12
] argued that humans deconstruct a hierarchical representation of the visual information
received through eyes and use this representation for recognition. From a pure machine learning
perspective, this means that the network should be able to deconstruct a scene into co-related parts,
which can be hierarchically represented [
19
]. To achieve this, Hinton et al. [
12
] proposed to augment
the conventional idea of neural network architecture to reflect the idea of several entities. Each entity
encapsulates a certain number of neurons and learns to recognize an implicitly defined visual entity
over a limited domain of viewing conditions and deformations.
3.1 Evolution of Capsules
In the first proposed capsule networks [
12
], the output of each capsule consisted of the probability
that a specific feature exists along with a set of instantiating parameters
1
. The goal of Hinton et al. [
12
]
was not recognizing or classifying the objects in an image but rather to force the outputs of a capsule
to recognize the pose in an image. A significant limitation of this first implementation of capsule
networks was that it required an additional external set of inputs that specified how the image had
been transformed for each of the entities to work. Although these transformations could be learned in
principle, the more significant challenge was to devise a way to instruct each of the capsules to discover
the underlying hierarchical relationship between the transformations in a complete end-to-end train
setting without explicitly using additional inputs.
1
Instantiating parameters [
12
] may include and encode the underlying pose, lighting, and deformation of
a particular feature relative to the other features that are detected by the capsules in the image.
2