Effectiveness of the Recent Advances in Capsule Networks Nidhin Harilal

2025-04-26 0 0 796.02KB 10 页 10玖币
侵权投诉
Effectiveness of the Recent Advances
in Capsule Networks
Nidhin Harilal
Department of Computer Science
University of Colorado, Boulder
nidhin.harilal@colorado.edu
Rohan Patil
Department of Computer Science
University of California, San Diego
rpatil@ucsd.edu
Abstract
Convolutional neural networks (CNNs) have revolutionized the field of deep neural
networks. However, recent research has shown that CNNs fail to generalize under
various conditions and hence the idea of capsules was introduced in 2011, though
the real surge of research started from 2017. In this paper, we present an overview of
the recent advances in capsule architecture and routing mechanisms. In addition, we
find that the relative focus in recent literature is on modifying routing procedure or
architecture as a whole but the study of other finer components, specifically, squash
function is wanting. Thus, we also present some new insights regarding the effect
of squash functions in performance of the capsule networks. Finally, we conclude
by discussing and proposing possible opportunities in the field of capsule networks.
1 Introduction
Over the last few years, neural networks have made remarkable progress in various tasks ranging
from vision tasks like image recognition and object segmentation to machine translation tasks. The
availability of huge amounts of data has made it possible for Neural Networks to excel in different
areas of Computer Vision. The considerable success lies in the fact that CNNs can automatically
extract high-level features from images, which are much more powerful than human-designed features.
Capsule Networks (CapsNet) were introduced by Hinton et al. [
12
,
34
], which addressed the significant
limitations of CNNs and showed superior performance on the MNIST [
22
] dataset. Presently, capsule
networks are regarded as one of the most promising breakthroughs in deep learning. The primary
reason behind this is that capsules’ idea provides a much more promising way of dealing with different
variations in images, including position, scale, orientation, and lighting, than the currently employed
methods in the neural networks community.
Despite capsule networks showing an increased positive impact on various vision tasks [
2
,
3
,
17
], the
lack of architectural knowledge has limited researchers to exploit the full potential of this new field.
Therefore, this paper aims to provide insights behind the working of capsule networks and critically
review the latest advances in this field. First, we briefly discuss the reason behind the introduction
of capsule networks, followed by a description of its components. Then, we provide an analysis
describing aspects and limitations of various components of different capsule networks, followed by an
analysis of augmenting squash functions. Lastly, we conclude our analysis by describing opportunities
for further research in this field.
* These authors contributed equally to this work
arXiv:2210.05834v1 [cs.CV] 11 Oct 2022
2 Convolutional Neural Networks (CNNs)
Convolutional neural networks (CNN) are feed-forward neural networks that can extract features from
data in a hierarchical structure. The architecture of CNN is inspired by visual perception [
16
]. CNN
architectures date back decades [
21
], consisting of convolutional layers that can extract high-level
features from images. CNNs detect these features in images and learn how to recognize objects with
this information. Layers near the start detect simpler features like edges, and as the layers get deeper,
they detect more complex features like eyes, noses, or an entire face in case of face recognition. It then
uses all of these features, which it has learned to make a final prediction. Deep CNNs have provided
a significant contribution in computer vision tasks such as image classification [
9
,
18
]. They have
also been successfully applied to other computer vision fields, such as object detection [
4
,
25
,
32
],
face recognition [29], etc.
Limitations:
CNNs perform exceptionally great when they are inferenced over images that
resemble the dataset [
7
]. Despite being successful, CNN performs poorly when it receives the same
image with a different viewpoint [
1
,
26
]. Convolving kernel across an image ensures invariance,
but it doesn’t certify equivariance. Max-pooling [
30
] was introduced to further aid in creating the
positional invariance. On close scrutiny, the pooling operation stacked with a convolutional layer will
only detect the features but not preserve any spatial relationships between the detected features. The
difference could be explained with a CNN detecting a human face. An average human face will have
a pair of eyes, nose, and mouth; however, an image in which these parts are present doesn’t qualify
it to be an image of a human face. Therefore, a system just detecting certain features in the image
without having information about their spatial arrangement could fail in many scenarios. In short, The
pooling operation, along with the convolution, was supposed to introduce positional, orientational, and
proportional invariances but rather became a cause for them [
26
]. Including more data or using methods
like data, augmentation is used to tackle this problem, which ensured that the model was trained on
as many viewpoints/ orientations as possible. However, this is a very crude way of handling this.
3 Capsule Networks (CapsNet)
Hinton et al. [
12
] proposed the first capsule networks, which, unlike conventional CNNs, were designed
to encode and preserve the underlying spatial information between its learned features. The basic
idea behind the introduction of capsules by Hinton et al. [
12
] was to create a neural network capable of
performing inverse graphics [
11
]. Computer graphics deals with generating a visual image from some
internal hierarchical representation of geometric data. This internal representation consists of matrices
that represent the relative positions and orientation of the geometrical objects. This representation
is then converted to an image that is finally rendered on the screen. Hinton and his colleagues in their
paper [
12
] argued that humans deconstruct a hierarchical representation of the visual information
received through eyes and use this representation for recognition. From a pure machine learning
perspective, this means that the network should be able to deconstruct a scene into co-related parts,
which can be hierarchically represented [
19
]. To achieve this, Hinton et al. [
12
] proposed to augment
the conventional idea of neural network architecture to reflect the idea of several entities. Each entity
encapsulates a certain number of neurons and learns to recognize an implicitly defined visual entity
over a limited domain of viewing conditions and deformations.
3.1 Evolution of Capsules
In the first proposed capsule networks [
12
], the output of each capsule consisted of the probability
that a specific feature exists along with a set of instantiating parameters
1
. The goal of Hinton et al. [
12
]
was not recognizing or classifying the objects in an image but rather to force the outputs of a capsule
to recognize the pose in an image. A significant limitation of this first implementation of capsule
networks was that it required an additional external set of inputs that specified how the image had
been transformed for each of the entities to work. Although these transformations could be learned in
principle, the more significant challenge was to devise a way to instruct each of the capsules to discover
the underlying hierarchical relationship between the transformations in a complete end-to-end train
setting without explicitly using additional inputs.
1
Instantiating parameters [
12
] may include and encode the underlying pose, lighting, and deformation of
a particular feature relative to the other features that are detected by the capsules in the image.
2
摘要:

EffectivenessoftheRecentAdvancesinCapsuleNetworksNidhinHarilalDepartmentofComputerScienceUniversityofColorado,Bouldernidhin.harilal@colorado.eduRohanPatilDepartmentofComputerScienceUniversityofCalifornia,SanDiegorpatil@ucsd.eduAbstractConvolutionalneuralnetworks(CNNs)haverevolutionizedtheeldofdee...

展开>> 收起<<
Effectiveness of the Recent Advances in Capsule Networks Nidhin Harilal.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:796.02KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注