the network can be folded into layers with shared
parameters and deepened without increasing the
number of parameters.
When used in CNNs, this technique can be used
by sharing kernels between the isomorphic convo-
lutional layers having the same configuration (i.e.
kernel size and the number of input and output
channels). In traditional CNNs, weight sharing was
limited to neurons in the same feature map. With
kernel-sharing, a kernel could be duplicated in fea-
ture maps at different layers. To update a shared
kernel with backpropagation, the gradients should
be accumulated across its different feature maps
throughout the network.
With kernel-sharing, the learning algorithm does
not search for layer-specific kernels, rather it looks
for shared kernels that can play different roles at
different depths of the network. This could be
considered as a form of regularization on the net-
work complexity. In usual regularization tech-
niques such as L2-norm [13], the complexity of over-
parameterized networks is controlled by putting
pressure on the majority of parameters to get near
zero values. While with shared kernels, the num-
ber of parameters is significantly reduced and ker-
nels are forced to maximally exploit their learning
capacity.
We applied the proposed kernel-sharing to two
different deep CNN architectures of ConvMixer [14]
and SE-ResNet [15] on CIFAR-10 and CIFAR-100
datasets [16]. The first layer of ConvMixer per-
forms a convolution with large kernels and stride,
then, it is followed by a cascade of isomorphic inter-
laying depthwise and pointwise convolutional lay-
ers. The SE-ResNet architecture has several con-
volutional stages with squeeze-and-excitation mod-
ules [15] in between, where, each stage is consist of
several isomorphic convolutional layers.
In the extreme case of applying kernel-sharing
to all the isomorphic layers, the classification accu-
racy of the ConvMixer model dropped only by 1.8%
and 3.6% on CIFAR-10 and CIFAR-100 datasets,
while the number of its trainable parameters was
reduced by 13.4 and 9.6 times with respect to the
baseline model. Similarly, the accuracy drop in SE-
ResNet with kernel-sharing was negligible, while,
the number of its trainable parameters was drasti-
cally fewer.
Also, our results indicate that kernel-sharing can
reduce the challenge of overfitting when we keep
the network depth or the number of trainable pa-
rameters the same as the baseline model.
2 Method
State-of-the-art CNNs are usually made of tens of
convolutional layers stacked on each other. Usually,
some of these layers have similar configurations,
such as the same number of channels and kernel
size. In multi-stage deep CNNs, usually, the model
has several stages of consecutive convolutional lay-
ers with the same configuration (see Fig. 1a). In
other words, layer configurations are similar in each
stage and vary across stages. For example, the
ConvMixer model [14] has only one stage of convo-
lutional layers, and the SE-ResNet [15] model has
three or four stages depending on the given dataset.
Even in deep CNNs with multi-block stages, layers
may have different configurations within the same
block but they are similar to layers in other blocks
(see Fig. 1b).
We introduce kernel-sharing or inter-layer
weight-sharing for deep CNNs to reduce the num-
ber of trainable parameters and eventually reduce
the memory footprint, which is especially useful
in memory-constrained situations. In the context
of CNNs, weight-sharing is an intra-layer concept
and refers to using the same kernel at different
locations. However, these kernels differ between
the layers. Here, we propose to go beyond and
train deep CNNs with shared kernels (i.e., kernel
weights) among the layers with the same configura-
tion, which we are going to call “isomorphic layers”
hereafter (see Fig. 2).
We define a sharing group as a set of isomorphic
layers sharing their kernels, located in the same
(Fig. 1a) or different stages (Fig. 1b). Also, one
might partition a set of isomorphic layers into two
or even more sharing groups. For example, niso-
morphic layers in Fig. 1a could be divided into two
sharing groups of n/2 isomorphic layers.
During the forward pass, isomorphic layers in a
sharing group use the same set of trainable pa-
rameters (i.e., shared kernels). Hence, we should
2