
Pushing the Efficiency Limit Using Structured Sparse Convolutions
Vinay Kumar Verma1*, Nikhil Mehta1*, Shijing Si3, Ricardo Henao1, Lawrence Carin1,2
1Duke University 2KAUST Saudi Arabia 3SEF, Shanghai International Studies University
Abstract
Weight pruning is among the most popular approaches
for compressing deep convolutional neural networks. Re-
cent work suggests that in a randomly initialized deep neu-
ral network, there exist sparse subnetworks that achieve
performance comparable to the original network. Unfortu-
nately, finding these subnetworks involves iterative stages
of training and pruning, which can be computationally ex-
pensive. We propose Structured Sparse Convolution (SSC),
that leverages the inherent structure in images to reduce
the parameters in the convolutional filter. This leads to im-
proved efficiency of convolutional architectures compared to
existing methods that perform pruning at initialization. We
show that SSC is a generalization of commonly used layers
(depthwise, groupwise and pointwise convolution) in “effi-
cient architectures.” Extensive experiments on well-known
CNN models and datasets show the effectiveness of the pro-
posed method. Architectures based on SSC achieve state-
of-the-art performance compared to baselines on CIFAR-
10, CIFAR-100, Tiny-ImageNet, and ImageNet classifica-
tion benchmarks. Our source code is publicly available at
https://github.com/vkvermaa/SSC.
1. Introduction
Overparameterized deep neural networks (DNNs) are known
to generalize well on the test data [
4
,
1
]. However, overpa-
rameterization increases the network size, making DNNs
resource-hungry and leading to extended training and in-
ference time. This hinders the training and deployment of
DNNs on low-power devices and limits the application of
DNNs in systems with strict latency requirements. Several
efforts have been made to reduce the storage and com-
putational complexity of DNNs using model compression
[
19
,
62
,
51
,
45
,
33
,
3
,
34
,
7
,
44
]. Network pruning is the
most popular approach for model compression. In network
pruning, we compress a large neural network by pruning
redundant parameters while maintaining the model perfor-
mance. The pruning approaches can be divided into two
categories: unstructured and structured. Unstructured prun-
*
The authors contributed equally to this work. Correspondence to
{vinayugc, nikhilmehta.dce}@gmail.com.
ing removes redundant connections in the kernel, leading
to sparse tensors [
27
,
18
,
64
]. Unstructured sparsity pro-
duces sporadic connectivity in the neural architecture, caus-
ing irregular memory access [
55
] that adversely impacts
the acceleration in hardware platforms. On the other hand,
structured pruning involves pruning parameters that follow
a high-level structure (
e.g.
, pruning parameters at the filter-
level [
29
,
34
,
12
]). Typically, structure pruning leads to prac-
tical acceleration, as the parameters are reduced while the
memory access remains contiguous. Existing pruning meth-
ods typically involve a three-stage pipeline: pretraining, prun-
ing and finetuning, where the latter two stages are carried out
in multiple stages until a desired pruning ratio is achieved.
While the final pruned model leads to a low inference cost,
the cost to achieve the pruned architecture remains high.
The lottery ticket hypothesis (LTH) [
15
,
16
] showed that
a randomly initialized overparametrized neural network con-
tains a sub-network, referred to as the “winning ticket,” that
when trained in isolation achieves the same test accuracy as
the original network. Similar to LTH, there is compelling
evidence [
38
,
39
,
14
,
13
,
2
,
1
,
45
] suggesting that overpa-
rameterization is not essential for high test accuracy, but is
helpful to find a good initialization for the network [
30
,
65
].
However, the procedure to find such sub-networks involves
iterative pruning [
15
] making it computationally intensive. If
we know the sub-network beforehand, we can train a much
smaller and efficient model with only 1-10% of the parame-
ters of the original network, reducing the computational cost
involved during training.
An open research question concerns how to design a
sub-network without undergoing the expensive multi-stage
process of training, pruning and finetuning. There have been
recent attempts [
27
,
52
] to alleviate this issue, involving a
one-time neural network pruning at initialization by solving
an optimization problem for detecting and removing unim-
portant connections. Once the sub-network is identified, the
model is trained without carrying out further pruning. This
procedure of pruning only once is referred to as pruning at
initialization or foresight pruning [
52
]. While these methods
can find an approximation to the winning ticket, they have
the following limitations hindering their practical applica-
bility:
(1)
The initial optimization procedure still requires
arXiv:2210.12818v1 [cs.CV] 23 Oct 2022