Pushing the Efficiency Limit Using Structured Sparse Convolutions Vinay Kumar Verma1 Nikhil Mehta1 Shijing Si3 Ricardo Henao1 Lawrence Carin12 1Duke University2KAUST Saudi Arabia3SEF Shanghai International Studies University

2025-05-02 0 0 1.41MB 11 页 10玖币
侵权投诉
Pushing the Efficiency Limit Using Structured Sparse Convolutions
Vinay Kumar Verma1*, Nikhil Mehta1*, Shijing Si3, Ricardo Henao1, Lawrence Carin1,2
1Duke University 2KAUST Saudi Arabia 3SEF, Shanghai International Studies University
Abstract
Weight pruning is among the most popular approaches
for compressing deep convolutional neural networks. Re-
cent work suggests that in a randomly initialized deep neu-
ral network, there exist sparse subnetworks that achieve
performance comparable to the original network. Unfortu-
nately, finding these subnetworks involves iterative stages
of training and pruning, which can be computationally ex-
pensive. We propose Structured Sparse Convolution (SSC),
that leverages the inherent structure in images to reduce
the parameters in the convolutional filter. This leads to im-
proved efficiency of convolutional architectures compared to
existing methods that perform pruning at initialization. We
show that SSC is a generalization of commonly used layers
(depthwise, groupwise and pointwise convolution) in “effi-
cient architectures.” Extensive experiments on well-known
CNN models and datasets show the effectiveness of the pro-
posed method. Architectures based on SSC achieve state-
of-the-art performance compared to baselines on CIFAR-
10, CIFAR-100, Tiny-ImageNet, and ImageNet classifica-
tion benchmarks. Our source code is publicly available at
https://github.com/vkvermaa/SSC.
1. Introduction
Overparameterized deep neural networks (DNNs) are known
to generalize well on the test data [
4
,
1
]. However, overpa-
rameterization increases the network size, making DNNs
resource-hungry and leading to extended training and in-
ference time. This hinders the training and deployment of
DNNs on low-power devices and limits the application of
DNNs in systems with strict latency requirements. Several
efforts have been made to reduce the storage and com-
putational complexity of DNNs using model compression
[
19
,
62
,
51
,
45
,
33
,
3
,
34
,
7
,
44
]. Network pruning is the
most popular approach for model compression. In network
pruning, we compress a large neural network by pruning
redundant parameters while maintaining the model perfor-
mance. The pruning approaches can be divided into two
categories: unstructured and structured. Unstructured prun-
*
The authors contributed equally to this work. Correspondence to
{vinayugc, nikhilmehta.dce}@gmail.com.
ing removes redundant connections in the kernel, leading
to sparse tensors [
27
,
18
,
64
]. Unstructured sparsity pro-
duces sporadic connectivity in the neural architecture, caus-
ing irregular memory access [
55
] that adversely impacts
the acceleration in hardware platforms. On the other hand,
structured pruning involves pruning parameters that follow
a high-level structure (
e.g.
, pruning parameters at the filter-
level [
29
,
34
,
12
]). Typically, structure pruning leads to prac-
tical acceleration, as the parameters are reduced while the
memory access remains contiguous. Existing pruning meth-
ods typically involve a three-stage pipeline: pretraining, prun-
ing and finetuning, where the latter two stages are carried out
in multiple stages until a desired pruning ratio is achieved.
While the final pruned model leads to a low inference cost,
the cost to achieve the pruned architecture remains high.
The lottery ticket hypothesis (LTH) [
15
,
16
] showed that
a randomly initialized overparametrized neural network con-
tains a sub-network, referred to as the “winning ticket,” that
when trained in isolation achieves the same test accuracy as
the original network. Similar to LTH, there is compelling
evidence [
38
,
39
,
14
,
13
,
2
,
1
,
45
] suggesting that overpa-
rameterization is not essential for high test accuracy, but is
helpful to find a good initialization for the network [
30
,
65
].
However, the procedure to find such sub-networks involves
iterative pruning [
15
] making it computationally intensive. If
we know the sub-network beforehand, we can train a much
smaller and efficient model with only 1-10% of the parame-
ters of the original network, reducing the computational cost
involved during training.
An open research question concerns how to design a
sub-network without undergoing the expensive multi-stage
process of training, pruning and finetuning. There have been
recent attempts [
27
,
52
] to alleviate this issue, involving a
one-time neural network pruning at initialization by solving
an optimization problem for detecting and removing unim-
portant connections. Once the sub-network is identified, the
model is trained without carrying out further pruning. This
procedure of pruning only once is referred to as pruning at
initialization or foresight pruning [
52
]. While these methods
can find an approximation to the winning ticket, they have
the following limitations hindering their practical applica-
bility:
(1)
The initial optimization procedure still requires
arXiv:2210.12818v1 [cs.CV] 23 Oct 2022
large memory, since the optimization process is carried out
over the original overparameterized model.
(2)
The obtained
winning ticket is specific to a particular dataset on which it
is approximated,
i.e.
, a network pruned using a particular
dataset may not perform optimally on a different dataset.
(3)
These pruning based methods lead to unstructured sparsity
in the model. Due to common hardware limitations, it is
very difficult to get a practical speedup from unstructured
compression.
In this paper, we design a novel structured sparse convo-
lution (SSC) filter for convolutional layers, requiring signifi-
cantly fewer parameters compared to standard convolution.
The proposed filter leverages the inherent spatial proper-
ties in the images. The commonly used deep convolutional
architectures, when coupled with SSC, outperform other
state-of-the-art methods that do pruning at initialization. Un-
like typical pruning approaches, the proposed architecture
is sparse by design and does not require multiple stages of
pruning. The sparsity of the architecture is dataset agnostic
and leads to better transfer ability of the model when com-
pared to existing state-of-the-art methods that do pruning at
initialization. We also show that the proposed filter has im-
plicit orthogonality that ensures minimum filter redundancy
at each layer. Additionally, we show that the proposed filter
can be viewed as a generalization of existing efficient convo-
lutional filters used in group-wise convolution (GWC) [
59
],
point-wise convolution (PWC) [
48
], and depth-wise convolu-
tion (DWC) [
50
]. Extensive experiments and ablation studies
on standard benchmarks depict the efficacy of the proposed
filter. Moreover, we further compress existing efficient mod-
els such as MobileNetv2 [
40
] and ShuffleNetv2 [
35
] while
achieving performance comparable to the original models.
2. Methods
We propose Structured Sparse Convolution (SSC) filter,
which is composed of layered spatially sparse
K×K
and
1×1
kernels. Unlike the typical CNN filters that have a ker-
nel of fixed size, the SSC filter has three type of kernels, as
shown in Figure 1. The heterogeneous nature of the kernels
are designed to have varying receptive fields that can cap-
ture different features in the input. As shown in Section 2.1,
heterogeneity in kernels allows the neural network layer to
accumulate information from different spatial locations in
the feature map while significantly reducing redundancy in
the number of parameters. Consider layer
l
of a model with
an input (
hl1
) of size
il1×il1×M
, where
il1
corre-
sponds to the spatial dimension (width and height) and
M
denotes the number of channels of the input. Assume that
layer
l
has
N
filters, resulting in an output feature map
hl
of
size
il×il×N
. We represent the computational and mem-
ory cost at the
lth
layer using the number of floating-point
operations (
Fl
) and the number of parameters (
Pl
), respec-
tively. The computational and memory cost associated with
Figure 1. The three basic components that are used in the proposed
SSC filter. Blue blocks indicates a zero-weight location. The red,
orange and green blocks show the active weight location in three
different type of kernels.
Figure 2. The proposed convolutional layer with
N
SSC filters.
Blue blocks denote the zero-weight locations in
3×3
and
1×1
kernels, while other colors show active weights.
a standard convolutional layer with a
K×K
kernel is the
following:
Fl=i2
l×N×(K2M),(1)
Pl=N×(K2M),(2)
where
(K2M)
represents the number of total parameters
from all
M
channel-specific kernels. As is evident from (1)
and (2), reducing
(K2M)
directly reduces both the number
of parameters and the computational cost of the model. This
is indeed what our proposed method (SSC) achieves – we
design two types of sparse kernels, which form the basic
components of SSC.
Odd/Even K×Kkernel
: The two types of
K×K
kernels differ in terms of the location of the enforced sparsity.
Considering
SRK2
to be the flattened version of the
K×K2D kernel, we define the odd kernel as:
(S[i] = 0 i∈ {2p|0<2p < K2, p N}
S[i] = wii∈ {2p+ 1 |0<2p+ 1 < K2, p N},(3)
The even kernel is defined in a similar fashion, where the
kernel is zero at odd coordinates and non-zero at even coordi-
nates of the filter. Figure 1 (a-b) illustrates the odd and even
kernel, respectively, when
K= 3
. These kernels replace the
standard K×Kkernels used in a convolutional layer.
SSC Filter
: A convolutional layer having
N
SSC filters
is shown in Figure 2. An SSC filter is referred to as odd (or
even) filter, if it only contains odd (or even) kernels. For each
convolutional layer, an equal number of odd and even filters
are used. We make the following modifications to a standard
convolution filter having M,K×Kkernels:
1.
Among the
M
different kernels, we replace each kernel
at the
kg
location with an odd/even kernel, where
g
is
a hyperparameter such that
0< g < M
and
k∈ {n
N|0< k gM}
. Note that each filter has only
one type (odd/even) of kernel. Each of the
N
filters have
M/g
such kernels. The computational cost (
Fsg
) and the
memory cost (
Psg
) for all the odd/even kernels in a filter:
Fsg =i2
l×N×(K2c)M
g,(4)
Psg =N×(K2c)M
g,(5)
c=(K2
2Odd kernel
K2− ⌈K2
2Even kernel ,(6)
where
c
represents the number of zeros in the kernel and
.denotes the ceiling function.
2.
Out of the remaining
M(1 1/g)
kernel locations in
the filter, we place a
1×1
kernel at a fixed interval of
p
as shown in Figure 1 (c). Each of the
N
filters have
M(1 1/g)/p 1×1
kernels. The computational and
memory cost of these
1×1
kernels can be defined as:
Fsp =i2
l×N×M(1 1/g)
p,(7)
Psp =N×M(1 1/g)
p.(8)
3.
The SSC filter is empty at the remaining
M(1 1/p)(1
1/g)
locations, causing the filter to ignore the correspond-
ing feature maps (input channels). Note that while a partic-
ular filter may not act on certain input features, other SSC
filters of the convolutional layer will. This is enforced by
the shifting procedure introduced below.
Shift operation:
If we naively use SSC filters in a con-
volutional layer, there will be a loss of information, as all
N
filters will ignore the same input feature maps. To ensure
that each SSC filter attends to a different set of feature maps,
we shift the location of all kernels (
K×K
and
1×1
) by
(nmod q)
at initialization
1
, where
n∈ {1, ..., N}
denotes
the index of the filter and
q:= max(g, p)
. The shift opera-
tion across
N
filters can be visualized in Figure 2. We can
divide the
N
filters into sets of disjoint filters such that all
1The shift operation is applied only once before training begins.
the filters in a particular set attend to distinct input feature
maps. Formally, let the collection of sets be defined as:
Q:= {(0, q),[q, 2q),...,[N(Nmod q), N)},(9)
where
[a, a +q)
denotes the set of filters
a
through
a+q1
.
Then
f, f[a, a +q)
,
f
and
f
tend to disjoint in-
put feature maps if
f̸=f
. Moreover,
f
and
f
are
“near-orthogonal” (
fTf0
), since they attend to non-
overlapping regions of the input feature maps. As discussed
in Section 2.1, the orthogonal property of a layer is of inde-
pendent interest and allows the network to learn uncorrelated
filters. Note that the design of the SSC filter induces struc-
tural sparsity as the sparse region is predetermined and fixed,
which is in contrast to the unstructured pruning method
[15, 27, 52].
We can quantify the total reduction in the number of
floating-point operations (
RF
) and the number of parameters
(Rp) with respect to the standard convolutional layer:
RF=1Fsg +Fsp
Fl×100% (10)
=1(1 c/K2)
g(1 1/g)
K2p×100%,(11)
Rp=1Psg +Psp
Pl×100% (12)
=1(1 c/K2)
g(1 1/g)
K2p×100%,(13)
where
0< g
and
p<M
. The hyperparameters
p
and
g
are set to achieve the desired sparsity in the architectures;
we use
Rp
as guiding principle behind choosing
p
and
g
.
One can also achieve a desired reduction in floating-point
operations (
RF
) to determine the corresponding hyperpa-
rameters. However, in our experiments we consider sparsity
constraints only.
2.1. Implicit Orthogonality
Recent work [
42
,
58
] shows that deep convolutional net-
works learn correlated filters in overparameterized regimes.
This implies filter redundancy and correlated feature maps
when working with deep architectures. The issue of correla-
tion across multiple filters in the convolutional layer has been
addressed by incorporating an explicit orthogonality con-
straint to the filter of each layer [
5
,
53
]. Consider a 2D matrix
WRJ×N
containing all the filters:
W= [f1, f2, . . . , fN]
,
where
fnRJ
is the vector containing all the parameters
in the
nth
filter, and
J=K2M
for a standard convolutional
layer. The soft-orthogonality (SO) constraint on a layer
l
with the corresponding 2D matrix Wlis defined as:
LSO =λ||WT
lWlI||2
F,(14)
where
IRN×N
is the identity matrix,
λ
controls the
degree of orthogonality and
||.||2
F
is the Frobenius norm.
摘要:

PushingtheEfficiencyLimitUsingStructuredSparseConvolutionsVinayKumarVerma1*,NikhilMehta1*,ShijingSi3,RicardoHenao1,LawrenceCarin1,21DukeUniversity2KAUSTSaudiArabia3SEF,ShanghaiInternationalStudiesUniversityAbstractWeightpruningisamongthemostpopularapproachesforcompressingdeepconvolutionalneuralnetwo...

展开>> 收起<<
Pushing the Efficiency Limit Using Structured Sparse Convolutions Vinay Kumar Verma1 Nikhil Mehta1 Shijing Si3 Ricardo Henao1 Lawrence Carin12 1Duke University2KAUST Saudi Arabia3SEF Shanghai International Studies University.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.41MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注