Pushing the Efficiency Limit Using Structured Sparse Convolutions Vinay Kumar Verma1 Nikhil Mehta1 Shijing Si3 Ricardo Henao1 Lawrence Carin12 1Duke University2KAUST Saudi Arabia3SEF Shanghai International Studies University

2025-05-02 0 0 1.41MB 11 页 10玖币

侵权投诉

Pushing the Efﬁciency Limit Using Structured Sparse Convolutions

Vinay Kumar Verma1*, Nikhil Mehta1*, Shijing Si3, Ricardo Henao1, Lawrence Carin1,2

1Duke University 2KAUST Saudi Arabia 3SEF, Shanghai International Studies University

Abstract

Weight pruning is among the most popular approaches

for compressing deep convolutional neural networks. Re-

cent work suggests that in a randomly initialized deep neu-

ral network, there exist sparse subnetworks that achieve

performance comparable to the original network. Unfortu-

nately, ﬁnding these subnetworks involves iterative stages

of training and pruning, which can be computationally ex-

pensive. We propose Structured Sparse Convolution (SSC),

that leverages the inherent structure in images to reduce

the parameters in the convolutional ﬁlter. This leads to im-

proved efﬁciency of convolutional architectures compared to

existing methods that perform pruning at initialization. We

show that SSC is a generalization of commonly used layers

(depthwise, groupwise and pointwise convolution) in “efﬁ-

cient architectures.” Extensive experiments on well-known

CNN models and datasets show the effectiveness of the pro-

posed method. Architectures based on SSC achieve state-

of-the-art performance compared to baselines on CIFAR-

10, CIFAR-100, Tiny-ImageNet, and ImageNet classiﬁca-

tion benchmarks. Our source code is publicly available at

https://github.com/vkvermaa/SSC.

1. Introduction

Overparameterized deep neural networks (DNNs) are known

to generalize well on the test data [

]. However, overpa-

rameterization increases the network size, making DNNs

resource-hungry and leading to extended training and in-

ference time. This hinders the training and deployment of

DNNs on low-power devices and limits the application of

DNNs in systems with strict latency requirements. Several

efforts have been made to reduce the storage and com-

putational complexity of DNNs using model compression

[

]. Network pruning is the

most popular approach for model compression. In network

pruning, we compress a large neural network by pruning

redundant parameters while maintaining the model perfor-

mance. The pruning approaches can be divided into two

categories: unstructured and structured. Unstructured prun-

The authors contributed equally to this work. Correspondence to

{vinayugc, nikhilmehta.dce}@gmail.com.

ing removes redundant connections in the kernel, leading

to sparse tensors [

]. Unstructured sparsity pro-

duces sporadic connectivity in the neural architecture, caus-

ing irregular memory access [

] that adversely impacts

the acceleration in hardware platforms. On the other hand,

structured pruning involves pruning parameters that follow

a high-level structure (

e.g.

, pruning parameters at the ﬁlter-

level [

]). Typically, structure pruning leads to prac-

tical acceleration, as the parameters are reduced while the

memory access remains contiguous. Existing pruning meth-

ods typically involve a three-stage pipeline: pretraining, prun-

ing and ﬁnetuning, where the latter two stages are carried out

in multiple stages until a desired pruning ratio is achieved.

While the ﬁnal pruned model leads to a low inference cost,

the cost to achieve the pruned architecture remains high.

The lottery ticket hypothesis (LTH) [

] showed that

a randomly initialized overparametrized neural network con-

tains a sub-network, referred to as the “winning ticket,” that

when trained in isolation achieves the same test accuracy as

the original network. Similar to LTH, there is compelling

evidence [

] suggesting that overpa-

rameterization is not essential for high test accuracy, but is

helpful to ﬁnd a good initialization for the network [

However, the procedure to ﬁnd such sub-networks involves

iterative pruning [

] making it computationally intensive. If

we know the sub-network beforehand, we can train a much

smaller and efﬁcient model with only 1-10% of the parame-

ters of the original network, reducing the computational cost

involved during training.

An open research question concerns how to design a

sub-network without undergoing the expensive multi-stage

process of training, pruning and ﬁnetuning. There have been

recent attempts [

] to alleviate this issue, involving a

one-time neural network pruning at initialization by solving

an optimization problem for detecting and removing unim-

portant connections. Once the sub-network is identiﬁed, the

model is trained without carrying out further pruning. This

procedure of pruning only once is referred to as pruning at

initialization or foresight pruning [

]. While these methods

can ﬁnd an approximation to the winning ticket, they have

the following limitations hindering their practical applica-

bility:

(1)

The initial optimization procedure still requires

arXiv:2210.12818v1 [cs.CV] 23 Oct 2022

large memory, since the optimization process is carried out

over the original overparameterized model.

(2)

The obtained

winning ticket is speciﬁc to a particular dataset on which it

is approximated,

i.e.

, a network pruned using a particular

dataset may not perform optimally on a different dataset.

(3)

These pruning based methods lead to unstructured sparsity

in the model. Due to common hardware limitations, it is

very difﬁcult to get a practical speedup from unstructured

compression.

In this paper, we design a novel structured sparse convo-

lution (SSC) ﬁlter for convolutional layers, requiring signiﬁ-

cantly fewer parameters compared to standard convolution.

The proposed ﬁlter leverages the inherent spatial proper-

ties in the images. The commonly used deep convolutional

architectures, when coupled with SSC, outperform other

state-of-the-art methods that do pruning at initialization. Un-

like typical pruning approaches, the proposed architecture

is sparse by design and does not require multiple stages of

pruning. The sparsity of the architecture is dataset agnostic

and leads to better transfer ability of the model when com-

pared to existing state-of-the-art methods that do pruning at

initialization. We also show that the proposed ﬁlter has im-

plicit orthogonality that ensures minimum ﬁlter redundancy

at each layer. Additionally, we show that the proposed ﬁlter

can be viewed as a generalization of existing efﬁcient convo-

lutional ﬁlters used in group-wise convolution (GWC) [

point-wise convolution (PWC) [

], and depth-wise convolu-

tion (DWC) [

]. Extensive experiments and ablation studies

on standard benchmarks depict the efﬁcacy of the proposed

ﬁlter. Moreover, we further compress existing efﬁcient mod-

els such as MobileNetv2 [

] and ShufﬂeNetv2 [

] while

achieving performance comparable to the original models.

2. Methods

We propose Structured Sparse Convolution (SSC) ﬁlter,

which is composed of layered spatially sparse

K×K

and

1×1

kernels. Unlike the typical CNN ﬁlters that have a ker-

nel of ﬁxed size, the SSC ﬁlter has three type of kernels, as

shown in Figure 1. The heterogeneous nature of the kernels

are designed to have varying receptive ﬁelds that can cap-

ture different features in the input. As shown in Section 2.1,

heterogeneity in kernels allows the neural network layer to

accumulate information from different spatial locations in

the feature map while signiﬁcantly reducing redundancy in

the number of parameters. Consider layer

of a model with

an input (

hl−1

) of size

il−1×il−1×M

, where

il−1

corre-

sponds to the spatial dimension (width and height) and

denotes the number of channels of the input. Assume that

layer

has

ﬁlters, resulting in an output feature map

size

il×il×N

. We represent the computational and mem-

ory cost at the

lth

layer using the number of ﬂoating-point

operations (

) and the number of parameters (

), respec-

tively. The computational and memory cost associated with

Figure 1. The three basic components that are used in the proposed

SSC ﬁlter. Blue blocks indicates a zero-weight location. The red,

orange and green blocks show the active weight location in three

different type of kernels.

Figure 2. The proposed convolutional layer with

SSC ﬁlters.

Blue blocks denote the zero-weight locations in

3×3

and

1×1

kernels, while other colors show active weights.

a standard convolutional layer with a

K×K

kernel is the

following:

Fl=i2

l×N×(K2M),(1)

Pl=N×(K2M),(2)

where

(K2M)

represents the number of total parameters

from all

channel-speciﬁc kernels. As is evident from (1)

and (2), reducing

(K2M)

directly reduces both the number

of parameters and the computational cost of the model. This

is indeed what our proposed method (SSC) achieves – we

design two types of sparse kernels, which form the basic

components of SSC.

Odd/Even K×Kkernel

: The two types of

K×K

kernels differ in terms of the location of the enforced sparsity.

Considering

S∈RK2

to be the ﬂattened version of the

K×K2D kernel, we deﬁne the odd kernel as:

(S[i] = 0 i∈ {2p|0<2p < K2, p ∈N}

S[i] = wii∈ {2p+ 1 |0<2p+ 1 < K2, p ∈N},(3)

The even kernel is deﬁned in a similar fashion, where the

kernel is zero at odd coordinates and non-zero at even coordi-

nates of the ﬁlter. Figure 1 (a-b) illustrates the odd and even

kernel, respectively, when

K= 3

. These kernels replace the

standard K×Kkernels used in a convolutional layer.

SSC Filter

: A convolutional layer having

SSC ﬁlters

is shown in Figure 2. An SSC ﬁlter is referred to as odd (or

even) ﬁlter, if it only contains odd (or even) kernels. For each

convolutional layer, an equal number of odd and even ﬁlters

are used. We make the following modiﬁcations to a standard

convolution ﬁlter having M,K×Kkernels:

Among the

different kernels, we replace each kernel

at the

k∗g

location with an odd/even kernel, where

a hyperparameter such that

0< g < M

and

k∈ {n∈

N|0< k ∗g≤M}

. Note that each ﬁlter has only

one type (odd/even) of kernel. Each of the

ﬁlters have

M/g

such kernels. The computational cost (

Fsg

) and the

memory cost (

Psg

) for all the odd/even kernels in a ﬁlter:

Fsg =i2

l×N×(K2−c)M

g,(4)

Psg =N×(K2−c)M

g,(5)

c=(⌈K2

2⌉Odd kernel

K2− ⌈K2

2⌉Even kernel ,(6)

where

represents the number of zeros in the kernel and

⌈.⌉denotes the ceiling function.

Out of the remaining

M(1 −1/g)

kernel locations in

the ﬁlter, we place a

1×1

kernel at a ﬁxed interval of

as shown in Figure 1 (c). Each of the

ﬁlters have

M(1 −1/g)/p 1×1

kernels. The computational and

memory cost of these

1×1

kernels can be deﬁned as:

Fsp =i2

l×N×M(1 −1/g)

p,(7)

Psp =N×M(1 −1/g)

p.(8)

The SSC ﬁlter is empty at the remaining

M(1 −1/p)(1 −

1/g)

locations, causing the ﬁlter to ignore the correspond-

ing feature maps (input channels). Note that while a partic-

ular ﬁlter may not act on certain input features, other SSC

ﬁlters of the convolutional layer will. This is enforced by

the shifting procedure introduced below.

Shift operation:

If we naively use SSC ﬁlters in a con-

volutional layer, there will be a loss of information, as all

ﬁlters will ignore the same input feature maps. To ensure

that each SSC ﬁlter attends to a different set of feature maps,

we shift the location of all kernels (

K×K

and

1×1

) by

(nmod q)

at initialization

, where

n∈ {1, ..., N}

denotes

the index of the ﬁlter and

q:= max(g, p)

. The shift opera-

tion across

ﬁlters can be visualized in Figure 2. We can

divide the

ﬁlters into sets of disjoint ﬁlters such that all

1The shift operation is applied only once before training begins.

the ﬁlters in a particular set attend to distinct input feature

maps. Formally, let the collection of sets be deﬁned as:

Q:= {(0, q),[q, 2q),...,[N−(Nmod q), N)},(9)

where

[a, a +q)

denotes the set of ﬁlters

through

a+q−1

Then

∀f, f′∈[a, a +q)

and

f′

tend to disjoint in-

put feature maps if

f̸=f′

. Moreover,

and

f′

are

“near-orthogonal” (

fTf′≈0

), since they attend to non-

overlapping regions of the input feature maps. As discussed

in Section 2.1, the orthogonal property of a layer is of inde-

pendent interest and allows the network to learn uncorrelated

ﬁlters. Note that the design of the SSC ﬁlter induces struc-

tural sparsity as the sparse region is predetermined and ﬁxed,

which is in contrast to the unstructured pruning method

[15, 27, 52].

We can quantify the total reduction in the number of

ﬂoating-point operations (

) and the number of parameters

(Rp) with respect to the standard convolutional layer:

RF=1−Fsg +Fsp

Fl×100% (10)

=1−(1 −c/K2)

g−(1 −1/g)

K2p×100%,(11)

Rp=1−Psg +Psp

Pl×100% (12)

=1−(1 −c/K2)

g−(1 −1/g)

K2p×100%,(13)

where

0< g

and

p<M

. The hyperparameters

and

are set to achieve the desired sparsity in the architectures;

we use

as guiding principle behind choosing

and

One can also achieve a desired reduction in ﬂoating-point

operations (

) to determine the corresponding hyperpa-

rameters. However, in our experiments we consider sparsity

constraints only.

2.1. Implicit Orthogonality

Recent work [

] shows that deep convolutional net-

works learn correlated ﬁlters in overparameterized regimes.

This implies ﬁlter redundancy and correlated feature maps

when working with deep architectures. The issue of correla-

tion across multiple ﬁlters in the convolutional layer has been

addressed by incorporating an explicit orthogonality con-

straint to the ﬁlter of each layer [

]. Consider a 2D matrix

W∈RJ×N

containing all the ﬁlters:

W= [f1, f2, . . . , fN]

where

fn∈RJ

is the vector containing all the parameters

in the

nth

ﬁlter, and

J=K2M

for a standard convolutional

layer. The soft-orthogonality (SO) constraint on a layer

with the corresponding 2D matrix Wlis deﬁned as:

LSO =λ||WT

lWl−I||2

F,(14)

where

I∈RN×N

is the identity matrix,

controls the

degree of orthogonality and

||.||2

is the Frobenius norm.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PushingtheEfficiencyLimitUsingStructuredSparseConvolutionsVinayKumarVerma1*,NikhilMehta1*,ShijingSi3,RicardoHenao1,LawrenceCarin1,21DukeUniversity2KAUSTSaudiArabia3SEF,ShanghaiInternationalStudiesUniversityAbstractWeightpruningisamongthemostpopularapproachesforcompressingdeepconvolutionalneuralnetwo...

展开>> 收起<<

Pushing the Efficiency Limit Using Structured Sparse Convolutions Vinay Kumar Verma1 Nikhil Mehta1 Shijing Si3 Ricardo Henao1 Lawrence Carin12 1Duke University2KAUST Saudi Arabia3SEF Shanghai International Studies University.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pushing the Efficiency Limit Using Structured Sparse Convolutions Vinay Kumar Verma1 Nikhil Mehta1 Shijing Si3 Ricardo Henao1 Lawrence Carin12 1Duke University2KAUST Saudi Arabia3SEF Shanghai International Studies University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: