SEKRON A D ECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Marawan Gamal Abdel Hameed Ali Mosleh Marzieh S. Tahaei Vahid Partovi Nia

2025-05-01 0 0 6.89MB 18 页 10玖币

侵权投诉

SEKRON: A DECOMPOSITION METHOD SUPPORTING

MANY FACTORIZATION STRUCTURES

Marawan Gamal Abdel Hameed, Ali Mosleh, Marzieh S. Tahaei, Vahid Partovi Nia

Noah’s Ark Lab, Huawei Technologies Canada

marawan.abdel.hameed@huawei.com ali.mosleh@huawei.com

marzieh.tahaei@huawei.com vahid.partovinia@huawei.com

ABSTRACT

While convolutional neural networks (CNNs) have become the de facto standard

for most image processing and computer vision applications, their deployment on

edge devices remains challenging. Tensor decomposition methods provide a means

of compressing CNNs to meet the wide range of device constraints by imposing cer-

tain factorization structures on their convolution tensors. However, being limited to

the small set of factorization structures presented by state-of-the-art decomposition

approaches can lead to sub-optimal performance. We propose SeKron, a novel

tensor decomposition method that offers a wide variety of factorization structures,

using sequences of Kronecker products. By recursively ﬁnding approximating

Kronecker factors, we arrive at optimal decompositions for each of the factorization

structures. We show that SeKron is a ﬂexible decomposition that generalizes widely

used methods, such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic

(CP) and Tucker decompositions. Crucially, we derive an efﬁcient convolution pro-

jection algorithm shared by all SeKron structures, leading to seamless compression

of CNN models. We validate SeKron for model compression on both high-level

and low-level computer vision tasks and ﬁnd that it outperforms state-of-the-art

decomposition methods.

1 INTRODUCTION

Deep learning models have introduced new state-of-the-art solutions to both high-level computer

vision problems (He et al. 2016; Ren et al. 2015), and low-level image processing tasks (Wang et al.

2018b; Schuler et al. 2015; Kokkinos & Lefkimmiatis 2018) through convolutional neural networks

(CNNs). Such models are obtained at the expense of millions of training parameters that come

along deep CNNs making them computationally intensive. As a result, many of these models are of

limited use as they are challenging to deploy on resource-constrained edge devices. Compared with

neural networks for high-level computer vision tasks (e.g., ResNet-50 (He et al. 2016)), models for

low-level imaging problems such as single image super-resolution have much a higher computational

complexity due to the larger feature map sizes. Moreover, they are typically infeasible to run on cloud

computing servers. Thus, their deployment on edge devices is even more critical.

In recent years an increasing trend has begun in reducing the size of state-of-the-art CNN backbones

through efﬁcient architecture designs such as Xception (Chollet 2017), MobileNet (Howard et al.

2019), ShufﬂeNet (Zhang et al. 2018c), and EfﬁcientNet (Tan & Le 2019), to name a few. On

the other hand, there have been studies demonstrating signiﬁcant redundancy in the parameters of

large CNN models, implying that in theory the number of model parameters can be reduced while

maintaining performance (Denil et al. 2013). These studies provide the basis for the development of

many model compression techniques such as pruning (He et al. 2020), quantization (Hubara et al.

2017), knowledge distillation (Hinton et al. 2015), and tensor decomposition (Phan et al. 2020).

Tensor decomposition methods such as Tucker (Kim et al. 2016), Canonical Polyadic (CP) (Lebedev

et al. 2015), Tensor-Train (TT) (Novikov et al. 2015) and Tensor-Ring (TR) (Wang et al. 2018a)

rely on ﬁnding low-rank approximations of tensors under some imposed factorization structure as

illustrated in Figure 1a. In practice, some structures are more suitable than others when decomposing

arXiv:2210.06299v1 [cs.CV] 12 Oct 2022

(a)

A(1)

A(1) c

A(3)

A(4)

Tucker

A(1)

A(1) c

A(3)

A(4)

A(1)

A(2)

A(3)

A(4)

A(1)

A(2)

A(3)

A(4)

A(1)

A(S)

A(3) A(4)

A(5)

· · ·

f1h1w1

h2w2

h4w4

SeKron

(b)

SeKron with two sequence lengths

Figure 1: (a): Tensor network diagrams of various decomposition methods for a 4D convolution

tensor

W∈IRF×C×Kh×Kw

. Unlike all other decomposition methods where

f, c, h, w

index

over

ﬁxed

dimensions (i.e., dimensions of

), SeKron is ﬂexible in its factor dimensions, with

fk,ck,hk,wk,∀k∈ {1, ..., S}

indexing over

variable

dimension choices, as well as its sequence

length

. Thus, it allows for a wide range of factorization structures to be achieved. (b): Example of a

16 tensor

that can be more efﬁciently represented using a sequence of four Kronecker factors

(requiring

16 parameters

) in contrast to using a sequence length of two (requiring

32 parameters

tensors. Choosing from a limited set of factorization structures can lead to sub-optimal compressions

as well as lengthy runtimes depending on the hardware. This limitation can be alleviated by reshaping

tensors prior to their compression to improve performance as shown in (Garipov et al. 2016). However,

this approach requires time-consuming development of customized convolution algorithms.

We propose SeKron, a novel tensor decomposition method offering a wide range of factorization

structures that share the same efﬁcient convolution algorithm. Our method is inspired by approaches

based on the Kronecker Product Decomposition (Thakker et al. 2019; Hameed et al. 2022). Unlike

other decomposition methods, Kronecker Product Decomposition generalizes the product of smaller

factors from vectors and matrices to a range of tensor shapes, thereby exploiting local redundancy

between arbitrary slices of multi-dimensional weight tensors. SeKron represents tensors using

sequences of Kronecker products to compress convolution tensors in CNNs. We show that using

sequences of Kronecker products unlocks a wide range of factorization structures and generalizes

other decomposition methods such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP)

and Tucker, under the same framework. Sequences of Kronecker products also have the potential to

exploit local redundancies using far fewer parameters as illustrated in the example in Figure 1b. By

performing the convolution operation using each of the Kronecker factors independently, the number

of parameters, computational intensity, and runtime are reduced, simultaneously. Leveraging the

ﬂexibility SeKron, we ﬁnd efﬁcient factorization structures that outperform existing decomposition

methods on various image classiﬁcation and low-level image processing super-resolution tasks. In

summary, our contributions are:

•

Introducing SeKron, a novel tensor decomposition method based on sequences of Kronecker

products that allows for a wide range of factorization structures and generalizes other

decomposition methods such as TT, TR, CP and Tucker.

•

Providing a solution to the problem of ﬁnding the summation of sequences of Kronecker

products between factor tensors that best approximates the original tensor.

•

Deriving a single convolution algorithm shared by all factorization structures achievable by

SeKron, utilized as compressed convolutional layers in CNNs.

•

Improving the state-of-the-art of low-rank model compression on image classiﬁcation (high-

level vision) benchmarks such as ImageNet and CIFAR-10, as well as super-resolution

(low-level vision) benchmarks such as Set4, Set14, B100 and Urban100.

2 RELATED WORK ON DNN MODEL COMPRESSION

Sparsiﬁcation.

Different components of DNNs, such as weights (Han et al. 2015b;a), convolutional

ﬁlters (He et al. 2018; Luo et al. 2017) and feature maps (He et al. 2017; Zhuang et al. 2018) can

be sparse. The sparsity can be enforced using sparsity-aware regularization (Liu et al. 2015; Zhou

et al. 2016) or pruning techniques (Luo et al. 2017; Han et al. 2015b). Many pruning methods (Luo

et al. 2017; Zhang et al. 2018b) aim for a high compression ratio and accuracy regardless of the

structure of the sparsity. Thus, they often suffer from imbalanced workload caused by irregular

memory access. Hence, several works aim at zeroing out structured groups of DNN components

through more hardware friendly approaches (Wen et al. 2016).

Quantization.

The computation and memory complexity of DNNs can be reduced by quantizing

model parameters into lower bit-widths; wherein the majority of research works use ﬁxed-bit quanti-

zation. For instance, the methods proposed in (Gysel et al. 2018; Louizos et al. 2018) use ﬁxed 4 or

8-bit quantization. Model parameters have been quantized even further into ternary (Li et al. 2016;

Zhu et al. 2016) and binary (Courbariaux et al. 2015; Rastegari et al. 2016; Courbariaux et al. 2016),

representations. These methods often achieve low performance even with unquantized activations (Li

et al. 2016). Mixed-precision approaches, however, achieve more competitive performance as shown

in (Uhlich et al. 2019) where the bit-width for each layer is determined in an adaptive manner. Also,

choosing a uniform (Jacob et al. 2018) or nonuniform (Han et al. 2015a; Tang et al. 2017; Zhang et al.

2018a) quantization interval has important effects on the compression rate and the acceleration.

Tensor Decomposition.

Tensor decomposition approaches are based on factorizing weight tensors

into smaller tensors to reduce model sizes (Yin et al. 2021). Singular value decomposition (SVD)

applied on matrices as a 2-dimensional instance of tensor decomposition is used as one of the

pioneering approaches to perform model compression (Jaderberg et al. 2014). Other classical high-

dimensional tensor decomposition methods, such as Tucker (Tucker 1963) and CP decomposition

(Harshman et al. 1970), are also adopted to perform model compression. However, using these

methods often leads to signiﬁcant accuracy drops (Kim et al. 2015; Lebedev et al. 2015; Phan et al.

2020). The idea of reshaping weights of fully-connected layers into high-dimensional tensors and

representing them in TT format (Oseledets 2011) was extended to CNNs in (Garipov et al. 2016). For

multidimensional tensors, TR decomposition (Wang et al. 2018a) has become a more popular option

than TT (Wang et al. 2017). Subsequent ﬁlter basis decomposition works polished these approaches

using a shared ﬁlter basis. They have been proposed for low-level computer vision tasks such as single

image super-resolution in (Li et al. 2019). Kronecker factorization is another approach to replace

the weight tensors within fully-connected and convolution layers (Zhou et al. 2015). The rank-1

Kronecker product representation limitation of this approach is alleviated in (Hameed et al. 2022).

The compression rate in (Hameed et al. 2022) is determined by both the rank and factor dimensions.

For a ﬁxed rank, the maximum compression is achieved by selecting dimensions for each factor that

are closest to the square root of the original tensors’ dimensions. This leads to representations with

more parameters than those achieved using sequences of Kronecker products as shown in Fig. 1b.

There has been extensive research on tensor decomposition through characterizing global correlation

of tensors (Zheng et al. 2021), extending CP to non-Gaussian data (Hong et al. 2020), employing

augmented decomposition loss functions (Afshar et al. 2021), etc. for different applications. Our

main focus in this paper is on the ones used for NN compression.

Other Methods

NNs can also be compressed using Knowledge Distillation (KD) where a large pre-

trained network known as teacher is used to train a smaller student network (Mirzadeh et al. 2020; Heo

et al. 2019). Sharing weights in a more structured manner can be another model compression approach

as FSNet (Yang et al. 2020) which shares ﬁlter weights across spatial locations or ShaResNet (Boulch

2018) which reuses convolutional mappings within the same scale level. Designing lightweight CNNs

(Sandler et al. 2018; Iandola et al. 2016; Chollet 2017; Howard et al. 2019; Zhang et al. 2018c; Tan &

Le 2019) is another direction orthogonal to the aforementioned approaches.

3 METHOD

In this section, we introduce SeKron and how it can be used to compress tensors in deep learning

models. We start by providing background on the Kronecker Product Decomposition in Section 3.1.

Then, we introduce our decomposition method in 3.2. In Section 3.3, we provide an algorithm for

computing the convolution operation using each of the factors directly (avoiding reconstruction) at

runtime. Finally, we discuss the computational complexity of the proposed method in Section 3.4.

3.1 PRELIMINARIES

Convolutional layers prevalent in CNNs transform an input tensor

X∈IRC×Kh×Kw

using a weight

tensor W∈IRF×C×Kh×Kwvia a multi-linear map given by

Yf,x,y =

i=1

j=1

c=1

Wf,c,i,j Xc,i+x,j+y,(1)

where

and

denote the number of input channels and output channels, respectively, and

Kh×Kw

denotes the spatial size of the weight (ﬁlter). Tensor decomposition seeks a compact approximation

to replace W, typically through ﬁnding lower-rank tensors using SVD.

One possible way of obtaining such a compact approximation comes from the fact that any tensor

W∈IRw1×···×wN

can be written as sum of Kronecker products (Hameed et al. 2022). Let the

Kronecker product between two matrices A∈IRa1×a2and B∈IRb1×b2be given by

A⊗B,





A11B. . . A1a2B

.....

Aa11B. . . Aa1a2B





,(2)

which can be extended to two multi-way tensors A∈IRa1×···×aNand B∈IRb1×···×bNas

(A⊗B)i1···iN

,Aj1···jNBk1···kN,(3)

where

jn=jin

bnk, kn=inmodbn

. Since

can be written as a sum of Kronecker products, i.e.

W=PR

r=1 Ar⊗Br,it can be approximated using a lower-rank representation by solving

min

{Ar},{Br}



W−b

r=1

Ar⊗Br



,(4)

for

sums of Kronecker products (

R≤R

) using the SVD of a particular reshaping (unfolding) of

, where

|| · ||F

denotes the Frobenius norm. Thus, the convolutional layer can be compressed by

replacing Wwith Pb

r=1 Ar⊗Brin equation 1, where b

Rcontrols the compression rate.

3.2 SEKRON TENSOR DECOMPOSITION

The multi-way tensor representation in equation 3 can be extended to sequences of multi-way tensors

A(1) ⊗ · · · ⊗ A(S)i1···iN

,A(1)

j(1)

1···j(1)

· · · A(S)

j(S)

1···j(S)

,(5)

where

j(k)

n=





in−Pk−2

t=1 j(t)

nQS

l=t+1 a(l)

nmod a(S)

nk=S,

in−Pk−1

t=1 j(t)

nQS

l=t+1 a(l)

l=k+1 a(l)

notherwise,(6)

and

A(k)∈IRa(k)

1×···×a(k)

. Therefore, our decomposition method using sequences of Kronecker

products i.e., SeKron, applied on a given tensor W∈IRw1×···×wNinvolves ﬁnding a solution to

min

{A(k)}S

k=1 



W−

r1=1

A(1)

r1⊗

r2=1

A(2)

r1r2⊗ · · · ⊗

RS−1

rS−1=1

A(S−1)

r1···rS−1⊗A(S)

r1···rS−1



.(7)

Although this is a non-convex optimization problem, we provide a solution based on recursive

application of SVD and demonstrate its quasi-optimality:

Theorem 1

(Tensor Decomposition using a Sequence of Kronecker Products)

Any tensor

W∈

IRw1×···×wNcan be represented by a sequence of Kronecker products between S∈IN factors:

r1=1

A(1)

r1⊗

r2=1

A(2)

r1r2⊗ · · · ⊗

RS−1

rS−1=1

A(S−1)

r1···rS−1⊗A(S)

r1···rS−1,(8)

where Riare ranks of intermediate matrices and A(k)∈IRa(k)

1×···×a(k)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SEKRON:ADECOMPOSITIONMETHODSUPPORTINGMANYFACTORIZATIONSTRUCTURESMarawanGamalAbdelHameed,AliMosleh,MarziehS.Tahaei,VahidPartoviNiaNoah'sArkLab,HuaweiTechnologiesCanadamarawan.abdel.hameed@huawei.comali.mosleh@huawei.commarzieh.tahaei@huawei.comvahid.partovinia@huawei.comABSTRACTWhileconvolutionalneur...

展开>> 收起<<

SEKRON A D ECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Marawan Gamal Abdel Hameed Ali Mosleh Marzieh S. Tahaei Vahid Partovi Nia.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SEKRON A D ECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Marawan Gamal Abdel Hameed Ali Mosleh Marzieh S. Tahaei Vahid Partovi Nia

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: