SEKRON A D ECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Marawan Gamal Abdel Hameed Ali Mosleh Marzieh S. Tahaei Vahid Partovi Nia

2025-05-01 0 0 6.89MB 18 页 10玖币
侵权投诉
SEKRON: A DECOMPOSITION METHOD SUPPORTING
MANY FACTORIZATION STRUCTURES
Marawan Gamal Abdel Hameed, Ali Mosleh, Marzieh S. Tahaei, Vahid Partovi Nia
Noah’s Ark Lab, Huawei Technologies Canada
marawan.abdel.hameed@huawei.com ali.mosleh@huawei.com
marzieh.tahaei@huawei.com vahid.partovinia@huawei.com
ABSTRACT
While convolutional neural networks (CNNs) have become the de facto standard
for most image processing and computer vision applications, their deployment on
edge devices remains challenging. Tensor decomposition methods provide a means
of compressing CNNs to meet the wide range of device constraints by imposing cer-
tain factorization structures on their convolution tensors. However, being limited to
the small set of factorization structures presented by state-of-the-art decomposition
approaches can lead to sub-optimal performance. We propose SeKron, a novel
tensor decomposition method that offers a wide variety of factorization structures,
using sequences of Kronecker products. By recursively finding approximating
Kronecker factors, we arrive at optimal decompositions for each of the factorization
structures. We show that SeKron is a flexible decomposition that generalizes widely
used methods, such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic
(CP) and Tucker decompositions. Crucially, we derive an efficient convolution pro-
jection algorithm shared by all SeKron structures, leading to seamless compression
of CNN models. We validate SeKron for model compression on both high-level
and low-level computer vision tasks and find that it outperforms state-of-the-art
decomposition methods.
1 INTRODUCTION
Deep learning models have introduced new state-of-the-art solutions to both high-level computer
vision problems (He et al. 2016; Ren et al. 2015), and low-level image processing tasks (Wang et al.
2018b; Schuler et al. 2015; Kokkinos & Lefkimmiatis 2018) through convolutional neural networks
(CNNs). Such models are obtained at the expense of millions of training parameters that come
along deep CNNs making them computationally intensive. As a result, many of these models are of
limited use as they are challenging to deploy on resource-constrained edge devices. Compared with
neural networks for high-level computer vision tasks (e.g., ResNet-50 (He et al. 2016)), models for
low-level imaging problems such as single image super-resolution have much a higher computational
complexity due to the larger feature map sizes. Moreover, they are typically infeasible to run on cloud
computing servers. Thus, their deployment on edge devices is even more critical.
In recent years an increasing trend has begun in reducing the size of state-of-the-art CNN backbones
through efficient architecture designs such as Xception (Chollet 2017), MobileNet (Howard et al.
2019), ShuffleNet (Zhang et al. 2018c), and EfficientNet (Tan & Le 2019), to name a few. On
the other hand, there have been studies demonstrating significant redundancy in the parameters of
large CNN models, implying that in theory the number of model parameters can be reduced while
maintaining performance (Denil et al. 2013). These studies provide the basis for the development of
many model compression techniques such as pruning (He et al. 2020), quantization (Hubara et al.
2017), knowledge distillation (Hinton et al. 2015), and tensor decomposition (Phan et al. 2020).
Tensor decomposition methods such as Tucker (Kim et al. 2016), Canonical Polyadic (CP) (Lebedev
et al. 2015), Tensor-Train (TT) (Novikov et al. 2015) and Tensor-Ring (TR) (Wang et al. 2018a)
rely on finding low-rank approximations of tensors under some imposed factorization structure as
illustrated in Figure 1a. In practice, some structures are more suitable than others when decomposing
1
arXiv:2210.06299v1 [cs.CV] 12 Oct 2022
(a)
G
A(1)
f
A(1) c
A(3)
h
A(4)
w
Tucker
A(1)
f
A(1) c
A(3)
h
A(4)
w
CP
A(1)
f
A(2)
c
A(3)
h
A(4)
w
TR
A(1)
f
A(2)
c
A(3)
h
A(4)
w
TT
A(1)
A(S)
A(3) A(4)
A(5)
· · ·
c1
f1h1w1
c2
f2
h2w2
c3
f3
h3
w3
c4
f4
h4w4
cS
fS
hS
wS
SeKron
(b)
SeKron with two sequence lengths
Figure 1: (a): Tensor network diagrams of various decomposition methods for a 4D convolution
tensor
WIRF×C×Kh×Kw
. Unlike all other decomposition methods where
f, c, h, w
index
over
fixed
dimensions (i.e., dimensions of
W
), SeKron is flexible in its factor dimensions, with
fk,ck,hk,wk,k∈ {1, ..., S}
indexing over
variable
dimension choices, as well as its sequence
length
S
. Thus, it allows for a wide range of factorization structures to be achieved. (b): Example of a
16
×
16 tensor
W
that can be more efficiently represented using a sequence of four Kronecker factors
(requiring
16 parameters
) in contrast to using a sequence length of two (requiring
32 parameters
).
tensors. Choosing from a limited set of factorization structures can lead to sub-optimal compressions
as well as lengthy runtimes depending on the hardware. This limitation can be alleviated by reshaping
tensors prior to their compression to improve performance as shown in (Garipov et al. 2016). However,
this approach requires time-consuming development of customized convolution algorithms.
We propose SeKron, a novel tensor decomposition method offering a wide range of factorization
structures that share the same efficient convolution algorithm. Our method is inspired by approaches
based on the Kronecker Product Decomposition (Thakker et al. 2019; Hameed et al. 2022). Unlike
other decomposition methods, Kronecker Product Decomposition generalizes the product of smaller
factors from vectors and matrices to a range of tensor shapes, thereby exploiting local redundancy
between arbitrary slices of multi-dimensional weight tensors. SeKron represents tensors using
sequences of Kronecker products to compress convolution tensors in CNNs. We show that using
sequences of Kronecker products unlocks a wide range of factorization structures and generalizes
other decomposition methods such as Tensor-Train (TT), Tensor-Ring (TR), Canonical Polyadic (CP)
and Tucker, under the same framework. Sequences of Kronecker products also have the potential to
exploit local redundancies using far fewer parameters as illustrated in the example in Figure 1b. By
performing the convolution operation using each of the Kronecker factors independently, the number
of parameters, computational intensity, and runtime are reduced, simultaneously. Leveraging the
flexibility SeKron, we find efficient factorization structures that outperform existing decomposition
methods on various image classification and low-level image processing super-resolution tasks. In
summary, our contributions are:
Introducing SeKron, a novel tensor decomposition method based on sequences of Kronecker
products that allows for a wide range of factorization structures and generalizes other
decomposition methods such as TT, TR, CP and Tucker.
Providing a solution to the problem of finding the summation of sequences of Kronecker
products between factor tensors that best approximates the original tensor.
Deriving a single convolution algorithm shared by all factorization structures achievable by
SeKron, utilized as compressed convolutional layers in CNNs.
Improving the state-of-the-art of low-rank model compression on image classification (high-
level vision) benchmarks such as ImageNet and CIFAR-10, as well as super-resolution
(low-level vision) benchmarks such as Set4, Set14, B100 and Urban100.
2 RELATED WORK ON DNN MODEL COMPRESSION
Sparsification.
Different components of DNNs, such as weights (Han et al. 2015b;a), convolutional
filters (He et al. 2018; Luo et al. 2017) and feature maps (He et al. 2017; Zhuang et al. 2018) can
be sparse. The sparsity can be enforced using sparsity-aware regularization (Liu et al. 2015; Zhou
2
et al. 2016) or pruning techniques (Luo et al. 2017; Han et al. 2015b). Many pruning methods (Luo
et al. 2017; Zhang et al. 2018b) aim for a high compression ratio and accuracy regardless of the
structure of the sparsity. Thus, they often suffer from imbalanced workload caused by irregular
memory access. Hence, several works aim at zeroing out structured groups of DNN components
through more hardware friendly approaches (Wen et al. 2016).
Quantization.
The computation and memory complexity of DNNs can be reduced by quantizing
model parameters into lower bit-widths; wherein the majority of research works use fixed-bit quanti-
zation. For instance, the methods proposed in (Gysel et al. 2018; Louizos et al. 2018) use fixed 4 or
8-bit quantization. Model parameters have been quantized even further into ternary (Li et al. 2016;
Zhu et al. 2016) and binary (Courbariaux et al. 2015; Rastegari et al. 2016; Courbariaux et al. 2016),
representations. These methods often achieve low performance even with unquantized activations (Li
et al. 2016). Mixed-precision approaches, however, achieve more competitive performance as shown
in (Uhlich et al. 2019) where the bit-width for each layer is determined in an adaptive manner. Also,
choosing a uniform (Jacob et al. 2018) or nonuniform (Han et al. 2015a; Tang et al. 2017; Zhang et al.
2018a) quantization interval has important effects on the compression rate and the acceleration.
Tensor Decomposition.
Tensor decomposition approaches are based on factorizing weight tensors
into smaller tensors to reduce model sizes (Yin et al. 2021). Singular value decomposition (SVD)
applied on matrices as a 2-dimensional instance of tensor decomposition is used as one of the
pioneering approaches to perform model compression (Jaderberg et al. 2014). Other classical high-
dimensional tensor decomposition methods, such as Tucker (Tucker 1963) and CP decomposition
(Harshman et al. 1970), are also adopted to perform model compression. However, using these
methods often leads to significant accuracy drops (Kim et al. 2015; Lebedev et al. 2015; Phan et al.
2020). The idea of reshaping weights of fully-connected layers into high-dimensional tensors and
representing them in TT format (Oseledets 2011) was extended to CNNs in (Garipov et al. 2016). For
multidimensional tensors, TR decomposition (Wang et al. 2018a) has become a more popular option
than TT (Wang et al. 2017). Subsequent filter basis decomposition works polished these approaches
using a shared filter basis. They have been proposed for low-level computer vision tasks such as single
image super-resolution in (Li et al. 2019). Kronecker factorization is another approach to replace
the weight tensors within fully-connected and convolution layers (Zhou et al. 2015). The rank-1
Kronecker product representation limitation of this approach is alleviated in (Hameed et al. 2022).
The compression rate in (Hameed et al. 2022) is determined by both the rank and factor dimensions.
For a fixed rank, the maximum compression is achieved by selecting dimensions for each factor that
are closest to the square root of the original tensors’ dimensions. This leads to representations with
more parameters than those achieved using sequences of Kronecker products as shown in Fig. 1b.
There has been extensive research on tensor decomposition through characterizing global correlation
of tensors (Zheng et al. 2021), extending CP to non-Gaussian data (Hong et al. 2020), employing
augmented decomposition loss functions (Afshar et al. 2021), etc. for different applications. Our
main focus in this paper is on the ones used for NN compression.
Other Methods
NNs can also be compressed using Knowledge Distillation (KD) where a large pre-
trained network known as teacher is used to train a smaller student network (Mirzadeh et al. 2020; Heo
et al. 2019). Sharing weights in a more structured manner can be another model compression approach
as FSNet (Yang et al. 2020) which shares filter weights across spatial locations or ShaResNet (Boulch
2018) which reuses convolutional mappings within the same scale level. Designing lightweight CNNs
(Sandler et al. 2018; Iandola et al. 2016; Chollet 2017; Howard et al. 2019; Zhang et al. 2018c; Tan &
Le 2019) is another direction orthogonal to the aforementioned approaches.
3 METHOD
In this section, we introduce SeKron and how it can be used to compress tensors in deep learning
models. We start by providing background on the Kronecker Product Decomposition in Section 3.1.
Then, we introduce our decomposition method in 3.2. In Section 3.3, we provide an algorithm for
computing the convolution operation using each of the factors directly (avoiding reconstruction) at
runtime. Finally, we discuss the computational complexity of the proposed method in Section 3.4.
3
3.1 PRELIMINARIES
Convolutional layers prevalent in CNNs transform an input tensor
XIRC×Kh×Kw
using a weight
tensor WIRF×C×Kh×Kwvia a multi-linear map given by
Yf,x,y =
Kh
X
i=1
Kw
X
j=1
C
X
c=1
Wf,c,i,j Xc,i+x,j+y,(1)
where
C
and
F
denote the number of input channels and output channels, respectively, and
Kh×Kw
denotes the spatial size of the weight (filter). Tensor decomposition seeks a compact approximation
to replace W, typically through finding lower-rank tensors using SVD.
One possible way of obtaining such a compact approximation comes from the fact that any tensor
WIRw1×···×wN
can be written as sum of Kronecker products (Hameed et al. 2022). Let the
Kronecker product between two matrices AIRa1×a2and BIRb1×b2be given by
AB,
A11B. . . A1a2B
.
.
.....
.
.
Aa11B. . . Aa1a2B
,(2)
which can be extended to two multi-way tensors AIRa1×···×aNand BIRb1×···×bNas
(AB)i1···iN
,Aj1···jNBk1···kN,(3)
where
jn=jin
bnk, kn=inmodbn
. Since
W
can be written as a sum of Kronecker products, i.e.
W=PR
r=1 ArBr,it can be approximated using a lower-rank representation by solving
min
{Ar},{Br}
Wb
R
X
r=1
ArBr
2
F
,(4)
for
b
R
sums of Kronecker products (
b
RR
) using the SVD of a particular reshaping (unfolding) of
W
, where
|| · ||F
denotes the Frobenius norm. Thus, the convolutional layer can be compressed by
replacing Wwith Pb
R
r=1 ArBrin equation 1, where b
Rcontrols the compression rate.
3.2 SEKRON TENSOR DECOMPOSITION
The multi-way tensor representation in equation 3 can be extended to sequences of multi-way tensors
A(1) · · · A(S)i1···iN
,A(1)
j(1)
1···j(1)
N
· · · A(S)
j(S)
1···j(S)
N
,(5)
where
j(k)
n=
inPk2
t=1 j(t)
nQS
l=t+1 a(l)
nmod a(S)
nk=S,
inPk1
t=1 j(t)
nQS
l=t+1 a(l)
n
QS
l=k+1 a(l)
notherwise,(6)
and
A(k)IRa(k)
1×···×a(k)
N
. Therefore, our decomposition method using sequences of Kronecker
products i.e., SeKron, applied on a given tensor WIRw1×···×wNinvolves finding a solution to
min
{A(k)}S
k=1
W
R1
X
r1=1
A(1)
r1
R2
X
r2=1
A(2)
r1r2 · · ·
RS1
X
rS1=1
A(S1)
r1···rS1A(S)
r1···rS1
2
F
.(7)
Although this is a non-convex optimization problem, we provide a solution based on recursive
application of SVD and demonstrate its quasi-optimality:
Theorem 1
(Tensor Decomposition using a Sequence of Kronecker Products)
.
Any tensor
W
IRw1×···×wNcan be represented by a sequence of Kronecker products between SIN factors:
W=
R1
X
r1=1
A(1)
r1
R2
X
r2=1
A(2)
r1r2 · · ·
RS1
X
rS1=1
A(S1)
r1···rS1A(S)
r1···rS1,(8)
where Riare ranks of intermediate matrices and A(k)IRa(k)
1×···×a(k)
N.
4
摘要:

SEKRON:ADECOMPOSITIONMETHODSUPPORTINGMANYFACTORIZATIONSTRUCTURESMarawanGamalAbdelHameed,AliMosleh,MarziehS.Tahaei,VahidPartoviNiaNoah'sArkLab,HuaweiTechnologiesCanadamarawan.abdel.hameed@huawei.comali.mosleh@huawei.commarzieh.tahaei@huawei.comvahid.partovinia@huawei.comABSTRACTWhileconvolutionalneur...

展开>> 收起<<
SEKRON A D ECOMPOSITION METHOD SUPPORTING MANY FACTORIZATION STRUCTURES Marawan Gamal Abdel Hameed Ali Mosleh Marzieh S. Tahaei Vahid Partovi Nia.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:6.89MB 格式:PDF 时间:2025-05-01

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注