et al. 2016) or pruning techniques (Luo et al. 2017; Han et al. 2015b). Many pruning methods (Luo
et al. 2017; Zhang et al. 2018b) aim for a high compression ratio and accuracy regardless of the
structure of the sparsity. Thus, they often suffer from imbalanced workload caused by irregular
memory access. Hence, several works aim at zeroing out structured groups of DNN components
through more hardware friendly approaches (Wen et al. 2016).
Quantization.
The computation and memory complexity of DNNs can be reduced by quantizing
model parameters into lower bit-widths; wherein the majority of research works use fixed-bit quanti-
zation. For instance, the methods proposed in (Gysel et al. 2018; Louizos et al. 2018) use fixed 4 or
8-bit quantization. Model parameters have been quantized even further into ternary (Li et al. 2016;
Zhu et al. 2016) and binary (Courbariaux et al. 2015; Rastegari et al. 2016; Courbariaux et al. 2016),
representations. These methods often achieve low performance even with unquantized activations (Li
et al. 2016). Mixed-precision approaches, however, achieve more competitive performance as shown
in (Uhlich et al. 2019) where the bit-width for each layer is determined in an adaptive manner. Also,
choosing a uniform (Jacob et al. 2018) or nonuniform (Han et al. 2015a; Tang et al. 2017; Zhang et al.
2018a) quantization interval has important effects on the compression rate and the acceleration.
Tensor Decomposition.
Tensor decomposition approaches are based on factorizing weight tensors
into smaller tensors to reduce model sizes (Yin et al. 2021). Singular value decomposition (SVD)
applied on matrices as a 2-dimensional instance of tensor decomposition is used as one of the
pioneering approaches to perform model compression (Jaderberg et al. 2014). Other classical high-
dimensional tensor decomposition methods, such as Tucker (Tucker 1963) and CP decomposition
(Harshman et al. 1970), are also adopted to perform model compression. However, using these
methods often leads to significant accuracy drops (Kim et al. 2015; Lebedev et al. 2015; Phan et al.
2020). The idea of reshaping weights of fully-connected layers into high-dimensional tensors and
representing them in TT format (Oseledets 2011) was extended to CNNs in (Garipov et al. 2016). For
multidimensional tensors, TR decomposition (Wang et al. 2018a) has become a more popular option
than TT (Wang et al. 2017). Subsequent filter basis decomposition works polished these approaches
using a shared filter basis. They have been proposed for low-level computer vision tasks such as single
image super-resolution in (Li et al. 2019). Kronecker factorization is another approach to replace
the weight tensors within fully-connected and convolution layers (Zhou et al. 2015). The rank-1
Kronecker product representation limitation of this approach is alleviated in (Hameed et al. 2022).
The compression rate in (Hameed et al. 2022) is determined by both the rank and factor dimensions.
For a fixed rank, the maximum compression is achieved by selecting dimensions for each factor that
are closest to the square root of the original tensors’ dimensions. This leads to representations with
more parameters than those achieved using sequences of Kronecker products as shown in Fig. 1b.
There has been extensive research on tensor decomposition through characterizing global correlation
of tensors (Zheng et al. 2021), extending CP to non-Gaussian data (Hong et al. 2020), employing
augmented decomposition loss functions (Afshar et al. 2021), etc. for different applications. Our
main focus in this paper is on the ones used for NN compression.
Other Methods
NNs can also be compressed using Knowledge Distillation (KD) where a large pre-
trained network known as teacher is used to train a smaller student network (Mirzadeh et al. 2020; Heo
et al. 2019). Sharing weights in a more structured manner can be another model compression approach
as FSNet (Yang et al. 2020) which shares filter weights across spatial locations or ShaResNet (Boulch
2018) which reuses convolutional mappings within the same scale level. Designing lightweight CNNs
(Sandler et al. 2018; Iandola et al. 2016; Chollet 2017; Howard et al. 2019; Zhang et al. 2018c; Tan &
Le 2019) is another direction orthogonal to the aforementioned approaches.
3 METHOD
In this section, we introduce SeKron and how it can be used to compress tensors in deep learning
models. We start by providing background on the Kronecker Product Decomposition in Section 3.1.
Then, we introduce our decomposition method in 3.2. In Section 3.3, we provide an algorithm for
computing the convolution operation using each of the factors directly (avoiding reconstruction) at
runtime. Finally, we discuss the computational complexity of the proposed method in Section 3.4.
3