1 Deep learning model compression using network sensitivity and gradients

2025-04-28 0 0 744.81KB 11 页 10玖币
侵权投诉
1
Deep learning model compression using network
sensitivity and gradients
Madhumitha Sakthi1, Niranjan Yadla2, Raj Pawate2
Abstract—Deep learning model compression is an improving
and important field for the edge deployment of deep learning
models. Given the increasing size of the models and their
corresponding power consumption, it is vital to decrease the
model size and compute requirement without a significant drop
in the model’s performance. In this paper, we present model
compression algorithms for both non-retraining and retraining
conditions. In the first case where retraining of the model is not
feasible due to lack of access to the original data or absence
of necessary compute resources while only having access to off-
the-shelf models, we propose the Bin & Quant algorithm for
compression of the deep learning models using the sensitivity of
the network parameters. This results in 13x compression of the
speech command and control model and 7x compression of the
DeepSpeech2 models. In the second case when the models can be
retrained and utmost compression is required for the negligible
loss in accuracy, we propose our novel gradient-weighted k-means
clustering algorithm (GWK). This method uses the gradients
in identifying the important weight values in a given cluster
and nudges the centroid towards those values, thereby giving
importance to sensitive weights. Our method effectively combines
product quantization with the EWGS [1] algorithm for sub-1-
bit representation of the quantized models. We test our GWK
algorithm on the CIFAR10 dataset across a range of models such
as ResNet20, ResNet56, MobileNetv2 and show 35x compression
on quantized models for less than 2% absolute loss in accuracy
compared to the floating-point models.
Index Terms—model compression, storage compression, com-
puter vision, image classification
I. INTRODUCTION
Deep learning models are applied to achieve state-of-the-
art results across applications in various fields. The initial
success of deep learning models was attributed to the ex-
tremely large model parameters and hence their ability to
model complex problems. However, given the applicability
of deep learning across various fields and their upcoming
edge device deployment, it is crucial to reduce the model
size and computational complexity without compromising on
the model performance. Therefore, recent research has fo-
cused on producing lightweight deep learning models, pruning,
quantization and clustering techniques for compression. Most
often, these methods either focus on storage compression or
model quantization for edge computing advantage. Also, these
methods need extensive re-training of the compressed model
starting from the floating point model which is already heavily
trained. While re-training of these models is compute-intensive
along with the requirement of the knowledge of the initial
1. Electrical and Computer Engineering, The University of Texas at Austin,
USA (madhumithasakthi.iyer@utexas.edu), 2. Tensilica IPG, Cadence Design
Systems Inc., USA (pawateb@cadence.com)
hyperparameters used for training the original model, in order
to achieve extreme compression for the least loss in accuracy,
it is vital to retrain the model.
Fig. 1: The overall compression algorithm for various models.
As the compression complexity increases, the size and infer-
ence complexity of the model decreases.
In this paper, we present 3 such compression cases for
varying storage, compression complexity and accuracy re-
quirements. First, our original Bin & Quant method [2] is
specifically designed for compressing the pre-trained models
without having to retrain them after compression. In this
method, we use sensitivity analysis to guide us in identifying
the appropriate bins for a particular layer in a model and
cluster the bin values to represent them using a single label
and hence introduce storage compression. This method was
specifically developed for float activation models and for
storage compression. The main motivation for developing
a technique for compression without retraining is attributed
to the need for the original training data’s availability for
retraining. In this era of growing need for privacy and security
concerns, often obtaining access to the original data could be
difficult [3]. Also, it is important to have prior knowledge
about the hyperparameters used for training the original model
and if there is no access to the same amount of compute
capacity that was used to train the initial model, there will
be a need to explore hyperparameters for retraining the model
again. In addition to this, setting up the system for retraining
is also an additional task that is time-consuming and hence
expensive. Therefore, we propose a compression algorithm
without the need for retraining. In the second case, we extend
our Bin & Quant algorithm to quantized models and apply the
arXiv:2210.05111v1 [cs.LG] 11 Oct 2022
2
algorithm for compression of the uint8 weight values without
retraining. This compression scheme works on top of the
quantized models and provides extra compression at minimal
cost in accuracy and compression complexity.
Finally, in cases where retraining is feasible, we present
a gradient-weighted k-means(GWK) algorithm in addition to
quantization-aware training of the model using the EWGS [1]
algorithm for storage compression of the quantized models.
This method is proposed for utmost model compression while
taking advantage of quantized weights and activation to retain
the edge deployment advantages of the quantized models.
Since we use product quantization, similar to [4], [5], we
achieve sub-1-bit representation per weight while the activa-
tion is also quantized. Our GWK method utilizes the gradients
in identifying the sensitive parameters of the network. We
empirically show that the network parameters perturbed based
on a higher gradient lead to a drop in accuracy compared
to randomly perturbed weights. Thereby, emphasizing the
fact that the gradients are capable of identifying sensitive
parameters of the network and this information aids in nudging
the centroids of the clusters towards these sensitive weights
in our GWK algorithm. Therefore, we present these three
algorithms for three different scenarios. In the first case, where
retraining is not feasible and the model has float weights and
activations, Bin & Quant would be used for storage level
compression. In the second case, we propose to use Bin &
Quant for the storage compression of the quantized models.
Finally, the third algorithm is proposed for the utmost model
compression in conditions where retraining is possible and
least/no accuracy drop is required.
We tested our first algorithm on speech recognition models
and showed 13x compression in the case of the speech
command and control model, and 7x compression of the
deepspeech2 model and this included a range of speech
commands and large vocabulary speech recognition models
and VGG16 network. The second algorithm was tested on
computer vision models such as Mobilenetv2 and Resnet
since these models are widely used as post training quantized
models, especially in edge devices for image classification
tasks. In this case, we achieved 6.32x compression of the
Mobilenetv2 model with quantized activations. Finally, the
third algorithm was tested on ResNet, MobileNet models
and we achieved sub-1-bit representation per weight of the
models with quantized activations. In figure 1, we describe
the three compression algorithm and their relevant result. As
the compression complexity increases, the size and inference
complexity decrease.
Therefore, the main contribution of this paper is the follow-
ing:
A novel Bin & Quant model compression algorithm for
off-the-shelf models without retraining the float weight
and activation parameters for storage compression.
The Bin & Quant algorithm extended to integer quan-
tized models for compression of weight and activation
quantized models without retraining.
Empirically established the significance of gradients in
identifying the sensitive parameters of the network.
A novel gradient weighted k-means algorithm for utmost
model compression of the quantized models using EWGS
[1]. The gradients nudge the centroids of the cluster
towards sensitive parameters and the algorithm achieves a
sub-1-bit weight representation of quantized parameters.
We organize the rest of the paper as follows. In section
two, we give a brief overview of the existing research on deep
learning model compression algorithms for storage compres-
sion and quantization methods. In section three, we present our
three proposed algorithms for retraining and non-retraining-
based compression methods. In section four, the experimental
evaluation of the proposed algorithms is presented across
various models and finally, we conclude our methods in section
five.
II. RELATED WORK
Clustering-based model compression techniques have re-
cently gained a lot of attention due to their ability to represent
multiple weight values using a single label and hence save on
storage space. Particularly, the work by [6] has been widely
used and developed further. In [6], the authors presented a
multi-stage compression pipeline of clustering, re-training and
Huffman encoding of various vision models such as VGG16,
AlexNet. Recently, the work by [5] extended this clustering
technique to use product quantization in addition so that
multiple weight values in proximity will be represented by a
single label value and thereby leading to the ability to represent
with sub-1-bit per weight value. Their k-means algorithm is
optimized to preserve the output activation rather than the
weights directly and showed 29x compression in the case of
ResNet-18 and 18x compression in the case of ResNet-50
model. In the [4] paper, the non-differentiable k-means clus-
tering technique was addressed by adding attention matrices
to guide cluster formation and enabled joint optimization of
DNN parameters and cluster centroids. They could compress
the Mobilenet-v2 to 3 bits/weight and ResNet-18 to 0.717
bits/weight. However, in all of the above methods, the authors
focused on weight storage compression and retained float
activations. In DKM [4] the authors propose to use attention
matrices to guide centroid formation and therefore, in the case
of a large ResNet-50 model with 8/8 i.e., 8-bit clusters and
8-d subvector size, the model does not converge due to out-of-
memory error. However, in our proposed GWK, we use a 1-d
gradient vector, the same size as that of our original model
weight that is being compressed without the need for an extra
attention matrix parameter to guide cluster formation. Also,
while DKM only focuses on storage compression, we provide
results on both storage compression and weight quantization.
Finally, in DKM the attention matrices are derived based on
the distance metric and this is used to guide the centroid forma-
tion. Whereas, we use the gradients to guide us in the centroid
formation motivated by the empirically tested hypothesis that
gradients are capable of identifying the sensitive parameters in
a layer and therefore, we give importance to these parameters
during centroid determination. The work by [7] compresses the
Mobilenet model by 10.7x using universal vector quantization
and universal source coding technique. The [8] paper proposed
3
Fig. 2: From left to right, the effect of Gaussian perturbation is shown in the first row, and the effect of a ±50% relative
magnitude perturbation is shown in the second row for VGG16, SPC, measured using accuracy and DS, DS 2 measured using
error rate.
to use reinforcement learning to automatically determine the
quantization policy that’s best based on the hardware accel-
erator’s feedback and test their algorithm on MobileNetv1,v2
and ResNet models. In another paper [9], the authors propose
the GOBO algorithm for weight clustering using a method
similar to k-means on NLP models. In [10], the authors
propose to permute the weights before compressing and fine-
tuning the layers and this results in significant performance
improvement. The authors in [11] propose a non-retraining
setup for compression of vision models by determining the
best k based on accuracy loss. Whereas, in our B&Q work,
we focus on a novel algorithm for compression based on the
sensitivity of the network to noise and the method is applicable
to both float and integer quantized models.
Finally, in order to deploy the models in INT DSPs,
both the weights and activations should be quantized to
INT values. Model quantization methods have also gained
popularity due to their ability to quantize both the weights and
activation values [12]. In [1], the authors propose EWGS as
an alternative algorithm for a straight-through estimator(STE)
by scaling each gradient element based on the sign of the
gradient and the discretizer error. They showed an accuracy
of 73.9% on the imagenet dataset using the ResNet34 model
with 4-bit weights and 4-bit activations. In [13], again an
algorithm was proposed for training quantized models by
using a customized regularization loss for directing the weight
values towards a distribution with maximum accuracy while
minimizing quantization error. On MobilenetV2 for the person
detection task, they compressed it to 2 bits and showed
87.5% accuracy. PROFIT [14] is another Mobilenet model
compression technique where the activation instability due
to weight quantization is alleviated by progressive freezing
and iterative training of the model. They achieved 4-bit com-
pression of the Mobilenetv1 model for a 60.056% accuracy.
The authors in [15] presented an additive power of two
quantization technique. Their reparametrization of the clipping
function is applied for a better-defined gradient for learning the
clipping threshold and weight normalization for stabilizing the
training. They showed ResNet-18 compressed to 5-bit weight
and activation with 70.9% accuracy on the ImageNet dataset.
Memory efficient networks such as MobileNet [16], [17] mod-
els, EfficientNet [18], [19] are alternatives for directly training
efficient neural networks instead of training a heavy network
and then compressing and/or quantizing them. However, in
our study, we compress and quantize the memory-efficient
Mobilenet network as well.
Although all the model quantization papers focused only
on weight and activation quantization technique, [20] au-
thors presented a method for both storage compression
and weight/activation quantization using product quantization.
Their algorithm randomly quantizes different random subset of
weights during each forward pass rather than all the weights as
a quantization-aware training scheme and improved the perfor-
mance of the model while inducing extreme compression. But
the authors only tested their method on the EfficientNet-B3
model and RoBERTa model and reduced the representation to
4-bit weights and activation for a significant loss in accuracy.
In the case of speech recognition models, there have been
several works on compression. In [21], they used knowledge
distillation to train the student network. However, the knowl-
edge distillation method is resource intensive. The other works
[21], [22] applied Singular Value Decomposition (SVD) and
pruned the neurons for network compression. The authors in
[23] showed compression to 1/3rd the original size but they
had to use model adaptation and fine-tuning for maintaining
the network performance. Recently, in [24] trained highly
sparse speech recognition networks that were 20% of the
full weight models and the network also exhibited noise
robustness. The authors in [25] proposed to quantize the
speech recognition models and calibrate the model during
quantization using synthetic data.
III. METHOD
A. Bin & Quant
We designed the Bin & Quant method [2] specifically for
compressing speech recognition models. We compress Speech
Command and Control (SPC), Deep Speech, Deep Speech
2 and VGG16 models using the proposed algorithm. In this
摘要:

1DeeplearningmodelcompressionusingnetworksensitivityandgradientsMadhumithaSakthi1,NiranjanYadla2,RajPawate2Abstract—Deeplearningmodelcompressionisanimprovingandimportanteldfortheedgedeploymentofdeeplearningmodels.Giventheincreasingsizeofthemodelsandtheircorrespondingpowerconsumption,itisvitaltodecr...

展开>> 收起<<
1 Deep learning model compression using network sensitivity and gradients.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:744.81KB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注