1 Deep learning model compression using network sensitivity and gradients

2025-04-28 0 0 744.81KB 11 页 10玖币

侵权投诉

Deep learning model compression using network

sensitivity and gradients

Madhumitha Sakthi1, Niranjan Yadla2, Raj Pawate2

Abstract—Deep learning model compression is an improving

and important ﬁeld for the edge deployment of deep learning

models. Given the increasing size of the models and their

corresponding power consumption, it is vital to decrease the

model size and compute requirement without a signiﬁcant drop

in the model’s performance. In this paper, we present model

compression algorithms for both non-retraining and retraining

conditions. In the ﬁrst case where retraining of the model is not

feasible due to lack of access to the original data or absence

of necessary compute resources while only having access to off-

the-shelf models, we propose the Bin & Quant algorithm for

compression of the deep learning models using the sensitivity of

the network parameters. This results in 13x compression of the

speech command and control model and 7x compression of the

DeepSpeech2 models. In the second case when the models can be

retrained and utmost compression is required for the negligible

loss in accuracy, we propose our novel gradient-weighted k-means

clustering algorithm (GWK). This method uses the gradients

in identifying the important weight values in a given cluster

and nudges the centroid towards those values, thereby giving

importance to sensitive weights. Our method effectively combines

product quantization with the EWGS [1] algorithm for sub-1-

bit representation of the quantized models. We test our GWK

algorithm on the CIFAR10 dataset across a range of models such

as ResNet20, ResNet56, MobileNetv2 and show 35x compression

on quantized models for less than 2% absolute loss in accuracy

compared to the ﬂoating-point models.

Index Terms—model compression, storage compression, com-

puter vision, image classiﬁcation

I. INTRODUCTION

Deep learning models are applied to achieve state-of-the-

art results across applications in various ﬁelds. The initial

success of deep learning models was attributed to the ex-

tremely large model parameters and hence their ability to

model complex problems. However, given the applicability

of deep learning across various ﬁelds and their upcoming

edge device deployment, it is crucial to reduce the model

size and computational complexity without compromising on

the model performance. Therefore, recent research has fo-

cused on producing lightweight deep learning models, pruning,

quantization and clustering techniques for compression. Most

often, these methods either focus on storage compression or

model quantization for edge computing advantage. Also, these

methods need extensive re-training of the compressed model

starting from the ﬂoating point model which is already heavily

trained. While re-training of these models is compute-intensive

along with the requirement of the knowledge of the initial

1. Electrical and Computer Engineering, The University of Texas at Austin,

USA (madhumithasakthi.iyer@utexas.edu), 2. Tensilica IPG, Cadence Design

Systems Inc., USA (pawateb@cadence.com)

hyperparameters used for training the original model, in order

to achieve extreme compression for the least loss in accuracy,

it is vital to retrain the model.

Fig. 1: The overall compression algorithm for various models.

As the compression complexity increases, the size and infer-

ence complexity of the model decreases.

In this paper, we present 3 such compression cases for

varying storage, compression complexity and accuracy re-

quirements. First, our original Bin & Quant method [2] is

speciﬁcally designed for compressing the pre-trained models

without having to retrain them after compression. In this

method, we use sensitivity analysis to guide us in identifying

the appropriate bins for a particular layer in a model and

cluster the bin values to represent them using a single label

and hence introduce storage compression. This method was

speciﬁcally developed for ﬂoat activation models and for

storage compression. The main motivation for developing

a technique for compression without retraining is attributed

to the need for the original training data’s availability for

retraining. In this era of growing need for privacy and security

concerns, often obtaining access to the original data could be

difﬁcult [3]. Also, it is important to have prior knowledge

about the hyperparameters used for training the original model

and if there is no access to the same amount of compute

capacity that was used to train the initial model, there will

be a need to explore hyperparameters for retraining the model

again. In addition to this, setting up the system for retraining

is also an additional task that is time-consuming and hence

expensive. Therefore, we propose a compression algorithm

without the need for retraining. In the second case, we extend

our Bin & Quant algorithm to quantized models and apply the

arXiv:2210.05111v1 [cs.LG] 11 Oct 2022

algorithm for compression of the uint8 weight values without

retraining. This compression scheme works on top of the

quantized models and provides extra compression at minimal

cost in accuracy and compression complexity.

Finally, in cases where retraining is feasible, we present

a gradient-weighted k-means(GWK) algorithm in addition to

quantization-aware training of the model using the EWGS [1]

algorithm for storage compression of the quantized models.

This method is proposed for utmost model compression while

taking advantage of quantized weights and activation to retain

the edge deployment advantages of the quantized models.

Since we use product quantization, similar to [4], [5], we

achieve sub-1-bit representation per weight while the activa-

tion is also quantized. Our GWK method utilizes the gradients

in identifying the sensitive parameters of the network. We

empirically show that the network parameters perturbed based

on a higher gradient lead to a drop in accuracy compared

to randomly perturbed weights. Thereby, emphasizing the

fact that the gradients are capable of identifying sensitive

parameters of the network and this information aids in nudging

the centroids of the clusters towards these sensitive weights

in our GWK algorithm. Therefore, we present these three

algorithms for three different scenarios. In the ﬁrst case, where

retraining is not feasible and the model has ﬂoat weights and

activations, Bin & Quant would be used for storage level

compression. In the second case, we propose to use Bin &

Quant for the storage compression of the quantized models.

Finally, the third algorithm is proposed for the utmost model

compression in conditions where retraining is possible and

least/no accuracy drop is required.

We tested our ﬁrst algorithm on speech recognition models

and showed 13x compression in the case of the speech

command and control model, and 7x compression of the

deepspeech2 model and this included a range of speech

commands and large vocabulary speech recognition models

and VGG16 network. The second algorithm was tested on

computer vision models such as Mobilenetv2 and Resnet

since these models are widely used as post training quantized

models, especially in edge devices for image classiﬁcation

tasks. In this case, we achieved 6.32x compression of the

Mobilenetv2 model with quantized activations. Finally, the

third algorithm was tested on ResNet, MobileNet models

and we achieved sub-1-bit representation per weight of the

models with quantized activations. In ﬁgure 1, we describe

the three compression algorithm and their relevant result. As

the compression complexity increases, the size and inference

complexity decrease.

Therefore, the main contribution of this paper is the follow-

ing:

•A novel Bin & Quant model compression algorithm for

off-the-shelf models without retraining the ﬂoat weight

and activation parameters for storage compression.

•The Bin & Quant algorithm extended to integer quan-

tized models for compression of weight and activation

quantized models without retraining.

•Empirically established the signiﬁcance of gradients in

identifying the sensitive parameters of the network.

•A novel gradient weighted k-means algorithm for utmost

model compression of the quantized models using EWGS

[1]. The gradients nudge the centroids of the cluster

towards sensitive parameters and the algorithm achieves a

sub-1-bit weight representation of quantized parameters.

We organize the rest of the paper as follows. In section

two, we give a brief overview of the existing research on deep

learning model compression algorithms for storage compres-

sion and quantization methods. In section three, we present our

three proposed algorithms for retraining and non-retraining-

based compression methods. In section four, the experimental

evaluation of the proposed algorithms is presented across

various models and ﬁnally, we conclude our methods in section

ﬁve.

II. RELATED WORK

Clustering-based model compression techniques have re-

cently gained a lot of attention due to their ability to represent

multiple weight values using a single label and hence save on

storage space. Particularly, the work by [6] has been widely

used and developed further. In [6], the authors presented a

multi-stage compression pipeline of clustering, re-training and

Huffman encoding of various vision models such as VGG16,

AlexNet. Recently, the work by [5] extended this clustering

technique to use product quantization in addition so that

multiple weight values in proximity will be represented by a

single label value and thereby leading to the ability to represent

with sub-1-bit per weight value. Their k-means algorithm is

optimized to preserve the output activation rather than the

weights directly and showed 29x compression in the case of

ResNet-18 and 18x compression in the case of ResNet-50

model. In the [4] paper, the non-differentiable k-means clus-

tering technique was addressed by adding attention matrices

to guide cluster formation and enabled joint optimization of

DNN parameters and cluster centroids. They could compress

the Mobilenet-v2 to 3 bits/weight and ResNet-18 to 0.717

bits/weight. However, in all of the above methods, the authors

focused on weight storage compression and retained ﬂoat

activations. In DKM [4] the authors propose to use attention

matrices to guide centroid formation and therefore, in the case

of a large ResNet-50 model with 8/8 i.e., 8-bit clusters and

8-d subvector size, the model does not converge due to out-of-

memory error. However, in our proposed GWK, we use a 1-d

gradient vector, the same size as that of our original model

weight that is being compressed without the need for an extra

attention matrix parameter to guide cluster formation. Also,

while DKM only focuses on storage compression, we provide

results on both storage compression and weight quantization.

Finally, in DKM the attention matrices are derived based on

the distance metric and this is used to guide the centroid forma-

tion. Whereas, we use the gradients to guide us in the centroid

formation motivated by the empirically tested hypothesis that

gradients are capable of identifying the sensitive parameters in

a layer and therefore, we give importance to these parameters

during centroid determination. The work by [7] compresses the

Mobilenet model by 10.7x using universal vector quantization

and universal source coding technique. The [8] paper proposed

Fig. 2: From left to right, the effect of Gaussian perturbation is shown in the ﬁrst row, and the effect of a ±50% relative

magnitude perturbation is shown in the second row for VGG16, SPC, measured using accuracy and DS, DS 2 measured using

error rate.

to use reinforcement learning to automatically determine the

quantization policy that’s best based on the hardware accel-

erator’s feedback and test their algorithm on MobileNetv1,v2

and ResNet models. In another paper [9], the authors propose

the GOBO algorithm for weight clustering using a method

similar to k-means on NLP models. In [10], the authors

propose to permute the weights before compressing and ﬁne-

tuning the layers and this results in signiﬁcant performance

improvement. The authors in [11] propose a non-retraining

setup for compression of vision models by determining the

best k based on accuracy loss. Whereas, in our B&Q work,

we focus on a novel algorithm for compression based on the

sensitivity of the network to noise and the method is applicable

to both ﬂoat and integer quantized models.

Finally, in order to deploy the models in INT DSPs,

both the weights and activations should be quantized to

INT values. Model quantization methods have also gained

popularity due to their ability to quantize both the weights and

activation values [12]. In [1], the authors propose EWGS as

an alternative algorithm for a straight-through estimator(STE)

by scaling each gradient element based on the sign of the

gradient and the discretizer error. They showed an accuracy

of 73.9% on the imagenet dataset using the ResNet34 model

with 4-bit weights and 4-bit activations. In [13], again an

algorithm was proposed for training quantized models by

using a customized regularization loss for directing the weight

values towards a distribution with maximum accuracy while

minimizing quantization error. On MobilenetV2 for the person

detection task, they compressed it to 2 bits and showed

87.5% accuracy. PROFIT [14] is another Mobilenet model

compression technique where the activation instability due

to weight quantization is alleviated by progressive freezing

and iterative training of the model. They achieved 4-bit com-

pression of the Mobilenetv1 model for a 60.056% accuracy.

The authors in [15] presented an additive power of two

quantization technique. Their reparametrization of the clipping

function is applied for a better-deﬁned gradient for learning the

clipping threshold and weight normalization for stabilizing the

training. They showed ResNet-18 compressed to 5-bit weight

and activation with 70.9% accuracy on the ImageNet dataset.

Memory efﬁcient networks such as MobileNet [16], [17] mod-

els, EfﬁcientNet [18], [19] are alternatives for directly training

efﬁcient neural networks instead of training a heavy network

and then compressing and/or quantizing them. However, in

our study, we compress and quantize the memory-efﬁcient

Mobilenet network as well.

Although all the model quantization papers focused only

on weight and activation quantization technique, [20] au-

thors presented a method for both storage compression

and weight/activation quantization using product quantization.

Their algorithm randomly quantizes different random subset of

weights during each forward pass rather than all the weights as

a quantization-aware training scheme and improved the perfor-

mance of the model while inducing extreme compression. But

the authors only tested their method on the EfﬁcientNet-B3

model and RoBERTa model and reduced the representation to

4-bit weights and activation for a signiﬁcant loss in accuracy.

In the case of speech recognition models, there have been

several works on compression. In [21], they used knowledge

distillation to train the student network. However, the knowl-

edge distillation method is resource intensive. The other works

[21], [22] applied Singular Value Decomposition (SVD) and

pruned the neurons for network compression. The authors in

[23] showed compression to 1/3rd the original size but they

had to use model adaptation and ﬁne-tuning for maintaining

the network performance. Recently, in [24] trained highly

sparse speech recognition networks that were 20% of the

full weight models and the network also exhibited noise

robustness. The authors in [25] proposed to quantize the

speech recognition models and calibrate the model during

quantization using synthetic data.

III. METHOD

A. Bin & Quant

We designed the Bin & Quant method [2] speciﬁcally for

compressing speech recognition models. We compress Speech

Command and Control (SPC), Deep Speech, Deep Speech

2 and VGG16 models using the proposed algorithm. In this

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1DeeplearningmodelcompressionusingnetworksensitivityandgradientsMadhumithaSakthi1,NiranjanYadla2,RajPawate2AbstractDeeplearningmodelcompressionisanimprovingandimportanteldfortheedgedeploymentofdeeplearningmodels.Giventheincreasingsizeofthemodelsandtheircorrespondingpowerconsumption,itisvitaltodecr...

展开>> 收起<<

1 Deep learning model compression using network sensitivity and gradients.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Deep learning model compression using network sensitivity and gradients

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: