2
algorithm for compression of the uint8 weight values without
retraining. This compression scheme works on top of the
quantized models and provides extra compression at minimal
cost in accuracy and compression complexity.
Finally, in cases where retraining is feasible, we present
a gradient-weighted k-means(GWK) algorithm in addition to
quantization-aware training of the model using the EWGS [1]
algorithm for storage compression of the quantized models.
This method is proposed for utmost model compression while
taking advantage of quantized weights and activation to retain
the edge deployment advantages of the quantized models.
Since we use product quantization, similar to [4], [5], we
achieve sub-1-bit representation per weight while the activa-
tion is also quantized. Our GWK method utilizes the gradients
in identifying the sensitive parameters of the network. We
empirically show that the network parameters perturbed based
on a higher gradient lead to a drop in accuracy compared
to randomly perturbed weights. Thereby, emphasizing the
fact that the gradients are capable of identifying sensitive
parameters of the network and this information aids in nudging
the centroids of the clusters towards these sensitive weights
in our GWK algorithm. Therefore, we present these three
algorithms for three different scenarios. In the first case, where
retraining is not feasible and the model has float weights and
activations, Bin & Quant would be used for storage level
compression. In the second case, we propose to use Bin &
Quant for the storage compression of the quantized models.
Finally, the third algorithm is proposed for the utmost model
compression in conditions where retraining is possible and
least/no accuracy drop is required.
We tested our first algorithm on speech recognition models
and showed 13x compression in the case of the speech
command and control model, and 7x compression of the
deepspeech2 model and this included a range of speech
commands and large vocabulary speech recognition models
and VGG16 network. The second algorithm was tested on
computer vision models such as Mobilenetv2 and Resnet
since these models are widely used as post training quantized
models, especially in edge devices for image classification
tasks. In this case, we achieved 6.32x compression of the
Mobilenetv2 model with quantized activations. Finally, the
third algorithm was tested on ResNet, MobileNet models
and we achieved sub-1-bit representation per weight of the
models with quantized activations. In figure 1, we describe
the three compression algorithm and their relevant result. As
the compression complexity increases, the size and inference
complexity decrease.
Therefore, the main contribution of this paper is the follow-
ing:
•A novel Bin & Quant model compression algorithm for
off-the-shelf models without retraining the float weight
and activation parameters for storage compression.
•The Bin & Quant algorithm extended to integer quan-
tized models for compression of weight and activation
quantized models without retraining.
•Empirically established the significance of gradients in
identifying the sensitive parameters of the network.
•A novel gradient weighted k-means algorithm for utmost
model compression of the quantized models using EWGS
[1]. The gradients nudge the centroids of the cluster
towards sensitive parameters and the algorithm achieves a
sub-1-bit weight representation of quantized parameters.
We organize the rest of the paper as follows. In section
two, we give a brief overview of the existing research on deep
learning model compression algorithms for storage compres-
sion and quantization methods. In section three, we present our
three proposed algorithms for retraining and non-retraining-
based compression methods. In section four, the experimental
evaluation of the proposed algorithms is presented across
various models and finally, we conclude our methods in section
five.
II. RELATED WORK
Clustering-based model compression techniques have re-
cently gained a lot of attention due to their ability to represent
multiple weight values using a single label and hence save on
storage space. Particularly, the work by [6] has been widely
used and developed further. In [6], the authors presented a
multi-stage compression pipeline of clustering, re-training and
Huffman encoding of various vision models such as VGG16,
AlexNet. Recently, the work by [5] extended this clustering
technique to use product quantization in addition so that
multiple weight values in proximity will be represented by a
single label value and thereby leading to the ability to represent
with sub-1-bit per weight value. Their k-means algorithm is
optimized to preserve the output activation rather than the
weights directly and showed 29x compression in the case of
ResNet-18 and 18x compression in the case of ResNet-50
model. In the [4] paper, the non-differentiable k-means clus-
tering technique was addressed by adding attention matrices
to guide cluster formation and enabled joint optimization of
DNN parameters and cluster centroids. They could compress
the Mobilenet-v2 to 3 bits/weight and ResNet-18 to 0.717
bits/weight. However, in all of the above methods, the authors
focused on weight storage compression and retained float
activations. In DKM [4] the authors propose to use attention
matrices to guide centroid formation and therefore, in the case
of a large ResNet-50 model with 8/8 i.e., 8-bit clusters and
8-d subvector size, the model does not converge due to out-of-
memory error. However, in our proposed GWK, we use a 1-d
gradient vector, the same size as that of our original model
weight that is being compressed without the need for an extra
attention matrix parameter to guide cluster formation. Also,
while DKM only focuses on storage compression, we provide
results on both storage compression and weight quantization.
Finally, in DKM the attention matrices are derived based on
the distance metric and this is used to guide the centroid forma-
tion. Whereas, we use the gradients to guide us in the centroid
formation motivated by the empirically tested hypothesis that
gradients are capable of identifying the sensitive parameters in
a layer and therefore, we give importance to these parameters
during centroid determination. The work by [7] compresses the
Mobilenet model by 10.7x using universal vector quantization
and universal source coding technique. The [8] paper proposed