
other network parameters to minimize the loss function. [
5
] developed a method to learn a clipping
parameter for activation function. [
8
] proposed a learnable scaling factor and means to estimate and
scale their gradients to jointly train with other network parameters. [
15
] suggested a gradient-based
hardware-friendly quantizer2which optimizes power-of-2 (PO2) constrained clipping range3.
In this paper we study MSQE- and gradient-based methods for weight quantization on commonly used
vision backbones with the objective of understanding the sources (underlying causes) of performance
difference between the two classes of method. Based on our insights, we propose improvements
to the training process that increase stability of training and the performance of the final models to
achieve validation accuracy of
66.9±0.17%
on MobileNetV1 and
67.5±0.01%
on MobileNetV2.
To the best of our knowledge, we are not aware of any work in the literature that satisfies all our
hardware-friendliness constraints. The code for the quantizers will be available on Github 4.
The rest of the paper is organized as follows. Section 2 provides background on hardware-friendly
quantization and describes the two main classes of hardware-friendly quantizers. In Section 3, we
analyze the effects of quantization during training of the quantizers, and we improve the performance
of both quantizers based on the findings of the analysis. Finally, experiments and conclusion can be
found in Section 4 and Section 5, respectively.
2 Background
In this section, we describe hardware-friendly quantization constraints for efficient custom hardware
implementation and the two main classes of hardware-friendly quantization methods used in our
study.
2.1 Quantization for Deep Neural Networks
Quantization in DNN is a process of carefully searching for quantization parameters such as the
scaling factor (
∆
) in
(1)
below
5
to map model parameters (e.g., weight, bias) and/or activations from
a large set to approximated values in a finite and smaller set using a function
Q:R→C
with
minimum accuracy degradation as shown in (1).
wq=Q(w, ∆) = ∆ ·clip(round( w
∆); qmin, qmax),where w∈R, wq∈C(1)
2.2 Hardware-Friendly Quantizer Constraints
Uniform quantization
, which has uniformly spaced quantization levels, is often preferred due to its
simplicity in hardware. Non-uniform quantization techniques like K-mean clustering based methods
requires code-book look-up table related hardware overhead and logarithmic based methods can
have a problem so called rigid resolution
6
[
20
]. Adding multiple logarithmic quantization terms was
suggested to solve the problem [20], however it incurs hardware overhead.
Symmetric quantization
has a symmetrical number of quantization levels around zero. It has less
hardware overhead than asymmetric quantization as the zero point (i.e., quantization range offset) is
0. Even some terms related to the zero point (when it is non-zero) can be pre-computed and added to
the bias, however there is a term unavoidably calculated on-the-fly with respect to the input [18].
Per-tensor (layer) quantization
has a single scaling factor (and zero point) per entire tensor. It
has better hardware efficiency than per-channel quantization, which has different scaling factor for
each output channel of the tensor, since all the accumulators for entire tensor operate using the same
scaling factor.
2It is similar to LSQ quantizer in [8] with the additional PO2 scaling factor constraint.
3
It is equivalent to learning the PO2 scaling factor for fixed bit-widths, uniform, and symmetric quantization.
4https://github.com/google/qkeras/tree/master/experimental/quantizers
5
It models uniform and symmetric quantization, where
qmin =−2bw−1+ 1
,
qmax = 2bw−1
−1
and
qmin = 0
,
qmax = 2bw
−1
for signed and unsigned input, respectively (
+1
in the signed
qmin
is to match the
number of quantization levels of qmax, and bw is the bit width of the quantizer.).
6Having a larger bit-width only increases the quantization levels concentrated around 0.
2