A Closer Look at Hardware-Friendly Weight Quantization Sungmin Bae1 Piotr Zielinski1 and Satrajit Chatterjee

2025-04-28 0 0 999.16KB 16 页 10玖币
侵权投诉
A Closer Look at Hardware-Friendly
Weight Quantization
Sungmin Bae1, Piotr Zielinski1, and Satrajit Chatterjee*
1Google Research
1{smbae,zielinski}@google.com
*satrajit@gmail.com
Abstract
Quantizing a Deep Neural Network (DNN) model to be used on a custom ac-
celerator with efficient fixed-point hardware implementations, requires satisfying
many stringent hardware-friendly quantization constraints to train the model. We
evaluate the two main classes of hardware-friendly quantization methods in the
context of weight quantization: the traditional Mean Squared Quantization Error
(MSQE)-based methods and the more recent gradient-based methods. We study the
two methods on MobileNetV1 and MobileNetV2 using multiple empirical metrics
to identify the sources of performance differences between the two classes, namely,
sensitivity to outliers and convergence instability of the quantizer scaling factor.
Using those insights, we propose various techniques to improve the performance
of both quantization methods - they fix the optimization instability issues present
in the MSQE-based methods during quantization of MobileNet models and allow
us to improve validation performance of the gradient-based methods by 4.0% and
3.3% for MobileNetV1 and MobileNetV2 on ImageNet respectively.
1 Introduction
Recently there has been a rapid increase in demand for deploying Deep Neural Network (DNN)-based
inference tasks on battery-powered edge devices for privacy and latency requirements. A hardware-
based DNN accelerator designed for an specific application is shown to be a promising approach
to process such computationally challenging tasks under a tight power budget constraint, which is
much faster and significantly power-efficient than a general purpose DNN accelerator [
27
]. In order
to compress a DNN model small enough to fit it onto such an accelerator, quantization methods need
to deal with more stringent constraints [
9
] such as uniform and symmetric quantization, power-of-2
(PO2) scaling factors, and batch normalization layer folding [
13
]. The goal is to quantize model
weights down to 4 bits or lower, but even with quantization-aware training, doing so without incurring
a significant accuracy loss remains a challenging research problem for the community.
There are two main classes of quantizer optimization methods for quantization aware training: the
Mean Squared Quantization Error (
MSQE
)
-based
methods [
1
,
7
,
24
,
29
,
11
] and the more recent
gradient-based
methods [
15
,
16
,
28
,
5
,
8
,
23
]. The
MSQE-based
methods iteratively search for
optimal quantizer parameters at each training step minimizing the MSQE between the full-precision
value and its quantized low-precision counterpart. There are also MSQE-based methods which
formulate the optimization problem to directly minimize the loss as in [
11
,
10
]. QKeras [
7
], a
popular framework for hardware-friendly quantization aware-training, has extended the MSQE-
based methods with the hardware-friendly constraints. The
gradient-based
methods optimize the
quantization parameters directly during gradient descent steps, so they can be jointly learned with
This work was done when the author worked at Google, Mountain View, CA.
arXiv:2210.03671v1 [cs.LG] 7 Oct 2022
other network parameters to minimize the loss function. [
5
] developed a method to learn a clipping
parameter for activation function. [
8
] proposed a learnable scaling factor and means to estimate and
scale their gradients to jointly train with other network parameters. [
15
] suggested a gradient-based
hardware-friendly quantizer2which optimizes power-of-2 (PO2) constrained clipping range3.
In this paper we study MSQE- and gradient-based methods for weight quantization on commonly used
vision backbones with the objective of understanding the sources (underlying causes) of performance
difference between the two classes of method. Based on our insights, we propose improvements
to the training process that increase stability of training and the performance of the final models to
achieve validation accuracy of
66.9±0.17%
on MobileNetV1 and
67.5±0.01%
on MobileNetV2.
To the best of our knowledge, we are not aware of any work in the literature that satisfies all our
hardware-friendliness constraints. The code for the quantizers will be available on Github 4.
The rest of the paper is organized as follows. Section 2 provides background on hardware-friendly
quantization and describes the two main classes of hardware-friendly quantizers. In Section 3, we
analyze the effects of quantization during training of the quantizers, and we improve the performance
of both quantizers based on the findings of the analysis. Finally, experiments and conclusion can be
found in Section 4 and Section 5, respectively.
2 Background
In this section, we describe hardware-friendly quantization constraints for efficient custom hardware
implementation and the two main classes of hardware-friendly quantization methods used in our
study.
2.1 Quantization for Deep Neural Networks
Quantization in DNN is a process of carefully searching for quantization parameters such as the
scaling factor (
) in
(1)
below
5
to map model parameters (e.g., weight, bias) and/or activations from
a large set to approximated values in a finite and smaller set using a function
Q:RC
with
minimum accuracy degradation as shown in (1).
wq=Q(w, ∆) = ∆ ·clip(round( w
); qmin, qmax),where wR, wqC(1)
2.2 Hardware-Friendly Quantizer Constraints
Uniform quantization
, which has uniformly spaced quantization levels, is often preferred due to its
simplicity in hardware. Non-uniform quantization techniques like K-mean clustering based methods
requires code-book look-up table related hardware overhead and logarithmic based methods can
have a problem so called rigid resolution
6
[
20
]. Adding multiple logarithmic quantization terms was
suggested to solve the problem [20], however it incurs hardware overhead.
Symmetric quantization
has a symmetrical number of quantization levels around zero. It has less
hardware overhead than asymmetric quantization as the zero point (i.e., quantization range offset) is
0. Even some terms related to the zero point (when it is non-zero) can be pre-computed and added to
the bias, however there is a term unavoidably calculated on-the-fly with respect to the input [18].
Per-tensor (layer) quantization
has a single scaling factor (and zero point) per entire tensor. It
has better hardware efficiency than per-channel quantization, which has different scaling factor for
each output channel of the tensor, since all the accumulators for entire tensor operate using the same
scaling factor.
2It is similar to LSQ quantizer in [8] with the additional PO2 scaling factor constraint.
3
It is equivalent to learning the PO2 scaling factor for fixed bit-widths, uniform, and symmetric quantization.
4https://github.com/google/qkeras/tree/master/experimental/quantizers
5
It models uniform and symmetric quantization, where
qmin =2bw1+ 1
,
qmax = 2bw1
1
and
qmin = 0
,
qmax = 2bw
1
for signed and unsigned input, respectively (
+1
in the signed
qmin
is to match the
number of quantization levels of qmax, and bw is the bit width of the quantizer.).
6Having a larger bit-width only increases the quantization levels concentrated around 0.
2
Batch Normalization (BN) layer folding
[
18
] reduces the two-step computations of the batch-
normalized convolution to a one-step convolution by merging batch normalization operation with the
weight and the bias of an adjacent linear layer [13].
Power-of-2 (PO2) scaling factor
improves the hardware efficiency by constraining the scaling factor
to a power-of-2 integer (that is to
2s
where
s
is an integer) as then the scaling operation only requires
a simple bit-shift operation instead of a fixed point multiplication. However, it often results in the
accuracy degradation since the expressiveness of the scaling factor is limited.
2.3 MSQE-based Hardware-Friendly Quantizer
An MSQE-based hardware-friendly quantizer directly optimizes the PO2 scaling factor to minimize
the mean squared quantization error [7], in its basic form, which is shown in (2).
PO2 = arg min
P O2
||Q(w, P O2)w||2,where P O2=P O2(∆) = 2round(log2(∆)) (2)
Algorithm 1 describes a pseudo-code of the MSQE optimization. It reduce the quantization problem
to a linear regression problem with a closed form solution to find an optimal scaling factor in terms
of the minimum MSQE, where it iteratively performs least squares fits at each training step [
29
]. The
initial scaling factor
init
can be either the previously found scaling factor or a rough estimation
based on the input distribution. We call it
MSQE
and use it as the baseline (and specifically the
implementation from QKeras [
2
] (i.e., quantized_bits)) for the MSQE-based hardware-friendly
quantizer in our study.
Algorithm 1 MSQE-based Optimization
Input: w, init, Niters
Output: P O2
procedure :
qQ(w,init )
init
NNiters
while N6= 0 do
qTw
qTqFind an unconstrained optimal minimizing the MSQE.
P O2P O2(∆) Constrain to be a power-of-2 value.
qQ(w,P O2)
P O2Quantize w with the newly found P O2.
NN1
end while
end procedure
2.4 Gradient-based Hardware-Friendly Quantizer
A gradient-based hardware-friendly quantizer learns the PO2 scaling factor from the gradients seeking
to optimize the scaling factor to minimize the task loss directly instead of locally optimizing the
MSQE objective at each training iteration [
8
,
15
,
4
]. However, training
(in
(2)
) from the gradients
can cause a numerical problem as the input to the
log2
function can become negative by gradient
update. It can be resolved by learning the scaling factor directly in the log domain (
log2
) without
log2function [15]7as shown in (3)8.
P O2=P O2(∆log2)=2round(∆log2)(3)
The local gradient of
log2
has dependency with the magnitude of the scaling factor (
2log2
) (see
(4)9
),
which can cause training instability (too large gradient) or slow convergence (too small gradient)
7Softplus function can also be used to train directly to avoid the numerical problem [23].
8
[
15
] suggests to use ceil function instead of round function to bias the scaling factor in the direction of
having more elements within the quantization range, however round is denoted for simplicity.
9It uses a straight-through estimator (STE) [3] in order to back-propagate through the round function.
3
Figure 1:
MSQE
and
MSQE(finetune)
were unstable and diverged due to outliers, while
GRAD
was
stable, during training quantized MobileNetV1 (4W4A) from scratch on ImageNet.
depending on the scaling factor value [
15
]. To lessen the problem, [
15
] suggests a method to
normalize the gradient with its moving average variance and clip any large gradients for SGD training.
However, it also reports that using Adam [17] optimizer without the normalization works well.
wq
log2
=wq
P O2
·P O2
log2
:= wq
P O2
·2log2·ln 2 (4)
The scaling factor can be initialized in several ways based on input values. It can be based on the
dynamic range of the inputs, the input distributions [
8
,
15
], and MSQE-optimal scaling factor [
4
].
We use a quantizer based on [
15
] as the baseline for the gradient-based hardware-friendly quantizer,
and call it GRAD.
3 Improving the Hardware-Friendly Quantizers
We analyze the advantages and disadvantages of both quantizer types and improve their performance
based on our analysis.
3.1 Improving MSQE
3.1.1 Sub-optimality in the Optimization of MSQE
Algorithm 1 can converge to a sub-optimal solution as it makes a linear approximation to the original
discrete non-linear problem. The sub-optimality can be improved by using a line search algorithm,
where it simply searches for an MSQE-optimal PO2 scaling factor from the vicinity of the solution
from Algorithm 1 (we will call the method as
finetune
in this paper.) (See Appendix A for an example
and the method in detail.).
3.1.2 A Problem with Outliers in MSQE
A quantizer should make good trade-off between rounding and clipping errors in order to keep the
precision of the majority of the input values in the presence of outliers. However, Algorithm 1
quantizes the input values uniformly between the largest and smallest values, so it is highly sensitive
to outliers due to the squared error term in the optimization objective. This can be problematic
especially for per-tensor quantization with batch normalization (BN) folding, where difference in the
dynamic range of the weights between the channels can be large (see Figure 12 in the Appendix for
details.). Figure 1 shows that the
MSQE(baseline)
and
MSQE(finetune)
were unstable and diverged,
when
GRAD
was stable, during training quantized MobileNetV1 on ImageNet from scratch. The
various metrics of a layer in Figure 1 shows that the scaling factor of
MSQE(finetune)
followed the
rapidly increasing dynamic ranges of the input (weight) values causing about 80
%
of them to round
to zero, which leaded to training divergence. On the other hand,
GRAD
had a well-controlled scaling
factor in comparison.
To reduce the sensitivity to outliers, a weighting factor can be used to suppress them during the
MSQE optimization at each training step by performing a weighted least squares fit for Algorithm 1
and a weighted line search (see the Appendix B for the method in detail.).
(5)
shows a simple outlier
mask (
Moutlier
) using a percentile based threshold (
σoutlier
) (assuming the input (
w
) has a Gaussian
distribution) to suppress outliers, where each element in the outlier mask gets either 1 or 0 depending
4
摘要:

ACloserLookatHardware-FriendlyWeightQuantizationSungminBae1,PiotrZielinski1,andSatrajitChatterjee*1GoogleResearch1{smbae,zielinski}@google.com*satrajit@gmail.comAbstractQuantizingaDeepNeuralNetwork(DNN)modeltobeusedonacustomac-celeratorwithefcientxed-pointhardwareimplementations,requiressatisfyin...

展开>> 收起<<
A Closer Look at Hardware-Friendly Weight Quantization Sungmin Bae1 Piotr Zielinski1 and Satrajit Chatterjee.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:999.16KB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注