A Closer Look at Hardware-Friendly Weight Quantization Sungmin Bae1 Piotr Zielinski1 and Satrajit Chatterjee

2025-04-28 0 0 999.16KB 16 页 10玖币

侵权投诉

A Closer Look at Hardware-Friendly

Weight Quantization

Sungmin Bae1, Piotr Zielinski1, and Satrajit Chatterjee∗*

1Google Research

1{smbae,zielinski}@google.com

*satrajit@gmail.com

Abstract

Quantizing a Deep Neural Network (DNN) model to be used on a custom ac-

celerator with efﬁcient ﬁxed-point hardware implementations, requires satisfying

many stringent hardware-friendly quantization constraints to train the model. We

evaluate the two main classes of hardware-friendly quantization methods in the

context of weight quantization: the traditional Mean Squared Quantization Error

(MSQE)-based methods and the more recent gradient-based methods. We study the

two methods on MobileNetV1 and MobileNetV2 using multiple empirical metrics

to identify the sources of performance differences between the two classes, namely,

sensitivity to outliers and convergence instability of the quantizer scaling factor.

Using those insights, we propose various techniques to improve the performance

of both quantization methods - they ﬁx the optimization instability issues present

in the MSQE-based methods during quantization of MobileNet models and allow

us to improve validation performance of the gradient-based methods by 4.0% and

3.3% for MobileNetV1 and MobileNetV2 on ImageNet respectively.

1 Introduction

Recently there has been a rapid increase in demand for deploying Deep Neural Network (DNN)-based

inference tasks on battery-powered edge devices for privacy and latency requirements. A hardware-

based DNN accelerator designed for an speciﬁc application is shown to be a promising approach

to process such computationally challenging tasks under a tight power budget constraint, which is

much faster and signiﬁcantly power-efﬁcient than a general purpose DNN accelerator [

]. In order

to compress a DNN model small enough to ﬁt it onto such an accelerator, quantization methods need

to deal with more stringent constraints [

] such as uniform and symmetric quantization, power-of-2

(PO2) scaling factors, and batch normalization layer folding [

]. The goal is to quantize model

weights down to 4 bits or lower, but even with quantization-aware training, doing so without incurring

a signiﬁcant accuracy loss remains a challenging research problem for the community.

There are two main classes of quantizer optimization methods for quantization aware training: the

Mean Squared Quantization Error (

MSQE

)

-based

methods [

] and the more recent

gradient-based

methods [

]. The

MSQE-based

methods iteratively search for

optimal quantizer parameters at each training step minimizing the MSQE between the full-precision

value and its quantized low-precision counterpart. There are also MSQE-based methods which

formulate the optimization problem to directly minimize the loss as in [

]. QKeras [

], a

popular framework for hardware-friendly quantization aware-training, has extended the MSQE-

based methods with the hardware-friendly constraints. The

gradient-based

methods optimize the

quantization parameters directly during gradient descent steps, so they can be jointly learned with

∗This work was done when the author worked at Google, Mountain View, CA.

arXiv:2210.03671v1 [cs.LG] 7 Oct 2022

other network parameters to minimize the loss function. [

] developed a method to learn a clipping

parameter for activation function. [

] proposed a learnable scaling factor and means to estimate and

scale their gradients to jointly train with other network parameters. [

] suggested a gradient-based

hardware-friendly quantizer2which optimizes power-of-2 (PO2) constrained clipping range3.

In this paper we study MSQE- and gradient-based methods for weight quantization on commonly used

vision backbones with the objective of understanding the sources (underlying causes) of performance

difference between the two classes of method. Based on our insights, we propose improvements

to the training process that increase stability of training and the performance of the ﬁnal models to

achieve validation accuracy of

66.9±0.17%

on MobileNetV1 and

67.5±0.01%

on MobileNetV2.

To the best of our knowledge, we are not aware of any work in the literature that satisﬁes all our

hardware-friendliness constraints. The code for the quantizers will be available on Github 4.

The rest of the paper is organized as follows. Section 2 provides background on hardware-friendly

quantization and describes the two main classes of hardware-friendly quantizers. In Section 3, we

analyze the effects of quantization during training of the quantizers, and we improve the performance

of both quantizers based on the ﬁndings of the analysis. Finally, experiments and conclusion can be

found in Section 4 and Section 5, respectively.

2 Background

In this section, we describe hardware-friendly quantization constraints for efﬁcient custom hardware

implementation and the two main classes of hardware-friendly quantization methods used in our

study.

2.1 Quantization for Deep Neural Networks

Quantization in DNN is a process of carefully searching for quantization parameters such as the

scaling factor (

∆

) in

(1)

below

to map model parameters (e.g., weight, bias) and/or activations from

a large set to approximated values in a ﬁnite and smaller set using a function

Q:R→C

with

minimum accuracy degradation as shown in (1).

wq=Q(w, ∆) = ∆ ·clip(round( w

∆); qmin, qmax),where w∈R, wq∈C(1)

2.2 Hardware-Friendly Quantizer Constraints

Uniform quantization

, which has uniformly spaced quantization levels, is often preferred due to its

simplicity in hardware. Non-uniform quantization techniques like K-mean clustering based methods

requires code-book look-up table related hardware overhead and logarithmic based methods can

have a problem so called rigid resolution

[

]. Adding multiple logarithmic quantization terms was

suggested to solve the problem [20], however it incurs hardware overhead.

Symmetric quantization

has a symmetrical number of quantization levels around zero. It has less

hardware overhead than asymmetric quantization as the zero point (i.e., quantization range offset) is

0. Even some terms related to the zero point (when it is non-zero) can be pre-computed and added to

the bias, however there is a term unavoidably calculated on-the-ﬂy with respect to the input [18].

Per-tensor (layer) quantization

has a single scaling factor (and zero point) per entire tensor. It

has better hardware efﬁciency than per-channel quantization, which has different scaling factor for

each output channel of the tensor, since all the accumulators for entire tensor operate using the same

scaling factor.

2It is similar to LSQ quantizer in [8] with the additional PO2 scaling factor constraint.

It is equivalent to learning the PO2 scaling factor for ﬁxed bit-widths, uniform, and symmetric quantization.

4https://github.com/google/qkeras/tree/master/experimental/quantizers

It models uniform and symmetric quantization, where

qmin =−2bw−1+ 1

qmax = 2bw−1

−1

and

qmin = 0

qmax = 2bw

−1

for signed and unsigned input, respectively (

in the signed

qmin

is to match the

number of quantization levels of qmax, and bw is the bit width of the quantizer.).

6Having a larger bit-width only increases the quantization levels concentrated around 0.

Batch Normalization (BN) layer folding

[

] reduces the two-step computations of the batch-

normalized convolution to a one-step convolution by merging batch normalization operation with the

weight and the bias of an adjacent linear layer [13].

Power-of-2 (PO2) scaling factor

improves the hardware efﬁciency by constraining the scaling factor

to a power-of-2 integer (that is to

where

is an integer) as then the scaling operation only requires

a simple bit-shift operation instead of a ﬁxed point multiplication. However, it often results in the

accuracy degradation since the expressiveness of the scaling factor is limited.

2.3 MSQE-based Hardware-Friendly Quantizer

An MSQE-based hardware-friendly quantizer directly optimizes the PO2 scaling factor to minimize

the mean squared quantization error [7], in its basic form, which is shown in (2).

∆∗

PO2 = arg min

∆P O2

||Q(w, ∆P O2)−w||2,where ∆P O2=P O2(∆) = 2round(log2(∆)) (2)

Algorithm 1 describes a pseudo-code of the MSQE optimization. It reduce the quantization problem

to a linear regression problem with a closed form solution to ﬁnd an optimal scaling factor in terms

of the minimum MSQE, where it iteratively performs least squares ﬁts at each training step [

]. The

initial scaling factor

∆init

can be either the previously found scaling factor or a rough estimation

based on the input distribution. We call it

MSQE

and use it as the baseline (and speciﬁcally the

implementation from QKeras [

] (i.e., quantized_bits)) for the MSQE-based hardware-friendly

quantizer in our study.

Algorithm 1 MSQE-based Optimization

Input: w, ∆init, Niters

Output: ∆P O2

procedure :

q←Q(w,∆init )

∆init

N←Niters

while N6= 0 do

∆←qTw

qTqFind an unconstrained optimal ∆minimizing the MSQE.

∆P O2←P O2(∆) Constrain ∆to be a power-of-2 value.

q←Q(w,∆P O2)

∆P O2Quantize w with the newly found ∆P O2.

N←N−1

end while

end procedure

2.4 Gradient-based Hardware-Friendly Quantizer

A gradient-based hardware-friendly quantizer learns the PO2 scaling factor from the gradients seeking

to optimize the scaling factor to minimize the task loss directly instead of locally optimizing the

MSQE objective at each training iteration [

]. However, training

∆

(in

(2)

) from the gradients

can cause a numerical problem as the input to the

log2

function can become negative by gradient

update. It can be resolved by learning the scaling factor directly in the log domain (

∆log2

) without

log2function [15]7as shown in (3)8.

∆P O2=P O2(∆log2)=2round(∆log2)(3)

The local gradient of

∆log2

has dependency with the magnitude of the scaling factor (

2∆log2

) (see

(4)9

which can cause training instability (too large gradient) or slow convergence (too small gradient)

7Softplus function can also be used to train ∆directly to avoid the numerical problem [23].

[

] suggests to use ceil function instead of round function to bias the scaling factor in the direction of

having more elements within the quantization range, however round is denoted for simplicity.

9It uses a straight-through estimator (STE) [3] in order to back-propagate through the round function.

Figure 1:

MSQE

and

MSQE(ﬁnetune)

were unstable and diverged due to outliers, while

GRAD

was

stable, during training quantized MobileNetV1 (4W4A) from scratch on ImageNet.

depending on the scaling factor value [

]. To lessen the problem, [

] suggests a method to

normalize the gradient with its moving average variance and clip any large gradients for SGD training.

However, it also reports that using Adam [17] optimizer without the normalization works well.

∂wq

∂∆log2

=∂wq

∂∆P O2

·∂∆P O2

∂∆log2

:= ∂wq

∂∆P O2

·2∆log2·ln 2 (4)

The scaling factor can be initialized in several ways based on input values. It can be based on the

dynamic range of the inputs, the input distributions [

], and MSQE-optimal scaling factor [

We use a quantizer based on [

] as the baseline for the gradient-based hardware-friendly quantizer,

and call it GRAD.

3 Improving the Hardware-Friendly Quantizers

We analyze the advantages and disadvantages of both quantizer types and improve their performance

based on our analysis.

3.1 Improving MSQE

3.1.1 Sub-optimality in the Optimization of MSQE

Algorithm 1 can converge to a sub-optimal solution as it makes a linear approximation to the original

discrete non-linear problem. The sub-optimality can be improved by using a line search algorithm,

where it simply searches for an MSQE-optimal PO2 scaling factor from the vicinity of the solution

from Algorithm 1 (we will call the method as

ﬁnetune

in this paper.) (See Appendix A for an example

and the method in detail.).

3.1.2 A Problem with Outliers in MSQE

A quantizer should make good trade-off between rounding and clipping errors in order to keep the

precision of the majority of the input values in the presence of outliers. However, Algorithm 1

quantizes the input values uniformly between the largest and smallest values, so it is highly sensitive

to outliers due to the squared error term in the optimization objective. This can be problematic

especially for per-tensor quantization with batch normalization (BN) folding, where difference in the

dynamic range of the weights between the channels can be large (see Figure 12 in the Appendix for

details.). Figure 1 shows that the

MSQE(baseline)

and

MSQE(ﬁnetune)

were unstable and diverged,

when

GRAD

was stable, during training quantized MobileNetV1 on ImageNet from scratch. The

various metrics of a layer in Figure 1 shows that the scaling factor of

MSQE(ﬁnetune)

followed the

rapidly increasing dynamic ranges of the input (weight) values causing about 80

of them to round

to zero, which leaded to training divergence. On the other hand,

GRAD

had a well-controlled scaling

factor in comparison.

To reduce the sensitivity to outliers, a weighting factor can be used to suppress them during the

MSQE optimization at each training step by performing a weighted least squares ﬁt for Algorithm 1

and a weighted line search (see the Appendix B for the method in detail.).

(5)

shows a simple outlier

mask (

Moutlier

) using a percentile based threshold (

σoutlier

) (assuming the input (

) has a Gaussian

distribution) to suppress outliers, where each element in the outlier mask gets either 1 or 0 depending

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACloserLookatHardware-FriendlyWeightQuantizationSungminBae1,PiotrZielinski1,andSatrajitChatterjee*1GoogleResearch1{smbae,zielinski}@google.com*satrajit@gmail.comAbstractQuantizingaDeepNeuralNetwork(DNN)modeltobeusedonacustomac-celeratorwithefcientxed-pointhardwareimplementations,requiressatisfyin...

展开>> 收起<<

A Closer Look at Hardware-Friendly Weight Quantization Sungmin Bae1 Piotr Zielinski1 and Satrajit Chatterjee.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Closer Look at Hardware-Friendly Weight Quantization Sungmin Bae1 Piotr Zielinski1 and Satrajit Chatterjee

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: