SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION A REGULARIZATION-FREE APPROACH Kai Zhen Martin Radfar Hieu Nguyen Grant P . Strimel Nathan Susanj Athanasios Mouchtaris

2025-05-02 0 0 1.42MB 8 页 10玖币

侵权投诉

SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION:

A REGULARIZATION-FREE APPROACH

Kai Zhen, Martin Radfar, Hieu Nguyen, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris

Amazon Alexa AI

ABSTRACT

For on-device automatic speech recognition (ASR), quan-

tization aware training (QAT) is ubiquitous to achieve the

trade-off between model predictive performance and efﬁ-

ciency. Among existing QAT methods, one major draw-

back is that the quantization centroids have to be predeter-

mined and ﬁxed. To overcome this limitation, we intro-

duce a regularization-free, “soft-to-hard” compression mech-

anism with self-adjustable centroids in a µ-Law constrained

space, resulting in a simpler yet more versatile quantiza-

tion scheme, called General Quantizer (GQ). We apply GQ

to ASR tasks using Recurrent Neural Network Transducer

(RNN-T) and Conformer architectures on both LibriSpeech

and de-identiﬁed far-ﬁeld datasets. Without accuracy degra-

dation, GQ can compress both RNN-T and Conformer into

sub-8-bit, and for some RNN-T layers, to 1-bit for fast and

accurate inference. We observe a 30.73% memory footprint

saving and 31.75% user-perceived latency reduction com-

pared to 8-bit QAT via physical device benchmarking.

Index Terms—On-device speech recognition, quantiza-

tion aware training, RNN-T, conformer, model efﬁciency

1. INTRODUCTION

Improving the efﬁciency of neural automatic speech recogni-

tion (ASR) models via quantization is critical for on-device

deployment scenarios [1, 2]. For neural network accelera-

tor (NNA) embedded devices, where memory and bandwidth

are at a premium, quantization can reduce the footprint and

lower the bandwidth consumption of ASR execution, which

will not only afford a faster model inference but also facilitate

model deployment to various portable devices where a stable

network connection is limited.

Existing quantization methods can be post-training quan-

tization (PTQ) or in-training / quantization aware training

(QAT). PTQ is applied after the model training is complete

by compressing models into 8-bit representations and is rela-

tively well supported by various libraries [3, 4, 5, 6, 7, 8], such

as TensorFlow Lite [9] and AIMET [10] for on-device deploy-

ment. However, almost no existing PTQ supports customized

quantization conﬁgurations to compress machine learning

(ML) layers and kernels into sub-8-bit (S8B) regimes [11].

Moreover, the performance drop is inevitable as the model

is unaware of the loss of precision when being quantized at

test time. In contrast, QAT performs bit-depth reduction of

model weights (for example, from 32-bit ﬂoating point to

8-bit integer) during training which usually yields superior

performance over PTQ [12][13]. The QAT mechanism can

be in the forward pass (FP-QAT) or the backward pass (BP-

QAT), with the difference being whether regularization is

used in the loss function. FP-QAT [11] quantizes the model

weights during forward propagation to pre-deﬁned quanti-

zation centroids. BP-QAT [1, 14, 15] relies on customized

regularizers to gradually force weights to those quantization

centroids (i.e., “soft quantization” via gradient) during train-

ing before hard compression performs in the late training

phase. As model weights are informed by the customized

regularizers to move closer to where they are quantized at

runtime per training step, the predictive performance is often

well preserved. Therefore, the focus of this work is on QAT.

Under both FP- and BP-QAT, it is essential that the quan-

tization centroids are deﬁned and speciﬁed before model

training. As such, the demerit is the low feasibility when

quantizing models in S8B mode because one needs to select

the proper quantization centroids and their conﬁgurations for

each kernel in each layer to ensure minimal runtime perfor-

mance degradation. Consequently, applying existing QAT

methods to Conformer [16] becomes quite challenging, as it

usually contains more than hundreds of kernels.

In this work, we propose General Quantizer (GQ), a

regularization-free, model-agnostic quantization scheme with

a mixed ﬂavor of both FP- and BP-QAT. GQ is “general” in

that it does not augment the objective function by introducing

any regularizer as in BP-QAT but determines the appropri-

ate quantization centroids during model training for a given

bit depth, and it can be simply applied in a plug-and-play

manner to an arbitrary ASR model. Unlike FP-QAT, GQ

features a soft-to-hard quantization during training, allow-

ing model weights to hop around adjacent partitions more

easily. Under GQ, quantization centroids are self-adjustable

but in a µ-Law constrained space. As a proof-of-concept,

we adopt the ASR task and conduct experiments on both the

LibriSpeech and de-identiﬁed far-ﬁeld datasets to evaluate

GQ on three major end-to-end ASR architectures, namely

conventional Recurrent Neural Network Transducer (RNN-

arXiv:2210.09188v2 [cs.SD] 1 Nov 2022

T) [17], Bifocal RNN-T [18], and Conformer [19][20]. Our

results show that in all three architectures, GQ yields little to

no accuracy loss when compressing models to S8B or even

sub-5-bit (5-bit or lower). We also present performance op-

timization strategies from ablation studies on bit-allocation

and quantization frequency. Our contributions are as follows:

• We propose GQ, inspired by both FP- and BP-QAT ap-

proaches. GQ enables on-centroid weight aggregation

without augmented regularizers. Instead, it leverages

Softmax annealing to impose soft-to-hard quantization

on centroids from the µ-Law constrained space.

• GQ supports different quantization modes for a wide

range of granularity: different bit depths can be speci-

ﬁed for different kernels/layers/modules.

• With GQ, we losslessly compress a lightweight stream-

ing Conformer into sub-5-bit with more than 6×

model size reduction. To our best knowledge, this

is among the ﬁrst sub-5-bit Conformer models for on-

device ASR. Without accuracy degradation, our GQ-

compressed 5-bit Bifocal RNN-T reduces the memory

footprint by 30.73% and P90 user-perceived latency

(UPL) by 31.30%.

We describe the problem in Sec. 2 and GQ in Sec. 3. The

experimental settings and results are detailed in Sec. 4. We

conclude in Sec. 5 with some ﬁnal remarks.

2. PRELIMINARIES

2.1. Problem Formulation

Consider a general deep neural network architecture with K

layers, F=F1◦ ··· ◦ FK, mapping the input from Rd1

to the output in RdK+1 as F:Rd17−→ RdK+1 , where the

input and output of an arbitrary k-th layer are x(k+1) :=

Fk(x(k)). Under supervised learning, the training data X=

{x1,...,xN}and Y={y1,...,yN}are used for updat-

ing model weights W= [W1,...,WK]for Klayers in F.

Usually the optimization process is over the training objective

function L(X,Y,F,W) = 1

i=1

`(F(xi),yi) + λR(W),

where iis the data batch index, `is the major loss term mea-

suring model accuracy and R(W)is the regularizer blended

to the objective function via a coefﬁcient λ.

Network quantization aims at discretizing model weights.

For scalar quantization, it is to convert each weight, w∈W,

to a quantization centroid, z∈z, where z= [z1, ...zm]to

ensure the network is compressed into dlog2me-bit. For S8B

quantization, centroids are from a subset of INT8 values.

2.2. Related QAT Approaches

BP-QAT counters model weight continuity via regulariza-

tion. For example, it introduces weight regularizers on

model weights, i.e., R(W), measuring the point-wise dis-

tance between each weight and mquantization centroids in

the centroid vector z= [z1, ...zm]. Note that the quantization

weight regularizer in the loss function, as R(W), must be

gradient descent compatible. Consequently, R(W)cannot

enforce each weight to be replaced by the closest centroid in

zas w= arg min

i||w−zi|| , for w∈Wand zi∈z, be-

cause the min operator is not differentiable. Recent BP-QAT

methods force weights to approach the centroid in zusing

R(W) = P

w∈WD(w, z),where the differentiable dissimilarity

function Dis based on a cosine function in [1, 14].

In contrast, FP-QAT can be regularizer free [11, 21].

Usually, the process is to use a “fake quantizer” or equiva-

lent operations during training, hard quantizing weights to

a speciﬁc range and bit-depth; and then at runtime, convert-

ing the model to INT8 format via TFLite [22]. The study

[11] uses native quantization operators with which, during

training, the weights are quantized and then converted to the

integer type for model deployment. However, FP-QAT is

essentially hard compression recurring during training with

severely dropped performance when applied to S8B quan-

tization. Consequently, ﬁnetuning is usually needed, which

prolongs the model training time [23, 24].

Both FP-QAT and BP-QAT require specifying appropri-

ate quantization centroids before model training. While the

centroids for INT8 model compression are pre-deﬁned, for

S8B quantization, the optimal set of centroids is usually ker-

nel/layer speciﬁc. For models, such as Conformer [16], where

there are usually hundreds of kernels, current S8B QAT meth-

ods become less tractable.

In this work, we combine the merit from both FP- and

BP-QAT and propose General Quantizer (GQ) that navigates

weights to quantization centroids without introducing aug-

mented regularizers but via feedforward-only operators. Our

work is inspired by a continuous relaxation of quantization

[25] also used for speech representation learning [26, 27, 28,

29, 30, 31, 32], and µ-Law algorithm for 8-bit pulse-code

modulation (PCM) digital telecommunication [33].

3. METHODS

3.1. Centroid Selection via Softmax-Based Dissimilarity

Matrices

For any weight value wi∈wwhere |w|=n, and the quan-

tization centroid vector z= [z1, ...zm], we deﬁne the point-

wise dissimilarity matrix in Eq. 1

Asoft =





a11 ··· a1m

.....

an1··· anm





,(1)

where aij is the probability of representing wiby zj. Each

row in Asoft[i·]is summed to 1with the largest probability

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SUB-8-BITQUANTIZATIONFORON-DEVICESPEECHRECOGNITION:AREGULARIZATION-FREEAPPROACHKaiZhen,MartinRadfar,HieuNguyen,GrantP.Strimel,NathanSusanj,AthanasiosMouchtarisAmazonAlexaAIABSTRACTForon-deviceautomaticspeechrecognition(ASR),quan-tizationawaretraining(QAT)isubiquitoustoachievethetrade-offbetweenmodel...

展开>> 收起<<

SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION A REGULARIZATION-FREE APPROACH Kai Zhen Martin Radfar Hieu Nguyen Grant P . Strimel Nathan Susanj Athanasios Mouchtaris.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION A REGULARIZATION-FREE APPROACH Kai Zhen Martin Radfar Hieu Nguyen Grant P . Strimel Nathan Susanj Athanasios Mouchtaris

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: