SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION A REGULARIZATION-FREE APPROACH Kai Zhen Martin Radfar Hieu Nguyen Grant P . Strimel Nathan Susanj Athanasios Mouchtaris

2025-05-02 0 0 1.42MB 8 页 10玖币
侵权投诉
SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION:
A REGULARIZATION-FREE APPROACH
Kai Zhen, Martin Radfar, Hieu Nguyen, Grant P. Strimel, Nathan Susanj, Athanasios Mouchtaris
Amazon Alexa AI
ABSTRACT
For on-device automatic speech recognition (ASR), quan-
tization aware training (QAT) is ubiquitous to achieve the
trade-off between model predictive performance and effi-
ciency. Among existing QAT methods, one major draw-
back is that the quantization centroids have to be predeter-
mined and fixed. To overcome this limitation, we intro-
duce a regularization-free, “soft-to-hard” compression mech-
anism with self-adjustable centroids in a µ-Law constrained
space, resulting in a simpler yet more versatile quantiza-
tion scheme, called General Quantizer (GQ). We apply GQ
to ASR tasks using Recurrent Neural Network Transducer
(RNN-T) and Conformer architectures on both LibriSpeech
and de-identified far-field datasets. Without accuracy degra-
dation, GQ can compress both RNN-T and Conformer into
sub-8-bit, and for some RNN-T layers, to 1-bit for fast and
accurate inference. We observe a 30.73% memory footprint
saving and 31.75% user-perceived latency reduction com-
pared to 8-bit QAT via physical device benchmarking.
Index TermsOn-device speech recognition, quantiza-
tion aware training, RNN-T, conformer, model efficiency
1. INTRODUCTION
Improving the efficiency of neural automatic speech recogni-
tion (ASR) models via quantization is critical for on-device
deployment scenarios [1, 2]. For neural network accelera-
tor (NNA) embedded devices, where memory and bandwidth
are at a premium, quantization can reduce the footprint and
lower the bandwidth consumption of ASR execution, which
will not only afford a faster model inference but also facilitate
model deployment to various portable devices where a stable
network connection is limited.
Existing quantization methods can be post-training quan-
tization (PTQ) or in-training / quantization aware training
(QAT). PTQ is applied after the model training is complete
by compressing models into 8-bit representations and is rela-
tively well supported by various libraries [3, 4, 5, 6, 7, 8], such
as TensorFlow Lite [9] and AIMET [10] for on-device deploy-
ment. However, almost no existing PTQ supports customized
quantization configurations to compress machine learning
(ML) layers and kernels into sub-8-bit (S8B) regimes [11].
Moreover, the performance drop is inevitable as the model
is unaware of the loss of precision when being quantized at
test time. In contrast, QAT performs bit-depth reduction of
model weights (for example, from 32-bit floating point to
8-bit integer) during training which usually yields superior
performance over PTQ [12][13]. The QAT mechanism can
be in the forward pass (FP-QAT) or the backward pass (BP-
QAT), with the difference being whether regularization is
used in the loss function. FP-QAT [11] quantizes the model
weights during forward propagation to pre-defined quanti-
zation centroids. BP-QAT [1, 14, 15] relies on customized
regularizers to gradually force weights to those quantization
centroids (i.e., “soft quantization” via gradient) during train-
ing before hard compression performs in the late training
phase. As model weights are informed by the customized
regularizers to move closer to where they are quantized at
runtime per training step, the predictive performance is often
well preserved. Therefore, the focus of this work is on QAT.
Under both FP- and BP-QAT, it is essential that the quan-
tization centroids are defined and specified before model
training. As such, the demerit is the low feasibility when
quantizing models in S8B mode because one needs to select
the proper quantization centroids and their configurations for
each kernel in each layer to ensure minimal runtime perfor-
mance degradation. Consequently, applying existing QAT
methods to Conformer [16] becomes quite challenging, as it
usually contains more than hundreds of kernels.
In this work, we propose General Quantizer (GQ), a
regularization-free, model-agnostic quantization scheme with
a mixed flavor of both FP- and BP-QAT. GQ is “general” in
that it does not augment the objective function by introducing
any regularizer as in BP-QAT but determines the appropri-
ate quantization centroids during model training for a given
bit depth, and it can be simply applied in a plug-and-play
manner to an arbitrary ASR model. Unlike FP-QAT, GQ
features a soft-to-hard quantization during training, allow-
ing model weights to hop around adjacent partitions more
easily. Under GQ, quantization centroids are self-adjustable
but in a µ-Law constrained space. As a proof-of-concept,
we adopt the ASR task and conduct experiments on both the
LibriSpeech and de-identified far-field datasets to evaluate
GQ on three major end-to-end ASR architectures, namely
conventional Recurrent Neural Network Transducer (RNN-
arXiv:2210.09188v2 [cs.SD] 1 Nov 2022
T) [17], Bifocal RNN-T [18], and Conformer [19][20]. Our
results show that in all three architectures, GQ yields little to
no accuracy loss when compressing models to S8B or even
sub-5-bit (5-bit or lower). We also present performance op-
timization strategies from ablation studies on bit-allocation
and quantization frequency. Our contributions are as follows:
We propose GQ, inspired by both FP- and BP-QAT ap-
proaches. GQ enables on-centroid weight aggregation
without augmented regularizers. Instead, it leverages
Softmax annealing to impose soft-to-hard quantization
on centroids from the µ-Law constrained space.
GQ supports different quantization modes for a wide
range of granularity: different bit depths can be speci-
fied for different kernels/layers/modules.
With GQ, we losslessly compress a lightweight stream-
ing Conformer into sub-5-bit with more than 6×
model size reduction. To our best knowledge, this
is among the first sub-5-bit Conformer models for on-
device ASR. Without accuracy degradation, our GQ-
compressed 5-bit Bifocal RNN-T reduces the memory
footprint by 30.73% and P90 user-perceived latency
(UPL) by 31.30%.
We describe the problem in Sec. 2 and GQ in Sec. 3. The
experimental settings and results are detailed in Sec. 4. We
conclude in Sec. 5 with some final remarks.
2. PRELIMINARIES
2.1. Problem Formulation
Consider a general deep neural network architecture with K
layers, F=F1◦ ··· ◦ FK, mapping the input from Rd1
to the output in RdK+1 as F:Rd17−RdK+1 , where the
input and output of an arbitrary k-th layer are x(k+1) :=
Fk(x(k)). Under supervised learning, the training data X=
{x1,...,xN}and Y={y1,...,yN}are used for updat-
ing model weights W= [W1,...,WK]for Klayers in F.
Usually the optimization process is over the training objective
function L(X,Y,F,W) = 1
N
N
P
i=1
`(F(xi),yi) + λR(W),
where iis the data batch index, `is the major loss term mea-
suring model accuracy and R(W)is the regularizer blended
to the objective function via a coefficient λ.
Network quantization aims at discretizing model weights.
For scalar quantization, it is to convert each weight, wW,
to a quantization centroid, zz, where z= [z1, ...zm]to
ensure the network is compressed into dlog2me-bit. For S8B
quantization, centroids are from a subset of INT8 values.
2.2. Related QAT Approaches
BP-QAT counters model weight continuity via regulariza-
tion. For example, it introduces weight regularizers on
model weights, i.e., R(W), measuring the point-wise dis-
tance between each weight and mquantization centroids in
the centroid vector z= [z1, ...zm]. Note that the quantization
weight regularizer in the loss function, as R(W), must be
gradient descent compatible. Consequently, R(W)cannot
enforce each weight to be replaced by the closest centroid in
zas w= arg min
i||wzi|| , for wWand ziz, be-
cause the min operator is not differentiable. Recent BP-QAT
methods force weights to approach the centroid in zusing
R(W) = P
wWD(w, z),where the differentiable dissimilarity
function Dis based on a cosine function in [1, 14].
In contrast, FP-QAT can be regularizer free [11, 21].
Usually, the process is to use a “fake quantizer” or equiva-
lent operations during training, hard quantizing weights to
a specific range and bit-depth; and then at runtime, convert-
ing the model to INT8 format via TFLite [22]. The study
[11] uses native quantization operators with which, during
training, the weights are quantized and then converted to the
integer type for model deployment. However, FP-QAT is
essentially hard compression recurring during training with
severely dropped performance when applied to S8B quan-
tization. Consequently, finetuning is usually needed, which
prolongs the model training time [23, 24].
Both FP-QAT and BP-QAT require specifying appropri-
ate quantization centroids before model training. While the
centroids for INT8 model compression are pre-defined, for
S8B quantization, the optimal set of centroids is usually ker-
nel/layer specific. For models, such as Conformer [16], where
there are usually hundreds of kernels, current S8B QAT meth-
ods become less tractable.
In this work, we combine the merit from both FP- and
BP-QAT and propose General Quantizer (GQ) that navigates
weights to quantization centroids without introducing aug-
mented regularizers but via feedforward-only operators. Our
work is inspired by a continuous relaxation of quantization
[25] also used for speech representation learning [26, 27, 28,
29, 30, 31, 32], and µ-Law algorithm for 8-bit pulse-code
modulation (PCM) digital telecommunication [33].
3. METHODS
3.1. Centroid Selection via Softmax-Based Dissimilarity
Matrices
For any weight value wiwwhere |w|=n, and the quan-
tization centroid vector z= [z1, ...zm], we define the point-
wise dissimilarity matrix in Eq. 1
Asoft =
a11 ··· a1m
.
.
.....
.
.
an1··· anm
,(1)
where aij is the probability of representing wiby zj. Each
row in Asoft[i·]is summed to 1with the largest probability
摘要:

SUB-8-BITQUANTIZATIONFORON-DEVICESPEECHRECOGNITION:AREGULARIZATION-FREEAPPROACHKaiZhen,MartinRadfar,HieuNguyen,GrantP.Strimel,NathanSusanj,AthanasiosMouchtarisAmazonAlexaAIABSTRACTForon-deviceautomaticspeechrecognition(ASR),quan-tizationawaretraining(QAT)isubiquitoustoachievethetrade-offbetweenmodel...

展开>> 收起<<
SUB-8-BIT QUANTIZATION FOR ON-DEVICE SPEECH RECOGNITION A REGULARIZATION-FREE APPROACH Kai Zhen Martin Radfar Hieu Nguyen Grant P . Strimel Nathan Susanj Athanasios Mouchtaris.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:1.42MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注