
T) [17], Bifocal RNN-T [18], and Conformer [19][20]. Our
results show that in all three architectures, GQ yields little to
no accuracy loss when compressing models to S8B or even
sub-5-bit (5-bit or lower). We also present performance op-
timization strategies from ablation studies on bit-allocation
and quantization frequency. Our contributions are as follows:
• We propose GQ, inspired by both FP- and BP-QAT ap-
proaches. GQ enables on-centroid weight aggregation
without augmented regularizers. Instead, it leverages
Softmax annealing to impose soft-to-hard quantization
on centroids from the µ-Law constrained space.
• GQ supports different quantization modes for a wide
range of granularity: different bit depths can be speci-
fied for different kernels/layers/modules.
• With GQ, we losslessly compress a lightweight stream-
ing Conformer into sub-5-bit with more than 6×
model size reduction. To our best knowledge, this
is among the first sub-5-bit Conformer models for on-
device ASR. Without accuracy degradation, our GQ-
compressed 5-bit Bifocal RNN-T reduces the memory
footprint by 30.73% and P90 user-perceived latency
(UPL) by 31.30%.
We describe the problem in Sec. 2 and GQ in Sec. 3. The
experimental settings and results are detailed in Sec. 4. We
conclude in Sec. 5 with some final remarks.
2. PRELIMINARIES
2.1. Problem Formulation
Consider a general deep neural network architecture with K
layers, F=F1◦ ··· ◦ FK, mapping the input from Rd1
to the output in RdK+1 as F:Rd17−→ RdK+1 , where the
input and output of an arbitrary k-th layer are x(k+1) :=
Fk(x(k)). Under supervised learning, the training data X=
{x1,...,xN}and Y={y1,...,yN}are used for updat-
ing model weights W= [W1,...,WK]for Klayers in F.
Usually the optimization process is over the training objective
function L(X,Y,F,W) = 1
N
N
P
i=1
`(F(xi),yi) + λR(W),
where iis the data batch index, `is the major loss term mea-
suring model accuracy and R(W)is the regularizer blended
to the objective function via a coefficient λ.
Network quantization aims at discretizing model weights.
For scalar quantization, it is to convert each weight, w∈W,
to a quantization centroid, z∈z, where z= [z1, ...zm]to
ensure the network is compressed into dlog2me-bit. For S8B
quantization, centroids are from a subset of INT8 values.
2.2. Related QAT Approaches
BP-QAT counters model weight continuity via regulariza-
tion. For example, it introduces weight regularizers on
model weights, i.e., R(W), measuring the point-wise dis-
tance between each weight and mquantization centroids in
the centroid vector z= [z1, ...zm]. Note that the quantization
weight regularizer in the loss function, as R(W), must be
gradient descent compatible. Consequently, R(W)cannot
enforce each weight to be replaced by the closest centroid in
zas w= arg min
i||w−zi|| , for w∈Wand zi∈z, be-
cause the min operator is not differentiable. Recent BP-QAT
methods force weights to approach the centroid in zusing
R(W) = P
w∈WD(w, z),where the differentiable dissimilarity
function Dis based on a cosine function in [1, 14].
In contrast, FP-QAT can be regularizer free [11, 21].
Usually, the process is to use a “fake quantizer” or equiva-
lent operations during training, hard quantizing weights to
a specific range and bit-depth; and then at runtime, convert-
ing the model to INT8 format via TFLite [22]. The study
[11] uses native quantization operators with which, during
training, the weights are quantized and then converted to the
integer type for model deployment. However, FP-QAT is
essentially hard compression recurring during training with
severely dropped performance when applied to S8B quan-
tization. Consequently, finetuning is usually needed, which
prolongs the model training time [23, 24].
Both FP-QAT and BP-QAT require specifying appropri-
ate quantization centroids before model training. While the
centroids for INT8 model compression are pre-defined, for
S8B quantization, the optimal set of centroids is usually ker-
nel/layer specific. For models, such as Conformer [16], where
there are usually hundreds of kernels, current S8B QAT meth-
ods become less tractable.
In this work, we combine the merit from both FP- and
BP-QAT and propose General Quantizer (GQ) that navigates
weights to quantization centroids without introducing aug-
mented regularizers but via feedforward-only operators. Our
work is inspired by a continuous relaxation of quantization
[25] also used for speech representation learning [26, 27, 28,
29, 30, 31, 32], and µ-Law algorithm for 8-bit pulse-code
modulation (PCM) digital telecommunication [33].
3. METHODS
3.1. Centroid Selection via Softmax-Based Dissimilarity
Matrices
For any weight value wi∈wwhere |w|=n, and the quan-
tization centroid vector z= [z1, ...zm], we define the point-
wise dissimilarity matrix in Eq. 1
Asoft =
a11 ··· a1m
.
.
.....
.
.
an1··· anm
,(1)
where aij is the probability of representing wiby zj. Each
row in Asoft[i·]is summed to 1with the largest probability