COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE
ASR
Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman
Google LLC, USA
{dongseong, khechai, ngyuzh, strohman}@google.com
ABSTRACT
Knowledge distillation is an effective machine learning technique to
transfer knowledge from a teacher model to a smaller student model,
especially with unlabeled data. In this paper, we focus on knowledge
distillation for the RNN-T model, which is widely used in state-of-
the-art (SoTA) automatic speech recognition (ASR). Specifically, we
compared using soft and hard target distillation to train large-scale
RNN-T models on the LibriSpeech/LibriLight public dataset (60k
hours) and our in-house data (600k hours). We found that hard tar-
gets are more effective when the teacher and student have different
architecture, such as large teacher and small streaming student. On
the other hand, soft target distillation works better in self-training
scenario like iterative large teacher training. For a large model with
0.6B weights, we achieve a new SoTA word error rate (WER) on
LibriSpeech (8% relative improvement on dev-other) using Noisy
Student Training with soft target distillation. It also allows our pro-
duction teacher to adapt new data domain continuously.
Index Terms—RNN Transducer, Knowledge Distillation,
Noisy Student Training, Semi-supervised learning
1. INTRODUCTION
The success of end-to-end (E2E) speech recognition models [1, 2, 3]
are highly dependent on having a large amount of high-quality tran-
scribed speech data. However, it is expensive and difficult to get the
high-quality human transcriptions, which restricts the development
of automatic speech recognition (ASR).
Knowledge distillation [4] is an effective technique to transfer
knowledge from a teacher model to a student model. Noisy Stu-
dent Training (NST) [5, 6] is a well established method that applies
knowledge distillation in an iterative training fashion to progres-
sively improve the model. For each iteration, NST transcribe a large
amounts of unlabeled data from teacher model and use it to train the
student model with data augmentation. NST has been shown to be
effectiveness for ImageNet [5] and the LibriSpeech automatic speech
recognition task [6].
In this paper, we explored knowledge distillation for the RNN-
T [7] model. RNN-T is widely used in large-scale ASR sys-
tems [8, 9, 10] and achieves state-of-the-art results on the Lib-
riSpeech dataset [11, 12, 13]. NST training of RNN-T models was
first studied in [6] using hard target distillation [4, 14], where the
student model is trained using pseudo labels generated by a teacher
model. Hard target distillation was also used in the follow-up
works [11, 13] that further improved the SoTA results on Lib-
riSpeech by combining with pre-training. More recently, soft target
distillation for RNN-T was explored in [15, 16], where the KL di-
vergence between the teacher and student output label distribution
is used as the loss function, similar to those used in [4]. However,
it was only used for model compression [15] and streaming ASR
models [16].
Motivated by the success of using soft target distillation in image
domain [5] and a recent theoretical analysis [17] claiming that soft
target distillation is better than hard target, this paper investigates
the optimal knowledge distillation method for the large-scale RNN-
T architecture.
We demonstrate that soft target distillation brings better word
error rate (WER) in self-training whose teacher and student have
the same architecture. Otherwise, hard target distillation brings bet-
ter WER. Teacher self-training with soft target distillation makes
new LibriSpeech SoTA WER on a similar setup to W2v-BERT pa-
per [13]; WERs (dev/dev-other/test/test-other) of Conformer 600M
model without external language model from 1.3/2.6/1.4/2.7to
1.3/2.4/1.4/2.6. In addition, we succeeded to train new production
teacher without performance degradation by soft distillation, in the
case that training data distribution is shifted.
The contributions of this work include the following: (1) A sys-
tematic study of soft and hard target distillation on large-scale
SoTA RNN-T. (2) A practical solution of soft/hard target distil-
lation given different situation. (3) A more efficient way to do
soft distillation and achieve new SoTA on LibriSpeech.
2. RELATED WORK
2.1. RNN-T model
In this section, we briefly summarize RNN Transducer [7] (RNN-
T). The RNN-T loss is given by the sum of the negative log posterior
probability for all possible sequences through the U×Tlattice, as
shown in Fig 1, where Uis the length of target token sequence yand
Tis the number of audio input features x. Each node at (u, t)in the
lattice represents a log probability made of a pair of acoustic (amt)
and label (lmu) states. The RNN-T joint network combines these
states to output the probabilities of the next label (such as charac-
ters or word pieces [18]) token (vertical transition) or a special blank
token (horizontal transition). The RNN-T loss can be computed ef-
ficiently using the forward-backward algorithm [19].
2.2. RNN-T distillation
The first knowledge distillation [4] paper introduces both soft target
and hard target distillation in classification task. Soft target is a cat-
egorical distribution over all classes while hard target is a one-hot
vector.
In RNN-T distillation, hard target is a transcript label, which
is represented by a sequence of one-hot vectors. The first RNN-T
NST [6] utilizes hard target distillation and then hard target distilla-
arXiv:2210.05793v2 [cs.LG] 28 Oct 2022