COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE ASR Dongseong Hwang Khe Chai Sim Yu Zhang Trevor Strohman

2025-04-27 0 0 659.83KB 8 页 10玖币
侵权投诉
COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE
ASR
Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman
Google LLC, USA
{dongseong, khechai, ngyuzh, strohman}@google.com
ABSTRACT
Knowledge distillation is an effective machine learning technique to
transfer knowledge from a teacher model to a smaller student model,
especially with unlabeled data. In this paper, we focus on knowledge
distillation for the RNN-T model, which is widely used in state-of-
the-art (SoTA) automatic speech recognition (ASR). Specifically, we
compared using soft and hard target distillation to train large-scale
RNN-T models on the LibriSpeech/LibriLight public dataset (60k
hours) and our in-house data (600k hours). We found that hard tar-
gets are more effective when the teacher and student have different
architecture, such as large teacher and small streaming student. On
the other hand, soft target distillation works better in self-training
scenario like iterative large teacher training. For a large model with
0.6B weights, we achieve a new SoTA word error rate (WER) on
LibriSpeech (8% relative improvement on dev-other) using Noisy
Student Training with soft target distillation. It also allows our pro-
duction teacher to adapt new data domain continuously.
Index TermsRNN Transducer, Knowledge Distillation,
Noisy Student Training, Semi-supervised learning
1. INTRODUCTION
The success of end-to-end (E2E) speech recognition models [1, 2, 3]
are highly dependent on having a large amount of high-quality tran-
scribed speech data. However, it is expensive and difficult to get the
high-quality human transcriptions, which restricts the development
of automatic speech recognition (ASR).
Knowledge distillation [4] is an effective technique to transfer
knowledge from a teacher model to a student model. Noisy Stu-
dent Training (NST) [5, 6] is a well established method that applies
knowledge distillation in an iterative training fashion to progres-
sively improve the model. For each iteration, NST transcribe a large
amounts of unlabeled data from teacher model and use it to train the
student model with data augmentation. NST has been shown to be
effectiveness for ImageNet [5] and the LibriSpeech automatic speech
recognition task [6].
In this paper, we explored knowledge distillation for the RNN-
T [7] model. RNN-T is widely used in large-scale ASR sys-
tems [8, 9, 10] and achieves state-of-the-art results on the Lib-
riSpeech dataset [11, 12, 13]. NST training of RNN-T models was
first studied in [6] using hard target distillation [4, 14], where the
student model is trained using pseudo labels generated by a teacher
model. Hard target distillation was also used in the follow-up
works [11, 13] that further improved the SoTA results on Lib-
riSpeech by combining with pre-training. More recently, soft target
distillation for RNN-T was explored in [15, 16], where the KL di-
vergence between the teacher and student output label distribution
is used as the loss function, similar to those used in [4]. However,
it was only used for model compression [15] and streaming ASR
models [16].
Motivated by the success of using soft target distillation in image
domain [5] and a recent theoretical analysis [17] claiming that soft
target distillation is better than hard target, this paper investigates
the optimal knowledge distillation method for the large-scale RNN-
T architecture.
We demonstrate that soft target distillation brings better word
error rate (WER) in self-training whose teacher and student have
the same architecture. Otherwise, hard target distillation brings bet-
ter WER. Teacher self-training with soft target distillation makes
new LibriSpeech SoTA WER on a similar setup to W2v-BERT pa-
per [13]; WERs (dev/dev-other/test/test-other) of Conformer 600M
model without external language model from 1.3/2.6/1.4/2.7to
1.3/2.4/1.4/2.6. In addition, we succeeded to train new production
teacher without performance degradation by soft distillation, in the
case that training data distribution is shifted.
The contributions of this work include the following: (1) A sys-
tematic study of soft and hard target distillation on large-scale
SoTA RNN-T. (2) A practical solution of soft/hard target distil-
lation given different situation. (3) A more efficient way to do
soft distillation and achieve new SoTA on LibriSpeech.
2. RELATED WORK
2.1. RNN-T model
In this section, we briefly summarize RNN Transducer [7] (RNN-
T). The RNN-T loss is given by the sum of the negative log posterior
probability for all possible sequences through the U×Tlattice, as
shown in Fig 1, where Uis the length of target token sequence yand
Tis the number of audio input features x. Each node at (u, t)in the
lattice represents a log probability made of a pair of acoustic (amt)
and label (lmu) states. The RNN-T joint network combines these
states to output the probabilities of the next label (such as charac-
ters or word pieces [18]) token (vertical transition) or a special blank
token (horizontal transition). The RNN-T loss can be computed ef-
ficiently using the forward-backward algorithm [19].
2.2. RNN-T distillation
The first knowledge distillation [4] paper introduces both soft target
and hard target distillation in classification task. Soft target is a cat-
egorical distribution over all classes while hard target is a one-hot
vector.
In RNN-T distillation, hard target is a transcript label, which
is represented by a sequence of one-hot vectors. The first RNN-T
NST [6] utilizes hard target distillation and then hard target distilla-
arXiv:2210.05793v2 [cs.LG] 28 Oct 2022
Fig. 1. RNN-T log probability U×Tlattice, following [7].
tion is widely used by follow-up LibriSpeech SoTA papers [11, 13,
20].
As soft target is a categorical distribution, soft target distillation
uses a KL divergence loss between teacher log probability and stu-
dent log probability. The natural extension for RNN-T model is KL
divergence over RNN-T log probability U×Tlattice as shown in
Eq 1. The recent RNN-T distillation papers [15, 16] use this method.
LKL =X
u,t
X
k
PT(k|u, t) ln PT(k|u, t)
PS(k|u, t)(1)
Soft RNN-T distillation is based on node-wise KL divergence
which means that the student directly learns the alignment of the
teacher, whereas for hard distillation, the student learns the align-
ment by RNN-T loss.
Efficient RNN-T distillation paper [15] proposes the memory ef-
ficient soft target distillation, because naive implementation requires
O(U×T×K)memory complexity in Eq 1. The efficient distilla-
tion distils 3 probabilities such as y,blank and the remainder labels,
instead of kclasses.
2.3. W2v-BERT
Self-supervised learning techniques [13, 20, 21, 22, 23, 24, 25]
have been shown to be effective for pre-training ASR models and
achieved impressive performance for speech recognition tasks. We
utilize self-supervised learning to pretrain the encoder of the student
model before distillation. We compared two recent methods, W2v-
BERT [13] and BEST-RQ [20], and found that the former achieved
better WER results. Therefore, we use W2v-BERT pre-training for
all the experiments in this paper.
3. METHOD
3.1. Multi-stage training using self/semi-supervised learning
We combine W2v-BERT and NST as follows:
1. Prepare the existing strong teacher.
2. Pretrain RNN-T encoder of the student by W2v-BERT.
3. Distillation from the teacher to the pre-trained student.
The existing strong teacher is bi-directional model trained by
W2v-BERT and multi generation NST. We call the distillation target
as student. In the paper, the student model is both large bi-directional
model and small streaming model.
3.2. Distillation methods
We use a linear combination of the RNN-T loss (LRNNT ) and the
KL-divergence loss (LKL) as the overall training loss:
L=αLRNNT + (1 α)LKL (2)
It can express different training methods by adjusting α. With α= 1
and human labeled data, it is a conventional RNN-T training. When
pseudo-labeled data are used, we can achieve hard target distillation
with α= 1 and soft target distillation with α= 0. Furthermore,
we can mix both the hard and soft target distillation by setting αto
be between 0and 1. We can also mix soft target distillation with
supervised training by using LRNNT on labeled data and LKL on un-
labeled data, which is used in existing distillation work in other do-
mains [4, 26, 27]. In this paper, most of the experiments use α= 0
because we found that using LKL alone achieves better WERs, as
shown in Section 4.3.
We implement RNN-T soft target distillation efficiently. In Eq 1,
the KL divergence for each (u, t)can be computed independently, so
it is free to compute all at once or break them into smaller groups to
balance time and space trade-off. We found that computing 8time
frames each iteration works best. With this implementation, vanilla
soft target distillation and efficient RNN-T distillation [15] consume
similar memory. See 5.5 for the experiments results.
3.3. Spectrum Frequency augmentation
Augmentation method choice is an important part in NST. We
propose two different augmentation algorithms on top of SpecAug-
ment [28]. These augmentation methods improve soft target distil-
lation. See 5.4 for the experiment results.
3.3.1. Frequency Warping
Frequency Warping warps a log mel spectrogram in the frequency
axis, which imitates pitch shift. SpecAugment [28] introduces time
warping. Frequency warping is the similar technique for the fre-
quency axis as shown in Alg 1.
Algorithm 1 Frequency Warping
Require: input log mel feature, xIRF×T:Fis frequency di-
mension and Tis time sequence length.
1. Draw yin uniform random with range (0, F )as the anchor
point in Fig 2.
2. Get the max distance F=F×γf.γfis frequency warping
ratio, which is a hyper parameter.
3. Draw dy in uniform random with range (F, F).
4. Get the destination point by y0=y+dy clipped by range
(0, F ).
5. Warp the log mel feature from the anchor point yto the desti-
nation point y0.
3.3.2. Frequency Noise
Frequency Noise is multiplicative noise to the log mel spectrogram,
which imitates a random equalizer that boosts or lowers random fre-
quencies. The Frequency Noise algorithms is shown in Alg 2.
摘要:

COMPARISONOFSOFTANDHARDTARGETRNN-TDISTILLATIONFORLARGE-SCALEASRDongseongHwang,KheChaiSim,YuZhang,TrevorStrohmanGoogleLLC,USAfdongseong,khechai,ngyuzh,strohmang@google.comABSTRACTKnowledgedistillationisaneffectivemachinelearningtechniquetotransferknowledgefromateachermodeltoasmallerstudentmodel,espec...

展开>> 收起<<
COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE ASR Dongseong Hwang Khe Chai Sim Yu Zhang Trevor Strohman.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:659.83KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注