COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE ASR Dongseong Hwang Khe Chai Sim Yu Zhang Trevor Strohman

2025-04-27 1 0 659.83KB 8 页 10玖币

侵权投诉

COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE

ASR

Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman

Google LLC, USA

{dongseong, khechai, ngyuzh, strohman}@google.com

ABSTRACT

Knowledge distillation is an effective machine learning technique to

transfer knowledge from a teacher model to a smaller student model,

especially with unlabeled data. In this paper, we focus on knowledge

distillation for the RNN-T model, which is widely used in state-of-

the-art (SoTA) automatic speech recognition (ASR). Speciﬁcally, we

compared using soft and hard target distillation to train large-scale

RNN-T models on the LibriSpeech/LibriLight public dataset (60k

hours) and our in-house data (600k hours). We found that hard tar-

gets are more effective when the teacher and student have different

architecture, such as large teacher and small streaming student. On

the other hand, soft target distillation works better in self-training

scenario like iterative large teacher training. For a large model with

0.6B weights, we achieve a new SoTA word error rate (WER) on

LibriSpeech (8% relative improvement on dev-other) using Noisy

Student Training with soft target distillation. It also allows our pro-

duction teacher to adapt new data domain continuously.

Index Terms—RNN Transducer, Knowledge Distillation,

Noisy Student Training, Semi-supervised learning

1. INTRODUCTION

The success of end-to-end (E2E) speech recognition models [1, 2, 3]

are highly dependent on having a large amount of high-quality tran-

scribed speech data. However, it is expensive and difﬁcult to get the

high-quality human transcriptions, which restricts the development

of automatic speech recognition (ASR).

Knowledge distillation [4] is an effective technique to transfer

knowledge from a teacher model to a student model. Noisy Stu-

dent Training (NST) [5, 6] is a well established method that applies

knowledge distillation in an iterative training fashion to progres-

sively improve the model. For each iteration, NST transcribe a large

amounts of unlabeled data from teacher model and use it to train the

student model with data augmentation. NST has been shown to be

effectiveness for ImageNet [5] and the LibriSpeech automatic speech

recognition task [6].

In this paper, we explored knowledge distillation for the RNN-

T [7] model. RNN-T is widely used in large-scale ASR sys-

tems [8, 9, 10] and achieves state-of-the-art results on the Lib-

riSpeech dataset [11, 12, 13]. NST training of RNN-T models was

ﬁrst studied in [6] using hard target distillation [4, 14], where the

student model is trained using pseudo labels generated by a teacher

model. Hard target distillation was also used in the follow-up

works [11, 13] that further improved the SoTA results on Lib-

riSpeech by combining with pre-training. More recently, soft target

distillation for RNN-T was explored in [15, 16], where the KL di-

vergence between the teacher and student output label distribution

is used as the loss function, similar to those used in [4]. However,

it was only used for model compression [15] and streaming ASR

models [16].

Motivated by the success of using soft target distillation in image

domain [5] and a recent theoretical analysis [17] claiming that soft

target distillation is better than hard target, this paper investigates

the optimal knowledge distillation method for the large-scale RNN-

T architecture.

We demonstrate that soft target distillation brings better word

error rate (WER) in self-training whose teacher and student have

the same architecture. Otherwise, hard target distillation brings bet-

ter WER. Teacher self-training with soft target distillation makes

new LibriSpeech SoTA WER on a similar setup to W2v-BERT pa-

per [13]; WERs (dev/dev-other/test/test-other) of Conformer 600M

model without external language model from 1.3/2.6/1.4/2.7to

1.3/2.4/1.4/2.6. In addition, we succeeded to train new production

teacher without performance degradation by soft distillation, in the

case that training data distribution is shifted.

The contributions of this work include the following: (1) A sys-

tematic study of soft and hard target distillation on large-scale

SoTA RNN-T. (2) A practical solution of soft/hard target distil-

lation given different situation. (3) A more efﬁcient way to do

soft distillation and achieve new SoTA on LibriSpeech.

2. RELATED WORK

2.1. RNN-T model

In this section, we brieﬂy summarize RNN Transducer [7] (RNN-

T). The RNN-T loss is given by the sum of the negative log posterior

probability for all possible sequences through the U×Tlattice, as

shown in Fig 1, where Uis the length of target token sequence yand

Tis the number of audio input features x. Each node at (u, t)in the

lattice represents a log probability made of a pair of acoustic (amt)

and label (lmu) states. The RNN-T joint network combines these

states to output the probabilities of the next label (such as charac-

ters or word pieces [18]) token (vertical transition) or a special blank

token (horizontal transition). The RNN-T loss can be computed ef-

ﬁciently using the forward-backward algorithm [19].

2.2. RNN-T distillation

The ﬁrst knowledge distillation [4] paper introduces both soft target

and hard target distillation in classiﬁcation task. Soft target is a cat-

egorical distribution over all classes while hard target is a one-hot

vector.

In RNN-T distillation, hard target is a transcript label, which

is represented by a sequence of one-hot vectors. The ﬁrst RNN-T

NST [6] utilizes hard target distillation and then hard target distilla-

arXiv:2210.05793v2 [cs.LG] 28 Oct 2022

Fig. 1. RNN-T log probability U×Tlattice, following [7].

tion is widely used by follow-up LibriSpeech SoTA papers [11, 13,

20].

As soft target is a categorical distribution, soft target distillation

uses a KL divergence loss between teacher log probability and stu-

dent log probability. The natural extension for RNN-T model is KL

divergence over RNN-T log probability U×Tlattice as shown in

Eq 1. The recent RNN-T distillation papers [15, 16] use this method.

LKL =X

u,t

PT(k|u, t) ln PT(k|u, t)

PS(k|u, t)(1)

Soft RNN-T distillation is based on node-wise KL divergence

which means that the student directly learns the alignment of the

teacher, whereas for hard distillation, the student learns the align-

ment by RNN-T loss.

Efﬁcient RNN-T distillation paper [15] proposes the memory ef-

ﬁcient soft target distillation, because naive implementation requires

O(U×T×K)memory complexity in Eq 1. The efﬁcient distilla-

tion distils 3 probabilities such as y,blank and the remainder labels,

instead of kclasses.

2.3. W2v-BERT

Self-supervised learning techniques [13, 20, 21, 22, 23, 24, 25]

have been shown to be effective for pre-training ASR models and

achieved impressive performance for speech recognition tasks. We

utilize self-supervised learning to pretrain the encoder of the student

model before distillation. We compared two recent methods, W2v-

BERT [13] and BEST-RQ [20], and found that the former achieved

better WER results. Therefore, we use W2v-BERT pre-training for

all the experiments in this paper.

3. METHOD

3.1. Multi-stage training using self/semi-supervised learning

We combine W2v-BERT and NST as follows:

1. Prepare the existing strong teacher.

2. Pretrain RNN-T encoder of the student by W2v-BERT.

3. Distillation from the teacher to the pre-trained student.

The existing strong teacher is bi-directional model trained by

W2v-BERT and multi generation NST. We call the distillation target

as student. In the paper, the student model is both large bi-directional

model and small streaming model.

3.2. Distillation methods

We use a linear combination of the RNN-T loss (LRNNT ) and the

KL-divergence loss (LKL) as the overall training loss:

L=αLRNNT + (1 −α)LKL (2)

It can express different training methods by adjusting α. With α= 1

and human labeled data, it is a conventional RNN-T training. When

pseudo-labeled data are used, we can achieve hard target distillation

with α= 1 and soft target distillation with α= 0. Furthermore,

we can mix both the hard and soft target distillation by setting αto

be between 0and 1. We can also mix soft target distillation with

supervised training by using LRNNT on labeled data and LKL on un-

labeled data, which is used in existing distillation work in other do-

mains [4, 26, 27]. In this paper, most of the experiments use α= 0

because we found that using LKL alone achieves better WERs, as

shown in Section 4.3.

We implement RNN-T soft target distillation efﬁciently. In Eq 1,

the KL divergence for each (u, t)can be computed independently, so

it is free to compute all at once or break them into smaller groups to

balance time and space trade-off. We found that computing 8time

frames each iteration works best. With this implementation, vanilla

soft target distillation and efﬁcient RNN-T distillation [15] consume

similar memory. See 5.5 for the experiments results.

3.3. Spectrum Frequency augmentation

Augmentation method choice is an important part in NST. We

propose two different augmentation algorithms on top of SpecAug-

ment [28]. These augmentation methods improve soft target distil-

lation. See 5.4 for the experiment results.

3.3.1. Frequency Warping

Frequency Warping warps a log mel spectrogram in the frequency

axis, which imitates pitch shift. SpecAugment [28] introduces time

warping. Frequency warping is the similar technique for the fre-

quency axis as shown in Alg 1.

Algorithm 1 Frequency Warping

Require: input log mel feature, x∈IRF×T:Fis frequency di-

mension and Tis time sequence length.

1. Draw yin uniform random with range (0, F )as the anchor

point in Fig 2.

2. Get the max distance ∆F=F×γf.γfis frequency warping

ratio, which is a hyper parameter.

3. Draw dy in uniform random with range (−∆F, ∆F).

4. Get the destination point by y0=y+dy clipped by range

(0, F ).

5. Warp the log mel feature from the anchor point yto the desti-

nation point y0.

3.3.2. Frequency Noise

Frequency Noise is multiplicative noise to the log mel spectrogram,

which imitates a random equalizer that boosts or lowers random fre-

quencies. The Frequency Noise algorithms is shown in Alg 2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

COMPARISONOFSOFTANDHARDTARGETRNN-TDISTILLATIONFORLARGE-SCALEASRDongseongHwang,KheChaiSim,YuZhang,TrevorStrohmanGoogleLLC,USAfdongseong,khechai,ngyuzh,strohmang@google.comABSTRACTKnowledgedistillationisaneffectivemachinelearningtechniquetotransferknowledgefromateachermodeltoasmallerstudentmodel,espec...

展开>> 收起<<

COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE ASR Dongseong Hwang Khe Chai Sim Yu Zhang Trevor Strohman.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

COMPARISON OF SOFT AND HARD TARGET RNN-T DISTILLATION FOR LARGE-SCALE ASR Dongseong Hwang Khe Chai Sim Yu Zhang Trevor Strohman

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: