DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED AUTOMATIC SPEECH RECOGNITION Somshubra Majumdar Shantanu Acharya Vitaly Lavrukhin Boris Ginsburg

2025-04-27 0 0 355.83KB 6 页 10玖币

侵权投诉

DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED

AUTOMATIC SPEECH RECOGNITION

Somshubra Majumdar*, Shantanu Acharya*, Vitaly Lavrukhin, Boris Ginsburg

{smajumdar, shantanua, vlavrukhin, bginsburg}@nvidia.com

ABSTRACT

Automatic speech recognition models are often adapted

to improve their accuracy in a new domain. A potential

drawback of model adaptation to new domains is catastrophic

forgetting, where the Word Error Rate on the original domain

is signiﬁcantly degraded. This paper addresses the situation

when we want to simultaneously adapt automatic speech

recognition models to a new domain and limit the degrada-

tion of accuracy on the original domain without access to the

original training dataset. We propose several techniques such

as a limited training strategy and regularized adapter modules

for the Transducer encoder, prediction, and joiner network.

We apply these methods to the Google Speech Commands

and to the UK and Ireland English Dialect speech data set and

obtain strong results on the new target domain while limiting

the degradation on the original domain.

Index Terms—Automatic Speech Recognition, Domain

Adaptation, Catastrophic Forgetting, Transducer, Adapter

1. INTRODUCTION

Using a pre-trained Automatic Speech Recognition (ASR)

system on a different domain than the one it was trained

on, usually leads to severe degradation in Word Error Rate

(WER). The adaptation of end-to-end ASR models to new

domains presents several challenges. First, obtaining large

amounts of labeled data on a new domain is expensive. Sec-

ondly, the most common domain adaptation approach is to

ﬁne-tune the ASR model, however, ﬁne-tuning the model on

relatively small amounts of data causes it to overﬁt the new

domain. Finally, during adaptation, the WER of the model on

the original domain may deteriorate, a phenomenon known as

Catastrophic Forgetting [1]. This is a signiﬁcant drawback as

the adapted model can no longer accurately transcribe speech

from the original domain.

Prior works addressing domain adaptation generally fall

into two categories: post-training adaptation and on-the-ﬂy

adaptation [2]. Post-training adaptation generally involves

using domain-speciﬁc Language Models (LMs) [3]. These

models do not require the acoustic model to be re-trained but

*Equal contribution.

their usefulness is limited only to those applications where

new domains differ only by new vocabulary terms that were

not present originally. If we move to applications where the

new domain differs by the speaker accent or new grammar,

then these approaches do not perform well [4]. On-the-

ﬂy adaptation techniques usually involve either Continual

Joint Training (CJT) [5] or ﬁnetuning an existing pre-trained

model. Since both these approaches require the training of the

entire original model, they tend to require signiﬁcant compute

resources and data to perform well. Continual joint training

has several drawbacks, primarily that it assumes that the en-

tire original dataset is available for adaptation, and it does not

consider the cumulative cost of training on an ever-growing

dataset.

Zhao et. al. [6] propose a uniﬁed speaker adaptation ap-

proach that incorporates a speaker-aware persistent memory

model and a gradual pruning method. Hwang et. al. [7]

utilize the combination of self- and semi-supervised learn-

ing methods to solve unseen domain adaptation problems in

a large-scale production setting for an online ASR model.

While these approaches help in overcoming catastrophic for-

getting, they make an implicit assumption that the data of the

original domain is always available during the adaptation pro-

cess. This assumption might not viable in practical scenar-

ios, as production systems are generally trained on licensed

or sensitive data which makes data sharing unfeasible.

Houlsby et. al. [8] propose adapter modules that are

small sub-networks injected into the layers of a pre-trained

neural network. The parameters of the pre-trained network

are frozen and only the injected parameters are updated on

the new domain. Even with just a small fraction of the en-

tire model being trained on the new domain, adapters show

performance comparable to ﬁne-tuning. Tomanek et. al. [9]

demonstrate ASR domain adaptation by attaching adapters to

the encoder part of a Recurrent Neural Network Transducer

(RNN-T) [10] and the Transformer Transducer (T-T) [11].

Eeckt et al. [12] use task-speciﬁc adapters to overcome

catastrophic forgetting in domain adaptation. They demon-

strate three adaptation techniques: (1) keeping the original

model’s parameters frozen (2) training with special regular-

ization such as Elastic Weight Consolidation (EWC) [13] (3)

using Knowledge Distillation (KD) [14]. Still, the underlying

assumption is that the original dataset is available during the

To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar © IEEE 2022

arXiv:2210.03255v1 [cs.SD] 6 Oct 2022

adaptation.

We consider a Constrained Domain Adaptation task,

where the adaptation for the new domain is done without

access to any of the original domain data. We also strictly

limit the allowed degradation on the original domain after

adaptation [12]. Our main contributions are the following:

1. We add adapter modules [8] to the encoder, decoder,

and joint modules of the Conformer Transducer.

2. We train the adapters using various regularization tech-

niques alongside a constrained training schedule, and

show considerable improvement on the new domain

while limiting degradation on the original domain.

3. Finally, we propose a scoring scheme to select mod-

els that perform well in the constrained adaptation set-

ting and evaluate the proposed approach on the Google

Speech Commands [15] benchmark and the UK and

Ireland English Dialect speech dataset [16].

2. CONSTRAINED DOMAIN ADAPTATION

2.1. Degradation control on original domain

In order to reduce the accuracy loss on the original domain,

we formalize the ﬁrst constraint as follows. During con-

strained domain adaptation, a candidate solution (C) must

be evaluated on some evaluation set of the original domain

prior to adaptation (o) and after adaptation (o∗) so as to limit

the absolute degradation of WER to at most κ, where κis

some predetermined acceptable degradation in WER on the

evaluation datasets of the original domain.

To formalize the above constraint, we ﬁrst deﬁne Word

Error Rate (WER) degradation after adaptation as:

WERDego= max(0,WERo∗−WERo)(1)

where the subscript o∗,a∗represent evaluation on the origi-

nal domain and adapted domain after the adaptation process,

and o,arepresent the evaluation on the original domain and

adapted domain prior to the adaptation process respectively.

We then deﬁne the weight of degradation on the original

domain as OSCALE , which is a scaling factor computed as :

OSCALE =1

i=1

max 0,κi−WERDego,i

κi(2)

where Nis the number of evaluation datasets from the original

domain that the model was initially trained on and κis the

maximum tolerable absolute degradation of word error rate

on the original domain.

Next, we deﬁne relative WER improvement (WERR) on

the new domain as AWERR, such that

AWERR = max 0,WERa−WERa∗

WERa(3)

In this formulation, we only accept candidates which improve

in WER on the new domain.

Combining the above deﬁnitions, we propose the follow-

ing candidate selection metric which when maximized, yields

candidates that maximize the relative WER improvement on

the new domain, while simultaneously minimizing the degra-

dation on the old domain. We deﬁne this metric as a score

function in Eqn 4:

Score =OSCALE ∗AWERR (4)

We select the value of κto be 3%, such that the abso-

lute increase in WER on the original dataset is constrained

to 3%. It is possible to select a stricter threshold, however,

the number of candidate solutions that satisfy the constraints

decrease signiﬁcantly, and exceedingly few valid candidates

exist for the ﬁne-tuning case. This score is maximized when

the candidate attains the largest relative WER improvement

on the new domain after scaling by a factor in the range of

[0,1] which indicates the weight of degradation of WER on

the original domain. Note that the score has a minimum value

of 0, if the absolute degradation of WER on the original do-

main surpasses κ, or if WER on the new domain becomes

worse after adaptation.

2.2. Adaptation without access to original dataset

During constrained domain adaptation, a candidate solution

must only use the data of the new domain for adaptation, with-

out access to any data from the original domain. It may use

data from the original domain only for evaluation, in order

to determine the severity of degradation on that domain after

adaptation. When applying this constraint, we cannot freely

compute the Coverage (COV) metric [12] since it computes

the difference between ﬁne-tuning and CJT, though we may

still utilize Learning Without Forgetting (LWF) [18] which

distills the model’s knowledge using just the data of the new

domain.

3. SPEECH TRANSDUCERS WITH ADAPTERS

Adapters are small sub-networks injected into speciﬁc layers

of a pre-trained neural network, such that the original param-

eters of the network are frozen, avoiding gradient updates,

and only the injected parameters of the adapter sub-networks

are updated. Original adapters from [8] are two-layer residual

feed-forward networks, with an intermediate activation func-

tion, usually, ReLU [19]. Optionally adapters have a layer

normalization applied to the input or output of the adapter.

Adaptors have been also applied to multilingual ASR [20],

cross-lingual ASR [21], self-supervised ASR [22] and con-

textual ASR [2]. Adapters have been found useful in reducing

Catastrophic Forgetting in ASR [12].

Following an uniﬁed framework for adapters proposed

by He et al. [23], we consider three types of adapters for

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DAMAGECONTROLDURINGDOMAINADAPTATIONFORTRANSDUCERBASEDAUTOMATICSPEECHRECOGNITIONSomshubraMajumdar*,ShantanuAcharya*,VitalyLavrukhin,BorisGinsburgfsmajumdar,shantanua,vlavrukhin,bginsburgg@nvidia.comABSTRACTAutomaticspeechrecognitionmodelsareoftenadaptedtoimprovetheiraccuracyinanewdomain.Apotentialdra...

展开>> 收起<<

DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED AUTOMATIC SPEECH RECOGNITION Somshubra Majumdar Shantanu Acharya Vitaly Lavrukhin Boris Ginsburg.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED AUTOMATIC SPEECH RECOGNITION Somshubra Majumdar Shantanu Acharya Vitaly Lavrukhin Boris Ginsburg

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: