DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED AUTOMATIC SPEECH RECOGNITION Somshubra Majumdar Shantanu Acharya Vitaly Lavrukhin Boris Ginsburg

2025-04-27 0 0 355.83KB 6 页 10玖币
侵权投诉
DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED
AUTOMATIC SPEECH RECOGNITION
Somshubra Majumdar*, Shantanu Acharya*, Vitaly Lavrukhin, Boris Ginsburg
{smajumdar, shantanua, vlavrukhin, bginsburg}@nvidia.com
ABSTRACT
Automatic speech recognition models are often adapted
to improve their accuracy in a new domain. A potential
drawback of model adaptation to new domains is catastrophic
forgetting, where the Word Error Rate on the original domain
is significantly degraded. This paper addresses the situation
when we want to simultaneously adapt automatic speech
recognition models to a new domain and limit the degrada-
tion of accuracy on the original domain without access to the
original training dataset. We propose several techniques such
as a limited training strategy and regularized adapter modules
for the Transducer encoder, prediction, and joiner network.
We apply these methods to the Google Speech Commands
and to the UK and Ireland English Dialect speech data set and
obtain strong results on the new target domain while limiting
the degradation on the original domain.
Index TermsAutomatic Speech Recognition, Domain
Adaptation, Catastrophic Forgetting, Transducer, Adapter
1. INTRODUCTION
Using a pre-trained Automatic Speech Recognition (ASR)
system on a different domain than the one it was trained
on, usually leads to severe degradation in Word Error Rate
(WER). The adaptation of end-to-end ASR models to new
domains presents several challenges. First, obtaining large
amounts of labeled data on a new domain is expensive. Sec-
ondly, the most common domain adaptation approach is to
fine-tune the ASR model, however, fine-tuning the model on
relatively small amounts of data causes it to overfit the new
domain. Finally, during adaptation, the WER of the model on
the original domain may deteriorate, a phenomenon known as
Catastrophic Forgetting [1]. This is a significant drawback as
the adapted model can no longer accurately transcribe speech
from the original domain.
Prior works addressing domain adaptation generally fall
into two categories: post-training adaptation and on-the-fly
adaptation [2]. Post-training adaptation generally involves
using domain-specific Language Models (LMs) [3]. These
models do not require the acoustic model to be re-trained but
978-1-6654-7189-3/22/$31.00 © 2023 IEEE
*Equal contribution.
their usefulness is limited only to those applications where
new domains differ only by new vocabulary terms that were
not present originally. If we move to applications where the
new domain differs by the speaker accent or new grammar,
then these approaches do not perform well [4]. On-the-
fly adaptation techniques usually involve either Continual
Joint Training (CJT) [5] or finetuning an existing pre-trained
model. Since both these approaches require the training of the
entire original model, they tend to require significant compute
resources and data to perform well. Continual joint training
has several drawbacks, primarily that it assumes that the en-
tire original dataset is available for adaptation, and it does not
consider the cumulative cost of training on an ever-growing
dataset.
Zhao et. al. [6] propose a unified speaker adaptation ap-
proach that incorporates a speaker-aware persistent memory
model and a gradual pruning method. Hwang et. al. [7]
utilize the combination of self- and semi-supervised learn-
ing methods to solve unseen domain adaptation problems in
a large-scale production setting for an online ASR model.
While these approaches help in overcoming catastrophic for-
getting, they make an implicit assumption that the data of the
original domain is always available during the adaptation pro-
cess. This assumption might not viable in practical scenar-
ios, as production systems are generally trained on licensed
or sensitive data which makes data sharing unfeasible.
Houlsby et. al. [8] propose adapter modules that are
small sub-networks injected into the layers of a pre-trained
neural network. The parameters of the pre-trained network
are frozen and only the injected parameters are updated on
the new domain. Even with just a small fraction of the en-
tire model being trained on the new domain, adapters show
performance comparable to fine-tuning. Tomanek et. al. [9]
demonstrate ASR domain adaptation by attaching adapters to
the encoder part of a Recurrent Neural Network Transducer
(RNN-T) [10] and the Transformer Transducer (T-T) [11].
Eeckt et al. [12] use task-specific adapters to overcome
catastrophic forgetting in domain adaptation. They demon-
strate three adaptation techniques: (1) keeping the original
model’s parameters frozen (2) training with special regular-
ization such as Elastic Weight Consolidation (EWC) [13] (3)
using Knowledge Distillation (KD) [14]. Still, the underlying
assumption is that the original dataset is available during the
To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar © IEEE 2022
arXiv:2210.03255v1 [cs.SD] 6 Oct 2022
adaptation.
We consider a Constrained Domain Adaptation task,
where the adaptation for the new domain is done without
access to any of the original domain data. We also strictly
limit the allowed degradation on the original domain after
adaptation [12]. Our main contributions are the following:
1. We add adapter modules [8] to the encoder, decoder,
and joint modules of the Conformer Transducer.
2. We train the adapters using various regularization tech-
niques alongside a constrained training schedule, and
show considerable improvement on the new domain
while limiting degradation on the original domain.
3. Finally, we propose a scoring scheme to select mod-
els that perform well in the constrained adaptation set-
ting and evaluate the proposed approach on the Google
Speech Commands [15] benchmark and the UK and
Ireland English Dialect speech dataset [16].
2. CONSTRAINED DOMAIN ADAPTATION
2.1. Degradation control on original domain
In order to reduce the accuracy loss on the original domain,
we formalize the first constraint as follows. During con-
strained domain adaptation, a candidate solution (C) must
be evaluated on some evaluation set of the original domain
prior to adaptation (o) and after adaptation (o) so as to limit
the absolute degradation of WER to at most κ, where κis
some predetermined acceptable degradation in WER on the
evaluation datasets of the original domain.
To formalize the above constraint, we first define Word
Error Rate (WER) degradation after adaptation as:
WERDego= max(0,WERoWERo)(1)
where the subscript o,arepresent evaluation on the origi-
nal domain and adapted domain after the adaptation process,
and o,arepresent the evaluation on the original domain and
adapted domain prior to the adaptation process respectively.
We then define the weight of degradation on the original
domain as OSCALE , which is a scaling factor computed as :
OSCALE =1
N
N
X
i=1
max 0,κiWERDego,i
κi(2)
where Nis the number of evaluation datasets from the original
domain that the model was initially trained on and κis the
maximum tolerable absolute degradation of word error rate
on the original domain.
Next, we define relative WER improvement (WERR) on
the new domain as AWERR, such that
AWERR = max 0,WERaWERa
WERa(3)
In this formulation, we only accept candidates which improve
in WER on the new domain.
Combining the above definitions, we propose the follow-
ing candidate selection metric which when maximized, yields
candidates that maximize the relative WER improvement on
the new domain, while simultaneously minimizing the degra-
dation on the old domain. We define this metric as a score
function in Eqn 4:
Score =OSCALE AWERR (4)
We select the value of κto be 3%, such that the abso-
lute increase in WER on the original dataset is constrained
to 3%. It is possible to select a stricter threshold, however,
the number of candidate solutions that satisfy the constraints
decrease significantly, and exceedingly few valid candidates
exist for the fine-tuning case. This score is maximized when
the candidate attains the largest relative WER improvement
on the new domain after scaling by a factor in the range of
[0,1] which indicates the weight of degradation of WER on
the original domain. Note that the score has a minimum value
of 0, if the absolute degradation of WER on the original do-
main surpasses κ, or if WER on the new domain becomes
worse after adaptation.
2.2. Adaptation without access to original dataset
During constrained domain adaptation, a candidate solution
must only use the data of the new domain for adaptation, with-
out access to any data from the original domain. It may use
data from the original domain only for evaluation, in order
to determine the severity of degradation on that domain after
adaptation. When applying this constraint, we cannot freely
compute the Coverage (COV) metric [12] since it computes
the difference between fine-tuning and CJT, though we may
still utilize Learning Without Forgetting (LWF) [18] which
distills the model’s knowledge using just the data of the new
domain.
3. SPEECH TRANSDUCERS WITH ADAPTERS
Adapters are small sub-networks injected into specific layers
of a pre-trained neural network, such that the original param-
eters of the network are frozen, avoiding gradient updates,
and only the injected parameters of the adapter sub-networks
are updated. Original adapters from [8] are two-layer residual
feed-forward networks, with an intermediate activation func-
tion, usually, ReLU [19]. Optionally adapters have a layer
normalization applied to the input or output of the adapter.
Adaptors have been also applied to multilingual ASR [20],
cross-lingual ASR [21], self-supervised ASR [22] and con-
textual ASR [2]. Adapters have been found useful in reducing
Catastrophic Forgetting in ASR [12].
Following an unified framework for adapters proposed
by He et al. [23], we consider three types of adapters for
2
摘要:

DAMAGECONTROLDURINGDOMAINADAPTATIONFORTRANSDUCERBASEDAUTOMATICSPEECHRECOGNITIONSomshubraMajumdar*,ShantanuAcharya*,VitalyLavrukhin,BorisGinsburgfsmajumdar,shantanua,vlavrukhin,bginsburgg@nvidia.comABSTRACTAutomaticspeechrecognitionmodelsareoftenadaptedtoimprovetheiraccuracyinanewdomain.Apotentialdra...

展开>> 收起<<
DAMAGE CONTROL DURING DOMAIN ADAPTATION FOR TRANSDUCER BASED AUTOMATIC SPEECH RECOGNITION Somshubra Majumdar Shantanu Acharya Vitaly Lavrukhin Boris Ginsburg.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:355.83KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注