TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED IN-DOMAIN SUPERVISION Chak-Fai Li Francis Keith William Hartmann Matthew Snover

2025-05-06 0 0 377.98KB 5 页 10玖币
侵权投诉
TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED
IN-DOMAIN SUPERVISION
Chak-Fai Li, Francis Keith, William Hartmann, Matthew Snover
Raytheon BBN, Cambridge MA, USA
{chak.fai.li, francis.keith, william.hartmann, matt.snover}@rtx.com
ABSTRACT
Advances in self-supervised learning have significantly reduced the
amount of transcribed audio required for training. However, the ma-
jority of work in this area is focused on read speech. We explore
limited supervision in the domain of conversational speech. While
we assume the amount of in-domain data is limited, we augment the
model with open source read speech data. The XLS-R model has
been shown to perform well with limited adaptation data and serves
as a strong baseline. We use untranscribed data for self-supervised
learning and semi-supervised training in an autoregressive encoder-
decoder model. We demonstrate that by using the XLS-R model for
pseudotranscription, a much smaller autoregressive model can out-
perform a finetuned XLS-R model when transcribed in-domain data
is limited, reducing WER by as much as 8% absolute.
Index Termsseq2seq, self-supervised learning, semi-supervised
training, domain adaptation
1. INTRODUCTION
Recent advances in self-supervised learning (SSL) have led to better
utilization of untranscribed data and reduced reliance on labeled
data. Some work has sought to eliminate the requirement of tran-
scribed data entirely [1]. Pretrained models that can be used directly
or finetuned to new datasets are widely available [2, 3, 4]. When
combined with traditional semi-supervised learning techniques,
these models achieve the state-of-the-art (SOTA) on a number of
datasets.
The vast majority of the work focuses on read speech, where the
standard benchmark is Librispeech [5]. While a single point of com-
parison has been advantageous for the community, there are other
challenging applications of automatic speech recognition (ASR) be-
yond read speech. It is unlikely that large scale self-supervised learn-
ing of read speech is the best way to improve conversational speech
(CS) recognition; previous work has shown that pretraining in the
target domain is more beneficial [6].
In this work we focus on reducing the amount of in-domain
supervision required for autoregressive (AR) ASR models for con-
versational speech. This domain presents unique challenges due to
both the data and the model. For many languages the amount of
transcribed conversational speech is severely limited—even publicly
available untranscribed speech is limited. Approaches that require
thousands of hours of data are difficult to apply due to lack of data.
Autoregressive models tend to require more data, at least par-
tially due to their need to learn an internal language model (LM).
Hybrid and CTC-trained models can be finetuned on extremely small
amounts of data. While initial results may be poor, performance
can be dramatically improved through the inclusion of an external
lexicon and LM. External LMs can also be applied to AR models
through techniques like shallow fusion, cold fusion [7], deep fusion
[8], component fusion [9], and internal language model (ILM) es-
timation [10]. However, the relative improvements from these ap-
proaches are limited compared to non-autoregressive models.
Early work with limited supervised training focused on the stan-
dard pseudolabeling approach to semi-supervised training (SST) [11,
12]. With the advent of deep neural networks and hybrid models,
there was renewed interest in SST. The models could be trained on
orders of magnitude more data [13] and there were advances in both
the selection [14] and use of the data [15]. Recent SST work with
all-neural models include classic approaches and methods adapted
from image classification. Noisy student training [16], a commonly
used technique, is an iterative pseudolabeling approach with filtering
that incorporates data augmentation on the source side. Instead of it-
eratively updating the pseudotranscripts, they can also be generated
on-the-fly with a continuously updated transcription model [17, 18].
Combining SSL and SST has further pushed the state-of-the-art [19].
Some recent work has applied these techniques to conversational
speech. In [3], they pretrained a model on a large amount of read
speech and conversational speech from the BABEL program. When
the models were finetuned on 30+ hours of in-domain data and com-
bined with an external LM, they were able to surpass the perfor-
mance of state-of-the-art hybrid models. Later work explored fine-
tuning pretrained models to conversational speech [20]. When the
data was limited, performance was poor compared to a model trained
on the full set of supervised data. The work most similar to ours is
[21]. They start from a model trained only on read speech and at-
tempt to adapt to a conversational dataset through SST. However,
they do not use self-supervised learning and performance on conver-
sational speech is poor compared to a fully supervised model.
We compare the state-of-the-art for three ASR approaches where
the amount of transcribed in-domain data ranges from 68 hours to
nothing. Our contributions include:
Demonstration that autoregressive models can outperform
state-of-the-art hybrid models and large finetuned XLS-R
models in the low-resource conversational speech domain
Drastically reducing the amount of in-domain supervised data
required for autoregressive conversational speech models
Showing read speech can be beneficial for conversational
speech systems.
2. LIMITED SUPERVISION TRAINING FOR
AUTOREGRESSIVE MODELS
Autoregressive style models require large amounts of training data.
While it is possible to adapt hybrid and CTC-trained models using
limited transcribed data, autoregressive models perform a more com-
plicated task and must rely on their own internal LM. We leverage
arXiv:2210.15135v1 [cs.CL] 27 Oct 2022
摘要:

TRAININGAUTOREGRESSIVESPEECHRECOGNITIONMODELSWITHLIMITEDIN-DOMAINSUPERVISIONChak-FaiLi,FrancisKeith,WilliamHartmann,MatthewSnoverRaytheonBBN,CambridgeMA,USAfchak.fai.li,francis.keith,william.hartmann,matt.snoverg@rtx.comABSTRACTAdvancesinself-supervisedlearninghavesignicantlyreducedtheamountoftrans...

展开>> 收起<<
TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED IN-DOMAIN SUPERVISION Chak-Fai Li Francis Keith William Hartmann Matthew Snover.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:377.98KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注