TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED IN-DOMAIN SUPERVISION Chak-Fai Li Francis Keith William Hartmann Matthew Snover

2025-05-06 0 0 377.98KB 5 页 10玖币

侵权投诉

TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED

IN-DOMAIN SUPERVISION

Chak-Fai Li, Francis Keith, William Hartmann, Matthew Snover

Raytheon BBN, Cambridge MA, USA

{chak.fai.li, francis.keith, william.hartmann, matt.snover}@rtx.com

ABSTRACT

Advances in self-supervised learning have signiﬁcantly reduced the

amount of transcribed audio required for training. However, the ma-

jority of work in this area is focused on read speech. We explore

limited supervision in the domain of conversational speech. While

we assume the amount of in-domain data is limited, we augment the

model with open source read speech data. The XLS-R model has

been shown to perform well with limited adaptation data and serves

as a strong baseline. We use untranscribed data for self-supervised

learning and semi-supervised training in an autoregressive encoder-

decoder model. We demonstrate that by using the XLS-R model for

pseudotranscription, a much smaller autoregressive model can out-

perform a ﬁnetuned XLS-R model when transcribed in-domain data

is limited, reducing WER by as much as 8% absolute.

Index Terms—seq2seq, self-supervised learning, semi-supervised

training, domain adaptation

1. INTRODUCTION

Recent advances in self-supervised learning (SSL) have led to better

utilization of untranscribed data and reduced reliance on labeled

data. Some work has sought to eliminate the requirement of tran-

scribed data entirely [1]. Pretrained models that can be used directly

or ﬁnetuned to new datasets are widely available [2, 3, 4]. When

combined with traditional semi-supervised learning techniques,

these models achieve the state-of-the-art (SOTA) on a number of

datasets.

The vast majority of the work focuses on read speech, where the

standard benchmark is Librispeech [5]. While a single point of com-

parison has been advantageous for the community, there are other

challenging applications of automatic speech recognition (ASR) be-

yond read speech. It is unlikely that large scale self-supervised learn-

ing of read speech is the best way to improve conversational speech

(CS) recognition; previous work has shown that pretraining in the

target domain is more beneﬁcial [6].

In this work we focus on reducing the amount of in-domain

supervision required for autoregressive (AR) ASR models for con-

versational speech. This domain presents unique challenges due to

both the data and the model. For many languages the amount of

transcribed conversational speech is severely limited—even publicly

available untranscribed speech is limited. Approaches that require

thousands of hours of data are difﬁcult to apply due to lack of data.

Autoregressive models tend to require more data, at least par-

tially due to their need to learn an internal language model (LM).

Hybrid and CTC-trained models can be ﬁnetuned on extremely small

amounts of data. While initial results may be poor, performance

can be dramatically improved through the inclusion of an external

lexicon and LM. External LMs can also be applied to AR models

through techniques like shallow fusion, cold fusion [7], deep fusion

[8], component fusion [9], and internal language model (ILM) es-

timation [10]. However, the relative improvements from these ap-

proaches are limited compared to non-autoregressive models.

Early work with limited supervised training focused on the stan-

dard pseudolabeling approach to semi-supervised training (SST) [11,

12]. With the advent of deep neural networks and hybrid models,

there was renewed interest in SST. The models could be trained on

orders of magnitude more data [13] and there were advances in both

the selection [14] and use of the data [15]. Recent SST work with

all-neural models include classic approaches and methods adapted

from image classiﬁcation. Noisy student training [16], a commonly

used technique, is an iterative pseudolabeling approach with ﬁltering

that incorporates data augmentation on the source side. Instead of it-

eratively updating the pseudotranscripts, they can also be generated

on-the-ﬂy with a continuously updated transcription model [17, 18].

Combining SSL and SST has further pushed the state-of-the-art [19].

Some recent work has applied these techniques to conversational

speech. In [3], they pretrained a model on a large amount of read

speech and conversational speech from the BABEL program. When

the models were ﬁnetuned on 30+ hours of in-domain data and com-

bined with an external LM, they were able to surpass the perfor-

mance of state-of-the-art hybrid models. Later work explored ﬁne-

tuning pretrained models to conversational speech [20]. When the

data was limited, performance was poor compared to a model trained

on the full set of supervised data. The work most similar to ours is

[21]. They start from a model trained only on read speech and at-

tempt to adapt to a conversational dataset through SST. However,

they do not use self-supervised learning and performance on conver-

sational speech is poor compared to a fully supervised model.

We compare the state-of-the-art for three ASR approaches where

the amount of transcribed in-domain data ranges from 68 hours to

nothing. Our contributions include:

• Demonstration that autoregressive models can outperform

state-of-the-art hybrid models and large ﬁnetuned XLS-R

models in the low-resource conversational speech domain

• Drastically reducing the amount of in-domain supervised data

required for autoregressive conversational speech models

• Showing read speech can be beneﬁcial for conversational

speech systems.

2. LIMITED SUPERVISION TRAINING FOR

AUTOREGRESSIVE MODELS

Autoregressive style models require large amounts of training data.

While it is possible to adapt hybrid and CTC-trained models using

limited transcribed data, autoregressive models perform a more com-

plicated task and must rely on their own internal LM. We leverage

arXiv:2210.15135v1 [cs.CL] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TRAININGAUTOREGRESSIVESPEECHRECOGNITIONMODELSWITHLIMITEDIN-DOMAINSUPERVISIONChak-FaiLi,FrancisKeith,WilliamHartmann,MatthewSnoverRaytheonBBN,CambridgeMA,USAfchak.fai.li,francis.keith,william.hartmann,matt.snoverg@rtx.comABSTRACTAdvancesinself-supervisedlearninghavesignicantlyreducedtheamountoftrans...

展开>> 收起<<

TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED IN-DOMAIN SUPERVISION Chak-Fai Li Francis Keith William Hartmann Matthew Snover.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TRAINING AUTOREGRESSIVE SPEECH RECOGNITION MODELS WITH LIMITED IN-DOMAIN SUPERVISION Chak-Fai Li Francis Keith William Hartmann Matthew Snover

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: