Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation Fernando L opez12 Jordi Luque1

2025-05-03 0 0 406.67KB 5 页 10玖币

侵权投诉

Iterative pseudo-forced alignment by acoustic CTC loss

for self-supervised ASR domain adaptation

Fernando L´

opez1,2, Jordi Luque1

1Telef´

onica I+D

2Universidad Aut´

onoma de Madrid

wiliam.lopezgavilanez@telefonica.com, jordi.luque@telefonica.com

Abstract

High-quality data labeling from speciﬁc domains is costly

and human time-consuming. In this work, we propose a self-

supervised domain adaptation method, based upon an iterative

pseudo-forced alignment algorithm. The produced alignments

are employed to customize an end-to-end Automatic Speech

Recognition (ASR) and iteratively reﬁned. The algorithm is fed

with frame-wise character posteriors produced by a seed ASR,

trained with out-of-domain data, and optimized throughout a

Connectionist Temporal Classiﬁcation (CTC) loss. The align-

ments are computed iteratively upon a corpus of broadcast TV.

The process is repeated by reducing the quantity of text to be

aligned or expanding the alignment window until ﬁnding the

best possible audio-text alignment. The starting timestamps,

or temporal anchors, are produced uniquely based on the conﬁ-

dence score of the last aligned utterance. This score is computed

with the paths of the CTC-alignment matrix. With this method-

ology, no human-revised text references are required. Align-

ments from long audio ﬁles with low-quality transcriptions, like

TV captions, are ﬁltered out by conﬁdence score and ready for

further ASR adaptation. The obtained results, on both the Span-

ish RTVE2022 and CommonVoice databases, underpin the fea-

sibility of using CTC-based systems to perform: highly accurate

audio-text alignments, domain adaptation and semi-supervised

training of end-to-end ASR.

Index Terms: Forced-alignment, Iterative, Automatic Speech

Recognition, Connectionist Temporal Classiﬁcation, Speech

Segmentation, Closed-Captions

1. Introduction

The quality and quantity of data required to train ASR systems

are fundamental. Even more in end-to-end ASR systems, which

require more data than hybrid Deep Neural Network (DNN)

- Hidden Markov Models (HMM) to learn acoustic represen-

tations [1]. When it comes to medium and low-resource lan-

guages, the maximum exploitation of data is sought, given that

manual annotations are costly and time-consuming. A common

technique is to retrieve audio-to-text alignments from available

audio data that has text references from the internet or TV

broadcast (e.g. conferences, subtitles). Some alignment meth-

ods only work under strict conditions as human-revised tran-

scriptions [2]. Nevertheless, in the major part of the data avail-

able, the transcription does not exactly match the spoken con-

tent, depending on the method that has been used to generate the

text (e.g. respeaking for subtitling). Therefore, other methods

have explored alignment retrieval with low-quality text refer-

ences [3, 4, 5, 1]. Some of these systems use a post-ﬁltering

process based on a conﬁdence score to overcome mistaken ref-

erences. Additionally, there is work with unaligned speech and

text but requires word-level segmentation [6].

More recently, the use of non-labeled data has been ex-

plored. Either in semi-supervised approaches, with pseudo-

labeling of unlabeled audio [7]. Or with self-supervised ap-

proaches, learning speech representations that are subsequently

used in a concrete task (supervised), requiring less quantity of

labeled data to achieve state-of-the-art results [8, 9, 10].

In this work, we train and adapt to the TV broadcast domain

an ASR in the Spanish language. Firstly, we ﬁne-tune, only

acoustically by a CTC loss, a model that was formerly trained

with non-labeled data from several languages. Then, we use the

trained ASR to perform utterance-level alignments on a corpus

comprised of Spanish TV broadcast data. For the alignments,

we developed an anchor-based algorithm that generates pseudo-

forced alignments with an iterative approach. The aligned data

is used to continue training the model. Hence, we propose a

framework that consists of a multi-step training process that it-

eratively improves the former ASR performance and adapts it,

step by step, to the speciﬁc broadcast TV acoustics and lan-

guage domain.

2. Related work

2.1. Spanish Speech Recognition

Recent tools and challenges have promoted the improvement of

Spanish ASR. To begin with, Hugging Face’s Robust Speech

Recognition Challenge sought improvement in ASR systems in

more than 50 languages. The participants used a ﬁne-tuned ver-

sion of Wav2Vec2-XLS-R (300M, 1B, or 2B of parameters)

with the CommonVoice database, and decoding based on an

n-gram language model (LM). Additionally, any pre and post-

processing system, such as denoising mechanisms or spelling

correctors, was allowed. The winner in the Spanish language

achieved a Word Error Rate (WER) of 8.82% in the test set

of CommonVoice v6.0, reducing it to 6.27% with an LM [11].

Moreover, the Speech-To-Text Albayzin Evaluation consists of

automatic transcription for Spanish TV shows. The generated

transcripts are compared with human revised transcriptions. In

the previous editions, the Machine Learning and Language Pro-

cessing (MLLP) research group achieved the best results, in

2021 a 16.06% WER in the test split of RTVE2020DB [12].

2.2. Audio-to-Text alignments

Audio-to-text alignments enrich transcriptions with temporal

references to the spoken content. They are required in ﬁelds

such as closed-captioning [13], audiobooks or database genera-

tion for speech technologies [14, 2]. Traditionally, alignments

have been performed with HMM-based models and the Viterbi

algorithm [15, 3, 2, 14]. Frameworks as sphinx [16] and MAUS

[17] are based on these technologies. This approach presents

many inconveniences [15, 1, 13], such as the assumption that

arXiv:2210.15226v2 [cs.CL] 15 Jan 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Iterativepseudo-forcedalignmentbyacousticCTClossforself-supervisedASRdomainadaptationFernandoL´opez1;2,JordiLuque11Telef´onicaI+D2UniversidadAut´onomadeMadridwiliam.lopezgavilanez@telefonica.com,jordi.luque@telefonica.comAbstractHigh-qualitydatalabelingfromspecicdomainsiscostlyandhumantime-consumin...

展开>> 收起<<

Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation Fernando L opez12 Jordi Luque1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation Fernando L opez12 Jordi Luque1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: