Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation Fernando L opez12 Jordi Luque1

2025-05-03 0 0 406.67KB 5 页 10玖币
侵权投诉
Iterative pseudo-forced alignment by acoustic CTC loss
for self-supervised ASR domain adaptation
Fernando L´
opez1,2, Jordi Luque1
1Telef´
onica I+D
2Universidad Aut´
onoma de Madrid
wiliam.lopezgavilanez@telefonica.com, jordi.luque@telefonica.com
Abstract
High-quality data labeling from specific domains is costly
and human time-consuming. In this work, we propose a self-
supervised domain adaptation method, based upon an iterative
pseudo-forced alignment algorithm. The produced alignments
are employed to customize an end-to-end Automatic Speech
Recognition (ASR) and iteratively refined. The algorithm is fed
with frame-wise character posteriors produced by a seed ASR,
trained with out-of-domain data, and optimized throughout a
Connectionist Temporal Classification (CTC) loss. The align-
ments are computed iteratively upon a corpus of broadcast TV.
The process is repeated by reducing the quantity of text to be
aligned or expanding the alignment window until finding the
best possible audio-text alignment. The starting timestamps,
or temporal anchors, are produced uniquely based on the confi-
dence score of the last aligned utterance. This score is computed
with the paths of the CTC-alignment matrix. With this method-
ology, no human-revised text references are required. Align-
ments from long audio files with low-quality transcriptions, like
TV captions, are filtered out by confidence score and ready for
further ASR adaptation. The obtained results, on both the Span-
ish RTVE2022 and CommonVoice databases, underpin the fea-
sibility of using CTC-based systems to perform: highly accurate
audio-text alignments, domain adaptation and semi-supervised
training of end-to-end ASR.
Index Terms: Forced-alignment, Iterative, Automatic Speech
Recognition, Connectionist Temporal Classification, Speech
Segmentation, Closed-Captions
1. Introduction
The quality and quantity of data required to train ASR systems
are fundamental. Even more in end-to-end ASR systems, which
require more data than hybrid Deep Neural Network (DNN)
- Hidden Markov Models (HMM) to learn acoustic represen-
tations [1]. When it comes to medium and low-resource lan-
guages, the maximum exploitation of data is sought, given that
manual annotations are costly and time-consuming. A common
technique is to retrieve audio-to-text alignments from available
audio data that has text references from the internet or TV
broadcast (e.g. conferences, subtitles). Some alignment meth-
ods only work under strict conditions as human-revised tran-
scriptions [2]. Nevertheless, in the major part of the data avail-
able, the transcription does not exactly match the spoken con-
tent, depending on the method that has been used to generate the
text (e.g. respeaking for subtitling). Therefore, other methods
have explored alignment retrieval with low-quality text refer-
ences [3, 4, 5, 1]. Some of these systems use a post-filtering
process based on a confidence score to overcome mistaken ref-
erences. Additionally, there is work with unaligned speech and
text but requires word-level segmentation [6].
More recently, the use of non-labeled data has been ex-
plored. Either in semi-supervised approaches, with pseudo-
labeling of unlabeled audio [7]. Or with self-supervised ap-
proaches, learning speech representations that are subsequently
used in a concrete task (supervised), requiring less quantity of
labeled data to achieve state-of-the-art results [8, 9, 10].
In this work, we train and adapt to the TV broadcast domain
an ASR in the Spanish language. Firstly, we fine-tune, only
acoustically by a CTC loss, a model that was formerly trained
with non-labeled data from several languages. Then, we use the
trained ASR to perform utterance-level alignments on a corpus
comprised of Spanish TV broadcast data. For the alignments,
we developed an anchor-based algorithm that generates pseudo-
forced alignments with an iterative approach. The aligned data
is used to continue training the model. Hence, we propose a
framework that consists of a multi-step training process that it-
eratively improves the former ASR performance and adapts it,
step by step, to the specific broadcast TV acoustics and lan-
guage domain.
2. Related work
2.1. Spanish Speech Recognition
Recent tools and challenges have promoted the improvement of
Spanish ASR. To begin with, Hugging Face’s Robust Speech
Recognition Challenge sought improvement in ASR systems in
more than 50 languages. The participants used a fine-tuned ver-
sion of Wav2Vec2-XLS-R (300M, 1B, or 2B of parameters)
with the CommonVoice database, and decoding based on an
n-gram language model (LM). Additionally, any pre and post-
processing system, such as denoising mechanisms or spelling
correctors, was allowed. The winner in the Spanish language
achieved a Word Error Rate (WER) of 8.82% in the test set
of CommonVoice v6.0, reducing it to 6.27% with an LM [11].
Moreover, the Speech-To-Text Albayzin Evaluation consists of
automatic transcription for Spanish TV shows. The generated
transcripts are compared with human revised transcriptions. In
the previous editions, the Machine Learning and Language Pro-
cessing (MLLP) research group achieved the best results, in
2021 a 16.06% WER in the test split of RTVE2020DB [12].
2.2. Audio-to-Text alignments
Audio-to-text alignments enrich transcriptions with temporal
references to the spoken content. They are required in fields
such as closed-captioning [13], audiobooks or database genera-
tion for speech technologies [14, 2]. Traditionally, alignments
have been performed with HMM-based models and the Viterbi
algorithm [15, 3, 2, 14]. Frameworks as sphinx [16] and MAUS
[17] are based on these technologies. This approach presents
many inconveniences [15, 1, 13], such as the assumption that
arXiv:2210.15226v2 [cs.CL] 15 Jan 2023
摘要:

Iterativepseudo-forcedalignmentbyacousticCTClossforself-supervisedASRdomainadaptationFernandoL´opez1;2,JordiLuque11Telef´onicaI+D2UniversidadAut´onomadeMadridwiliam.lopezgavilanez@telefonica.com,jordi.luque@telefonica.comAbstractHigh-qualitydatalabelingfromspecicdomainsiscostlyandhumantime-consumin...

展开>> 收起<<
Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation Fernando L opez12 Jordi Luque1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:406.67KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注