Iterative pseudo-forced alignment by acoustic CTC loss
for self-supervised ASR domain adaptation
Fernando L´
opez1,2, Jordi Luque1
1Telef´
onica I+D
2Universidad Aut´
onoma de Madrid
wiliam.lopezgavilanez@telefonica.com, jordi.luque@telefonica.com
Abstract
High-quality data labeling from specific domains is costly
and human time-consuming. In this work, we propose a self-
supervised domain adaptation method, based upon an iterative
pseudo-forced alignment algorithm. The produced alignments
are employed to customize an end-to-end Automatic Speech
Recognition (ASR) and iteratively refined. The algorithm is fed
with frame-wise character posteriors produced by a seed ASR,
trained with out-of-domain data, and optimized throughout a
Connectionist Temporal Classification (CTC) loss. The align-
ments are computed iteratively upon a corpus of broadcast TV.
The process is repeated by reducing the quantity of text to be
aligned or expanding the alignment window until finding the
best possible audio-text alignment. The starting timestamps,
or temporal anchors, are produced uniquely based on the confi-
dence score of the last aligned utterance. This score is computed
with the paths of the CTC-alignment matrix. With this method-
ology, no human-revised text references are required. Align-
ments from long audio files with low-quality transcriptions, like
TV captions, are filtered out by confidence score and ready for
further ASR adaptation. The obtained results, on both the Span-
ish RTVE2022 and CommonVoice databases, underpin the fea-
sibility of using CTC-based systems to perform: highly accurate
audio-text alignments, domain adaptation and semi-supervised
training of end-to-end ASR.
Index Terms: Forced-alignment, Iterative, Automatic Speech
Recognition, Connectionist Temporal Classification, Speech
Segmentation, Closed-Captions
1. Introduction
The quality and quantity of data required to train ASR systems
are fundamental. Even more in end-to-end ASR systems, which
require more data than hybrid Deep Neural Network (DNN)
- Hidden Markov Models (HMM) to learn acoustic represen-
tations [1]. When it comes to medium and low-resource lan-
guages, the maximum exploitation of data is sought, given that
manual annotations are costly and time-consuming. A common
technique is to retrieve audio-to-text alignments from available
audio data that has text references from the internet or TV
broadcast (e.g. conferences, subtitles). Some alignment meth-
ods only work under strict conditions as human-revised tran-
scriptions [2]. Nevertheless, in the major part of the data avail-
able, the transcription does not exactly match the spoken con-
tent, depending on the method that has been used to generate the
text (e.g. respeaking for subtitling). Therefore, other methods
have explored alignment retrieval with low-quality text refer-
ences [3, 4, 5, 1]. Some of these systems use a post-filtering
process based on a confidence score to overcome mistaken ref-
erences. Additionally, there is work with unaligned speech and
text but requires word-level segmentation [6].
More recently, the use of non-labeled data has been ex-
plored. Either in semi-supervised approaches, with pseudo-
labeling of unlabeled audio [7]. Or with self-supervised ap-
proaches, learning speech representations that are subsequently
used in a concrete task (supervised), requiring less quantity of
labeled data to achieve state-of-the-art results [8, 9, 10].
In this work, we train and adapt to the TV broadcast domain
an ASR in the Spanish language. Firstly, we fine-tune, only
acoustically by a CTC loss, a model that was formerly trained
with non-labeled data from several languages. Then, we use the
trained ASR to perform utterance-level alignments on a corpus
comprised of Spanish TV broadcast data. For the alignments,
we developed an anchor-based algorithm that generates pseudo-
forced alignments with an iterative approach. The aligned data
is used to continue training the model. Hence, we propose a
framework that consists of a multi-step training process that it-
eratively improves the former ASR performance and adapts it,
step by step, to the specific broadcast TV acoustics and lan-
guage domain.
2. Related work
2.1. Spanish Speech Recognition
Recent tools and challenges have promoted the improvement of
Spanish ASR. To begin with, Hugging Face’s Robust Speech
Recognition Challenge sought improvement in ASR systems in
more than 50 languages. The participants used a fine-tuned ver-
sion of Wav2Vec2-XLS-R (300M, 1B, or 2B of parameters)
with the CommonVoice database, and decoding based on an
n-gram language model (LM). Additionally, any pre and post-
processing system, such as denoising mechanisms or spelling
correctors, was allowed. The winner in the Spanish language
achieved a Word Error Rate (WER) of 8.82% in the test set
of CommonVoice v6.0, reducing it to 6.27% with an LM [11].
Moreover, the Speech-To-Text Albayzin Evaluation consists of
automatic transcription for Spanish TV shows. The generated
transcripts are compared with human revised transcriptions. In
the previous editions, the Machine Learning and Language Pro-
cessing (MLLP) research group achieved the best results, in
2021 a 16.06% WER in the test split of RTVE2020DB [12].
2.2. Audio-to-Text alignments
Audio-to-text alignments enrich transcriptions with temporal
references to the spoken content. They are required in fields
such as closed-captioning [13], audiobooks or database genera-
tion for speech technologies [14, 2]. Traditionally, alignments
have been performed with HMM-based models and the Viterbi
algorithm [15, 3, 2, 14]. Frameworks as sphinx [16] and MAUS
[17] are based on these technologies. This approach presents
many inconveniences [15, 1, 13], such as the assumption that
arXiv:2210.15226v2 [cs.CL] 15 Jan 2023