
EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR
Peter Vieting∗1, Christoph L¨
uscher∗1,2, Julian Dierkes1, Ralf Schl¨
uter1,2, Hermann Ney1,2
1Human Language Technology and Pattern Recognition Group,
Computer Science Department, RWTH Aachen University, 52074 Aachen, Germany
2AppTek GmbH, 52062 Aachen, Germany
{vieting,luescher,schlueter,ney}@cs.rwth-aachen.de
ABSTRACT
Unsupervised representation learning has recently helped au-
tomatic speech recognition (ASR) to tackle tasks with limited
labeled data. Following this, hardware limitations and applica-
tions give rise to the question how to take advantage of large
pre-trained models efficiently and reduce their complexity.
In this work, we study a challenging low resource conver-
sational telephony speech corpus from the medical domain
in Vietnamese and German. We show the benefits of using
unsupervised techniques beyond simple fine-tuning of large
pre-trained models, discuss how to adapt them to a practical
telephony task including bandwidth transfer and investigate
different data conditions for pre-training and fine-tuning. We
outperform the project baselines by 22% relative using pre-
training techniques. Further gains of 29% can be achieved
by refinements of architecture and training and 6% by adding
0.8 h of in-domain adaptation data.
Index Terms—speech recognition, medical ASR, unsu-
pervised pre-training
1. INTRODUCTION
The development of ASR systems has come a long way
and established remarkable performance, especially on tasks
with sufficient training data. However, varying acoustic and
recording conditions and speaking styles as well as a lack
of sufficient in-domain training data still pose challenges to
the development of accurate models [
1
]. Unsupervised pre-
training has recently allowed to exploit unlabeled audio data
which is available at much lower cost, significantly reducing
the need for transcribed data. Additionally, the public avail-
ability of pre-trained model checkpoints is appealing to reduce
training resource consumption both from an economical and
environmental point of view.
Nevertheless, these models are often very large, requiring
cutting-edge hardware both for training and recognition to
satisfy the computational and memory requirements. More-
over, application requirements regarding the real time factor
in recognition can be difficult to meet. This gives rise to the
∗equal contribution
question, how to efficiently take advantage of large pre-trained
models and how to reduce their complexity in order to meet
the demands mentioned above.
Furthermore, despite the feasibility of training ASR sys-
tems on very small amounts of labeled data when using pre-
trained models, there is certainly room for improvement be-
yond vanilla fine-tuning of existing models. This paper ad-
dresses a challenging real-world low-resource task. Concretely,
we use a conversational telephony speech corpus from the med-
ical domain with very small amount of data in Vietnamese and
German. This task constitutes a prime example for the applica-
tion of pre-trained models while still posing several challenges
like domain shift regarding the unsupervised models’ training
data (conversational speech, acoustic conditions, medical do-
main), telephony bandwidth and application requirements on
limiting the complexity of models and training.
This work shows how to exploit large pre-trained models
in a practical scenario with limited resources and has contribu-
tions along three main lines. The sampling rate mismatch is
addressed beyond simple re-sampling by different proposed
modifications of the feature extractor. We reduce model sizes
and GPU memory footprint by exploiting intermediate repre-
sentations and applying freezing schemes. Moreover, we study
multi-stage pre-training and fine-tuning to address the data
conditions and achieve adaptation for the target task.
2. RELATED WORK
Unsupervised approaches have gained popularity since
they have shown a potential of high performance with only
little annotated data [
2
]. Initial work applied this method to
an ASR task by running unsupervised pre-training on a large
unlabeled dataset, followed by a fine-tuning step with a small
annotated dataset [
3
,
4
,
5
]. This technique can drastically
reduce the amount of labeled data which is necessary to build
ASR systems. The successes motivated further research into
improving the modeling approach [
6
,
7
] and understanding
the individual components [
8
]. Furthermore, the data used for
pre-training and fine-tuning was studied, e.g., in a domain-shift
scenario [9] or using multilingual data [10].
Since the unsupervised loss is computed solely based on
arXiv:2210.15445v3 [eess.AS] 17 Aug 2023