EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR Peter Vieting1 Christoph L uscher12 Julian Dierkes1 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Group

2025-05-03 0 0 199.7KB 5 页 10玖币

侵权投诉

EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR

Peter Vieting∗1, Christoph L¨

uscher∗1,2, Julian Dierkes1, Ralf Schl¨

uter1,2, Hermann Ney1,2

1Human Language Technology and Pattern Recognition Group,

Computer Science Department, RWTH Aachen University, 52074 Aachen, Germany

2AppTek GmbH, 52062 Aachen, Germany

{vieting,luescher,schlueter,ney}@cs.rwth-aachen.de

ABSTRACT

Unsupervised representation learning has recently helped au-

tomatic speech recognition (ASR) to tackle tasks with limited

labeled data. Following this, hardware limitations and applica-

tions give rise to the question how to take advantage of large

pre-trained models efﬁciently and reduce their complexity.

In this work, we study a challenging low resource conver-

sational telephony speech corpus from the medical domain

in Vietnamese and German. We show the beneﬁts of using

unsupervised techniques beyond simple ﬁne-tuning of large

pre-trained models, discuss how to adapt them to a practical

telephony task including bandwidth transfer and investigate

different data conditions for pre-training and ﬁne-tuning. We

outperform the project baselines by 22% relative using pre-

training techniques. Further gains of 29% can be achieved

by reﬁnements of architecture and training and 6% by adding

0.8 h of in-domain adaptation data.

Index Terms—speech recognition, medical ASR, unsu-

pervised pre-training

1. INTRODUCTION

The development of ASR systems has come a long way

and established remarkable performance, especially on tasks

with sufﬁcient training data. However, varying acoustic and

recording conditions and speaking styles as well as a lack

of sufﬁcient in-domain training data still pose challenges to

the development of accurate models [

]. Unsupervised pre-

training has recently allowed to exploit unlabeled audio data

which is available at much lower cost, signiﬁcantly reducing

the need for transcribed data. Additionally, the public avail-

ability of pre-trained model checkpoints is appealing to reduce

training resource consumption both from an economical and

environmental point of view.

Nevertheless, these models are often very large, requiring

cutting-edge hardware both for training and recognition to

satisfy the computational and memory requirements. More-

over, application requirements regarding the real time factor

in recognition can be difﬁcult to meet. This gives rise to the

∗equal contribution

question, how to efﬁciently take advantage of large pre-trained

models and how to reduce their complexity in order to meet

the demands mentioned above.

Furthermore, despite the feasibility of training ASR sys-

tems on very small amounts of labeled data when using pre-

trained models, there is certainly room for improvement be-

yond vanilla ﬁne-tuning of existing models. This paper ad-

dresses a challenging real-world low-resource task. Concretely,

we use a conversational telephony speech corpus from the med-

ical domain with very small amount of data in Vietnamese and

German. This task constitutes a prime example for the applica-

tion of pre-trained models while still posing several challenges

like domain shift regarding the unsupervised models’ training

data (conversational speech, acoustic conditions, medical do-

main), telephony bandwidth and application requirements on

limiting the complexity of models and training.

This work shows how to exploit large pre-trained models

in a practical scenario with limited resources and has contribu-

tions along three main lines. The sampling rate mismatch is

addressed beyond simple re-sampling by different proposed

modiﬁcations of the feature extractor. We reduce model sizes

and GPU memory footprint by exploiting intermediate repre-

sentations and applying freezing schemes. Moreover, we study

multi-stage pre-training and ﬁne-tuning to address the data

conditions and achieve adaptation for the target task.

2. RELATED WORK

Unsupervised approaches have gained popularity since

they have shown a potential of high performance with only

little annotated data [

]. Initial work applied this method to

an ASR task by running unsupervised pre-training on a large

unlabeled dataset, followed by a ﬁne-tuning step with a small

annotated dataset [

]. This technique can drastically

reduce the amount of labeled data which is necessary to build

ASR systems. The successes motivated further research into

improving the modeling approach [

] and understanding

the individual components [

]. Furthermore, the data used for

pre-training and ﬁne-tuning was studied, e.g., in a domain-shift

scenario [9] or using multilingual data [10].

Since the unsupervised loss is computed solely based on

arXiv:2210.15445v3 [eess.AS] 17 Aug 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EFFICIENTUTILIZATIONOFLARGEPRE-TRAINEDMODELSFORLOWRESOURCEASRPeterVieting∗1,ChristophL¨uscher∗1,2,JulianDierkes1,RalfSchl¨uter1,2,HermannNey1,21HumanLanguageTechnologyandPatternRecognitionGroup,ComputerScienceDepartment,RWTHAachenUniversity,52074Aachen,Germany2AppTekGmbH,52062Aachen,Germany{vieting,...

展开>> 收起<<

EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR Peter Vieting1 Christoph L uscher12 Julian Dierkes1 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Group.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR Peter Vieting1 Christoph L uscher12 Julian Dierkes1 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Group

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: