EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR Peter Vieting1 Christoph L uscher12 Julian Dierkes1 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Group

2025-05-03 0 0 199.7KB 5 页 10玖币
侵权投诉
EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR
Peter Vieting1, Christoph L¨
uscher1,2, Julian Dierkes1, Ralf Schl¨
uter1,2, Hermann Ney1,2
1Human Language Technology and Pattern Recognition Group,
Computer Science Department, RWTH Aachen University, 52074 Aachen, Germany
2AppTek GmbH, 52062 Aachen, Germany
{vieting,luescher,schlueter,ney}@cs.rwth-aachen.de
ABSTRACT
Unsupervised representation learning has recently helped au-
tomatic speech recognition (ASR) to tackle tasks with limited
labeled data. Following this, hardware limitations and applica-
tions give rise to the question how to take advantage of large
pre-trained models efficiently and reduce their complexity.
In this work, we study a challenging low resource conver-
sational telephony speech corpus from the medical domain
in Vietnamese and German. We show the benefits of using
unsupervised techniques beyond simple fine-tuning of large
pre-trained models, discuss how to adapt them to a practical
telephony task including bandwidth transfer and investigate
different data conditions for pre-training and fine-tuning. We
outperform the project baselines by 22% relative using pre-
training techniques. Further gains of 29% can be achieved
by refinements of architecture and training and 6% by adding
0.8 h of in-domain adaptation data.
Index Termsspeech recognition, medical ASR, unsu-
pervised pre-training
1. INTRODUCTION
The development of ASR systems has come a long way
and established remarkable performance, especially on tasks
with sufficient training data. However, varying acoustic and
recording conditions and speaking styles as well as a lack
of sufficient in-domain training data still pose challenges to
the development of accurate models [
1
]. Unsupervised pre-
training has recently allowed to exploit unlabeled audio data
which is available at much lower cost, significantly reducing
the need for transcribed data. Additionally, the public avail-
ability of pre-trained model checkpoints is appealing to reduce
training resource consumption both from an economical and
environmental point of view.
Nevertheless, these models are often very large, requiring
cutting-edge hardware both for training and recognition to
satisfy the computational and memory requirements. More-
over, application requirements regarding the real time factor
in recognition can be difficult to meet. This gives rise to the
equal contribution
question, how to efficiently take advantage of large pre-trained
models and how to reduce their complexity in order to meet
the demands mentioned above.
Furthermore, despite the feasibility of training ASR sys-
tems on very small amounts of labeled data when using pre-
trained models, there is certainly room for improvement be-
yond vanilla fine-tuning of existing models. This paper ad-
dresses a challenging real-world low-resource task. Concretely,
we use a conversational telephony speech corpus from the med-
ical domain with very small amount of data in Vietnamese and
German. This task constitutes a prime example for the applica-
tion of pre-trained models while still posing several challenges
like domain shift regarding the unsupervised models’ training
data (conversational speech, acoustic conditions, medical do-
main), telephony bandwidth and application requirements on
limiting the complexity of models and training.
This work shows how to exploit large pre-trained models
in a practical scenario with limited resources and has contribu-
tions along three main lines. The sampling rate mismatch is
addressed beyond simple re-sampling by different proposed
modifications of the feature extractor. We reduce model sizes
and GPU memory footprint by exploiting intermediate repre-
sentations and applying freezing schemes. Moreover, we study
multi-stage pre-training and fine-tuning to address the data
conditions and achieve adaptation for the target task.
2. RELATED WORK
Unsupervised approaches have gained popularity since
they have shown a potential of high performance with only
little annotated data [
2
]. Initial work applied this method to
an ASR task by running unsupervised pre-training on a large
unlabeled dataset, followed by a fine-tuning step with a small
annotated dataset [
3
,
4
,
5
]. This technique can drastically
reduce the amount of labeled data which is necessary to build
ASR systems. The successes motivated further research into
improving the modeling approach [
6
,
7
] and understanding
the individual components [
8
]. Furthermore, the data used for
pre-training and fine-tuning was studied, e.g., in a domain-shift
scenario [9] or using multilingual data [10].
Since the unsupervised loss is computed solely based on
arXiv:2210.15445v3 [eess.AS] 17 Aug 2023
摘要:

EFFICIENTUTILIZATIONOFLARGEPRE-TRAINEDMODELSFORLOWRESOURCEASRPeterVieting∗1,ChristophL¨uscher∗1,2,JulianDierkes1,RalfSchl¨uter1,2,HermannNey1,21HumanLanguageTechnologyandPatternRecognitionGroup,ComputerScienceDepartment,RWTHAachenUniversity,52074Aachen,Germany2AppTekGmbH,52062Aachen,Germany{vieting,...

展开>> 收起<<
EFFICIENT UTILIZATION OF LARGE PRE-TRAINED MODELS FOR LOW RESOURCE ASR Peter Vieting1 Christoph L uscher12 Julian Dierkes1 Ralf Schl uter12 Hermann Ney12 1Human Language Technology and Pattern Recognition Group.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:199.7KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注