The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang Peikun Chen Fan Yu Xinfa Zhu Tianyi Xu Lei Xie

2025-05-06 0 0 266.08KB 5 页 10玖币
侵权投诉
The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR
Challenge
Yuhao Liang, Peikun Chen, Fan Yu, Xinfa Zhu, Tianyi Xu, Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University, Xi’an, China
liangyuhao@mail.nwpu.edu.cn, lxie@nwpu.edu.cn
Abstract
This paper describes our NPU-ASLP system submitted to the
ISCSLP 2022 Magichub Code-Switching ASR Challenge. In
this challenge, we first explore several popular end-to-end
ASR architectures and training strategies, including bi-encoder,
language-aware encoder (LAE) and mixture of experts (MoE).
To improve our system’s language modeling ability, we further
attempt the internal language model as well as the long con-
text language model. Given the limited training data in the
challenge, we further investigate the effects of data augmenta-
tion, including speed perturbation, pitch shifting, speech codec,
SpecAugment and synthetic data from text-to-speech (TTS). Fi-
nally, we explore ROVER-based score fusion to make full use
of complementary hypotheses from different models. Our sub-
mitted system achieves 16.87% on mix error rate (MER) on the
test set and comes to the 2nd place in the challenge ranking.
Index Terms: Automatic Speech Recognition, Code-
Switching, Data Augmentation
1. Introduction
Code-switching occurs when a speaker alternates between two
or more languages. With fast globalization and frequent cul-
ture exchange, code-switching has become a common language
phenomenon which poses significant challenges to speech and
language processing tasks including automatic speech recog-
nition (ASR). Code-switching may occur in the middle of a
sentence (intra-sentential) or at the sentence boundaries (inter-
sentential) while the former is considered to be more dif-
ficult to a speech recognizer. To promote reproducible re-
search of Mandarin-English code-switching ASR, ISCSLP2022
has specifically held the Magichub Code-Switching ASR
challenge1, which provides a sizeable corpus and a common
test-bed to benchmark the code-switching ASR performance.
Code-switching ASR has been explored for quite a long
time since the conventional hybrid ASR paradigm [1]. Progress
has also been advanced with several challenges specifically fo-
cusing on the code-switching phenomena [2, 3, 4]. With the re-
cent advances in deep learning, neural end-to-end (E2E) frame-
works, such as attention encoder decoder (AED) [5, 6] and neu-
ral transducer [7], have emerged as the mainstream for ASR
with simplified system building pipeline and substantial perfor-
mance improvement. However, modeling multiple languages
simultaneously in a unified neural architecture is non-trivial be-
cause different languages (e.g., Mandarin and English) have sig-
nificant differences in many aspects including modeling units
and manner of articulation.
Recently, language-expert modules [8, 9, 10] were pro-
posed for modeling different languages by separated parame-
ters in multilingual or cross-lingual settings, which can cap-
1https://magichub.com/competition/code-switching-asr-challenge
ture language-specific knowledge space effectively and mitigate
overfitting caused by the poverty of code-switching data.
Specifically, network parameters were decomposed into
language-specific parts (or experts) in a bi-encoder structure,
where each transformer encoder represents a language (i.e.,
Mandarin and English) [8, 9].
Meanwhile, the bi-encoder architecture can effectively
leverage rich monolingual data from both languages. But due
to the lack of interaction between the separated encoders, the
language-common feature space is apparently ignored. There-
fore, language-aware encoder (LAE) [9] was further proposed
to address this problem by sharing the preliminary blocks be-
fore the language-specific experts, which could model both
language-specific and language-common feature efficiently. In-
stead of sharing only the preliminary blocks, mixture of experts
(MoE) [10] was designed to share the majority of parameters,
which may be able to learn more language-common feature and
be better suited to limited training data conditions.
Another difficulty is the data sparsity problem. As the lan-
guage switch can be occurred anywhere in an utterance for the
more difficult intra-sentential switch, it is hard to collect enough
code-switching data and prediction of the switching position is
rather difficult.
To overcome this problem, data augmentation might be a
feasible solution, including text-to-speech (TTS) augmentation
and text data augmentation. Note that using synthetic data di-
rectly often has negligible gain or even misguides the ASR sys-
tem because of the mismatch between the synthetic and real
data.
For better use of synthetic data, some additional loss func-
tions [11, 12] and filtering strategies [13, 14] were proposed
to enforce the consistency of hypothesized labels between real
and synthetic data. For text augmentation, a machine transla-
tion model was usually adopted to expand on the original code-
switching text [15].
In this challenge, we approach the Mandarin-English code-
switching ASR by exploration of both multi-lingual neural ar-
chitectures and data augmentation. Specifically, we study the
bi-encoder, LAE and MoE architectures reviewed above un-
der the popular Conformer based AED framework implemented
with two popular ASR toolkits – ESPNet [16] and WeNet [17].
Various data augmentation methods, including speed perturba-
tion, pitch shifting, audio codec augmentation, spectrum aug-
mentation as well as text-to-speech augmentation. Specifically
for TTS augmentation, a consistency loss [11] is proved to be
effective for mitigating the mismatch in the distribution of real
and synthetic data. We further explore the effectiveness of lan-
guage modeling, including both internal language model as well
as long context language model [18]. Finally, ROVER [19] is
adopted for fusion of multiple hypothesis from various models,
which has previously proven to be effective [20, 21, 22]. Our fu-
arXiv:2210.14448v1 [cs.SD] 26 Oct 2022
摘要:

TheNPU-ASLPSystemforTheISCSLP2022MagichubCode-SwichingASRChallengeYuhaoLiang,PeikunChen,FanYu,XinfaZhu,TianyiXu,LeiXieAudio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity,Xi'an,Chinaliangyuhao@mail.nwpu.edu.cn,lxie@nwpu.edu.cnAbstractThispaperd...

展开>> 收起<<
The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang Peikun Chen Fan Yu Xinfa Zhu Tianyi Xu Lei Xie.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:266.08KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注