The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang Peikun Chen Fan Yu Xinfa Zhu Tianyi Xu Lei Xie

2025-05-06 0 0 266.08KB 5 页 10玖币

侵权投诉

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR

Challenge

Yuhao Liang, Peikun Chen, Fan Yu, Xinfa Zhu, Tianyi Xu, Lei Xie

Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,

Northwestern Polytechnical University, Xi’an, China

liangyuhao@mail.nwpu.edu.cn, lxie@nwpu.edu.cn

Abstract

This paper describes our NPU-ASLP system submitted to the

ISCSLP 2022 Magichub Code-Switching ASR Challenge. In

this challenge, we ﬁrst explore several popular end-to-end

ASR architectures and training strategies, including bi-encoder,

language-aware encoder (LAE) and mixture of experts (MoE).

To improve our system’s language modeling ability, we further

attempt the internal language model as well as the long con-

text language model. Given the limited training data in the

challenge, we further investigate the effects of data augmenta-

tion, including speed perturbation, pitch shifting, speech codec,

SpecAugment and synthetic data from text-to-speech (TTS). Fi-

nally, we explore ROVER-based score fusion to make full use

of complementary hypotheses from different models. Our sub-

mitted system achieves 16.87% on mix error rate (MER) on the

test set and comes to the 2nd place in the challenge ranking.

Index Terms: Automatic Speech Recognition, Code-

Switching, Data Augmentation

1. Introduction

Code-switching occurs when a speaker alternates between two

or more languages. With fast globalization and frequent cul-

ture exchange, code-switching has become a common language

phenomenon which poses signiﬁcant challenges to speech and

language processing tasks including automatic speech recog-

nition (ASR). Code-switching may occur in the middle of a

sentence (intra-sentential) or at the sentence boundaries (inter-

sentential) while the former is considered to be more dif-

ﬁcult to a speech recognizer. To promote reproducible re-

search of Mandarin-English code-switching ASR, ISCSLP2022

has speciﬁcally held the Magichub Code-Switching ASR

challenge1, which provides a sizeable corpus and a common

test-bed to benchmark the code-switching ASR performance.

Code-switching ASR has been explored for quite a long

time since the conventional hybrid ASR paradigm [1]. Progress

has also been advanced with several challenges speciﬁcally fo-

cusing on the code-switching phenomena [2, 3, 4]. With the re-

cent advances in deep learning, neural end-to-end (E2E) frame-

works, such as attention encoder decoder (AED) [5, 6] and neu-

ral transducer [7], have emerged as the mainstream for ASR

with simpliﬁed system building pipeline and substantial perfor-

mance improvement. However, modeling multiple languages

simultaneously in a uniﬁed neural architecture is non-trivial be-

cause different languages (e.g., Mandarin and English) have sig-

niﬁcant differences in many aspects including modeling units

and manner of articulation.

Recently, language-expert modules [8, 9, 10] were pro-

posed for modeling different languages by separated parame-

ters in multilingual or cross-lingual settings, which can cap-

1https://magichub.com/competition/code-switching-asr-challenge

ture language-speciﬁc knowledge space effectively and mitigate

overﬁtting caused by the poverty of code-switching data.

Speciﬁcally, network parameters were decomposed into

language-speciﬁc parts (or experts) in a bi-encoder structure,

where each transformer encoder represents a language (i.e.,

Mandarin and English) [8, 9].

Meanwhile, the bi-encoder architecture can effectively

leverage rich monolingual data from both languages. But due

to the lack of interaction between the separated encoders, the

language-common feature space is apparently ignored. There-

fore, language-aware encoder (LAE) [9] was further proposed

to address this problem by sharing the preliminary blocks be-

fore the language-speciﬁc experts, which could model both

language-speciﬁc and language-common feature efﬁciently. In-

stead of sharing only the preliminary blocks, mixture of experts

(MoE) [10] was designed to share the majority of parameters,

which may be able to learn more language-common feature and

be better suited to limited training data conditions.

Another difﬁculty is the data sparsity problem. As the lan-

guage switch can be occurred anywhere in an utterance for the

more difﬁcult intra-sentential switch, it is hard to collect enough

code-switching data and prediction of the switching position is

rather difﬁcult.

To overcome this problem, data augmentation might be a

feasible solution, including text-to-speech (TTS) augmentation

and text data augmentation. Note that using synthetic data di-

rectly often has negligible gain or even misguides the ASR sys-

tem because of the mismatch between the synthetic and real

data.

For better use of synthetic data, some additional loss func-

tions [11, 12] and ﬁltering strategies [13, 14] were proposed

to enforce the consistency of hypothesized labels between real

and synthetic data. For text augmentation, a machine transla-

tion model was usually adopted to expand on the original code-

switching text [15].

In this challenge, we approach the Mandarin-English code-

switching ASR by exploration of both multi-lingual neural ar-

chitectures and data augmentation. Speciﬁcally, we study the

bi-encoder, LAE and MoE architectures reviewed above un-

der the popular Conformer based AED framework implemented

with two popular ASR toolkits – ESPNet [16] and WeNet [17].

Various data augmentation methods, including speed perturba-

tion, pitch shifting, audio codec augmentation, spectrum aug-

mentation as well as text-to-speech augmentation. Speciﬁcally

for TTS augmentation, a consistency loss [11] is proved to be

effective for mitigating the mismatch in the distribution of real

and synthetic data. We further explore the effectiveness of lan-

guage modeling, including both internal language model as well

as long context language model [18]. Finally, ROVER [19] is

adopted for fusion of multiple hypothesis from various models,

which has previously proven to be effective [20, 21, 22]. Our fu-

arXiv:2210.14448v1 [cs.SD] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheNPU-ASLPSystemforTheISCSLP2022MagichubCode-SwichingASRChallengeYuhaoLiang,PeikunChen,FanYu,XinfaZhu,TianyiXu,LeiXieAudio,SpeechandLanguageProcessingGroup(ASLP@NPU),SchoolofComputerScience,NorthwesternPolytechnicalUniversity,Xi'an,Chinaliangyuhao@mail.nwpu.edu.cn,lxie@nwpu.edu.cnAbstractThispaperd...

展开>> 收起<<

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang Peikun Chen Fan Yu Xinfa Zhu Tianyi Xu Lei Xie.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge Yuhao Liang Peikun Chen Fan Yu Xinfa Zhu Tianyi Xu Lei Xie

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: