REDUCING LANGUAGE CONFUSION FOR CODE-SWITCHING SPEECH RECOGNITION WITH TOKEN-LEVEL LANGUAGE DIARIZATION Hexin Liu12 Haihua Xu1 Leibny Paola Garcia3 Andy W. H. Khong2 Yi He1 Sanjeev Khudanpur3

2025-04-29 0 0 642.32KB 5 页 10玖币

侵权投诉

REDUCING LANGUAGE CONFUSION FOR CODE-SWITCHING SPEECH RECOGNITION

WITH TOKEN-LEVEL LANGUAGE DIARIZATION

Hexin Liu1,2, Haihua Xu1, Leibny Paola Garcia3, Andy W. H. Khong2, Yi He1, Sanjeev Khudanpur3

1Bytedance AI Lab

2School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore

3CLSP and HLT-COE, Johns Hopkins University, USA

ABSTRACT

Code-switching (CS) refers to the phenomenon that lan-

guages switch within a speech signal and leads to language

confusion for automatic speech recognition (ASR). This pa-

per aims to address language confusion for improving CS-

ASR from two perspectives: incorporating and disentangling

language information. We incorporate language information

in the CS-ASR model by dynamically biasing the model

with token-level language posteriors which are outputs of a

sequence-to-sequence auxiliary language diarization module.

In contrast, the disentangling process reduces the difference

between languages via adversarial training so as to normalize

two languages. We conduct the experiments on the SEAME

dataset. Compared to the baseline model, both the joint op-

timization with LD and the language posterior bias achieve

performance improvement. The comparison of the proposed

methods indicates that incorporating language information

is more effective than disentangling for reducing language

confusion in CS speech.

Index Terms—code-switching, automatic speech recog-

nition, token, language diarization, language posterior

1. INTRODUCTION

Code-switching (CS) refers to the switching of languages

within a spontaneous multilingual recording. Although exist-

ing automatic speech recognition (ASR) methods have shown

to achieve good performance on monolingual speech [1, 2],

CS-ASR is still a challenge due to language confusion arising

from code switches and the lack of annotated data.

Language information is often incorporated into CS-ASR

models to tackle challenges associated with language confu-

sion. In [3], language identiﬁcation (LID) serves as an aux-

iliary task which enriches the shared encoder with language

information. A bi-encoder transformer network was proposed

in [4], where two encoders are pre-trained on monolingual

data independently to decouple the modeling of Mandarin and

English for the capture of language-speciﬁc information. This

dual-encoder CS-ASR approach has shown to be effective and

several methods were subsequently proposed based on this

framework [5, 6]. A language-speciﬁc attention mechanism

has also been proposed to reduce multilingual contextual in-

formation for a transformer encoder-decoder CS-ASR model

[7, 8]. In this approach, monolingual token embeddings are

separated from code-switching token sequences before being

fed into their respective self-attention modules within the de-

coder layers.

It is useful to note that the dual-encoder approach, in

general, performs LID at frame-level units—frame-level LID

outputs are assigned to the outputs of language-speciﬁc en-

coders before the weighted sum in the mixture-of-experts

interpolation process. The frame-level LID, however, is not

desirable since the LID performance generally degrades with

shorter speech signals [9, 10]. In addition, CS can be regarded

as a speaker-dependent phenomenon [11], where languages

within a CS speech signal share information such as the ac-

cent and discourse markers. Therefore, the language-speciﬁc

attention mechanism would lead to cross-lingual informa-

tion loss while learning monolingual information. Due to

the nature of languages and their transitions in multilingual

recordings, exploiting CS approaches at a lower-granularity

token level would be more appropriate for CS-ASR.

Language diarization (LD), as a special case of LID,

involves partitioning a code-switching speech signal into

homogeneous segments before determining their language

identities [12, 13]. In our work, LD is reformulated into a

sequence-to-sequence task similar to that of ASR to capture

token-level language information. Inspired by the success of

utterance-level one-hot language vector for multilingual ASR

[14, 15], we propose to reduce language confusion within CS

speech by supplementing the token embeddings with their

respective soft language labels—token-level language pos-

teriors predicted by the LD module—before feeding these

embeddings into the ASR decoder. Since two languages in

a CS scenario can be auditorially similar to each other due

to the accent and tone of the bilinguist, language posteri-

ors are expected to convey more language information than

one-hot language label vectors. Moreover, to explore the

effect of language information for CS-ASR, we also propose

a second technique to disentangle the language information

arXiv:2210.14567v1 [eess.AS] 26 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

REDUCINGLANGUAGECONFUSIONFORCODE-SWITCHINGSPEECHRECOGNITIONWITHTOKEN-LEVELLANGUAGEDIARIZATIONHexinLiu1;2,HaihuaXu1,LeibnyPaolaGarcia3,AndyW.H.Khong2,YiHe1,SanjeevKhudanpur31BytedanceAILab2SchoolofElectricalandElectronicEngineering,NanyangTechnologicalUniversity,Singapore3CLSPandHLT-COE,JohnsHopkinsU...

展开>> 收起<<

REDUCING LANGUAGE CONFUSION FOR CODE-SWITCHING SPEECH RECOGNITION WITH TOKEN-LEVEL LANGUAGE DIARIZATION Hexin Liu12 Haihua Xu1 Leibny Paola Garcia3 Andy W. H. Khong2 Yi He1 Sanjeev Khudanpur3.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

REDUCING LANGUAGE CONFUSION FOR CODE-SWITCHING SPEECH RECOGNITION WITH TOKEN-LEVEL LANGUAGE DIARIZATION Hexin Liu12 Haihua Xu1 Leibny Paola Garcia3 Andy W. H. Khong2 Yi He1 Sanjeev Khudanpur3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: