Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition

2025-05-02 1 0 936.81KB 9 页 10玖币

Pronunciation Generation for Foreign Language Words in

Intra-Sentential Code-Switching Speech Recognition

Wei Wanga,b, Chao Zhangaand Xiaopei Wua,∗∗

aAnhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601,

China

ARTICLE INFO

Keywords:

speech recognition

code-switching

data-driven

pronunciation generation

ABSTRACT

Code-Switching refers to the phenomenon of switching languages within a sentence or discourse.

However, limited code-switching , diﬀerent language phoneme-sets and high rebuilding costs throw

a challenge to make the specialized acoustic model for code-switching speech recognition. In this

paper, we make use of limited code-switching data as driving materials and explore a shortcut to

quickly develop intra-sentential code-switching recognition skill on the commissioned native language

acoustic model, where we propose a data-driven method to make the seed lexicon which is used to

train grapheme-to-phoneme model to predict mapping pronunciations for foreign language word in

code-switching sentences. The core work of the data-driven technology in this paper consists of a

phonetic decoding method and diﬀerent selection methods. And for imbalanced word-level driving

materials problem, we have an internal assistance inspiration that learning the good pronunciation

rules in the words that possess suﬃcient materials using the grapheme-to-phoneme model to help the

scarce. Our experiments show that the Mixed Error Rate in intra-sentential Chinese-English code-

switching recognition reduced from 29.15%, acquired on the pure Chinese recognizer, to 12.13% by

adding foreign language words’ pronunciation through our data-driven approach, and ﬁnally get the

best result 11.14% with the combination of diﬀerent selection methods and internal assistance tactic.

1. Introduction

Code-Switching (CS) is a common oral phenomenon for

many multilingual speakers that diﬀerent languages coexist

in sentences. As (Sankoﬀ and Poplack,1981) deﬁned, CS

can be categorized into inter-sentential switching and intra-

sentential switching. In the intra-sentential case we focus on

in this paper, foreign words usually appeared in a native lan-

guage sentence as loan words. Compared with monolingual

automatic speech recognition (ASR) systems, the hindrances

in the CS recognition are summarized as follows: (1) the lack

of CS training data, (2) the phonemes diﬀerence in diﬀerent

languages, (3) the accent issues on foreign languages spoken

by native language speakers.

In recent years, the related works in tackling CS have

been continuously deepened. Some mainstream methods fo-

cus on building the mixed-language acoustic model (AM)

and the language model (LM). Due to the limited CS speech

data, the phoneme-sharing method in multiple languages

has been applied broadly to decrease the size of mixed-

language AM units, where the sharing methods can be di-

vided into phoneme-merging (Lin et al.,2009;Li et al.,2011;

Sivasankaran et al.,2019) and using a universal phoneme set

such as IPA (Smith,2000) generally, however, those may in-

crease the risk of inter-language substitution errors due to ru-

ining the context of some triphones in each language. Some

multi-task-learning (MTL) technologies (Huang et al.,2013;

YIlmaz et al.,2016) also have been explored in CS, where

the recognition for accented speech (Mendes et al.,2019)

∗Principal corresponding author

wei.wang@imsl.org.cn (W. Wang); 14042@ahu.edu.cn (C. Zhang);

wxp2001@ahu.edu.cn (X. Wu)

ORCID(s): 0000-0002-1765-0486 (W. Wang)

and the recognition around the switching position (Chen

et al.,2016b) have been improved by transferring knowledge

between diﬀerent tasks. Generally, in the intra-sentential CS

case, the main part of speech is still native language (NL)

and foreign words occupy less often, therefore, under the

commissioned native language acoustic mode (NL-AM) that

shows good recognition and robustness in the real scenario

since it has abundant training data, it is a valuable work to

explore a shortcut to preserve its stability in native language

and extend its capability to foreign words.

For intra-sentential CS speech recognition, this paper

proposes a data-driven scheme to generate mapping pro-

nunciations for foreign words to meet intra-sentential CS

speech recognition, concretely, the reliable pronunciations

are given by the Grapheme-to-Phoneme (G2P) model which

was trained on the seed lexicon, where making this seed lex-

icon is a core data-driven work and this paper adopts pho-

netic decoding for candidate generation and average pos-

terior estimation (APE) or phoneme Confusion Network

(PCN) for candidate selection. Compared with the seed

lexicon made by (Huang et al.,2019), this paper proposes

the following improvements: (1) A purely data-driven ap-

proach without any linguist pronunciation labeling. (2) In

acoustic-based candidate selection, we adopt the average

utterance-level posterior probability of candidate to give an

acoustic score. (3) For pronunciation prediction, we use the

popular transformer-based (Zhang et al.,2017) sequence-to-

sequence (seq2seq) architecture in natural language process-

ing (NLP) ﬁeld to design our G2P model.

However, foreign words’ diﬀerent occurrences in the

CS corpus lead to imbalanced word-level driving materi-

als in the data-driven approach, which further cause imbal-

anced driving processing, and the pronunciation qualities

Wei Wang et al.: Preprint submitted to Elsevier Page 1 of 9

arXiv:2210.14691v1 [cs.SD] 26 Oct 2022

Pronunciation generation for foreign language words

with fewer materials always show poor performance. In this

work, for the imbalanced driving material issue, we further

propose an internal assistance strategy between the words

with suﬃcient materials and the words with scarce materi-

als to improve the pronunciation qualities in the seed lexicon

overall.

The rest of this paper is organized as follows. Section 2

is a summary of related works on retraining-free and pho-

netic decoding methods for CS. Section 3 describes candi-

date generation and selection methods for seed word pro-

nunciation. Section 4 introduces the pronunciation predic-

tion works that contain building seed lexicon, describing the

architecture of the transformer G2P model and proposing a

novel internal assistance method to improve the seed lexicon.

Section 5 gives the detailed experimental conﬁguration and

result. Section 6 concludes the advantages of the proposed

methods and lists some valuable future works.

2. Related Works

Compared to the available data in monolingual ASR, the

CS corpus is very limited (Ganji et al.,2019;Lyu et al.,2010;

Li et al.,2012;Shen et al.,2011;Chan et al.,2005;Lyu et al.,

2006), hence some CS recognizers adopt the retraining-free

methods that expand new language recognition capabili-

ties in existing monolingual ASR system instead of rebuild-

ing mixed-language AM/LM. (Yu et al.,2009) proposed

and compared four approaches for CS recognition under the

constraint of native language acoustic model (NL-AM) in

real-time, in that case, the foreign words were expressed in

the native language phonemes set through phoneme/senone

mapping using the least Kullback-Leibler Divergence, and

achieved the best result among the AM merging techniques.

Base on the NL-AM, Pronunciations Generation in foreign

words is considered as another low-cost solution for intra-

sentential CS speech recognition, so the core work will fo-

cus on generating good mapping pronunciations to foreign

words in native language phonemes, which work is simi-

lar to automatic lexicon learning (McGraw et al.,2013;Lu

et al.,2013;Chen et al.,2016a;Tsujioka et al.,2016;Zhang

et al.,2017) in solving the out-of-vocabulary issues. (Lau-

rent et al.,2010) proposed acoustic-based phonetic decod-

ing and iterative ﬁltering methods for proper nouns’ pho-

netic transcription. (Bhuvanagiri and Kopparapu,2010) and

(Bhuvanagirir and Kopparapu,2012) built a Hindi-English

ASR system based on the existing monolingual AM us-

ing the mapped lexicon and the modiﬁed language model.

(Modipa et al.,2013) constructed the Sepedi-English ASR

system based on a Sepedi speech decoder, where the pro-

nunciations of English words were obtained from the Sepedi

language phonetic decoder and then added into the origi-

nal lexicon. (Huang et al.,2019) obtained high-quality for-

eign words’ pronunciations from a grapheme-to-phoneme

(G2P) model trained on linguist/data-driven lexicon, where

the data-driven method consisted of a phonetic decoding

on foreign words spoken by native language speakers gen-

erating method and a rover-like (Fiscus,1997) Phoneme-

Confusion-Networks with acoustic score ranking method.

Also, in building mixed-language AM for the Mandarin-

English CS task, (Guo et al.,2018) adopted phonetic decod-

ing method to correct mismatched pronunciations by decod-

ing .

3. Pronunciation Generation For Seed Word

This section mainly introduces the data-driven way to

obtain good pronunciations of foreign seed words. Ap-

proaches in generating pronunciations can be divided into

manual and data-driven categories. On the manual side, peo-

ple will consider the rationality on pronouncing perception,

but that may be time-consuming and imprecise due to ac-

cent problems which are often neglected. On the data-driven

side, we will use phonetic decoding technology where the

decoding results are consistent with the similarity in acous-

tics and take native accents into account.

The phonetic decoding method is a data-driven way to

obtain foreign words’ pronunciations, that is, the NL-AM

based phonetic decoder is used to decode the foreign words’

audio segments to obtain the native phoneme sequence as

candidate pronunciations. The source to obtain foreign

words’ audio segment is mainly derived from the speech data

of native or foreign speakers, due to the CS speech recogni-

tion task is oriented to native speakers, in this article we use

the audio segments in a limited CS corpus spoken by native

speakers as driving materials.

3.1. Extracting Audio Segments

For foreign words’ audio segmentation, we need to ﬁrst

obtain their start/end timestamps. The timestamp acquisi-

tion is achieved by speech-text forced alignment on the AM,

and the general method is to build mixed-language GMM-

HMM based AM with a combination of native language lex-

icon and foreign language lexicon. (Huang et al.,2019)

maintain the retraining-free way that they used foreigner’s

audio segments as driving materials to obtain mapped lexi-

con by phonetic decoding to execute forced-alignment on the

NL-AM, though this method avoided pre-training a mixed-

language AM, but foreign words’ pronunciations with for-

eign accent may cause inaccurate alignments. In this work,

we still choose to pre-train a mixed-language GMM-HMM

based AM for alignment and segmentation of foreign words.

3.2. Phonetic Decoding

Diﬀerent from the word-level ASR system, a phonetic

decoder is built on the decoding graph with a phoneme-level

LM. Based on the NL-AM, we build a phonetic decoder

𝐏, and then decode the foreign words’ audio segments in

high acoustic weight setting to obtain the native language

phoneme sequences with high acoustic similarity.

For a foreign language word 𝑤, we extract its embed-

ded utterances subset O𝑤= {𝑂1, 𝑂2,⋯, 𝑂𝑀𝑤}in the lim-

ited CS corpus, where 𝑀𝑤denotes the number of utterances

which contain the word 𝑤. As subsection 3.1 introduces, we

extract its segments set 𝑆𝑤in O𝑤through forced-alignment

in mixed-language GMM-HMM:

𝑆𝑤= {𝑠𝑖∣𝑖= 1,2,⋯, 𝑘}(1)

Wei Wang et al.: Preprint submitted to Elsevier Page 2 of 9

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PronunciationGenerationforForeignLanguageWordsinIntra-SententialCode-SwitchingSpeechRecognitionWeiWanga,b,ChaoZhangaandXiaopeiWua,

收起<<

Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: