Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition

2025-05-02 1 0 936.81KB 9 页 10玖币
侵权投诉
Pronunciation Generation for Foreign Language Words in
Intra-Sentential Code-Switching Speech Recognition
Wei Wanga,b, Chao Zhangaand Xiaopei Wua,∗∗
aAnhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei 230601,
China
ARTICLE INFO
Keywords:
speech recognition
code-switching
data-driven
pronunciation generation
ABSTRACT
Code-Switching refers to the phenomenon of switching languages within a sentence or discourse.
However, limited code-switching , different language phoneme-sets and high rebuilding costs throw
a challenge to make the specialized acoustic model for code-switching speech recognition. In this
paper, we make use of limited code-switching data as driving materials and explore a shortcut to
quickly develop intra-sentential code-switching recognition skill on the commissioned native language
acoustic model, where we propose a data-driven method to make the seed lexicon which is used to
train grapheme-to-phoneme model to predict mapping pronunciations for foreign language word in
code-switching sentences. The core work of the data-driven technology in this paper consists of a
phonetic decoding method and different selection methods. And for imbalanced word-level driving
materials problem, we have an internal assistance inspiration that learning the good pronunciation
rules in the words that possess sufficient materials using the grapheme-to-phoneme model to help the
scarce. Our experiments show that the Mixed Error Rate in intra-sentential Chinese-English code-
switching recognition reduced from 29.15%, acquired on the pure Chinese recognizer, to 12.13% by
adding foreign language words’ pronunciation through our data-driven approach, and finally get the
best result 11.14% with the combination of different selection methods and internal assistance tactic.
1. Introduction
Code-Switching (CS) is a common oral phenomenon for
many multilingual speakers that different languages coexist
in sentences. As (Sankoff and Poplack,1981) defined, CS
can be categorized into inter-sentential switching and intra-
sentential switching. In the intra-sentential case we focus on
in this paper, foreign words usually appeared in a native lan-
guage sentence as loan words. Compared with monolingual
automatic speech recognition (ASR) systems, the hindrances
in the CS recognition are summarized as follows: (1) the lack
of CS training data, (2) the phonemes difference in different
languages, (3) the accent issues on foreign languages spoken
by native language speakers.
In recent years, the related works in tackling CS have
been continuously deepened. Some mainstream methods fo-
cus on building the mixed-language acoustic model (AM)
and the language model (LM). Due to the limited CS speech
data, the phoneme-sharing method in multiple languages
has been applied broadly to decrease the size of mixed-
language AM units, where the sharing methods can be di-
vided into phoneme-merging (Lin et al.,2009;Li et al.,2011;
Sivasankaran et al.,2019) and using a universal phoneme set
such as IPA (Smith,2000) generally, however, those may in-
crease the risk of inter-language substitution errors due to ru-
ining the context of some triphones in each language. Some
multi-task-learning (MTL) technologies (Huang et al.,2013;
YIlmaz et al.,2016) also have been explored in CS, where
the recognition for accented speech (Mendes et al.,2019)
Principal corresponding author
wei.wang@imsl.org.cn (W. Wang); 14042@ahu.edu.cn (C. Zhang);
wxp2001@ahu.edu.cn (X. Wu)
ORCID(s): 0000-0002-1765-0486 (W. Wang)
and the recognition around the switching position (Chen
et al.,2016b) have been improved by transferring knowledge
between different tasks. Generally, in the intra-sentential CS
case, the main part of speech is still native language (NL)
and foreign words occupy less often, therefore, under the
commissioned native language acoustic mode (NL-AM) that
shows good recognition and robustness in the real scenario
since it has abundant training data, it is a valuable work to
explore a shortcut to preserve its stability in native language
and extend its capability to foreign words.
For intra-sentential CS speech recognition, this paper
proposes a data-driven scheme to generate mapping pro-
nunciations for foreign words to meet intra-sentential CS
speech recognition, concretely, the reliable pronunciations
are given by the Grapheme-to-Phoneme (G2P) model which
was trained on the seed lexicon, where making this seed lex-
icon is a core data-driven work and this paper adopts pho-
netic decoding for candidate generation and average pos-
terior estimation (APE) or phoneme Confusion Network
(PCN) for candidate selection. Compared with the seed
lexicon made by (Huang et al.,2019), this paper proposes
the following improvements: (1) A purely data-driven ap-
proach without any linguist pronunciation labeling. (2) In
acoustic-based candidate selection, we adopt the average
utterance-level posterior probability of candidate to give an
acoustic score. (3) For pronunciation prediction, we use the
popular transformer-based (Zhang et al.,2017) sequence-to-
sequence (seq2seq) architecture in natural language process-
ing (NLP) field to design our G2P model.
However, foreign words’ different occurrences in the
CS corpus lead to imbalanced word-level driving materi-
als in the data-driven approach, which further cause imbal-
anced driving processing, and the pronunciation qualities
Wei Wang et al.: Preprint submitted to Elsevier Page 1 of 9
arXiv:2210.14691v1 [cs.SD] 26 Oct 2022
Pronunciation generation for foreign language words
with fewer materials always show poor performance. In this
work, for the imbalanced driving material issue, we further
propose an internal assistance strategy between the words
with sufficient materials and the words with scarce materi-
als to improve the pronunciation qualities in the seed lexicon
overall.
The rest of this paper is organized as follows. Section 2
is a summary of related works on retraining-free and pho-
netic decoding methods for CS. Section 3 describes candi-
date generation and selection methods for seed word pro-
nunciation. Section 4 introduces the pronunciation predic-
tion works that contain building seed lexicon, describing the
architecture of the transformer G2P model and proposing a
novel internal assistance method to improve the seed lexicon.
Section 5 gives the detailed experimental configuration and
result. Section 6 concludes the advantages of the proposed
methods and lists some valuable future works.
2. Related Works
Compared to the available data in monolingual ASR, the
CS corpus is very limited (Ganji et al.,2019;Lyu et al.,2010;
Li et al.,2012;Shen et al.,2011;Chan et al.,2005;Lyu et al.,
2006), hence some CS recognizers adopt the retraining-free
methods that expand new language recognition capabili-
ties in existing monolingual ASR system instead of rebuild-
ing mixed-language AM/LM. (Yu et al.,2009) proposed
and compared four approaches for CS recognition under the
constraint of native language acoustic model (NL-AM) in
real-time, in that case, the foreign words were expressed in
the native language phonemes set through phoneme/senone
mapping using the least Kullback-Leibler Divergence, and
achieved the best result among the AM merging techniques.
Base on the NL-AM, Pronunciations Generation in foreign
words is considered as another low-cost solution for intra-
sentential CS speech recognition, so the core work will fo-
cus on generating good mapping pronunciations to foreign
words in native language phonemes, which work is simi-
lar to automatic lexicon learning (McGraw et al.,2013;Lu
et al.,2013;Chen et al.,2016a;Tsujioka et al.,2016;Zhang
et al.,2017) in solving the out-of-vocabulary issues. (Lau-
rent et al.,2010) proposed acoustic-based phonetic decod-
ing and iterative filtering methods for proper nouns’ pho-
netic transcription. (Bhuvanagiri and Kopparapu,2010) and
(Bhuvanagirir and Kopparapu,2012) built a Hindi-English
ASR system based on the existing monolingual AM us-
ing the mapped lexicon and the modified language model.
(Modipa et al.,2013) constructed the Sepedi-English ASR
system based on a Sepedi speech decoder, where the pro-
nunciations of English words were obtained from the Sepedi
language phonetic decoder and then added into the origi-
nal lexicon. (Huang et al.,2019) obtained high-quality for-
eign words’ pronunciations from a grapheme-to-phoneme
(G2P) model trained on linguist/data-driven lexicon, where
the data-driven method consisted of a phonetic decoding
on foreign words spoken by native language speakers gen-
erating method and a rover-like (Fiscus,1997) Phoneme-
Confusion-Networks with acoustic score ranking method.
Also, in building mixed-language AM for the Mandarin-
English CS task, (Guo et al.,2018) adopted phonetic decod-
ing method to correct mismatched pronunciations by decod-
ing .
3. Pronunciation Generation For Seed Word
This section mainly introduces the data-driven way to
obtain good pronunciations of foreign seed words. Ap-
proaches in generating pronunciations can be divided into
manual and data-driven categories. On the manual side, peo-
ple will consider the rationality on pronouncing perception,
but that may be time-consuming and imprecise due to ac-
cent problems which are often neglected. On the data-driven
side, we will use phonetic decoding technology where the
decoding results are consistent with the similarity in acous-
tics and take native accents into account.
The phonetic decoding method is a data-driven way to
obtain foreign words’ pronunciations, that is, the NL-AM
based phonetic decoder is used to decode the foreign words’
audio segments to obtain the native phoneme sequence as
candidate pronunciations. The source to obtain foreign
words’ audio segment is mainly derived from the speech data
of native or foreign speakers, due to the CS speech recogni-
tion task is oriented to native speakers, in this article we use
the audio segments in a limited CS corpus spoken by native
speakers as driving materials.
3.1. Extracting Audio Segments
For foreign words’ audio segmentation, we need to first
obtain their start/end timestamps. The timestamp acquisi-
tion is achieved by speech-text forced alignment on the AM,
and the general method is to build mixed-language GMM-
HMM based AM with a combination of native language lex-
icon and foreign language lexicon. (Huang et al.,2019)
maintain the retraining-free way that they used foreigner’s
audio segments as driving materials to obtain mapped lexi-
con by phonetic decoding to execute forced-alignment on the
NL-AM, though this method avoided pre-training a mixed-
language AM, but foreign words’ pronunciations with for-
eign accent may cause inaccurate alignments. In this work,
we still choose to pre-train a mixed-language GMM-HMM
based AM for alignment and segmentation of foreign words.
3.2. Phonetic Decoding
Different from the word-level ASR system, a phonetic
decoder is built on the decoding graph with a phoneme-level
LM. Based on the NL-AM, we build a phonetic decoder
𝐏, and then decode the foreign words’ audio segments in
high acoustic weight setting to obtain the native language
phoneme sequences with high acoustic similarity.
For a foreign language word 𝑤, we extract its embed-
ded utterances subset O𝑤= {𝑂1, 𝑂2,, 𝑂𝑀𝑤}in the lim-
ited CS corpus, where 𝑀𝑤denotes the number of utterances
which contain the word 𝑤. As subsection 3.1 introduces, we
extract its segments set 𝑆𝑤in O𝑤through forced-alignment
in mixed-language GMM-HMM:
𝑆𝑤= {𝑠𝑖𝑖= 1,2,, 𝑘}(1)
Wei Wang et al.: Preprint submitted to Elsevier Page 2 of 9
摘要:

PronunciationGenerationforForeignLanguageWordsinIntra-SententialCode-SwitchingSpeechRecognitionWeiWanga,b,ChaoZhangaandXiaopeiWua,

收起<<
Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:936.81KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注