
Pronunciation generation for foreign language words
with fewer materials always show poor performance. In this
work, for the imbalanced driving material issue, we further
propose an internal assistance strategy between the words
with sufficient materials and the words with scarce materi-
als to improve the pronunciation qualities in the seed lexicon
overall.
The rest of this paper is organized as follows. Section 2
is a summary of related works on retraining-free and pho-
netic decoding methods for CS. Section 3 describes candi-
date generation and selection methods for seed word pro-
nunciation. Section 4 introduces the pronunciation predic-
tion works that contain building seed lexicon, describing the
architecture of the transformer G2P model and proposing a
novel internal assistance method to improve the seed lexicon.
Section 5 gives the detailed experimental configuration and
result. Section 6 concludes the advantages of the proposed
methods and lists some valuable future works.
2. Related Works
Compared to the available data in monolingual ASR, the
CS corpus is very limited (Ganji et al.,2019;Lyu et al.,2010;
Li et al.,2012;Shen et al.,2011;Chan et al.,2005;Lyu et al.,
2006), hence some CS recognizers adopt the retraining-free
methods that expand new language recognition capabili-
ties in existing monolingual ASR system instead of rebuild-
ing mixed-language AM/LM. (Yu et al.,2009) proposed
and compared four approaches for CS recognition under the
constraint of native language acoustic model (NL-AM) in
real-time, in that case, the foreign words were expressed in
the native language phonemes set through phoneme/senone
mapping using the least Kullback-Leibler Divergence, and
achieved the best result among the AM merging techniques.
Base on the NL-AM, Pronunciations Generation in foreign
words is considered as another low-cost solution for intra-
sentential CS speech recognition, so the core work will fo-
cus on generating good mapping pronunciations to foreign
words in native language phonemes, which work is simi-
lar to automatic lexicon learning (McGraw et al.,2013;Lu
et al.,2013;Chen et al.,2016a;Tsujioka et al.,2016;Zhang
et al.,2017) in solving the out-of-vocabulary issues. (Lau-
rent et al.,2010) proposed acoustic-based phonetic decod-
ing and iterative filtering methods for proper nouns’ pho-
netic transcription. (Bhuvanagiri and Kopparapu,2010) and
(Bhuvanagirir and Kopparapu,2012) built a Hindi-English
ASR system based on the existing monolingual AM us-
ing the mapped lexicon and the modified language model.
(Modipa et al.,2013) constructed the Sepedi-English ASR
system based on a Sepedi speech decoder, where the pro-
nunciations of English words were obtained from the Sepedi
language phonetic decoder and then added into the origi-
nal lexicon. (Huang et al.,2019) obtained high-quality for-
eign words’ pronunciations from a grapheme-to-phoneme
(G2P) model trained on linguist/data-driven lexicon, where
the data-driven method consisted of a phonetic decoding
on foreign words spoken by native language speakers gen-
erating method and a rover-like (Fiscus,1997) Phoneme-
Confusion-Networks with acoustic score ranking method.
Also, in building mixed-language AM for the Mandarin-
English CS task, (Guo et al.,2018) adopted phonetic decod-
ing method to correct mismatched pronunciations by decod-
ing .
3. Pronunciation Generation For Seed Word
This section mainly introduces the data-driven way to
obtain good pronunciations of foreign seed words. Ap-
proaches in generating pronunciations can be divided into
manual and data-driven categories. On the manual side, peo-
ple will consider the rationality on pronouncing perception,
but that may be time-consuming and imprecise due to ac-
cent problems which are often neglected. On the data-driven
side, we will use phonetic decoding technology where the
decoding results are consistent with the similarity in acous-
tics and take native accents into account.
The phonetic decoding method is a data-driven way to
obtain foreign words’ pronunciations, that is, the NL-AM
based phonetic decoder is used to decode the foreign words’
audio segments to obtain the native phoneme sequence as
candidate pronunciations. The source to obtain foreign
words’ audio segment is mainly derived from the speech data
of native or foreign speakers, due to the CS speech recogni-
tion task is oriented to native speakers, in this article we use
the audio segments in a limited CS corpus spoken by native
speakers as driving materials.
3.1. Extracting Audio Segments
For foreign words’ audio segmentation, we need to first
obtain their start/end timestamps. The timestamp acquisi-
tion is achieved by speech-text forced alignment on the AM,
and the general method is to build mixed-language GMM-
HMM based AM with a combination of native language lex-
icon and foreign language lexicon. (Huang et al.,2019)
maintain the retraining-free way that they used foreigner’s
audio segments as driving materials to obtain mapped lexi-
con by phonetic decoding to execute forced-alignment on the
NL-AM, though this method avoided pre-training a mixed-
language AM, but foreign words’ pronunciations with for-
eign accent may cause inaccurate alignments. In this work,
we still choose to pre-train a mixed-language GMM-HMM
based AM for alignment and segmentation of foreign words.
3.2. Phonetic Decoding
Different from the word-level ASR system, a phonetic
decoder is built on the decoding graph with a phoneme-level
LM. Based on the NL-AM, we build a phonetic decoder
𝐏, and then decode the foreign words’ audio segments in
high acoustic weight setting to obtain the native language
phoneme sequences with high acoustic similarity.
For a foreign language word 𝑤, we extract its embed-
ded utterances subset O𝑤= {𝑂1, 𝑂2,⋯, 𝑂𝑀𝑤}in the lim-
ited CS corpus, where 𝑀𝑤denotes the number of utterances
which contain the word 𝑤. As subsection 3.1 introduces, we
extract its segments set 𝑆𝑤in O𝑤through forced-alignment
in mixed-language GMM-HMM:
𝑆𝑤= {𝑠𝑖∣𝑖= 1,2,⋯, 𝑘}(1)
Wei Wang et al.: Preprint submitted to Elsevier Page 2 of 9