REDUCING LANGUAGE CONFUSION FOR CODE-SWITCHING SPEECH RECOGNITION
WITH TOKEN-LEVEL LANGUAGE DIARIZATION
Hexin Liu1,2, Haihua Xu1, Leibny Paola Garcia3, Andy W. H. Khong2, Yi He1, Sanjeev Khudanpur3
1Bytedance AI Lab
2School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
3CLSP and HLT-COE, Johns Hopkins University, USA
ABSTRACT
Code-switching (CS) refers to the phenomenon that lan-
guages switch within a speech signal and leads to language
confusion for automatic speech recognition (ASR). This pa-
per aims to address language confusion for improving CS-
ASR from two perspectives: incorporating and disentangling
language information. We incorporate language information
in the CS-ASR model by dynamically biasing the model
with token-level language posteriors which are outputs of a
sequence-to-sequence auxiliary language diarization module.
In contrast, the disentangling process reduces the difference
between languages via adversarial training so as to normalize
two languages. We conduct the experiments on the SEAME
dataset. Compared to the baseline model, both the joint op-
timization with LD and the language posterior bias achieve
performance improvement. The comparison of the proposed
methods indicates that incorporating language information
is more effective than disentangling for reducing language
confusion in CS speech.
Index Terms—code-switching, automatic speech recog-
nition, token, language diarization, language posterior
1. INTRODUCTION
Code-switching (CS) refers to the switching of languages
within a spontaneous multilingual recording. Although exist-
ing automatic speech recognition (ASR) methods have shown
to achieve good performance on monolingual speech [1, 2],
CS-ASR is still a challenge due to language confusion arising
from code switches and the lack of annotated data.
Language information is often incorporated into CS-ASR
models to tackle challenges associated with language confu-
sion. In [3], language identification (LID) serves as an aux-
iliary task which enriches the shared encoder with language
information. A bi-encoder transformer network was proposed
in [4], where two encoders are pre-trained on monolingual
data independently to decouple the modeling of Mandarin and
English for the capture of language-specific information. This
dual-encoder CS-ASR approach has shown to be effective and
several methods were subsequently proposed based on this
framework [5, 6]. A language-specific attention mechanism
has also been proposed to reduce multilingual contextual in-
formation for a transformer encoder-decoder CS-ASR model
[7, 8]. In this approach, monolingual token embeddings are
separated from code-switching token sequences before being
fed into their respective self-attention modules within the de-
coder layers.
It is useful to note that the dual-encoder approach, in
general, performs LID at frame-level units—frame-level LID
outputs are assigned to the outputs of language-specific en-
coders before the weighted sum in the mixture-of-experts
interpolation process. The frame-level LID, however, is not
desirable since the LID performance generally degrades with
shorter speech signals [9, 10]. In addition, CS can be regarded
as a speaker-dependent phenomenon [11], where languages
within a CS speech signal share information such as the ac-
cent and discourse markers. Therefore, the language-specific
attention mechanism would lead to cross-lingual informa-
tion loss while learning monolingual information. Due to
the nature of languages and their transitions in multilingual
recordings, exploiting CS approaches at a lower-granularity
token level would be more appropriate for CS-ASR.
Language diarization (LD), as a special case of LID,
involves partitioning a code-switching speech signal into
homogeneous segments before determining their language
identities [12, 13]. In our work, LD is reformulated into a
sequence-to-sequence task similar to that of ASR to capture
token-level language information. Inspired by the success of
utterance-level one-hot language vector for multilingual ASR
[14, 15], we propose to reduce language confusion within CS
speech by supplementing the token embeddings with their
respective soft language labels—token-level language pos-
teriors predicted by the LD module—before feeding these
embeddings into the ASR decoder. Since two languages in
a CS scenario can be auditorially similar to each other due
to the accent and tone of the bilinguist, language posteri-
ors are expected to convey more language information than
one-hot language label vectors. Moreover, to explore the
effect of language information for CS-ASR, we also propose
a second technique to disentangle the language information
arXiv:2210.14567v1 [eess.AS] 26 Oct 2022