CASNET: INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION
Fan-Lin Wang2, Yao-Fei Cheng2, Hung-Shin Lee1, Yu Tsao3, and Hsin-Min Wang2
1North Co., Ltd., Taiwan
2Institute of Information Science, Academia Sinica
3Research Center for Information Technology Innovation, Academia Sinica
ABSTRACT
Recording channel mismatch between training and testing
conditions has been shown to be a serious problem for speech
separation. This situation greatly reduces the separation per-
formance, and cannot meet the requirement of daily use. In
this study, inheriting the use of our previously constructed
TAT-2mix corpus, we address the channel mismatch problem
by proposing a channel-aware audio separation network (Cas-
Net), a deep learning framework for end-to-end time-domain
speech separation. CasNet is implemented on top of TasNet.
Channel embedding (characterizing channel information in
a mixture of multiple utterances) generated by Channel En-
coder is introduced into the separation module by the FiLM
technique. Through two training strategies, we explore two
roles that channel embedding may play: 1) a real-life noise
disturbance, making the model more robust, or 2) a guide,
instructing the separation model to retain the desired channel
information. Experimental results on TAT-2mix show that
CasNet trained with both training strategies outperforms the
TasNet baseline, which does not use channel embeddings.
Index Terms—Speech separation, channel embeddings
1. INTRODUCTION
Speech separation [1] originates from the cocktail party prob-
lem [2], which refers to the perception of each speech source
in a noisy social environment. To understand each speaker’s
speech, we first need to separate overlapping speech, which is
the goal of speech separation. As a necessary pre-processing
for downstream tasks, such as speaker diarization [3] and au-
tomatic speech recognition [4], many efforts have been made
in speech separation.
Nowadays, the main dataset used in speech separation re-
search is the WSJ0-2mix dataset [5]. In WSJ0-2mix, an arti-
ficially synthesized dataset, all mixed utterances are full over-
laps of clean speech from two speakers. In recent research, a
popular architecture is the time-domain audio separation net-
work (TasNet) [6]. Many TasNet-based models have achieved
extraordinary performance [7, 8, 9, 10, 11] on WSJ0-2mix.
However, WSJ0-2mix sets many restrictions on the experi-
ments, which may lead to domain mismatches.
Domain mismatch can be attributed to four factors:
speaker, content, channel, and environment. Regarding
speaker mismatch, the speakers in the test sets of all datasets
are designed to be unseen in the training sets. However,
there is no noticeable drop in performance, demonstrating
the speaker generalization of these models. Environment
mismatch refers to reverberation and noise that may be en-
countered in reality and are not seen in the training set. To
address this issue, two new datasets have been presented:
WHAM! [12] and WHAMR! [13], which are the noisy and
reverberant extensions of WSJ0-2mix, respectively.
Content mismatch focuses on what the speaker said, such
as vocabulary or even different languages that contain various
phonemes. In [14, 15], the authors argue that the larger the
vocabulary presented in the training set, the better the gener-
alization of the model. In [16], using the GlobalPhoneMS2
dataset consisting of 22 spoken languages, the authors show
that when trained on a multilingual dataset, the model can im-
prove its performance on unseen languages. Regarding chan-
nel mismatch, it focuses on the type of microphone used in the
recording. The authors of [17] argue that near-field data are
easier to separate than far-field data, even though both were
recorded in the same environment.
In the COVID-19 pandemic era, virtual meetings have be-
come prevalent and recorded with a wider variety of micro-
phones. Furthermore, smartphones are frequently used tools
in the daily recording. If all training sets are recorded with
condenser microphones, the speech separation performance
will drop significantly in daily use. Therefore, channel mis-
match should be investigated in more depth to meet demand.
In our previous work [18], we found that the impacts of dif-
ferent languages are small enough to be ignored compared
to the impacts of different channels. Also, although the con-
tent is the same, the separation performance varies due to the
different microphones. To address the channel mismatch, it is
necessary to create a channel-robust speech separation model.
Here, we define “channel robustness” in two directions ac-
cording to the channel of the target. The first is classic speech
separation: no matter which channel the mixture is recorded
through, the separated utterances must remain on the same
channel as the mixture. The other definition is that separated
utterances should be enhanced as if they were recorded by a
arXiv:2210.15370v1 [cs.SD] 27 Oct 2022