CASNET INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION Fan-Lin Wang2 Yao-Fei Cheng2 Hung-Shin Lee1 Yu Tsao3 and Hsin-Min Wang2 1North Co. Ltd. Taiwan

2025-04-30 0 0 342.64KB 5 页 10玖币
侵权投诉
CASNET: INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION
Fan-Lin Wang2, Yao-Fei Cheng2, Hung-Shin Lee1, Yu Tsao3, and Hsin-Min Wang2
1North Co., Ltd., Taiwan
2Institute of Information Science, Academia Sinica
3Research Center for Information Technology Innovation, Academia Sinica
ABSTRACT
Recording channel mismatch between training and testing
conditions has been shown to be a serious problem for speech
separation. This situation greatly reduces the separation per-
formance, and cannot meet the requirement of daily use. In
this study, inheriting the use of our previously constructed
TAT-2mix corpus, we address the channel mismatch problem
by proposing a channel-aware audio separation network (Cas-
Net), a deep learning framework for end-to-end time-domain
speech separation. CasNet is implemented on top of TasNet.
Channel embedding (characterizing channel information in
a mixture of multiple utterances) generated by Channel En-
coder is introduced into the separation module by the FiLM
technique. Through two training strategies, we explore two
roles that channel embedding may play: 1) a real-life noise
disturbance, making the model more robust, or 2) a guide,
instructing the separation model to retain the desired channel
information. Experimental results on TAT-2mix show that
CasNet trained with both training strategies outperforms the
TasNet baseline, which does not use channel embeddings.
Index TermsSpeech separation, channel embeddings
1. INTRODUCTION
Speech separation [1] originates from the cocktail party prob-
lem [2], which refers to the perception of each speech source
in a noisy social environment. To understand each speaker’s
speech, we first need to separate overlapping speech, which is
the goal of speech separation. As a necessary pre-processing
for downstream tasks, such as speaker diarization [3] and au-
tomatic speech recognition [4], many efforts have been made
in speech separation.
Nowadays, the main dataset used in speech separation re-
search is the WSJ0-2mix dataset [5]. In WSJ0-2mix, an arti-
ficially synthesized dataset, all mixed utterances are full over-
laps of clean speech from two speakers. In recent research, a
popular architecture is the time-domain audio separation net-
work (TasNet) [6]. Many TasNet-based models have achieved
extraordinary performance [7, 8, 9, 10, 11] on WSJ0-2mix.
However, WSJ0-2mix sets many restrictions on the experi-
ments, which may lead to domain mismatches.
Domain mismatch can be attributed to four factors:
speaker, content, channel, and environment. Regarding
speaker mismatch, the speakers in the test sets of all datasets
are designed to be unseen in the training sets. However,
there is no noticeable drop in performance, demonstrating
the speaker generalization of these models. Environment
mismatch refers to reverberation and noise that may be en-
countered in reality and are not seen in the training set. To
address this issue, two new datasets have been presented:
WHAM! [12] and WHAMR! [13], which are the noisy and
reverberant extensions of WSJ0-2mix, respectively.
Content mismatch focuses on what the speaker said, such
as vocabulary or even different languages that contain various
phonemes. In [14, 15], the authors argue that the larger the
vocabulary presented in the training set, the better the gener-
alization of the model. In [16], using the GlobalPhoneMS2
dataset consisting of 22 spoken languages, the authors show
that when trained on a multilingual dataset, the model can im-
prove its performance on unseen languages. Regarding chan-
nel mismatch, it focuses on the type of microphone used in the
recording. The authors of [17] argue that near-field data are
easier to separate than far-field data, even though both were
recorded in the same environment.
In the COVID-19 pandemic era, virtual meetings have be-
come prevalent and recorded with a wider variety of micro-
phones. Furthermore, smartphones are frequently used tools
in the daily recording. If all training sets are recorded with
condenser microphones, the speech separation performance
will drop significantly in daily use. Therefore, channel mis-
match should be investigated in more depth to meet demand.
In our previous work [18], we found that the impacts of dif-
ferent languages are small enough to be ignored compared
to the impacts of different channels. Also, although the con-
tent is the same, the separation performance varies due to the
different microphones. To address the channel mismatch, it is
necessary to create a channel-robust speech separation model.
Here, we define “channel robustness” in two directions ac-
cording to the channel of the target. The first is classic speech
separation: no matter which channel the mixture is recorded
through, the separated utterances must remain on the same
channel as the mixture. The other definition is that separated
utterances should be enhanced as if they were recorded by a
arXiv:2210.15370v1 [cs.SD] 27 Oct 2022
摘要:

CASNET:INVESTIGATINGCHANNELROBUSTNESSFORSPEECHSEPARATIONFan-LinWang2,Yao-FeiCheng2,Hung-ShinLee1,YuTsao3,andHsin-MinWang21NorthCo.,Ltd.,Taiwan2InstituteofInformationScience,AcademiaSinica3ResearchCenterforInformationTechnologyInnovation,AcademiaSinicaABSTRACTRecordingchannelmismatchbetweentrainingan...

展开>> 收起<<
CASNET INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION Fan-Lin Wang2 Yao-Fei Cheng2 Hung-Shin Lee1 Yu Tsao3 and Hsin-Min Wang2 1North Co. Ltd. Taiwan.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:342.64KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注