CASNET INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION Fan-Lin Wang2 Yao-Fei Cheng2 Hung-Shin Lee1 Yu Tsao3 and Hsin-Min Wang2 1North Co. Ltd. Taiwan

2025-04-30 0 0 342.64KB 5 页 10玖币

侵权投诉

CASNET: INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION

Fan-Lin Wang2, Yao-Fei Cheng2, Hung-Shin Lee1, Yu Tsao3, and Hsin-Min Wang2

1North Co., Ltd., Taiwan

2Institute of Information Science, Academia Sinica

3Research Center for Information Technology Innovation, Academia Sinica

ABSTRACT

Recording channel mismatch between training and testing

conditions has been shown to be a serious problem for speech

separation. This situation greatly reduces the separation per-

formance, and cannot meet the requirement of daily use. In

this study, inheriting the use of our previously constructed

TAT-2mix corpus, we address the channel mismatch problem

by proposing a channel-aware audio separation network (Cas-

Net), a deep learning framework for end-to-end time-domain

speech separation. CasNet is implemented on top of TasNet.

Channel embedding (characterizing channel information in

a mixture of multiple utterances) generated by Channel En-

coder is introduced into the separation module by the FiLM

technique. Through two training strategies, we explore two

roles that channel embedding may play: 1) a real-life noise

disturbance, making the model more robust, or 2) a guide,

instructing the separation model to retain the desired channel

information. Experimental results on TAT-2mix show that

CasNet trained with both training strategies outperforms the

TasNet baseline, which does not use channel embeddings.

Index Terms—Speech separation, channel embeddings

1. INTRODUCTION

Speech separation [1] originates from the cocktail party prob-

lem [2], which refers to the perception of each speech source

in a noisy social environment. To understand each speaker’s

speech, we ﬁrst need to separate overlapping speech, which is

the goal of speech separation. As a necessary pre-processing

for downstream tasks, such as speaker diarization [3] and au-

tomatic speech recognition [4], many efforts have been made

in speech separation.

Nowadays, the main dataset used in speech separation re-

search is the WSJ0-2mix dataset [5]. In WSJ0-2mix, an arti-

ﬁcially synthesized dataset, all mixed utterances are full over-

laps of clean speech from two speakers. In recent research, a

popular architecture is the time-domain audio separation net-

work (TasNet) [6]. Many TasNet-based models have achieved

extraordinary performance [7, 8, 9, 10, 11] on WSJ0-2mix.

However, WSJ0-2mix sets many restrictions on the experi-

ments, which may lead to domain mismatches.

Domain mismatch can be attributed to four factors:

speaker, content, channel, and environment. Regarding

speaker mismatch, the speakers in the test sets of all datasets

are designed to be unseen in the training sets. However,

there is no noticeable drop in performance, demonstrating

the speaker generalization of these models. Environment

mismatch refers to reverberation and noise that may be en-

countered in reality and are not seen in the training set. To

address this issue, two new datasets have been presented:

WHAM! [12] and WHAMR! [13], which are the noisy and

reverberant extensions of WSJ0-2mix, respectively.

Content mismatch focuses on what the speaker said, such

as vocabulary or even different languages that contain various

phonemes. In [14, 15], the authors argue that the larger the

vocabulary presented in the training set, the better the gener-

alization of the model. In [16], using the GlobalPhoneMS2

dataset consisting of 22 spoken languages, the authors show

that when trained on a multilingual dataset, the model can im-

prove its performance on unseen languages. Regarding chan-

nel mismatch, it focuses on the type of microphone used in the

recording. The authors of [17] argue that near-ﬁeld data are

easier to separate than far-ﬁeld data, even though both were

recorded in the same environment.

In the COVID-19 pandemic era, virtual meetings have be-

come prevalent and recorded with a wider variety of micro-

phones. Furthermore, smartphones are frequently used tools

in the daily recording. If all training sets are recorded with

condenser microphones, the speech separation performance

will drop signiﬁcantly in daily use. Therefore, channel mis-

match should be investigated in more depth to meet demand.

In our previous work [18], we found that the impacts of dif-

ferent languages are small enough to be ignored compared

to the impacts of different channels. Also, although the con-

tent is the same, the separation performance varies due to the

different microphones. To address the channel mismatch, it is

necessary to create a channel-robust speech separation model.

Here, we deﬁne “channel robustness” in two directions ac-

cording to the channel of the target. The ﬁrst is classic speech

separation: no matter which channel the mixture is recorded

through, the separated utterances must remain on the same

channel as the mixture. The other deﬁnition is that separated

utterances should be enhanced as if they were recorded by a

arXiv:2210.15370v1 [cs.SD] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CASNET:INVESTIGATINGCHANNELROBUSTNESSFORSPEECHSEPARATIONFan-LinWang2,Yao-FeiCheng2,Hung-ShinLee1,YuTsao3,andHsin-MinWang21NorthCo.,Ltd.,Taiwan2InstituteofInformationScience,AcademiaSinica3ResearchCenterforInformationTechnologyInnovation,AcademiaSinicaABSTRACTRecordingchannelmismatchbetweentrainingan...

展开>> 收起<<

CASNET INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION Fan-Lin Wang2 Yao-Fei Cheng2 Hung-Shin Lee1 Yu Tsao3 and Hsin-Min Wang2 1North Co. Ltd. Taiwan.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CASNET INVESTIGATING CHANNEL ROBUSTNESS FOR SPEECH SEPARATION Fan-Lin Wang2 Yao-Fei Cheng2 Hung-Shin Lee1 Yu Tsao3 and Hsin-Min Wang2 1North Co. Ltd. Taiwan

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: