Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation_2

2025-05-06 1 0 964.36KB 5 页 10玖币

侵权投诉

DEFORMABLE TEMPORAL CONVOLUTIONAL NETWORKS FOR MONAURAL NOISY

REVERBERANT SPEECH SEPARATION

William Ravenscroft , Stefan Goetze , and Thomas Hain

Department of Computer Science,The University of Shefﬁeld, Shefﬁeld, United Kingdom

{jwravenscroft1, s.goetze, t.hain}@shefﬁeld.ac.uk

ABSTRACT

Speech separation models are used for isolating individual speak-

ers in many speech processing applications. Deep learning mod-

els have been shown to lead to state-of-the-art (SOTA) results on a

number of speech separation benchmarks. One such class of mod-

els known as temporal convolutional networks (TCNs) has shown

promising results for speech separation tasks. A limitation of these

models is that they have a ﬁxed receptive ﬁeld (RF). Recent research

in speech dereverberation has shown that the optimal RF of a TCN

varies with the reverberation characteristics of the speech signal. In

this work deformable convolution is proposed as a solution to allow

TCN models to have dynamic RFs that can adapt to various reverber-

ation times for reverberant speech separation. The proposed models

are capable of achieving an 11.1 dB average scale-invariant signal-

to-distortion ratio (SISDR) improvement over the input signal on the

WHAMR benchmark. A relatively small deformable TCN model

of 1.3M parameters is proposed which gives comparable separation

performance to larger and more computationally complex models.

Index Terms—speech separation, deformable convolution, dy-

namic neural networks

1. INTRODUCTION

The separation of overlapping speech signals is an area that has been

widely studied and which has many applications [1–4]. Deep learn-

ing models have demonstrated impressive results on separating clean

speech mixtures [5, 6]. However, this performance still degrades

heavily under noisy reverberant conditions [7]. This performance

loss can be alleviated somewhat with careful hyper-parameter opti-

mization but a signiﬁcant performance gap still exists [8].

The Conv-TasNet speech separation model has been widely

studied and adapted for a number of speech enhancement tasks

[5, 9–11]. Conv-TasNet generally performs very well on clean

speech mixtures with a very low computational cost compared to

the most performant speech separation models [6, 12, 13] on the

WSJ0-2Mix benchmark [14]. As such, it is still used in many re-

lated areas of research [9, 11]. Recent research efforts in speech

separation have focused on producing more resource-efﬁcient mod-

els even if they do not produce the most SOTA results on separation

benchmarks [12, 13]. Previous work has investigated adaptations

to Conv-TasNet with additional modiﬁcations such as multi-scale

convolution and gating mechanisms applied to the outputs of con-

volutional layers but these signiﬁcantly increase the computational

complexity [15]. The Conv-TasNet model uses a sequence model

This work was supported by the Centre for Doctoral Training in Speech

and Language Technologies (SLT) and their Applications funded by UK Re-

search and Innovation [grant number EP/S023062/1]. This work was also

funded in part by 3M Health Information Systems, Inc.

known as a TCN. It was recently shown that the optimal RF of TCNs

in dereverberation models varies with reverberation time when the

model size is sufﬁciently large [10]. Furthermore, it was shown

that multi-dilation TCN models can be trained implicitly to weight

differently dilated convolutional kernels to optimally focus within

the RF on more or less temporal context according to the reverber-

ation time in the data for dereverberation tasks [16], i.e. for larger

reverberation times more weight was given to kernels with larger

dilation factors.

In this work deformable depthwise convolutional layers [17–19]

are proposed as a replacement for standard depthwise convolutional

layers [5] in TCN based speech separation models for reverberant

acoustic conditions. Deformable convolution allows each convolu-

tional layer to have an adaptive RF. When used as a replacement

for standard convolution in a TCN this enables the TCN to have a

RF that can adapt to different reverberant conditions. Using shared

weights [15] and dynamic mixing [20] are also explored as ways

to reduce model size and improve performance. A PyTorch library

for training deformable 1D convolutional layers as well as a Speech-

Brain [21] recipe for reproducing results (cf. Section 5) are provided.

The remainder of the paper proceeds as follows. In Section 2 the

signal model is discussed. The deformable temporal convolutional

network (DTCN) is introduced in Section 3. Section 4 discusses the

experimental setup, data and baseline systems. Results are given in

Section 5. Section 6 provides analysis of the proposed models and

conclusions are provided in Section 7.

2. SIGNAL MODEL

A noisy reverberant mixture of Cspeech signals sc[i]for discrete

sample index iconvolved with room impulse responses (RIRs) hc[i]

and corrupted by an additive noise signal ν[i]is deﬁned as

x[i] =

c=1

hc[i]∗sc[i] + ν[i](1)

where ∗is the convolution operator. The goal in this work is to

estimate the direct speech signal sdir,c[i]and remove the reverberant

reﬂections srev,c[i]where

x[i] =

c=1

(sdir,c[i] + srev,c[i]) + ν[i].(2)

3. DEFORMABLE TEMPORAL CONVOLUTIONAL

SEPARATION NETWORK

3.1. Network Architecture

The separation network uses a mask-based approach similar to [5].

The noisy reverberant microphone signal is ﬁrst segmented into Lx

arXiv:2210.15305v3 [cs.SD] 10 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DEFORMABLETEMPORALCONVOLUTIONALNETWORKSFORMONAURALNOISYREVERBERANTSPEECHSEPARATIONWilliamRavenscroft,StefanGoetze,andThomasHainDepartmentofComputerScience,TheUniversityofShefeld,Shefeld,UnitedKingdomfjwravenscroft1,s.goetze,t.haing@shefeld.ac.ukABSTRACTSpeechseparationmodelsareusedforisolatingind...

展开>> 收起<<

Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation_2.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: