
DEFORMABLE TEMPORAL CONVOLUTIONAL NETWORKS FOR MONAURAL NOISY
REVERBERANT SPEECH SEPARATION
William Ravenscroft , Stefan Goetze , and Thomas Hain
Department of Computer Science,The University of Sheffield, Sheffield, United Kingdom
{jwravenscroft1, s.goetze, t.hain}@sheffield.ac.uk
ABSTRACT
Speech separation models are used for isolating individual speak-
ers in many speech processing applications. Deep learning mod-
els have been shown to lead to state-of-the-art (SOTA) results on a
number of speech separation benchmarks. One such class of mod-
els known as temporal convolutional networks (TCNs) has shown
promising results for speech separation tasks. A limitation of these
models is that they have a fixed receptive field (RF). Recent research
in speech dereverberation has shown that the optimal RF of a TCN
varies with the reverberation characteristics of the speech signal. In
this work deformable convolution is proposed as a solution to allow
TCN models to have dynamic RFs that can adapt to various reverber-
ation times for reverberant speech separation. The proposed models
are capable of achieving an 11.1 dB average scale-invariant signal-
to-distortion ratio (SISDR) improvement over the input signal on the
WHAMR benchmark. A relatively small deformable TCN model
of 1.3M parameters is proposed which gives comparable separation
performance to larger and more computationally complex models.
Index Terms—speech separation, deformable convolution, dy-
namic neural networks
1. INTRODUCTION
The separation of overlapping speech signals is an area that has been
widely studied and which has many applications [1–4]. Deep learn-
ing models have demonstrated impressive results on separating clean
speech mixtures [5, 6]. However, this performance still degrades
heavily under noisy reverberant conditions [7]. This performance
loss can be alleviated somewhat with careful hyper-parameter opti-
mization but a significant performance gap still exists [8].
The Conv-TasNet speech separation model has been widely
studied and adapted for a number of speech enhancement tasks
[5, 9–11]. Conv-TasNet generally performs very well on clean
speech mixtures with a very low computational cost compared to
the most performant speech separation models [6, 12, 13] on the
WSJ0-2Mix benchmark [14]. As such, it is still used in many re-
lated areas of research [9, 11]. Recent research efforts in speech
separation have focused on producing more resource-efficient mod-
els even if they do not produce the most SOTA results on separation
benchmarks [12, 13]. Previous work has investigated adaptations
to Conv-TasNet with additional modifications such as multi-scale
convolution and gating mechanisms applied to the outputs of con-
volutional layers but these significantly increase the computational
complexity [15]. The Conv-TasNet model uses a sequence model
This work was supported by the Centre for Doctoral Training in Speech
and Language Technologies (SLT) and their Applications funded by UK Re-
search and Innovation [grant number EP/S023062/1]. This work was also
funded in part by 3M Health Information Systems, Inc.
known as a TCN. It was recently shown that the optimal RF of TCNs
in dereverberation models varies with reverberation time when the
model size is sufficiently large [10]. Furthermore, it was shown
that multi-dilation TCN models can be trained implicitly to weight
differently dilated convolutional kernels to optimally focus within
the RF on more or less temporal context according to the reverber-
ation time in the data for dereverberation tasks [16], i.e. for larger
reverberation times more weight was given to kernels with larger
dilation factors.
In this work deformable depthwise convolutional layers [17–19]
are proposed as a replacement for standard depthwise convolutional
layers [5] in TCN based speech separation models for reverberant
acoustic conditions. Deformable convolution allows each convolu-
tional layer to have an adaptive RF. When used as a replacement
for standard convolution in a TCN this enables the TCN to have a
RF that can adapt to different reverberant conditions. Using shared
weights [15] and dynamic mixing [20] are also explored as ways
to reduce model size and improve performance. A PyTorch library
for training deformable 1D convolutional layers as well as a Speech-
Brain [21] recipe for reproducing results (cf. Section 5) are provided.
The remainder of the paper proceeds as follows. In Section 2 the
signal model is discussed. The deformable temporal convolutional
network (DTCN) is introduced in Section 3. Section 4 discusses the
experimental setup, data and baseline systems. Results are given in
Section 5. Section 6 provides analysis of the proposed models and
conclusions are provided in Section 7.
2. SIGNAL MODEL
A noisy reverberant mixture of Cspeech signals sc[i]for discrete
sample index iconvolved with room impulse responses (RIRs) hc[i]
and corrupted by an additive noise signal ν[i]is defined as
x[i] =
C
X
c=1
hc[i]∗sc[i] + ν[i](1)
where ∗is the convolution operator. The goal in this work is to
estimate the direct speech signal sdir,c[i]and remove the reverberant
reflections srev,c[i]where
x[i] =
C
X
c=1
(sdir,c[i] + srev,c[i]) + ν[i].(2)
3. DEFORMABLE TEMPORAL CONVOLUTIONAL
SEPARATION NETWORK
3.1. Network Architecture
The separation network uses a mask-based approach similar to [5].
The noisy reverberant microphone signal is first segmented into Lx
arXiv:2210.15305v3 [cs.SD] 10 Mar 2023