
SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET
Rui Zhou1, Wenye Zhu1,2, Xiaofei Li1,∗
1Westlake University & Westlake Institute for Advanced Study, Hangzhou, China
2Zhejiang University, Hangzhou, China
ABSTRACT
This work proposes a new learning target based on reverberation
time shortening (RTS) for speech dereverberation. The learning tar-
get for dereverberation is usually set as the direct-path speech or
optionally with some early reflections. This type of target suddenly
truncates the reverberation, and thus it may not be suitable for net-
work training. The proposed RTS target suppresses reverberation
and meanwhile maintains the exponential decaying property of re-
verberation, which will ease the network training, and thus reduce
signal distortion caused by the prediction error. Moreover, this work
experimentally study to adapt our previously proposed FullSubNet
speech denoising network to speech dereverberation. Experiments
show that RTS is a more suitable learning target than direct-path
speech and early reflections, in terms of better suppressing reverber-
ation and signal distortion. FullSubNet is able to achieve outstanding
dereverberation performance.
Index Terms—Speech dereverberation, Reverberation time
shortening, Fullsubnet
1. INTRODUCTION
Severe late reverberation brings significant damage to the quality and
intelligibility of speech [1] and will also cause performance degrada-
tion for back-end tasks such as automatic speech recognition (ASR)
[2]. Normally, early reflections does not cause so much negative ef-
fect [3]. Speech dereverberation, especially the single-channel case,
is still a challenging task.
Before deep neural network (DNN) has been widely used, the
traditional dereverberation methods were based on statistical models
and signal processing algorithms. The essential problem of dere-
verberation is the deconvolution between speech signal and room
impulse response (RIR). Deconvolution can be accomplished by ap-
plying an inverse filter of RIR to the reverberant speech, which is
referred to as inverse filtering methods, e.g. [4], [5], [6]. As for the
inverse filtering methods, accurate RIR must be first blindly identi-
fied, which is very challenging espcially for the single-channel case
[7]. Even if the RIR is known, due to its non-minimum phase char-
acteristics in typical cases, directly computing its inverse filter will
cause system instability or non-causality [8], [9]. Moreover, inverse
filtering is very sensitive to noise. Alternatively, instead of resolving
the inverse filter of RIR, weighted prediction error (WPE) [10, 11]
uses linear prediction to directly estimate the inverse filter from re-
verberant signal, and applies the inverse filter to remove late rever-
beration. WPE has achieved remarkable performance, and is one of
the most popular dereverberation methods. Another technical line of
dereverberation is spectral subtraction, following the perspective of
speech enhancement. Late reverberation can be considered as addi-
tive noise, which is assumed to be independent of direct path signal
*Corresponding author.
and early reflections [12, 13]. In [14], methods for estimating the
power spectrum density of late reverberation have been summarized.
The application of DNN has made a great progress in solving
speech dereverberation. The basic idea is to construct a nonlinear
mapping function, based on supervised learning of DNN, from the
spectral feature of reverberant speech to the one of target speech.
The input feature could be directly the time-domain signal, or the
STFT (short-time Fourier transform) coefficients or magnitude spec-
trum of reverberant speech. Correspondingly, the output features
could be the time-domain signal, STFT coefficients, magnitude spec-
trum or magnitude mask of target speech. The network architecture
used for single-channel speech dereverberation has been evolved a
lot, and made a great progress, from the initial fully connected net-
works [15] to recurrent neural networks (RNN) with long short-term
memory (LSTM) for time series modeling [16, 17], and to convolu-
tional neural networks (CNN), such as U-NET [18, 19] and tempo-
ral convolutional networks (TCN) [20], then to (self-)attention-based
methods [21, 22].
In this work, we experimentally study to adapt our two previ-
ously proposed speech denoising networks, i.e. subband LSTM net-
work [23] (refered to as SubNet) and FullSubNet [24], for speech
dereverberation. Based on the cross-band filters model [25], the
time-domain convolution between source speech and RIR can be
decomposed into subband convolutions, and thence speech derever-
beration can be perfectly performed in subband based on deconvo-
lution or inverse filtering. SubNet inputs the noisy spectra of one
frequency and its neighbouring frequencies, and outputs/predicts the
clean speech spectra of this frequency, which seems exactly suitable
for speech dereverberation by mimicking the inverse filtering pro-
cess. FullSubNet combines SubNet with a fullband network to also
exploit the fullband spectral pattern, as the enhanced speech should
have a correct spectral pattern across all frequencies. FullSubNet
used for speech dereverberation can be seen as a combination of
speech spectral regression and subband inverse fitering. Experiments
show that SubNet and FullSubNet are indeed able to achieve out-
standing dereverberation performance.
More importantly, this work proposes a new learning target
based on reverberation time shortening (RTS). DNN-based methods
in the literature normally takes the direct-path speech as learning
target, which actually is a very strict target as removing all rever-
beration. As a result, they normally have a large prediction error,
which may cause speech distortion. Since early reflections do not
cause speech quality degradation, they are often preserved and only
late reverberation are removed, such as in WPE [10, 11] and the
spectral subtraction methods [14]. Preserving early reflections in
the learning target would reduce the prediction error of the net-
work. However, preserving only early reflections will also reduce
the sound naturalness, as sounds appear in real life never have such
type of reverberation form. In addition, no matter which training
target is used, the direct path or early reflections, the network need
arXiv:2210.11089v6 [eess.AS] 6 Jun 2023