SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET Rui Zhou1 Wenye Zhu12 Xiaofei Li1 1Westlake University Westlake Institute for Advanced Study Hangzhou China

2025-05-03 0 0 413.24KB 6 页 10玖币
侵权投诉
SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET
Rui Zhou1, Wenye Zhu1,2, Xiaofei Li1,
1Westlake University & Westlake Institute for Advanced Study, Hangzhou, China
2Zhejiang University, Hangzhou, China
ABSTRACT
This work proposes a new learning target based on reverberation
time shortening (RTS) for speech dereverberation. The learning tar-
get for dereverberation is usually set as the direct-path speech or
optionally with some early reflections. This type of target suddenly
truncates the reverberation, and thus it may not be suitable for net-
work training. The proposed RTS target suppresses reverberation
and meanwhile maintains the exponential decaying property of re-
verberation, which will ease the network training, and thus reduce
signal distortion caused by the prediction error. Moreover, this work
experimentally study to adapt our previously proposed FullSubNet
speech denoising network to speech dereverberation. Experiments
show that RTS is a more suitable learning target than direct-path
speech and early reflections, in terms of better suppressing reverber-
ation and signal distortion. FullSubNet is able to achieve outstanding
dereverberation performance.
Index TermsSpeech dereverberation, Reverberation time
shortening, Fullsubnet
1. INTRODUCTION
Severe late reverberation brings significant damage to the quality and
intelligibility of speech [1] and will also cause performance degrada-
tion for back-end tasks such as automatic speech recognition (ASR)
[2]. Normally, early reflections does not cause so much negative ef-
fect [3]. Speech dereverberation, especially the single-channel case,
is still a challenging task.
Before deep neural network (DNN) has been widely used, the
traditional dereverberation methods were based on statistical models
and signal processing algorithms. The essential problem of dere-
verberation is the deconvolution between speech signal and room
impulse response (RIR). Deconvolution can be accomplished by ap-
plying an inverse filter of RIR to the reverberant speech, which is
referred to as inverse filtering methods, e.g. [4], [5], [6]. As for the
inverse filtering methods, accurate RIR must be first blindly identi-
fied, which is very challenging espcially for the single-channel case
[7]. Even if the RIR is known, due to its non-minimum phase char-
acteristics in typical cases, directly computing its inverse filter will
cause system instability or non-causality [8], [9]. Moreover, inverse
filtering is very sensitive to noise. Alternatively, instead of resolving
the inverse filter of RIR, weighted prediction error (WPE) [10, 11]
uses linear prediction to directly estimate the inverse filter from re-
verberant signal, and applies the inverse filter to remove late rever-
beration. WPE has achieved remarkable performance, and is one of
the most popular dereverberation methods. Another technical line of
dereverberation is spectral subtraction, following the perspective of
speech enhancement. Late reverberation can be considered as addi-
tive noise, which is assumed to be independent of direct path signal
*Corresponding author.
and early reflections [12, 13]. In [14], methods for estimating the
power spectrum density of late reverberation have been summarized.
The application of DNN has made a great progress in solving
speech dereverberation. The basic idea is to construct a nonlinear
mapping function, based on supervised learning of DNN, from the
spectral feature of reverberant speech to the one of target speech.
The input feature could be directly the time-domain signal, or the
STFT (short-time Fourier transform) coefficients or magnitude spec-
trum of reverberant speech. Correspondingly, the output features
could be the time-domain signal, STFT coefficients, magnitude spec-
trum or magnitude mask of target speech. The network architecture
used for single-channel speech dereverberation has been evolved a
lot, and made a great progress, from the initial fully connected net-
works [15] to recurrent neural networks (RNN) with long short-term
memory (LSTM) for time series modeling [16, 17], and to convolu-
tional neural networks (CNN), such as U-NET [18, 19] and tempo-
ral convolutional networks (TCN) [20], then to (self-)attention-based
methods [21, 22].
In this work, we experimentally study to adapt our two previ-
ously proposed speech denoising networks, i.e. subband LSTM net-
work [23] (refered to as SubNet) and FullSubNet [24], for speech
dereverberation. Based on the cross-band filters model [25], the
time-domain convolution between source speech and RIR can be
decomposed into subband convolutions, and thence speech derever-
beration can be perfectly performed in subband based on deconvo-
lution or inverse filtering. SubNet inputs the noisy spectra of one
frequency and its neighbouring frequencies, and outputs/predicts the
clean speech spectra of this frequency, which seems exactly suitable
for speech dereverberation by mimicking the inverse filtering pro-
cess. FullSubNet combines SubNet with a fullband network to also
exploit the fullband spectral pattern, as the enhanced speech should
have a correct spectral pattern across all frequencies. FullSubNet
used for speech dereverberation can be seen as a combination of
speech spectral regression and subband inverse fitering. Experiments
show that SubNet and FullSubNet are indeed able to achieve out-
standing dereverberation performance.
More importantly, this work proposes a new learning target
based on reverberation time shortening (RTS). DNN-based methods
in the literature normally takes the direct-path speech as learning
target, which actually is a very strict target as removing all rever-
beration. As a result, they normally have a large prediction error,
which may cause speech distortion. Since early reflections do not
cause speech quality degradation, they are often preserved and only
late reverberation are removed, such as in WPE [10, 11] and the
spectral subtraction methods [14]. Preserving early reflections in
the learning target would reduce the prediction error of the net-
work. However, preserving only early reflections will also reduce
the sound naturalness, as sounds appear in real life never have such
type of reverberation form. In addition, no matter which training
target is used, the direct path or early reflections, the network need
arXiv:2210.11089v6 [eess.AS] 6 Jun 2023
to learn a sudden truncation of reverberation, which is not fully
suitable for network training and will cause signal distortion. The
proposed learning target is a shortened version of the original RIR,
and has a small target T60, e.g. 0.15 s. Instead of suddenly truncat-
ing RIR, this target still maintains the property of exponential decay,
which will maintain the sound naturalness and also ease the network
training. Experiments show that using the proposed learning target
can more effectively suppress reverberation and signal distortion. In
the context of channel equalization, the RIR reshaping method [26]
shares a similar spirit with the proposed RTS target.
2. THE PROPOSED METHOD
Denote single-channel signals in the time domain as:
y(n) = s(n)a(n) + e(n),(1)
where * stands for convolution, ndenotes the discrete time index.
y(n),s(n),a(n)and e(n)are reverberant speech, clean speech,
RIR and ambient noise, respectively. This work mainly works on
dereverberation, but certain amount of ambient noise will also be
considered and suppressed.
We can divide RIR a(n)into two parts, where ad(n)and
au(n) = a(n)ad(n)are the desired and undesired parts, respec-
tively. The reverberant speech can be rewritten as:
s(n)a(n) = s(n)ad(n) + s(n)au(n) = x(n) + u(n)(2)
This work aims to recover the desired signal x(n)from the reverber-
ant and noisy speech y(n).
Setting the learning target as the direct path speech or with some
early reflections amounts to applying a rectangular window w(n)rect
to RIR to obtain the dirsired part of RIR, i.e. ad=wrect(n)a(n).
The rectangular window for direct path and 50 ms of early reflections
are shown in Fig. 1 (a). The rectangular window suddenly truncates
the RIR, as shown in Fig. 1 (b) and (c) for the target of direct path and
early reflections, respectively. This may make the neural network
hard to learn a mapping function between the input and the output,
and leads to a large prediction error and signal distortion.
2.1. Learning Target: Reverberation Time Shortening
In this work, we propose a new learning target based on RTS, which
is a shortened version of the original RIR, and has a small target
T60, e.g. 0.15 s. Instead of suddenly truncating the RIR, the RTS
target still maintains the property of exponential decay, which will
maintain the sound naturalness and also ease the network training.
Formally, we define the new window function as :
w(n) = 1for nN1
10q(nN1)for n > N1
(3)
where N1denotes the discrete time index when the direct path ends.
The parameter qcontrols the decaying rate of the window. The orig-
inal RIR would be shortened by applying this window.
In the Polack’s Statistical Model [27], the reverberation compo-
nent of RIR can be realized by a Gaussian process with an exponen-
tially decaying envelope. Based on this model, RIR can be written
in the form of:
a(n)b(n)10p(nN1)for n > N1(4)
where b(n)denotes a zero-mean Gaussian noise sequence, and p
reflects the decaying rate.
(a) (b)
(c) (d)
Fig. 1. (a) Window functions. The desired part of RIR for (b) direct
path, (c) 50 ms of early reflections, (d) RTS.
Applying the window function to the original RIR, the target
ad(n) = w(n)a(n)n,
b(n)10(p+q)(nN1)for n > N1,(5)
is still exponentially decaying, with a new decaying rate of p+q.
Based on the definition of T60, namely power decaying by 60
dB, the original T60 of a(n)and the new T60 of ad(n)(denoted as
T
60) are respectively
T60 =3
pfs
, T
60 =3
(p+q)fs
,(6)
where fsdenotes the sampling frequency. It is obvious that T
60 is
smaller than T60 as long as qis positive.
In practice, we set the learning target with a desired T
60. Given
the original and target T60s, the window parameter qis set to:
q=3
T
60fs
3
T60fs
.(7)
Fig. 1 (a) shows the window function for the case with T60 = 0.7
s and T
60 = 0.15 s, and Fig. 1 (d) shows the corresponding desired
part of RIR. Different from the proposed RTS target that sets a vary-
ing dacay rate, i.e. q, according to the original and target T60s, an
exponential window with a constant decay rate was proposed in [28].
2.2. Single-channel Dereverberation Neural Networks
First, we analyze how to model reverberation in the time-frequency
domain, based on which a dereverberation neural network can be
properly designed/selected/analyzed. Applying STFT to Eq. (1), ig-
noring the additive noise term and using the narrow-band assump-
tion, we have Y(k, p)S(k, p)A(k), where p[1, P ]and k
[0, K 1] denote the frame and frequency indices, Y(k, p)and
S(k, p)are the STFT coefficients of y(n)and x(n),A(k)is the
Fourier transform of a(n). This assumption is only valid when RIR
is short relative to the STFT window, which is obviously not suitable
for the dereverberation problem where RIR is normally very long.
摘要:

SPEECHDEREVERBERATIONWITHAREVERBERATIONTIMESHORTENINGTARGETRuiZhou1,WenyeZhu1,2,XiaofeiLi1,∗1WestlakeUniversity&WestlakeInstituteforAdvancedStudy,Hangzhou,China2ZhejiangUniversity,Hangzhou,ChinaABSTRACTThisworkproposesanewlearningtargetbasedonreverberationtimeshortening(RTS)forspeechdereverberation....

展开>> 收起<<
SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET Rui Zhou1 Wenye Zhu12 Xiaofei Li1 1Westlake University Westlake Institute for Advanced Study Hangzhou China.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:413.24KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注