SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET Rui Zhou1 Wenye Zhu12 Xiaofei Li1 1Westlake University Westlake Institute for Advanced Study Hangzhou China

2025-05-03 0 0 413.24KB 6 页 10玖币

侵权投诉

SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET

Rui Zhou1, Wenye Zhu1,2, Xiaofei Li1,∗

1Westlake University & Westlake Institute for Advanced Study, Hangzhou, China

2Zhejiang University, Hangzhou, China

ABSTRACT

This work proposes a new learning target based on reverberation

time shortening (RTS) for speech dereverberation. The learning tar-

get for dereverberation is usually set as the direct-path speech or

optionally with some early reﬂections. This type of target suddenly

truncates the reverberation, and thus it may not be suitable for net-

work training. The proposed RTS target suppresses reverberation

and meanwhile maintains the exponential decaying property of re-

verberation, which will ease the network training, and thus reduce

signal distortion caused by the prediction error. Moreover, this work

experimentally study to adapt our previously proposed FullSubNet

speech denoising network to speech dereverberation. Experiments

show that RTS is a more suitable learning target than direct-path

speech and early reﬂections, in terms of better suppressing reverber-

ation and signal distortion. FullSubNet is able to achieve outstanding

dereverberation performance.

Index Terms—Speech dereverberation, Reverberation time

shortening, Fullsubnet

1. INTRODUCTION

Severe late reverberation brings signiﬁcant damage to the quality and

intelligibility of speech [1] and will also cause performance degrada-

tion for back-end tasks such as automatic speech recognition (ASR)

[2]. Normally, early reﬂections does not cause so much negative ef-

fect [3]. Speech dereverberation, especially the single-channel case,

is still a challenging task.

Before deep neural network (DNN) has been widely used, the

traditional dereverberation methods were based on statistical models

and signal processing algorithms. The essential problem of dere-

verberation is the deconvolution between speech signal and room

impulse response (RIR). Deconvolution can be accomplished by ap-

plying an inverse ﬁlter of RIR to the reverberant speech, which is

referred to as inverse ﬁltering methods, e.g. [4], [5], [6]. As for the

inverse ﬁltering methods, accurate RIR must be ﬁrst blindly identi-

ﬁed, which is very challenging espcially for the single-channel case

[7]. Even if the RIR is known, due to its non-minimum phase char-

acteristics in typical cases, directly computing its inverse ﬁlter will

cause system instability or non-causality [8], [9]. Moreover, inverse

ﬁltering is very sensitive to noise. Alternatively, instead of resolving

the inverse ﬁlter of RIR, weighted prediction error (WPE) [10, 11]

uses linear prediction to directly estimate the inverse ﬁlter from re-

verberant signal, and applies the inverse ﬁlter to remove late rever-

beration. WPE has achieved remarkable performance, and is one of

the most popular dereverberation methods. Another technical line of

dereverberation is spectral subtraction, following the perspective of

speech enhancement. Late reverberation can be considered as addi-

tive noise, which is assumed to be independent of direct path signal

*Corresponding author.

and early reﬂections [12, 13]. In [14], methods for estimating the

power spectrum density of late reverberation have been summarized.

The application of DNN has made a great progress in solving

speech dereverberation. The basic idea is to construct a nonlinear

mapping function, based on supervised learning of DNN, from the

spectral feature of reverberant speech to the one of target speech.

The input feature could be directly the time-domain signal, or the

STFT (short-time Fourier transform) coefﬁcients or magnitude spec-

trum of reverberant speech. Correspondingly, the output features

could be the time-domain signal, STFT coefﬁcients, magnitude spec-

trum or magnitude mask of target speech. The network architecture

used for single-channel speech dereverberation has been evolved a

lot, and made a great progress, from the initial fully connected net-

works [15] to recurrent neural networks (RNN) with long short-term

memory (LSTM) for time series modeling [16, 17], and to convolu-

tional neural networks (CNN), such as U-NET [18, 19] and tempo-

ral convolutional networks (TCN) [20], then to (self-)attention-based

methods [21, 22].

In this work, we experimentally study to adapt our two previ-

ously proposed speech denoising networks, i.e. subband LSTM net-

work [23] (refered to as SubNet) and FullSubNet [24], for speech

dereverberation. Based on the cross-band ﬁlters model [25], the

time-domain convolution between source speech and RIR can be

decomposed into subband convolutions, and thence speech derever-

beration can be perfectly performed in subband based on deconvo-

lution or inverse ﬁltering. SubNet inputs the noisy spectra of one

frequency and its neighbouring frequencies, and outputs/predicts the

clean speech spectra of this frequency, which seems exactly suitable

for speech dereverberation by mimicking the inverse ﬁltering pro-

cess. FullSubNet combines SubNet with a fullband network to also

exploit the fullband spectral pattern, as the enhanced speech should

have a correct spectral pattern across all frequencies. FullSubNet

used for speech dereverberation can be seen as a combination of

speech spectral regression and subband inverse ﬁtering. Experiments

show that SubNet and FullSubNet are indeed able to achieve out-

standing dereverberation performance.

More importantly, this work proposes a new learning target

based on reverberation time shortening (RTS). DNN-based methods

in the literature normally takes the direct-path speech as learning

target, which actually is a very strict target as removing all rever-

beration. As a result, they normally have a large prediction error,

which may cause speech distortion. Since early reﬂections do not

cause speech quality degradation, they are often preserved and only

late reverberation are removed, such as in WPE [10, 11] and the

spectral subtraction methods [14]. Preserving early reﬂections in

the learning target would reduce the prediction error of the net-

work. However, preserving only early reﬂections will also reduce

the sound naturalness, as sounds appear in real life never have such

type of reverberation form. In addition, no matter which training

target is used, the direct path or early reﬂections, the network need

arXiv:2210.11089v6 [eess.AS] 6 Jun 2023

to learn a sudden truncation of reverberation, which is not fully

suitable for network training and will cause signal distortion. The

proposed learning target is a shortened version of the original RIR,

and has a small target T60, e.g. 0.15 s. Instead of suddenly truncat-

ing RIR, this target still maintains the property of exponential decay,

which will maintain the sound naturalness and also ease the network

training. Experiments show that using the proposed learning target

can more effectively suppress reverberation and signal distortion. In

the context of channel equalization, the RIR reshaping method [26]

shares a similar spirit with the proposed RTS target.

2. THE PROPOSED METHOD

Denote single-channel signals in the time domain as:

y(n) = s(n)∗a(n) + e(n),(1)

where * stands for convolution, ndenotes the discrete time index.

y(n),s(n),a(n)and e(n)are reverberant speech, clean speech,

RIR and ambient noise, respectively. This work mainly works on

dereverberation, but certain amount of ambient noise will also be

considered and suppressed.

We can divide RIR a(n)into two parts, where ad(n)and

au(n) = a(n)−ad(n)are the desired and undesired parts, respec-

tively. The reverberant speech can be rewritten as:

s(n)∗a(n) = s(n)∗ad(n) + s(n)∗au(n) = x(n) + u(n)(2)

This work aims to recover the desired signal x(n)from the reverber-

ant and noisy speech y(n).

Setting the learning target as the direct path speech or with some

early reﬂections amounts to applying a rectangular window w(n)rect

to RIR to obtain the dirsired part of RIR, i.e. ad=wrect(n)a(n).

The rectangular window for direct path and 50 ms of early reﬂections

are shown in Fig. 1 (a). The rectangular window suddenly truncates

the RIR, as shown in Fig. 1 (b) and (c) for the target of direct path and

early reﬂections, respectively. This may make the neural network

hard to learn a mapping function between the input and the output,

and leads to a large prediction error and signal distortion.

2.1. Learning Target: Reverberation Time Shortening

In this work, we propose a new learning target based on RTS, which

is a shortened version of the original RIR, and has a small target

T60, e.g. 0.15 s. Instead of suddenly truncating the RIR, the RTS

target still maintains the property of exponential decay, which will

maintain the sound naturalness and also ease the network training.

Formally, we deﬁne the new window function as :

w(n) = 1for n≤N1

10−q(n−N1)for n > N1

(3)

where N1denotes the discrete time index when the direct path ends.

The parameter qcontrols the decaying rate of the window. The orig-

inal RIR would be shortened by applying this window.

In the Polack’s Statistical Model [27], the reverberation compo-

nent of RIR can be realized by a Gaussian process with an exponen-

tially decaying envelope. Based on this model, RIR can be written

in the form of:

a(n)≈b(n)10−p(n−N1)for n > N1(4)

where b(n)denotes a zero-mean Gaussian noise sequence, and p

reﬂects the decaying rate.

(a) (b)

Fig. 1. (a) Window functions. The desired part of RIR for (b) direct

path, (c) 50 ms of early reﬂections, (d) RTS.

Applying the window function to the original RIR, the target

ad(n) = w(n)a(n)∀n,

≈b(n)10−(p+q)(n−N1)for n > N1,(5)

is still exponentially decaying, with a new decaying rate of p+q.

Based on the deﬁnition of T60, namely power decaying by 60

dB, the original T60 of a(n)and the new T60 of ad(n)(denoted as

T′

60) are respectively

T60 =3

pfs

, T ′

60 =3

(p+q)fs

,(6)

where fsdenotes the sampling frequency. It is obvious that T′

60 is

smaller than T60 as long as qis positive.

In practice, we set the learning target with a desired T′

60. Given

the original and target T60s, the window parameter qis set to:

q=3

T′

60fs

−3

T60fs

.(7)

Fig. 1 (a) shows the window function for the case with T60 = 0.7

s and T′

60 = 0.15 s, and Fig. 1 (d) shows the corresponding desired

part of RIR. Different from the proposed RTS target that sets a vary-

ing dacay rate, i.e. q, according to the original and target T60s, an

exponential window with a constant decay rate was proposed in [28].

2.2. Single-channel Dereverberation Neural Networks

First, we analyze how to model reverberation in the time-frequency

domain, based on which a dereverberation neural network can be

properly designed/selected/analyzed. Applying STFT to Eq. (1), ig-

noring the additive noise term and using the narrow-band assump-

tion, we have Y(k, p)≈S(k, p)A(k), where p∈[1, P ]and k∈

[0, K −1] denote the frame and frequency indices, Y(k, p)and

S(k, p)are the STFT coefﬁcients of y(n)and x(n),A(k)is the

Fourier transform of a(n). This assumption is only valid when RIR

is short relative to the STFT window, which is obviously not suitable

for the dereverberation problem where RIR is normally very long.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SPEECHDEREVERBERATIONWITHAREVERBERATIONTIMESHORTENINGTARGETRuiZhou1,WenyeZhu1,2,XiaofeiLi1,∗1WestlakeUniversity&WestlakeInstituteforAdvancedStudy,Hangzhou,China2ZhejiangUniversity,Hangzhou,ChinaABSTRACTThisworkproposesanewlearningtargetbasedonreverberationtimeshortening(RTS)forspeechdereverberation....

展开>> 收起<<

SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET Rui Zhou1 Wenye Zhu12 Xiaofei Li1 1Westlake University Westlake Institute for Advanced Study Hangzhou China.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET Rui Zhou1 Wenye Zhu12 Xiaofei Li1 1Westlake University Westlake Institute for Advanced Study Hangzhou China

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: