PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANIS M FOR SPEECH ENHANCEMENT Jianqiao Cui Stefan Bleeck

2025-05-02 1 0 304.29KB 6 页 10玖币

侵权投诉

PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANISM FOR SPEECH

ENHANCEMENT

Jianqiao Cui, Stefan Bleeck

Institute of Sound and Vibration Research, University of Southampton, UK

ABSTRACT

Deep learning algorithm are increasingly used for speech

enhancement (SE). In supervised methods, global and local

information is required for accurate spectral mapping. A key

restriction is often poor capture of key contextual information.

To leverage long-term for target speakers and compensate

distortions of cleaned speech, this paper adopts a sequence-

to-sequence (S2S) mapping structure and proposes a novel

monaural speech enhancement system, consisting of a

Feature Extraction Block (FEB), a Compensation

Enhancement Block (ComEB) and a Mask Block (MB). In

the FEB a U-net block is used to extract abstract features

using complex-valued spectra with one path to suppress the

background noise in the magnitude domain using masking

methods and the MB takes magnitude features from the FEB

and compensates the lost complex-domain features produced

from ComEB to restore the final cleaned speech. Experiments

are conducted on the Librispeech dataset and results show

that the proposed model obtains better performance than

recent models in terms of ESTOI and PESQ scores.

Index Terms—Supervised speech enhancement, global

and local speech information, sequence-to-sequence mapping,

complex domain compensation, magnitude domain mask

1. INTRODUCTION

Single-channel speech enhancement (SE) aims to restore

target speech corrupted by background noise. Additive noise

degrades the performance of speech recognition systems [1]

as well as humans, specifically hearing impaired [2].

Nowadays analytical methods such as Wiener filtering [3] or

statistical model-based methods [4] have been replaced with

deep neural networks (DNNs) which have already

demonstrated promising performance on single-channel

speech enhancement [5,6]. Most SE algorithms are either

based on mapping [7] or masking [8]. The mapping-based

methods mainly use the spectral magnitude or complex-

valued features as input [7]. Successful masking-based

methods are the ideal binary mask (IBM) [9] or more often

the ideal ratio mask (IRM) [10]. For the former, the

magnitude and phase information are used individually in the

complex domain and estimated to restore the clean speech.

For the latter, the original phase information is directly used

to reconstruct the output. Often the mean square error (MSE)

and scale-invariant SNR (SI-SDR) [11] are adopted as the

loss function of the DNNs, however, the speech quality is

hard to be estimated as it only weakly correlates with human

ratings [12].

Recently, cascaded, or multi-stage concepts have been

suggested for SE [13] because the intermediate priors can

boost the optimization by decomposing the original task into

several sub-tasks. However, each sub-model’s performance

is restricted because they each only incrementally improve

the SNR. In [13], a two-pipeline structure was suggested,

using first a coarse spectrum method and secondly a

compensating and polishing method. However, the

performance of the second part heavily depends on its

previous output, and therefore in such a cascade topology, the

second-stage model should have enough tolerance to correct

for the previous stages’ error.

In this paper, we propose a parallel structure for coarse and

refined estimation respectively using two modules. The first

module (Compensation for Complex Domain Network

(CCDN)) calculates masked features to compensate complex

components from the second module. In a parallel-path

structure, one path is fed with the magnitude spectrum and

estimates a mask, the second path outputs complex domain

details. Because the mask path deals only with magnitude

information, some spectral details will be lost. Li et al. [14]

showed that it is important to decouple magnitude and phase

optimization. We introduce the compensation path to remove

distortion and to compensate lost details. Additionally, in our

model we use a module extracting more abstract feature

details for the next estimation.

The rest of this paper is organized as follows. Section 2

introduces the signal model. Section 3 introduces our

proposed model. In section 4, we present the dataset and

experimental setup. The experimental results and

comparisons are shown in section 5. Section 6 draws

conclusions.

2. SIGNAL MODEL FORMULATION

Single-channel speech enhancement aims to remove the

background noise  from the single-channel noisy speech ,

and the corresponding original clean speech denotes .

  

Where  represents the time sample index. Meanwhile, we

use the short-time Fourier transformation (STFT) to convert

the time domain speech signals into time-frequency (TF)

domain, that is:

   

Where  are transformed as  by STFT,

respectively.  is the corresponding time index and  is the

frequency bin. Eq. 2 can also be written as

     

where subscripts  respectively represent the real and

imaginary part of the complex-valued feature. In the rest

content, the  will be dropped.

3. METHODOLOGY

Mode

Magnitude

Mask

iSTFT

<Si>

<Sr>

+Concatenation

×Multiply

Feature Extraction Module

U-block

......

U-block

Mask Block

Complex-valued

Enhancement

Module

Compensation

Block

Fig. 1. Architecture of the proposed Compensation for Complex

Domain Network

The overall diagram of our proposed model is shown in Fig.1.

It consists of 4 parts, Feature Extraction Block (FEB), Mask

Block (MB), Complex-valued Enhancement Block (ComEB)

and Compensation Block (CB). The model input is the noisy

complex spectrum, denoted as     ,

and the corresponding target output is   

, where  represents the concatenation operation,

and  denote the time frames as well as frequency bins

respectively. Subscripts  denote the real and imaginary

parts.

3.1 Feature Extraction Block

U-nets have been shown to be successful in acoustic feature

extraction [15], however, consecutive up and down sampling

causes the loss of spectral information. For instance, the

power spectral density of harmonic structure from low to high

frequency regions will gradually attenuates. For another, the

correlation between adjacent frame is important, so it is

considerable to obtain both local and global information of

each speech sample. U2net [16] was proposed in 2020, whose

sub-Unet was employed as embedding layer with residual

learning, so as to learning more multi-scale features

effectively. Motived by this, in this paper, we replace the

traditional 2-D convolutional layer by our proposed U-block

module where we use LSTM as the middle layer, to mitigate

the information loss, as shown in Fig 2. Fig 3. Shows the

details of FEB, it is comprised of Gated Linear Unit (GLU),

Layer Normalization (LN), ELU activation function and U-

block with residual connection. This structure has 2

advantages, one is the U-block can grasp multi-scale

information between frames, which means better abilities to

capture contextual features. The other one is that the 2-D

GLU can filter some disturbing information and keep useful

details. The progress of FEB is given by:

   

LSTM

Conv2D +

Batch normalization +

ELU

DeConv2D +

Batch normalization +

ELU

Concatenation

LSTM

Input Output

Fig. 2. Architecture of proposed U-block

1-D Conv 1-D Conv

input

Sigmoid Linear

Layer Normalization

ELU

U-block

output

Gated Linear Unit

Add

Fig. 3. Architecture of proposed FEB module

3.2 Mask Block (MB)

In MB, the output is a mask to suppress the noise in

magnitude domain, contributing to coarse features filtered by

masks toward to the overall spectrum. Fig. 4 shows the

structure, the Mask Block (MB), which consists of encoder,

decoder and stacked Gated Residual Units (GRUs), as shown

in Fig.3. We use 5 sub-layers in the encoder and decoder

respectively. The encoder’s sub-layer contains a 1-D

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PARALLELGATEDNEURALNETWORKWITHATTENTIONMECHANISMFORSPEECHENHANCEMENTJianqiaoCui,StefanBleeckInstituteofSoundandVibrationResearch,UniversityofSouthampton,UKABSTRACTDeeplearningalgorithmareincreasinglyusedforspeechenhancement(SE).Insupervisedmethods,globalandlocalinformationisrequiredforaccuratespectr...

展开>> 收起<<

PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANIS M FOR SPEECH ENHANCEMENT Jianqiao Cui Stefan Bleeck.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANIS M FOR SPEECH ENHANCEMENT Jianqiao Cui Stefan Bleeck

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: