PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANIS M FOR SPEECH ENHANCEMENT Jianqiao Cui Stefan Bleeck

2025-05-02 0 0 304.29KB 6 页 10玖币
侵权投诉
PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANISM FOR SPEECH
ENHANCEMENT
Jianqiao Cui, Stefan Bleeck
Institute of Sound and Vibration Research, University of Southampton, UK
ABSTRACT
Deep learning algorithm are increasingly used for speech
enhancement (SE). In supervised methods, global and local
information is required for accurate spectral mapping. A key
restriction is often poor capture of key contextual information.
To leverage long-term for target speakers and compensate
distortions of cleaned speech, this paper adopts a sequence-
to-sequence (S2S) mapping structure and proposes a novel
monaural speech enhancement system, consisting of a
Feature Extraction Block (FEB), a Compensation
Enhancement Block (ComEB) and a Mask Block (MB). In
the FEB a U-net block is used to extract abstract features
using complex-valued spectra with one path to suppress the
background noise in the magnitude domain using masking
methods and the MB takes magnitude features from the FEB
and compensates the lost complex-domain features produced
from ComEB to restore the final cleaned speech. Experiments
are conducted on the Librispeech dataset and results show
that the proposed model obtains better performance than
recent models in terms of ESTOI and PESQ scores.
Index TermsSupervised speech enhancement, global
and local speech information, sequence-to-sequence mapping,
complex domain compensation, magnitude domain mask
1. INTRODUCTION
Single-channel speech enhancement (SE) aims to restore
target speech corrupted by background noise. Additive noise
degrades the performance of speech recognition systems [1]
as well as humans, specifically hearing impaired [2].
Nowadays analytical methods such as Wiener filtering [3] or
statistical model-based methods [4] have been replaced with
deep neural networks (DNNs) which have already
demonstrated promising performance on single-channel
speech enhancement [5,6]. Most SE algorithms are either
based on mapping [7] or masking [8]. The mapping-based
methods mainly use the spectral magnitude or complex-
valued features as input [7]. Successful masking-based
methods are the ideal binary mask (IBM) [9] or more often
the ideal ratio mask (IRM) [10]. For the former, the
magnitude and phase information are used individually in the
complex domain and estimated to restore the clean speech.
For the latter, the original phase information is directly used
to reconstruct the output. Often the mean square error (MSE)
and scale-invariant SNR (SI-SDR) [11] are adopted as the
loss function of the DNNs, however, the speech quality is
hard to be estimated as it only weakly correlates with human
ratings [12].
Recently, cascaded, or multi-stage concepts have been
suggested for SE [13] because the intermediate priors can
boost the optimization by decomposing the original task into
several sub-tasks. However, each sub-model’s performance
is restricted because they each only incrementally improve
the SNR. In [13], a two-pipeline structure was suggested,
using first a coarse spectrum method and secondly a
compensating and polishing method. However, the
performance of the second part heavily depends on its
previous output, and therefore in such a cascade topology, the
second-stage model should have enough tolerance to correct
for the previous stages’ error.
In this paper, we propose a parallel structure for coarse and
refined estimation respectively using two modules. The first
module (Compensation for Complex Domain Network
(CCDN)) calculates masked features to compensate complex
components from the second module. In a parallel-path
structure, one path is fed with the magnitude spectrum and
estimates a mask, the second path outputs complex domain
details. Because the mask path deals only with magnitude
information, some spectral details will be lost. Li et al. [14]
showed that it is important to decouple magnitude and phase
optimization. We introduce the compensation path to remove
distortion and to compensate lost details. Additionally, in our
model we use a module extracting more abstract feature
details for the next estimation.
The rest of this paper is organized as follows. Section 2
introduces the signal model. Section 3 introduces our
proposed model. In section 4, we present the dataset and
experimental setup. The experimental results and
comparisons are shown in section 5. Section 6 draws
conclusions.
2. SIGNAL MODEL FORMULATION
Single-channel speech enhancement aims to remove the
background noise from the single-channel noisy speech ,
and the corresponding original clean speech denotes .
  
Where represents the time sample index. Meanwhile, we
use the short-time Fourier transformation (STFT) to convert
the time domain speech signals into time-frequency (TF)
domain, that is:
   
Where  are transformed as by STFT,
respectively. is the corresponding time index and is the
frequency bin. Eq. 2 can also be written as
    
where subscripts  respectively represent the real and
imaginary part of the complex-valued feature. In the rest
content, the  will be dropped.
3. METHODOLOGY
Mode
xi
xr
Magnitude
+
Mask
×
Sr
Si
iSTFT
<Si>
<Sr>
+Concatenation
×Multiply
Feature Extraction Module
U-block
......
U-block
Mask Block
Complex-valued
Enhancement
Module
Compensation
Block
Fig. 1. Architecture of the proposed Compensation for Complex
Domain Network
The overall diagram of our proposed model is shown in Fig.1.
It consists of 4 parts, Feature Extraction Block (FEB), Mask
Block (MB), Complex-valued Enhancement Block (ComEB)
and Compensation Block (CB). The model input is the noisy
complex spectrum, denoted as     ,
and the corresponding target output is   
, where  represents the concatenation operation,
and  denote the time frames as well as frequency bins
respectively. Subscripts denote the real and imaginary
parts.
3.1 Feature Extraction Block
U-nets have been shown to be successful in acoustic feature
extraction [15], however, consecutive up and down sampling
causes the loss of spectral information. For instance, the
power spectral density of harmonic structure from low to high
frequency regions will gradually attenuates. For another, the
correlation between adjacent frame is important, so it is
considerable to obtain both local and global information of
each speech sample. U2net [16] was proposed in 2020, whose
sub-Unet was employed as embedding layer with residual
learning, so as to learning more multi-scale features
effectively. Motived by this, in this paper, we replace the
traditional 2-D convolutional layer by our proposed U-block
module where we use LSTM as the middle layer, to mitigate
the information loss, as shown in Fig 2. Fig 3. Shows the
details of FEB, it is comprised of Gated Linear Unit (GLU),
Layer Normalization (LN), ELU activation function and U-
block with residual connection. This structure has 2
advantages, one is the U-block can grasp multi-scale
information between frames, which means better abilities to
capture contextual features. The other one is that the 2-D
GLU can filter some disturbing information and keep useful
details. The progress of FEB is given by:
   
LSTM
Conv2D +
Batch normalization +
ELU
DeConv2D +
Batch normalization +
ELU
Concatenation
LSTM
Input Output
Fig. 2. Architecture of proposed U-block
1-D Conv 1-D Conv
input
Sigmoid Linear
×
Layer Normalization
ELU
U-block
+
output
Gated Linear Unit
Add
Fig. 3. Architecture of proposed FEB module
3.2 Mask Block (MB)
In MB, the output is a mask to suppress the noise in
magnitude domain, contributing to coarse features filtered by
masks toward to the overall spectrum. Fig. 4 shows the
structure, the Mask Block (MB), which consists of encoder,
decoder and stacked Gated Residual Units (GRUs), as shown
in Fig.3. We use 5 sub-layers in the encoder and decoder
respectively. The encoder’s sub-layer contains a 1-D
摘要:

PARALLELGATEDNEURALNETWORKWITHATTENTIONMECHANISMFORSPEECHENHANCEMENTJianqiaoCui,StefanBleeckInstituteofSoundandVibrationResearch,UniversityofSouthampton,UKABSTRACTDeeplearningalgorithmareincreasinglyusedforspeechenhancement(SE).Insupervisedmethods,globalandlocalinformationisrequiredforaccuratespectr...

展开>> 收起<<
PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANIS M FOR SPEECH ENHANCEMENT Jianqiao Cui Stefan Bleeck.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:304.29KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注