PARALLEL GATED NEURAL NETWORK WITH ATTENTION MECHANISM FOR SPEECH
ENHANCEMENT
Jianqiao Cui, Stefan Bleeck
Institute of Sound and Vibration Research, University of Southampton, UK
ABSTRACT
Deep learning algorithm are increasingly used for speech
enhancement (SE). In supervised methods, global and local
information is required for accurate spectral mapping. A key
restriction is often poor capture of key contextual information.
To leverage long-term for target speakers and compensate
distortions of cleaned speech, this paper adopts a sequence-
to-sequence (S2S) mapping structure and proposes a novel
monaural speech enhancement system, consisting of a
Feature Extraction Block (FEB), a Compensation
Enhancement Block (ComEB) and a Mask Block (MB). In
the FEB a U-net block is used to extract abstract features
using complex-valued spectra with one path to suppress the
background noise in the magnitude domain using masking
methods and the MB takes magnitude features from the FEB
and compensates the lost complex-domain features produced
from ComEB to restore the final cleaned speech. Experiments
are conducted on the Librispeech dataset and results show
that the proposed model obtains better performance than
recent models in terms of ESTOI and PESQ scores.
Index Terms—Supervised speech enhancement, global
and local speech information, sequence-to-sequence mapping,
complex domain compensation, magnitude domain mask
1. INTRODUCTION
Single-channel speech enhancement (SE) aims to restore
target speech corrupted by background noise. Additive noise
degrades the performance of speech recognition systems [1]
as well as humans, specifically hearing impaired [2].
Nowadays analytical methods such as Wiener filtering [3] or
statistical model-based methods [4] have been replaced with
deep neural networks (DNNs) which have already
demonstrated promising performance on single-channel
speech enhancement [5,6]. Most SE algorithms are either
based on mapping [7] or masking [8]. The mapping-based
methods mainly use the spectral magnitude or complex-
valued features as input [7]. Successful masking-based
methods are the ideal binary mask (IBM) [9] or more often
the ideal ratio mask (IRM) [10]. For the former, the
magnitude and phase information are used individually in the
complex domain and estimated to restore the clean speech.
For the latter, the original phase information is directly used
to reconstruct the output. Often the mean square error (MSE)
and scale-invariant SNR (SI-SDR) [11] are adopted as the
loss function of the DNNs, however, the speech quality is
hard to be estimated as it only weakly correlates with human
ratings [12].
Recently, cascaded, or multi-stage concepts have been
suggested for SE [13] because the intermediate priors can
boost the optimization by decomposing the original task into
several sub-tasks. However, each sub-model’s performance
is restricted because they each only incrementally improve
the SNR. In [13], a two-pipeline structure was suggested,
using first a coarse spectrum method and secondly a
compensating and polishing method. However, the
performance of the second part heavily depends on its
previous output, and therefore in such a cascade topology, the
second-stage model should have enough tolerance to correct
for the previous stages’ error.
In this paper, we propose a parallel structure for coarse and
refined estimation respectively using two modules. The first
module (Compensation for Complex Domain Network
(CCDN)) calculates masked features to compensate complex
components from the second module. In a parallel-path
structure, one path is fed with the magnitude spectrum and
estimates a mask, the second path outputs complex domain
details. Because the mask path deals only with magnitude
information, some spectral details will be lost. Li et al. [14]
showed that it is important to decouple magnitude and phase
optimization. We introduce the compensation path to remove
distortion and to compensate lost details. Additionally, in our
model we use a module extracting more abstract feature
details for the next estimation.
The rest of this paper is organized as follows. Section 2
introduces the signal model. Section 3 introduces our
proposed model. In section 4, we present the dataset and
experimental setup. The experimental results and
comparisons are shown in section 5. Section 6 draws
conclusions.
2. SIGNAL MODEL FORMULATION