audio denoising into an image segmentation problem.
By removing the noise area in the audio image, we can
realize the purpose of audio denoising.
•
We develop an audio ImageMask tool to label the col-
lected dataset and apply a few-shot generalization strat-
egy to accelerate the data label process. We also demon-
strate that our model can be easily extended to speech
denoising, audio separation, audio enhancement, and
noise estimation.
2. Related Work
Audio denoising has been widely explored, and many
methods have evolved from traditional methods of estimating
the difference between noise and clean audio statistics [37],
to the adoption of deep learning methods [3].
Traditional methods for audio denoising can be dated
back to the 1970s. Boll [
5
] proposed a noise suppression al-
gorithm for spectral subtraction using the spectral noise bias
calculated in a non-speech environment. Another statistical
method proposed in [
28
] is a more comprehensive algorithm,
combining the concept of A Priori Signal-to-Noise Ratio
(SNR) with earlier typical audio enhancement schemes such
as Wiener filtering [
6
,
18
], spectral subtraction, or Maximum
Likelihood estimates. In the realm of the frequency-domain
algorithm, minimum mean square error (MMSE) based ap-
proaches is a mainstream approach besides Wiener filtering.
Hansen et al. [
12
] proposed an auditory masking threshold
enhancement method by applying Generalized MMSE es-
timator in an auditory enhancement scheme. In [
19
], the
MMSE estimator is used to enhance the performance of
short-time spectral coefficients by estimating discrete Fourier
transform (DFT) coefficients of both noisy and clean speech.
One major problem is that the performance of traditional
methods for noise separation and reduction will be degraded
with the presence of natural noises, which are largely differ-
ent from artificial noises applied in the experiments.
Wavelet transformation methods are developed to over-
come the difficulty of studying signals with low SNR and
are reported with better performance than filtering meth-
ods. Zhao et al. [
41
] used an improved threshold denoising
method, overcame the discontinuity in hard-threshold de-
noising, and reduced the permanent bias in soft-threshold
denoising. Srivastava et al. [
30
] developed a wavelet denois-
ing approach based on wavelet shrinkage, allowing for the
analysis of low SNR signals. Pouyani et al. [
22
] proposed
an adaptive method based on discrete wavelet transform and
artificial neural network to filtrate lung sound signals in a
noisy environment. Kui et al. [
15
] also combined the wavelet
algorithm with CNNs to classify the log mel-frequency spec-
tral coefficients features from the heart sound signal with
higher accuracy. These combined methods outperformed
single wavelet transformation methods.
Deep learning methods are later introduced to the au-
dio denoising field, complementing the disadvantages of
traditional methods and demonstrating a stronger ability to
learn data and characteristics with a few samples [
37
]. The
deep neural network (DNN)-based audio enhancement algo-
rithms have shown great potential in their ability to capture
data features with complicated nonlinear functions [
16
]. Xu
et al. [
38
] introduced a deep learning model for automatic
speech denoising to detect silent intervals and better capture
the noise pattern with the time-varying feature. Saleem et
al. [
27
] used the deep learning-based approach for audio en-
hancement accompanying complex noises. An ideal binary
mask (IBM) is used during the training and testing, and the
trained DNNs are used to estimate IBM. Xu et al. [
39
] pro-
posed a DNN-based supervised method to enhance audio by
finding a mapping function between noisy and clean audio
samples. A large mixture of noisy dataset is used during
the training and other techniques, including global variance
equalization and the dropout and noise-aware training strate-
gies. Saleem et al. [
26
] also developed a supervised DNN-
based single channel audio enhancement algorithm and ap-
plied less aggressive Wiener filtering as an additional DNN
layer. Vuong et al. [
34
] described a modulation-domain
loss function for a deep learning-based audio enhancement
approach, applying additional Learnable spectro-temporal
receptive fields to enhance the objective prediction of audio
quality and intelligibility.
Yet a problem in speech denoising application of DNN
is that sometimes, it is difficult for models to track a target
speaker in multiple training speakers, which means that the
DNNs are not easy to handle long-term contexts [
33
,
16
].
Therefore, deep learning approaches, such as convolutional
neural network (CNN)-based and recurrent neural network
(RNN)-based models, are explored. Alamdari et al. [
2
] ap-
plied a fully convolutional neural network (FCN) for audio
denoising with only noisy samples, and the study displayed
the superiority of the new model compared to the traditional
supervised approaches. Germain et al. [
10
] trained an FCN
using a deep feature loss, trained for acoustic environment
detection and domestic audio tagging. The research showed
that this new approach is particularly useful for audio with
the most intrusive background noise. Kong et al. [
14
] pro-
posed an audio enhancement method with pre-trained audio
neural networks using weakly labeled data with only au-
dio tags of audio clips and applied a convolutional U-Net
to predict the waveform of individual anchor segments se-
lected by PANNs. Raj et al. [
24
] proposed a multilayered
CNN-based auto-CODEC for audio signal denoising, us-
ing the mel-frequency cepstral coefficients, providing good
encoding and high security. Abouzid et al. [
1
] combined
the convolutional and denoising autoencoders into convolu-
tional denoising autoencoders for the suppression of noise
and compression of audio data.
On top of single-type deep learning methods, Tan et