BirdSoundsDenoising Deep Visual Audio Denoising for Bird Sounds Youshan Zhang Yeshiva University NYC NY

2025-05-06 0 0 1MB 10 页 10玖币
侵权投诉
BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds
Youshan Zhang
Yeshiva University, NYC, NY
youshan.zhang@yu.edu
Jialu Li
Cornell University, Ithaca, NY
jl4284@cornell.edu
Figure 1: The overall progress of our proposed deep visual audio denoising model (DVAD).
Abstract
Audio denoising has been explored for decades using
both traditional and deep learning-based methods. However,
these methods are still limited to either manually added ar-
tificial noise or lower denoised audio quality. To overcome
these challenges, we collect a large-scale natural noise bird
sound dataset. We are the first to transfer the audio de-
noising problem into an image segmentation problem and
propose a deep visual audio denoising (DVAD) model. With
a total of 14,120 audio images, we develop an audio Im-
ageMask tool and propose to use a few-shot generalization
strategy to label these images. Extensive experimental re-
sults demonstrate that the proposed model achieves state-
of-the-art performance. We also show that our method can
be easily generalized to speech denoising, audio separation,
audio enhancement, and noise estimation.
1. Introduction
With the development of technology, audio signals have
been increasingly used as main sources of information trans-
mission [
17
], such as teleconferences [
14
], the speech-to-text
function in social media [
32
], the lung [
22
] and heart [
15
]
sounds for disease diagnosis, instrument solo identifica-
tion [
11
], and hearing aid [
37
,
27
,
33
], etc. Therefore, it
is important to maintain the quality of signal transmission
and retain as much useful information as possible. However,
due to existing noises in the actual environment, the trans-
mission of audio signals, including speech and other signals
that we intend to collect, are inevitably affected, resulting
in the poor quality and intelligibility of audio signals. Au-
dio denoising can significantly increase audio quality and
contribute to a better outcome of information transition.
Audio denoising has been a popular research area in re-
cent years and different methods have been applied to re-
duce noise and separate audio, including traditional statis-
tics approaches [
5
,
28
,
12
,
19
] and deep learning ap-
proaches [
27
,
26
,
2
,
24
,
16
]. While there are several dif-
ficulties encountered across these models. In this paper, we
specifically use samples from the natural environment, which
presents more challenges to the proposed research models.
Why natural audio denoising is difficult?
Firstly, the most common difficulty encountered is the
limited sources for training. Deep learning-based models
require both clean and noisy audio samples for training.
However, in reality, audio signals come with noises that can-
not be separated to produce desired training samples [
14
].
Secondly, most noisy audio samples used for model train-
ing are artificially compiled, such as white gaussian noise
(WGN) [
30
,
39
], which is composed differently from natural
noise. In addition, we could still observe the clean signal
patterns in the artificial noise audio, while it is difficult to ob-
serve the clean signal patterns in real noise audio as shown in
Fig. 1(leftmost and rightmost signal). Therefore, the denois-
ing performance of the training models might not perform
as well in the real setting compared to experiments.
These two challenges are commonly encountered in the
audio denoising field, and we address them using a deep
visual audio denoising model (DVAD). In this paper, we
first collect audio samples that are directly acquired from the
natural environment. The proposed model can process more
complex and natural noises compared to previous models.
We offer three principal contributions:
We present a benchmark bird sounds denoising dataset
with the goal of advancing the state-of-the-art in audio
denoising under natural noise background.
To the best of our knowledge, we are the first to transfer
arXiv:2210.10196v1 [cs.SD] 18 Oct 2022
audio denoising into an image segmentation problem.
By removing the noise area in the audio image, we can
realize the purpose of audio denoising.
We develop an audio ImageMask tool to label the col-
lected dataset and apply a few-shot generalization strat-
egy to accelerate the data label process. We also demon-
strate that our model can be easily extended to speech
denoising, audio separation, audio enhancement, and
noise estimation.
2. Related Work
Audio denoising has been widely explored, and many
methods have evolved from traditional methods of estimating
the difference between noise and clean audio statistics [37],
to the adoption of deep learning methods [3].
Traditional methods for audio denoising can be dated
back to the 1970s. Boll [
5
] proposed a noise suppression al-
gorithm for spectral subtraction using the spectral noise bias
calculated in a non-speech environment. Another statistical
method proposed in [
28
] is a more comprehensive algorithm,
combining the concept of A Priori Signal-to-Noise Ratio
(SNR) with earlier typical audio enhancement schemes such
as Wiener filtering [
6
,
18
], spectral subtraction, or Maximum
Likelihood estimates. In the realm of the frequency-domain
algorithm, minimum mean square error (MMSE) based ap-
proaches is a mainstream approach besides Wiener filtering.
Hansen et al. [
12
] proposed an auditory masking threshold
enhancement method by applying Generalized MMSE es-
timator in an auditory enhancement scheme. In [
19
], the
MMSE estimator is used to enhance the performance of
short-time spectral coefficients by estimating discrete Fourier
transform (DFT) coefficients of both noisy and clean speech.
One major problem is that the performance of traditional
methods for noise separation and reduction will be degraded
with the presence of natural noises, which are largely differ-
ent from artificial noises applied in the experiments.
Wavelet transformation methods are developed to over-
come the difficulty of studying signals with low SNR and
are reported with better performance than filtering meth-
ods. Zhao et al. [
41
] used an improved threshold denoising
method, overcame the discontinuity in hard-threshold de-
noising, and reduced the permanent bias in soft-threshold
denoising. Srivastava et al. [
30
] developed a wavelet denois-
ing approach based on wavelet shrinkage, allowing for the
analysis of low SNR signals. Pouyani et al. [
22
] proposed
an adaptive method based on discrete wavelet transform and
artificial neural network to filtrate lung sound signals in a
noisy environment. Kui et al. [
15
] also combined the wavelet
algorithm with CNNs to classify the log mel-frequency spec-
tral coefficients features from the heart sound signal with
higher accuracy. These combined methods outperformed
single wavelet transformation methods.
Deep learning methods are later introduced to the au-
dio denoising field, complementing the disadvantages of
traditional methods and demonstrating a stronger ability to
learn data and characteristics with a few samples [
37
]. The
deep neural network (DNN)-based audio enhancement algo-
rithms have shown great potential in their ability to capture
data features with complicated nonlinear functions [
16
]. Xu
et al. [
38
] introduced a deep learning model for automatic
speech denoising to detect silent intervals and better capture
the noise pattern with the time-varying feature. Saleem et
al. [
27
] used the deep learning-based approach for audio en-
hancement accompanying complex noises. An ideal binary
mask (IBM) is used during the training and testing, and the
trained DNNs are used to estimate IBM. Xu et al. [
39
] pro-
posed a DNN-based supervised method to enhance audio by
finding a mapping function between noisy and clean audio
samples. A large mixture of noisy dataset is used during
the training and other techniques, including global variance
equalization and the dropout and noise-aware training strate-
gies. Saleem et al. [
26
] also developed a supervised DNN-
based single channel audio enhancement algorithm and ap-
plied less aggressive Wiener filtering as an additional DNN
layer. Vuong et al. [
34
] described a modulation-domain
loss function for a deep learning-based audio enhancement
approach, applying additional Learnable spectro-temporal
receptive fields to enhance the objective prediction of audio
quality and intelligibility.
Yet a problem in speech denoising application of DNN
is that sometimes, it is difficult for models to track a target
speaker in multiple training speakers, which means that the
DNNs are not easy to handle long-term contexts [
33
,
16
].
Therefore, deep learning approaches, such as convolutional
neural network (CNN)-based and recurrent neural network
(RNN)-based models, are explored. Alamdari et al. [
2
] ap-
plied a fully convolutional neural network (FCN) for audio
denoising with only noisy samples, and the study displayed
the superiority of the new model compared to the traditional
supervised approaches. Germain et al. [
10
] trained an FCN
using a deep feature loss, trained for acoustic environment
detection and domestic audio tagging. The research showed
that this new approach is particularly useful for audio with
the most intrusive background noise. Kong et al. [
14
] pro-
posed an audio enhancement method with pre-trained audio
neural networks using weakly labeled data with only au-
dio tags of audio clips and applied a convolutional U-Net
to predict the waveform of individual anchor segments se-
lected by PANNs. Raj et al. [
24
] proposed a multilayered
CNN-based auto-CODEC for audio signal denoising, us-
ing the mel-frequency cepstral coefficients, providing good
encoding and high security. Abouzid et al. [
1
] combined
the convolutional and denoising autoencoders into convolu-
tional denoising autoencoders for the suppression of noise
and compression of audio data.
On top of single-type deep learning methods, Tan et
摘要:

BirdSoundsDenoising:DeepVisualAudioDenoisingforBirdSoundsYoushanZhangYeshivaUniversity,NYC,NYyoushan.zhang@yu.eduJialuLiCornellUniversity,Ithaca,NYjl4284@cornell.eduFigure1:Theoverallprogressofourproposeddeepvisualaudiodenoisingmodel(DVAD).AbstractAudiodenoisinghasbeenexploredfordecadesusingbothtrad...

展开>> 收起<<
BirdSoundsDenoising Deep Visual Audio Denoising for Bird Sounds Youshan Zhang Yeshiva University NYC NY.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:1MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注