BirdSoundsDenoising Deep Visual Audio Denoising for Bird Sounds Youshan Zhang Yeshiva University NYC NY

2025-05-06 0 0 1MB 10 页 10玖币

侵权投诉

BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Youshan Zhang

Yeshiva University, NYC, NY

youshan.zhang@yu.edu

Jialu Li

Cornell University, Ithaca, NY

jl4284@cornell.edu

Figure 1: The overall progress of our proposed deep visual audio denoising model (DVAD).

Abstract

Audio denoising has been explored for decades using

both traditional and deep learning-based methods. However,

these methods are still limited to either manually added ar-

tiﬁcial noise or lower denoised audio quality. To overcome

these challenges, we collect a large-scale natural noise bird

sound dataset. We are the ﬁrst to transfer the audio de-

noising problem into an image segmentation problem and

propose a deep visual audio denoising (DVAD) model. With

a total of 14,120 audio images, we develop an audio Im-

ageMask tool and propose to use a few-shot generalization

strategy to label these images. Extensive experimental re-

sults demonstrate that the proposed model achieves state-

of-the-art performance. We also show that our method can

be easily generalized to speech denoising, audio separation,

audio enhancement, and noise estimation.

1. Introduction

With the development of technology, audio signals have

been increasingly used as main sources of information trans-

mission [

], such as teleconferences [

], the speech-to-text

function in social media [

], the lung [

] and heart [

]

sounds for disease diagnosis, instrument solo identiﬁca-

tion [

], and hearing aid [

], etc. Therefore, it

is important to maintain the quality of signal transmission

and retain as much useful information as possible. However,

due to existing noises in the actual environment, the trans-

mission of audio signals, including speech and other signals

that we intend to collect, are inevitably affected, resulting

in the poor quality and intelligibility of audio signals. Au-

dio denoising can signiﬁcantly increase audio quality and

contribute to a better outcome of information transition.

Audio denoising has been a popular research area in re-

cent years and different methods have been applied to re-

duce noise and separate audio, including traditional statis-

tics approaches [

] and deep learning ap-

proaches [

]. While there are several dif-

ﬁculties encountered across these models. In this paper, we

speciﬁcally use samples from the natural environment, which

presents more challenges to the proposed research models.

Why natural audio denoising is difﬁcult?

Firstly, the most common difﬁculty encountered is the

limited sources for training. Deep learning-based models

require both clean and noisy audio samples for training.

However, in reality, audio signals come with noises that can-

not be separated to produce desired training samples [

Secondly, most noisy audio samples used for model train-

ing are artiﬁcially compiled, such as white gaussian noise

(WGN) [

], which is composed differently from natural

noise. In addition, we could still observe the clean signal

patterns in the artiﬁcial noise audio, while it is difﬁcult to ob-

serve the clean signal patterns in real noise audio as shown in

Fig. 1(leftmost and rightmost signal). Therefore, the denois-

ing performance of the training models might not perform

as well in the real setting compared to experiments.

These two challenges are commonly encountered in the

audio denoising ﬁeld, and we address them using a deep

visual audio denoising model (DVAD). In this paper, we

ﬁrst collect audio samples that are directly acquired from the

natural environment. The proposed model can process more

complex and natural noises compared to previous models.

We offer three principal contributions:

•

We present a benchmark bird sounds denoising dataset

with the goal of advancing the state-of-the-art in audio

denoising under natural noise background.

•

To the best of our knowledge, we are the ﬁrst to transfer

arXiv:2210.10196v1 [cs.SD] 18 Oct 2022

audio denoising into an image segmentation problem.

By removing the noise area in the audio image, we can

realize the purpose of audio denoising.

•

We develop an audio ImageMask tool to label the col-

lected dataset and apply a few-shot generalization strat-

egy to accelerate the data label process. We also demon-

strate that our model can be easily extended to speech

denoising, audio separation, audio enhancement, and

noise estimation.

2. Related Work

Audio denoising has been widely explored, and many

methods have evolved from traditional methods of estimating

the difference between noise and clean audio statistics [37],

to the adoption of deep learning methods [3].

Traditional methods for audio denoising can be dated

back to the 1970s. Boll [

] proposed a noise suppression al-

gorithm for spectral subtraction using the spectral noise bias

calculated in a non-speech environment. Another statistical

method proposed in [

] is a more comprehensive algorithm,

combining the concept of A Priori Signal-to-Noise Ratio

(SNR) with earlier typical audio enhancement schemes such

as Wiener ﬁltering [

], spectral subtraction, or Maximum

Likelihood estimates. In the realm of the frequency-domain

algorithm, minimum mean square error (MMSE) based ap-

proaches is a mainstream approach besides Wiener ﬁltering.

Hansen et al. [

] proposed an auditory masking threshold

enhancement method by applying Generalized MMSE es-

timator in an auditory enhancement scheme. In [

], the

MMSE estimator is used to enhance the performance of

short-time spectral coefﬁcients by estimating discrete Fourier

transform (DFT) coefﬁcients of both noisy and clean speech.

One major problem is that the performance of traditional

methods for noise separation and reduction will be degraded

with the presence of natural noises, which are largely differ-

ent from artiﬁcial noises applied in the experiments.

Wavelet transformation methods are developed to over-

come the difﬁculty of studying signals with low SNR and

are reported with better performance than ﬁltering meth-

ods. Zhao et al. [

] used an improved threshold denoising

method, overcame the discontinuity in hard-threshold de-

noising, and reduced the permanent bias in soft-threshold

denoising. Srivastava et al. [

] developed a wavelet denois-

ing approach based on wavelet shrinkage, allowing for the

analysis of low SNR signals. Pouyani et al. [

] proposed

an adaptive method based on discrete wavelet transform and

artiﬁcial neural network to ﬁltrate lung sound signals in a

noisy environment. Kui et al. [

] also combined the wavelet

algorithm with CNNs to classify the log mel-frequency spec-

tral coefﬁcients features from the heart sound signal with

higher accuracy. These combined methods outperformed

single wavelet transformation methods.

Deep learning methods are later introduced to the au-

dio denoising ﬁeld, complementing the disadvantages of

traditional methods and demonstrating a stronger ability to

learn data and characteristics with a few samples [

]. The

deep neural network (DNN)-based audio enhancement algo-

rithms have shown great potential in their ability to capture

data features with complicated nonlinear functions [

]. Xu

et al. [

] introduced a deep learning model for automatic

speech denoising to detect silent intervals and better capture

the noise pattern with the time-varying feature. Saleem et

al. [

] used the deep learning-based approach for audio en-

hancement accompanying complex noises. An ideal binary

mask (IBM) is used during the training and testing, and the

trained DNNs are used to estimate IBM. Xu et al. [

] pro-

posed a DNN-based supervised method to enhance audio by

ﬁnding a mapping function between noisy and clean audio

samples. A large mixture of noisy dataset is used during

the training and other techniques, including global variance

equalization and the dropout and noise-aware training strate-

gies. Saleem et al. [

] also developed a supervised DNN-

based single channel audio enhancement algorithm and ap-

plied less aggressive Wiener ﬁltering as an additional DNN

layer. Vuong et al. [

] described a modulation-domain

loss function for a deep learning-based audio enhancement

approach, applying additional Learnable spectro-temporal

receptive ﬁelds to enhance the objective prediction of audio

quality and intelligibility.

Yet a problem in speech denoising application of DNN

is that sometimes, it is difﬁcult for models to track a target

speaker in multiple training speakers, which means that the

DNNs are not easy to handle long-term contexts [

Therefore, deep learning approaches, such as convolutional

neural network (CNN)-based and recurrent neural network

(RNN)-based models, are explored. Alamdari et al. [

] ap-

plied a fully convolutional neural network (FCN) for audio

denoising with only noisy samples, and the study displayed

the superiority of the new model compared to the traditional

supervised approaches. Germain et al. [

] trained an FCN

using a deep feature loss, trained for acoustic environment

detection and domestic audio tagging. The research showed

that this new approach is particularly useful for audio with

the most intrusive background noise. Kong et al. [

] pro-

posed an audio enhancement method with pre-trained audio

neural networks using weakly labeled data with only au-

dio tags of audio clips and applied a convolutional U-Net

to predict the waveform of individual anchor segments se-

lected by PANNs. Raj et al. [

] proposed a multilayered

CNN-based auto-CODEC for audio signal denoising, us-

ing the mel-frequency cepstral coefﬁcients, providing good

encoding and high security. Abouzid et al. [

] combined

the convolutional and denoising autoencoders into convolu-

tional denoising autoencoders for the suppression of noise

and compression of audio data.

On top of single-type deep learning methods, Tan et

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BirdSoundsDenoising:DeepVisualAudioDenoisingforBirdSoundsYoushanZhangYeshivaUniversity,NYC,NYyoushan.zhang@yu.eduJialuLiCornellUniversity,Ithaca,NYjl4284@cornell.eduFigure1:Theoverallprogressofourproposeddeepvisualaudiodenoisingmodel(DVAD).AbstractAudiodenoisinghasbeenexploredfordecadesusingbothtrad...

展开>> 收起<<

BirdSoundsDenoising Deep Visual Audio Denoising for Bird Sounds Youshan Zhang Yeshiva University NYC NY.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

BirdSoundsDenoising Deep Visual Audio Denoising for Bird Sounds Youshan Zhang Yeshiva University NYC NY

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: