SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION Xubo Liu1 Haohe Liu1 Qiuqiang Kong2 Xinhao Mei1 Mark D. Plumbley1 Wenwu Wang1 1Centre for Vision Speech and Signal Processing CVSSP University of Surrey UK

2025-05-03 1 0 897.83KB 5 页 10玖币

侵权投诉

SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION

Xubo Liu1, Haohe Liu1, Qiuqiang Kong2, Xinhao Mei1, Mark D. Plumbley1, Wenwu Wang1

1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK

2Speech, Audio, and Music Intelligence (SAMI) Group, ByteDance, China

ABSTRACT

Recently, there has been increasing interest in building efﬁcient audio

neural networks for on-device scenarios. Most existing approaches

are designed to reduce the size of audio neural networks using meth-

ods such as model pruning. In this work, we show that instead of

reducing model size using complex methods, eliminating the tem-

poral redundancy in the input audio features (e.g., mel-spectrogram)

could be an effective approach for efﬁcient audio classiﬁcation. To

do so, we proposed a family of simple pooling front-ends (SimPFs)

which use simple non-parametric pooling operations to reduce the

redundant information within the mel-spectrogram. We perform ex-

tensive experiments on four audio classiﬁcation tasks to evaluate the

performance of SimPFs. Experimental results show that SimPFs can

achieve a reduction in more than half of the number of ﬂoating point

operations (FLOPs) for off-the-shelf audio neural networks, with neg-

ligible degradation or even some improvements in audio classiﬁcation

performance.

Index Terms—

Audio classiﬁcation, audio front-ends, on-device,

convolutional neural networks, deep learning

1. INTRODUCTION

Audio classiﬁcation is an important research topic in the ﬁeld of

signal processing and machine learning. There are many applications

of audio classiﬁcation, such as acoustic scene classiﬁcation [

], sound

event detection [

] and keywords spotting [

]. Audio classiﬁcation

plays a key role in many real-world applications including acoustic

monitoring [4], healthcare [5] and multimedia indexing [6].

Neural network methods such as convolutional neural networks

(CNNs) have been used for audio classiﬁcation and have achieved

state-of-the-art performance [

]. Generally, state-of-the-art audio

classiﬁcation models are designed with large sizes and complicated

modules, which make the audio classiﬁcation networks computa-

tionally inefﬁcient, in terms of e.g. the number of ﬂoating point

operations (FLOPs) and running memory. However, in many real-

world scenarios, audio classiﬁcation models need to be deployed on

resource-constrained platforms such as mobile devices [10].

There has been increasing interest in building efﬁcient audio

neural networks in the literature. Existing methods can generally be

divided into three categories. The ﬁrst is to utilize model compression

techniques such as pruning [

]. The second is to transfer the

knowledge from a large-scale pre-trained model to a small model

via knowledge distillation [

]. The last one is to directly

exploit efﬁcient networks for audio classiﬁcation, such as MobileNets

[

]. In summary, these methods mainly focus on reducing model

size. However, the computational cost (e.g., FLOPs) of the audio

neural network is not only determined by the size of the model but

also highly dependent on the size of the input features.

As existing audio neural networks usually take mel-spectrogram

which may be temporally redundant. For example, the pattern of

a siren audio clip is highly repetitive in the spectrogram, as shown

in Figure 1. In principle, if one can remove the redundancy in the

input mel-spectrogram, the computational cost can be signiﬁcantly

reduced. However, reducing input feature size for audio neural net-

works has received little attention in the literature, especially in terms

of improving their computation efﬁciency.

In this paper, we propose a family of

sim

ple

ooling

ront-ends

(SimPFs) for improving the computation efﬁciency of audio neural

networks. SimPFs utilize simple non-parametric pooling methods

(e.g., max pooling) to eliminate the temporally redundant informa-

tion in the input mel-spectrogram. The simple pooling operation

on an input mel-spectrogram achieves a substantial improvement

in computation efﬁciency for audio neural networks. To evaluate

the effectiveness of SimPFs, we conduct extensive experiments on

four audio classiﬁcation datasets including DCASE19 acoustic scene

classiﬁcation [

], ESC-50 environmental sound classiﬁcation [

Google SpeechCommands keywords spotting [

], and AudioSet audio

tagging [

]. We demonstrate that SimPFs can reduce more than half

of the computation FLOPs for off-the-shelf audio neural networks [

with negligibly degraded or even improved classiﬁcation performance.

For example, on DCASE19 acoustic scene classiﬁcation, SimPF can

reduce the FLOPs by 75% while improving the classiﬁcation accu-

racy approximately by 1.2%. Our proposed SimPFs are simple to

implement and can be integrated into any audio neural network at

a negligible computation cost. The code of our proposed method is

made available at GitHub1.

The remainder of this paper is organized as follows. The next

section introduces the related work of this paper. Section 3 introduces

the method SimPFs we proposed for efﬁcient audio classiﬁcation.

Section 4 presents the experimental settings and the evaluation results.

Conclusions and future directions are given in Section 5.

2. RELATED WORK

Our work relates to several works in the literature: efﬁcient audio

classiﬁcation, feature reduction for audio classiﬁcation, and audio

front-ends. We will discuss each of these as follows.

2.1. Efﬁcient audio classiﬁcation

Efﬁcient audio classiﬁcation for on-device applications has attracted

increasing attention in recent years. Singh et al. [

] proposed

to use pruning method to eliminate redundancy in audio convolutional

neural networks for acoustic scene classiﬁcation, which can reduce

approximately 25% FLOPs at 1% reduction in accuracy. Knowledge

distillation methods [

] have been used for efﬁcient audio

1https://github.com/liuxubo717/SimPFs

arXiv:2210.00943v5 [eess.AS] 7 May 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SIMPLEPOOLINGFRONT-ENDSFOREFFICIENTAUDIOCLASSIFICATIONXuboLiu1,HaoheLiu1,QiuqiangKong2,XinhaoMei1,MarkD.Plumbley1,WenwuWang11CentreforVision,SpeechandSignalProcessing(CVSSP),UniversityofSurrey,UK2Speech,Audio,andMusicIntelligence(SAMI)Group,ByteDance,ChinaABSTRACTRecently,therehasbeenincreasinginter...

展开>> 收起<<

SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION Xubo Liu1 Haohe Liu1 Qiuqiang Kong2 Xinhao Mei1 Mark D. Plumbley1 Wenwu Wang1 1Centre for Vision Speech and Signal Processing CVSSP University of Surrey UK.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION Xubo Liu1 Haohe Liu1 Qiuqiang Kong2 Xinhao Mei1 Mark D. Plumbley1 Wenwu Wang1 1Centre for Vision Speech and Signal Processing CVSSP University of Surrey UK

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: