SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION Xubo Liu1 Haohe Liu1 Qiuqiang Kong2 Xinhao Mei1 Mark D. Plumbley1 Wenwu Wang1 1Centre for Vision Speech and Signal Processing CVSSP University of Surrey UK

2025-05-03 0 0 897.83KB 5 页 10玖币
侵权投诉
SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION
Xubo Liu1, Haohe Liu1, Qiuqiang Kong2, Xinhao Mei1, Mark D. Plumbley1, Wenwu Wang1
1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, UK
2Speech, Audio, and Music Intelligence (SAMI) Group, ByteDance, China
ABSTRACT
Recently, there has been increasing interest in building efficient audio
neural networks for on-device scenarios. Most existing approaches
are designed to reduce the size of audio neural networks using meth-
ods such as model pruning. In this work, we show that instead of
reducing model size using complex methods, eliminating the tem-
poral redundancy in the input audio features (e.g., mel-spectrogram)
could be an effective approach for efficient audio classification. To
do so, we proposed a family of simple pooling front-ends (SimPFs)
which use simple non-parametric pooling operations to reduce the
redundant information within the mel-spectrogram. We perform ex-
tensive experiments on four audio classification tasks to evaluate the
performance of SimPFs. Experimental results show that SimPFs can
achieve a reduction in more than half of the number of floating point
operations (FLOPs) for off-the-shelf audio neural networks, with neg-
ligible degradation or even some improvements in audio classification
performance.
Index Terms
Audio classification, audio front-ends, on-device,
convolutional neural networks, deep learning
1. INTRODUCTION
Audio classification is an important research topic in the field of
signal processing and machine learning. There are many applications
of audio classification, such as acoustic scene classification [
1
], sound
event detection [
2
] and keywords spotting [
3
]. Audio classification
plays a key role in many real-world applications including acoustic
monitoring [4], healthcare [5] and multimedia indexing [6].
Neural network methods such as convolutional neural networks
(CNNs) have been used for audio classification and have achieved
state-of-the-art performance [
7
,
8
,
9
]. Generally, state-of-the-art audio
classification models are designed with large sizes and complicated
modules, which make the audio classification networks computa-
tionally inefficient, in terms of e.g. the number of floating point
operations (FLOPs) and running memory. However, in many real-
world scenarios, audio classification models need to be deployed on
resource-constrained platforms such as mobile devices [10].
There has been increasing interest in building efficient audio
neural networks in the literature. Existing methods can generally be
divided into three categories. The first is to utilize model compression
techniques such as pruning [
11
,
12
]. The second is to transfer the
knowledge from a large-scale pre-trained model to a small model
via knowledge distillation [
13
,
14
,
15
]. The last one is to directly
exploit efficient networks for audio classification, such as MobileNets
[
7
,
16
]. In summary, these methods mainly focus on reducing model
size. However, the computational cost (e.g., FLOPs) of the audio
neural network is not only determined by the size of the model but
also highly dependent on the size of the input features.
As existing audio neural networks usually take mel-spectrogram
which may be temporally redundant. For example, the pattern of
a siren audio clip is highly repetitive in the spectrogram, as shown
in Figure 1. In principle, if one can remove the redundancy in the
input mel-spectrogram, the computational cost can be significantly
reduced. However, reducing input feature size for audio neural net-
works has received little attention in the literature, especially in terms
of improving their computation efficiency.
In this paper, we propose a family of
sim
ple
p
ooling
f
ront-ends
(SimPFs) for improving the computation efficiency of audio neural
networks. SimPFs utilize simple non-parametric pooling methods
(e.g., max pooling) to eliminate the temporally redundant informa-
tion in the input mel-spectrogram. The simple pooling operation
on an input mel-spectrogram achieves a substantial improvement
in computation efficiency for audio neural networks. To evaluate
the effectiveness of SimPFs, we conduct extensive experiments on
four audio classification datasets including DCASE19 acoustic scene
classification [
17
], ESC-50 environmental sound classification [
18
],
Google SpeechCommands keywords spotting [
3
], and AudioSet audio
tagging [
19
]. We demonstrate that SimPFs can reduce more than half
of the computation FLOPs for off-the-shelf audio neural networks [
7
],
with negligibly degraded or even improved classification performance.
For example, on DCASE19 acoustic scene classification, SimPF can
reduce the FLOPs by 75% while improving the classification accu-
racy approximately by 1.2%. Our proposed SimPFs are simple to
implement and can be integrated into any audio neural network at
a negligible computation cost. The code of our proposed method is
made available at GitHub1.
The remainder of this paper is organized as follows. The next
section introduces the related work of this paper. Section 3 introduces
the method SimPFs we proposed for efficient audio classification.
Section 4 presents the experimental settings and the evaluation results.
Conclusions and future directions are given in Section 5.
2. RELATED WORK
Our work relates to several works in the literature: efficient audio
classification, feature reduction for audio classification, and audio
front-ends. We will discuss each of these as follows.
2.1. Efficient audio classification
Efficient audio classification for on-device applications has attracted
increasing attention in recent years. Singh et al. [
11
,
12
,
20
] proposed
to use pruning method to eliminate redundancy in audio convolutional
neural networks for acoustic scene classification, which can reduce
approximately 25% FLOPs at 1% reduction in accuracy. Knowledge
distillation methods [
13
,
14
,
15
] have been used for efficient audio
1https://github.com/liuxubo717/SimPFs
arXiv:2210.00943v5 [eess.AS] 7 May 2023
摘要:

SIMPLEPOOLINGFRONT-ENDSFOREFFICIENTAUDIOCLASSIFICATIONXuboLiu1,HaoheLiu1,QiuqiangKong2,XinhaoMei1,MarkD.Plumbley1,WenwuWang11CentreforVision,SpeechandSignalProcessing(CVSSP),UniversityofSurrey,UK2Speech,Audio,andMusicIntelligence(SAMI)Group,ByteDance,ChinaABSTRACTRecently,therehasbeenincreasinginter...

展开>> 收起<<
SIMPLE POOLING FRONT-ENDS FOR EFFICIENT AUDIO CLASSIFICATION Xubo Liu1 Haohe Liu1 Qiuqiang Kong2 Xinhao Mei1 Mark D. Plumbley1 Wenwu Wang1 1Centre for Vision Speech and Signal Processing CVSSP University of Surrey UK.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:897.83KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注