Learning Temporal Resolution in Spectrogram for Audio Classification Haohe Liu1 Xubo Liu1 Qiuqiang Kong2 Wenwu Wang1 Mark D. Plumbley1 1University of Surrey

2025-04-29 0 0 6.87MB 14 页 10玖币
侵权投诉
Learning Temporal Resolution in Spectrogram for Audio Classification
Haohe Liu1, Xubo Liu1, Qiuqiang Kong2, Wenwu Wang1, Mark D. Plumbley1
1University of Surrey
2The Chinese University of Hong Kong
Abstract
The audio spectrogram is a time-frequency representation
that has been widely used for audio classification. One of
the key attributes of the audio spectrogram is the temporal
resolution, which depends on the hop size used in the Short-
Time Fourier Transform (STFT). Previous works generally
assume the hop size should be a constant value (e.g., 10 ms).
However, a fixed temporal resolution is not always optimal
for different types of sound. The temporal resolution affects
not only classification accuracy but also computational cost.
This paper proposes a novel method, DiffRes, that enables
differentiable temporal resolution modeling for audio classi-
fication. Given a spectrogram calculated with a fixed hop size,
DiffRes merges non-essential time frames while preserving
important frames. DiffRes acts as a “drop-in” module be-
tween an audio spectrogram and a classifier and can be jointly
optimized with the classification task. We evaluate DiffRes on
five audio classification tasks, using mel-spectrograms as the
acoustic features, followed by off-the-shelf classifier back-
bones. Compared with previous methods using the fixed tem-
poral resolution, the DiffRes-based method can achieve the
equivalent or better classification accuracy with at least 25%
computational cost reduction. We further show that DiffRes
can improve classification accuracy by increasing the tempo-
ral resolution of input acoustic features, without adding to the
computational cost.
1 Introduction
Audio classification refers to a series of tasks that assign
labels to an audio clip. Those tasks include audio tag-
ging (Kong et al. 2020), speech keyword classfication (Kim
et al. 2021), and music genres classification (Castellon, Don-
ahue, and Liang 2021). The input to an audio classification
system is usually a one-dimensional audio waveform, which
can be represented by discrete samples. Although there are
methods using time-domain samples as features (Kong et al.
2020; Lee et al. 2017), the majority of studies on audio clas-
sification convert the waveform into a spectrogram as the in-
put feature (Gong, Chung, and Glass 2021b,a). Spectrogram
is usually calculated by the Fourier transform (Champeney
and Champeney 1987), which is applied in short waveform
chunks multiplied by a windowing function, resulting in a
two-dimensional time-frequency representation. According
Copyright © 2024, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
Insect
Siren
Alarm Clock
10 ms hop size
Siren
10 ms hop size
Alarm Clock
40 ms hop size
Siren
40 ms hop size
Insect
Siren
Alarm Clock
10 ms hop size
Siren
10 ms hop size
Alarm Clock
40 ms hop size
Siren
40 ms hop size
Insect
Siren
Alarm Clock
10 ms hop size
Siren
10 ms hop size
Alarm Clock
40 ms hop size
Siren
40 ms hop size
Insect
Siren
Alarm Clock
10 ms hop size
Siren
10 ms hop size
Alarm Clock
40 ms hop size
Siren
40 ms hop size
Figure 1: The spectrogram of Alarm Clock and Siren sound
with 40 ms and 10 ms hop sizes. All with a 25 ms window
size. The pattern of Siren, which is relatively stable, does
not change significantly using a smaller hop size (i.e., larger
temporal resolution), while Alarm Clock is the opposite.
to the Gabor’s uncertainty principle (Gabor 1946), there is
always a trade-off between time and frequency resolutions.
To achieve the desired resolution on the temporal dimension,
it is a common practice (Kong et al. 2021a; Liu et al. 2022)
to apply a fixed hop size between windows to capture the
dynamics between adjacent frames. With the fixed hop size,
the spectrogram has a fixed temporal resolution, which we
will refer to simply as resolution in this work.
Using a fixed resolution is not necessarily optimal for an
audio classification model. Intuitively, the resolution should
depend on the temporal pattern: fast-changing signals are
supposed to have high resolution, while relatively steady
signals or blank signals may not need the same high reso-
lution for the best accuracy (Huzaifah 2017). For example,
Figure 1 shows that by increasing resolution, more details
appear in the spectrogram of Alarm Clock while the pattern
of Siren stays mostly the same. This indicates the finer de-
tails in high-resolution Siren may not essentially contribute
to the classification accuracy. There are plenty of studies
on learning a suitable frequency resolution with a simi-
lar spirit (Stevens, Volkmann, and Newman 1937; Sainath
et al. 2013; Ravanelli and Bengio 2018b; Zeghidour et al.
2021). Most previous studies focus on investigating the ef-
fect of different temporal resolutions (Kekre et al. 2012;
Huzaifah 2017; Ilyashenko et al. 2019; Liu et al. 2023).
Huzaifah (2017) observe the optimal temporal resolution
for audio classification is class dependent. Ferraro et al.
(2021) experiment on music tagging with coarse-resolution
spectrograms, and observes a similar performance can be
maintained while being much faster to compute. Kazakos
et al. (2021) propose a two-stream architecture that pro-
cesses both fine-grained and coarse-resolution spectrogram
arXiv:2210.01719v3 [cs.SD] 12 Jan 2024
and shows the state-of-the-art result on VGG-Sound (Chen
et al. 2020). Recently, Liu et al. (2023) proposed a non-
parametric spectrogram-pooling-based module that can im-
prove classification efficiency with negligible performance
degradation. However, these approaches are generally built
on a fixed temporal resolution, which is not always optimal
for diverse sounds in the world. Intuitively, it is natural to
ask: can we dynamically learn the temporal resolution for
audio classification?
In this work, we demonstrate the first attempt to learn
temporal resolution in the spectrogram for audio classifi-
cation. We show that learning temporal resolution leads
to efficiency and accuracy improvements over the fixed-
resolution spectrogram. We propose a lightweight algo-
rithm, DiffRes, that makes spectrogram resolution differ-
entiable during model optimization. DiffRes can be used
as a “drop-in” module after spectrogram calculation and
optimized jointly with the downstream task. For the op-
timization of DiffRes, we propose a loss function, guide
loss, to inform the model of the low importance of empty
frames formed by SpecAug (Park et al. 2019). The output of
DiffRes is a time-frequency representation with varying res-
olution, which is achieved by adaptively merging the time
steps of a fixed-resolution spectrogram. The adaptive tem-
poral resolution alleviates the spectrogram temporal redun-
dancy and can speed up computation during training and
inference. We perform experiments on five different audio
tasks, including the largest audio dataset AudioSet (Gem-
meke et al. 2017). DiffRes shows clear improvements on all
tasks over the fixed-resolution mel-spectrogram baseline and
other learnable front-ends (Zeghidour et al. 2021; Ravanelli
and Bengio 2018b; Zeghidour et al. 2018). Compared with
methods using fixed-resolution spectrogram, we show that
using DiffRes-based models can achieve a computational
cost reduction of at least 25% with the equivalent or better
audio classification accuracy.
Besides, the potential of the high-resolution spectrogram,
e.g., with a one-millisecond hop size, is still unclear. Some
popular choices of hop size including 10 ms (B¨
ock et al.
2012; Kong et al. 2020; Gong, Chung, and Glass 2021a)
and 12.5ms (Rybakov et al. 2022). Previous studies (Kong
et al. 2020; Ferraro et al. 2021) show classification perfor-
mance can be steadily improved with the increase of res-
olution. One remaining question is: can even finer resolu-
tion improve the performance? We conduct an ablation study
for this question on a limited-vocabulary speech recogni-
tion task with hop sizes smaller than 10 ms. We noticed
that accuracy can still be improved with smaller hop size, at
a cost of increased computational complexity. By introduc-
ing DiffRes with high-resolution spectrograms, we observe
that the classifier performance gains are maintained while
the computational cost is significantly reduced.
Our contributions are summarized as follows:
We present DiffRes, a differentiable approach for learn-
ing temporal resolution in the audio spectrogram, which
improves classification accuracy and reduces the compu-
tational cost for off-the-shelf audio classification models.
We extensively evaluate the effectiveness of DiffRes
on five audio classification tasks. We further show that
DiffRes can improve classification accuracy by increas-
ing the temporal resolution of input acoustic features,
without adding to the computational cost.
Our code is available at https://github.com/haoheliu/
diffres-python.
2 Method
We provide an overview of DiffRes-based audio classifica-
tion in Section 2.1. We introduce the detailed formulation
and the optimization of DiffRes in Section 2.2, and 2.3.
2.1 Overview
Let xRLdenote a one-dimensional audio time waveform,
where Lis the number of audio samples. An audio clas-
sification system can be decomposed into a feature extrac-
tion stage and a classification stage. In the feature extraction
stage, the audio waveform will be processed by a function
Ql,h :RLRF×T, which maps the time waveform into
a two-dimensional time-frequency representation X, such
as a mel-spectrogram, where X:= (X1, ..., XF,τ )is
the τ-th frame. Here, Tand Fstand for the time and fre-
quency dimensions of the extracted representation. We also
refer to the representation along the temporal dimensions as
frames. We use land hto denote window length and hop
size, respectively. Usually TL
h. We define the temporal
resolution 1
hby frame per second (FPS), which denotes the
number of frames in one second. In the classification stage,
Xwill be processed by a classification model Gθparame-
terized by θ. The output of Gθis the label predictions ˆ
y, in
which ˆ
yidenotes the probability of class i. Given the paired
training data (x,y)D, where ydenotes the one-hot vector
for ground-truth labels, the optimization of the classification
system can be formulated as
arg min
θ
E(x,y)DL(Gθ(X),y),(1)
where Lis a loss function such as cross entropy (De Boer
et al. 2005). Figure 2 show an overview of performing clas-
sification with DiffRes. DiffRes is a “drop-in” module be-
tween Xand Gθfocusing on learning the optimal temporal
resolution with a learnable function Fϕ:RF×TRF×t,
where tis the parameter denoting the target output time
dimensions of DiffRes, and ϕis the learnable parameters.
DiffRes formulates Fϕwith two steps: i) estimating the im-
portance of each time frame with a learnable model Hϕ:
Xs, where sis a 1×Tshape row vector; and ii) warp-
ing frames based on a frame warping algorithm, the warp-
ing is performed along a single direction on the temporal di-
mension. We introduce the details of these two steps in Sec-
tion 2.2. We define the dimension reduction rate δof DiffRes
by δ= (Tt)/T . Usually, δ1and tTbecause the
temporal resolution of the DiffRes output is either coarser or
equal to that of X. Given the same T, a larger δmeans fewer
temporal dimensions tin the output of DiffRes, and usually
less computation is needed for Gθ. Similar to Equation 1, Fϕ
can be jointly optimized with Gθby
arg min
θ,ϕ
E(x,y)DL(Gθ(Fϕ(X)),y).(2)
Hop size
STFT
MelFB+Log
ResConv1D
ResConv1D
ResConv1D
ResConv1D
ResConv1D !
Estimate
frame importance
EfficientNet
Classifier
Calculate
mel-spectrogram
Calculate
warp matrix
Merge
frames
PreserveImportant
frame
DiffRes
Rescale
Importance score
Down stream
task
Input
audio waveform
"
'#
$
)%×' 2 3
*(×' +)×' ,%×)
- *0+,-
Figure 2: Audio classification with DiffRes and mel-spectrogram. Green blocks contain learnable parameters. DiffRes is a
“drop-in” module between spectrogram calculation and the downstream task.
2.2 Differentiable temporal resolution modeling
Frame importance estimation We design a frame impor-
tance estimation module Hϕto decide the proportion of each
frame that needs to be kept in the output, which is similar to
the sample weighting operation (Zhang and Pfister 2021) in
previous studies. The frame importance estimation module
will output a row vector swith shape 1×T, where the ele-
ment s
τis the importance score of the τ-th time frame X:.
The frame importance estimation can be denoted as
s=σ(Hϕ(X)),(3)
where sis the row vector of importance scores, and
σis the sigmoid function. A higher value in s
τindi-
cates the τ-th frame is important for classification. We
apply the sigmoid function to stabilize training by limit-
ing the values in sbetween zero and one. We implement
Hϕwith a stack of one-dimensional convolutional neu-
ral networks (CNNs) (Fukushima and Miyake 1982; Le-
Cun et al. 1989). Specifically, Hϕis a stack of five one-
dimensional convolutional blocks (ResConv1D). We design
the ResConv1D block following other CNN based meth-
ods (Shu et al. 2021; Liu et al. 2020; Kong et al. 2021b).
Each ResConv1D has two layers of one-dimensional CNN
with batch normalization (Ioffe and Szegedy 2015) and
leaky rectified linear unit activation functions. We apply
residual connection (He et al. 2016) for easier training of
the deep architecture (Zaeemzadeh, Rahnavard, and Shah
2020). Each CNN layer is zero-padded to ensure the tem-
poral dimension does not change (LeCun, Bengio, and Hin-
ton 2015). We use exponentially decreasing channel num-
bers to reduce the computation. In the next frame warping
step (Section 2.2), elements in the importance score will rep-
resent the proportion of each input frame that contributes
to an output frame. Therefore, we perform rescale opera-
tion on s, resulting in an sthat satisfies s[0,1]1×T
and PT
k=1 skt. The rescale operation can be denoted
as ˇ
s=s
PT
i=1 s
i
t, s=ˇ
s
max(ˇ
s,1) , where ˇ
sis an intermediate
variable that may contain elements greater than one, max
denotes the maximum operation. To quantify how active Hϕ
is trying to distinguish between important and less impor-
tant frames, we also design a measurement, activeness ρ,
which is calculated by the standard derivation of the non-
empty frames, given by
ρ=1
1δsPiSactive (si¯
si)2
|Sactive|,(4)
Sactive ={i|E(X:,i)>min(E(X:,i)) + ϵ},(5)
where Sactive is the set of indices of non-empty frames, ϵis a
small value, |S|denotes the size of set S, function E(·)cal-
culates the root-mean-square energy (Law and Rennie 2015)
of a frame in the spectrogram, and function min(·)calcu-
lates the minimum value within a matrix. We use δto unify
the value of ρfor easier comparison between different δset-
tings. The activeness ρcan be used as an indicator of how
DiffRes behaves during training. A higher ρindicates the
model is more active at learning the frame importance. A
lower ρsuch as zero indicates learning nothing. We will dis-
cuss the learning process of DiffRes with ρin Section 3.3.
Temporal Frame Warping We perform temporal frame
warping based on sand Xto calculate an adaptive tempo-
ral resolution representation O, which is similar to the idea
of generating derived features (Pentreath 2015). Generally,
the temporal frame warping algorithm can be denoted by
W=α(s)and O=β(X,W), where α(·)is a func-
tion that convert sinto a warp matrix Wwith shape t×T,
and β(·)is a function that applies Wto Xto calculate the
warpped feature O. Elements in Wsuch as Wi,j denote the
contribution of the j-th input frame X:,j to the i-th output
frame O:,i. We will introduce the realization of α(·)and β(·)
as follows.
Function α(·)calculates the warp matrix Wwith sby:
Wi,j =(sj,if i < Pj
k=1 ski+ 1
0,otherwise ,(6)
where we calculate the cumulative sum of sto decide which
output frame each input frame will be warped into. The warp
matrix Wwill be used for frame warping function β(·).
Function β(·)performs frame warping based on the warp
matrix W. The i-th output frame is calculated with Xand
the i-th row of W, given by
Oi,j =A((Xj,:)(Wi,:)),(7)
where A:R1×TRstands for the frame aggregation
function such as averaging, Ois the final output feature with
shape F×t.
摘要:

LearningTemporalResolutioninSpectrogramforAudioClassificationHaoheLiu1,XuboLiu1,QiuqiangKong2,WenwuWang1,MarkD.Plumbley11UniversityofSurrey2TheChineseUniversityofHongKongAbstractTheaudiospectrogramisatime-frequencyrepresentationthathasbeenwidelyusedforaudioclassification.Oneofthekeyattributesoftheau...

展开>> 收起<<
Learning Temporal Resolution in Spectrogram for Audio Classification Haohe Liu1 Xubo Liu1 Qiuqiang Kong2 Wenwu Wang1 Mark D. Plumbley1 1University of Surrey.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:6.87MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注