Learning Temporal Resolution in Spectrogram for Audio Classification Haohe Liu1 Xubo Liu1 Qiuqiang Kong2 Wenwu Wang1 Mark D. Plumbley1 1University of Surrey

2025-04-29 0 0 6.87MB 14 页 10玖币

侵权投诉

Learning Temporal Resolution in Spectrogram for Audio Classiﬁcation

Haohe Liu1, Xubo Liu1, Qiuqiang Kong2, Wenwu Wang1, Mark D. Plumbley1

1University of Surrey

2The Chinese University of Hong Kong

Abstract

The audio spectrogram is a time-frequency representation

that has been widely used for audio classiﬁcation. One of

the key attributes of the audio spectrogram is the temporal

resolution, which depends on the hop size used in the Short-

Time Fourier Transform (STFT). Previous works generally

assume the hop size should be a constant value (e.g., 10 ms).

However, a ﬁxed temporal resolution is not always optimal

for different types of sound. The temporal resolution affects

not only classiﬁcation accuracy but also computational cost.

This paper proposes a novel method, DiffRes, that enables

differentiable temporal resolution modeling for audio classi-

ﬁcation. Given a spectrogram calculated with a ﬁxed hop size,

DiffRes merges non-essential time frames while preserving

important frames. DiffRes acts as a “drop-in” module be-

tween an audio spectrogram and a classiﬁer and can be jointly

optimized with the classiﬁcation task. We evaluate DiffRes on

ﬁve audio classiﬁcation tasks, using mel-spectrograms as the

acoustic features, followed by off-the-shelf classiﬁer back-

bones. Compared with previous methods using the ﬁxed tem-

poral resolution, the DiffRes-based method can achieve the

equivalent or better classiﬁcation accuracy with at least 25%

computational cost reduction. We further show that DiffRes

can improve classiﬁcation accuracy by increasing the tempo-

ral resolution of input acoustic features, without adding to the

computational cost.

1 Introduction

Audio classiﬁcation refers to a series of tasks that assign

labels to an audio clip. Those tasks include audio tag-

ging (Kong et al. 2020), speech keyword classﬁcation (Kim

et al. 2021), and music genres classiﬁcation (Castellon, Don-

ahue, and Liang 2021). The input to an audio classiﬁcation

system is usually a one-dimensional audio waveform, which

can be represented by discrete samples. Although there are

methods using time-domain samples as features (Kong et al.

2020; Lee et al. 2017), the majority of studies on audio clas-

siﬁcation convert the waveform into a spectrogram as the in-

put feature (Gong, Chung, and Glass 2021b,a). Spectrogram

is usually calculated by the Fourier transform (Champeney

and Champeney 1987), which is applied in short waveform

chunks multiplied by a windowing function, resulting in a

two-dimensional time-frequency representation. According

Insect

Siren

Alarm Clock

10 ms hop size

Siren

10 ms hop size

Alarm Clock

40 ms hop size

Siren

40 ms hop size

Insect

Siren

Alarm Clock

10 ms hop size

Siren

10 ms hop size

Alarm Clock

40 ms hop size

Siren

40 ms hop size

Insect

Siren

Alarm Clock

10 ms hop size

Siren

10 ms hop size

Alarm Clock

40 ms hop size

Siren

40 ms hop size

Insect

Siren

Alarm Clock

10 ms hop size

Siren

10 ms hop size

Alarm Clock

40 ms hop size

Siren

40 ms hop size

Figure 1: The spectrogram of Alarm Clock and Siren sound

with 40 ms and 10 ms hop sizes. All with a 25 ms window

size. The pattern of Siren, which is relatively stable, does

not change signiﬁcantly using a smaller hop size (i.e., larger

temporal resolution), while Alarm Clock is the opposite.

to the Gabor’s uncertainty principle (Gabor 1946), there is

always a trade-off between time and frequency resolutions.

To achieve the desired resolution on the temporal dimension,

it is a common practice (Kong et al. 2021a; Liu et al. 2022)

to apply a ﬁxed hop size between windows to capture the

dynamics between adjacent frames. With the ﬁxed hop size,

the spectrogram has a ﬁxed temporal resolution, which we

will refer to simply as resolution in this work.

Using a ﬁxed resolution is not necessarily optimal for an

audio classiﬁcation model. Intuitively, the resolution should

depend on the temporal pattern: fast-changing signals are

supposed to have high resolution, while relatively steady

signals or blank signals may not need the same high reso-

lution for the best accuracy (Huzaifah 2017). For example,

Figure 1 shows that by increasing resolution, more details

appear in the spectrogram of Alarm Clock while the pattern

of Siren stays mostly the same. This indicates the ﬁner de-

tails in high-resolution Siren may not essentially contribute

to the classiﬁcation accuracy. There are plenty of studies

on learning a suitable frequency resolution with a simi-

lar spirit (Stevens, Volkmann, and Newman 1937; Sainath

et al. 2013; Ravanelli and Bengio 2018b; Zeghidour et al.

2021). Most previous studies focus on investigating the ef-

fect of different temporal resolutions (Kekre et al. 2012;

Huzaifah 2017; Ilyashenko et al. 2019; Liu et al. 2023).

Huzaifah (2017) observe the optimal temporal resolution

for audio classiﬁcation is class dependent. Ferraro et al.

(2021) experiment on music tagging with coarse-resolution

spectrograms, and observes a similar performance can be

maintained while being much faster to compute. Kazakos

et al. (2021) propose a two-stream architecture that pro-

cesses both ﬁne-grained and coarse-resolution spectrogram

arXiv:2210.01719v3 [cs.SD] 12 Jan 2024

and shows the state-of-the-art result on VGG-Sound (Chen

et al. 2020). Recently, Liu et al. (2023) proposed a non-

parametric spectrogram-pooling-based module that can im-

prove classiﬁcation efﬁciency with negligible performance

degradation. However, these approaches are generally built

on a ﬁxed temporal resolution, which is not always optimal

for diverse sounds in the world. Intuitively, it is natural to

ask: can we dynamically learn the temporal resolution for

audio classiﬁcation?

In this work, we demonstrate the ﬁrst attempt to learn

temporal resolution in the spectrogram for audio classiﬁ-

cation. We show that learning temporal resolution leads

to efﬁciency and accuracy improvements over the ﬁxed-

resolution spectrogram. We propose a lightweight algo-

rithm, DiffRes, that makes spectrogram resolution differ-

entiable during model optimization. DiffRes can be used

as a “drop-in” module after spectrogram calculation and

optimized jointly with the downstream task. For the op-

timization of DiffRes, we propose a loss function, guide

loss, to inform the model of the low importance of empty

frames formed by SpecAug (Park et al. 2019). The output of

DiffRes is a time-frequency representation with varying res-

olution, which is achieved by adaptively merging the time

steps of a ﬁxed-resolution spectrogram. The adaptive tem-

poral resolution alleviates the spectrogram temporal redun-

dancy and can speed up computation during training and

inference. We perform experiments on ﬁve different audio

tasks, including the largest audio dataset AudioSet (Gem-

meke et al. 2017). DiffRes shows clear improvements on all

tasks over the ﬁxed-resolution mel-spectrogram baseline and

other learnable front-ends (Zeghidour et al. 2021; Ravanelli

and Bengio 2018b; Zeghidour et al. 2018). Compared with

methods using ﬁxed-resolution spectrogram, we show that

using DiffRes-based models can achieve a computational

cost reduction of at least 25% with the equivalent or better

audio classiﬁcation accuracy.

Besides, the potential of the high-resolution spectrogram,

e.g., with a one-millisecond hop size, is still unclear. Some

popular choices of hop size including 10 ms (B¨

ock et al.

2012; Kong et al. 2020; Gong, Chung, and Glass 2021a)

and 12.5ms (Rybakov et al. 2022). Previous studies (Kong

et al. 2020; Ferraro et al. 2021) show classiﬁcation perfor-

mance can be steadily improved with the increase of res-

olution. One remaining question is: can even ﬁner resolu-

tion improve the performance? We conduct an ablation study

for this question on a limited-vocabulary speech recogni-

tion task with hop sizes smaller than 10 ms. We noticed

that accuracy can still be improved with smaller hop size, at

a cost of increased computational complexity. By introduc-

ing DiffRes with high-resolution spectrograms, we observe

that the classiﬁer performance gains are maintained while

the computational cost is signiﬁcantly reduced.

Our contributions are summarized as follows:

• We present DiffRes, a differentiable approach for learn-

ing temporal resolution in the audio spectrogram, which

improves classiﬁcation accuracy and reduces the compu-

tational cost for off-the-shelf audio classiﬁcation models.

• We extensively evaluate the effectiveness of DiffRes

on ﬁve audio classiﬁcation tasks. We further show that

DiffRes can improve classiﬁcation accuracy by increas-

ing the temporal resolution of input acoustic features,

without adding to the computational cost.

• Our code is available at https://github.com/haoheliu/

diffres-python.

2 Method

We provide an overview of DiffRes-based audio classiﬁca-

tion in Section 2.1. We introduce the detailed formulation

and the optimization of DiffRes in Section 2.2, and 2.3.

2.1 Overview

Let x∈RLdenote a one-dimensional audio time waveform,

where Lis the number of audio samples. An audio clas-

siﬁcation system can be decomposed into a feature extrac-

tion stage and a classiﬁcation stage. In the feature extraction

stage, the audio waveform will be processed by a function

Ql,h :RL→RF×T, which maps the time waveform into

a two-dimensional time-frequency representation X, such

as a mel-spectrogram, where X:,τ = (X1,τ , ..., XF,τ )is

the τ-th frame. Here, Tand Fstand for the time and fre-

quency dimensions of the extracted representation. We also

refer to the representation along the temporal dimensions as

frames. We use land hto denote window length and hop

size, respectively. Usually T∝L

h. We deﬁne the temporal

resolution 1

hby frame per second (FPS), which denotes the

number of frames in one second. In the classiﬁcation stage,

Xwill be processed by a classiﬁcation model Gθparame-

terized by θ. The output of Gθis the label predictions ˆ

y, in

which ˆ

yidenotes the probability of class i. Given the paired

training data (x,y)∈D, where ydenotes the one-hot vector

for ground-truth labels, the optimization of the classiﬁcation

system can be formulated as

arg min

E(x,y)∼DL(Gθ(X),y),(1)

where Lis a loss function such as cross entropy (De Boer

et al. 2005). Figure 2 show an overview of performing clas-

siﬁcation with DiffRes. DiffRes is a “drop-in” module be-

tween Xand Gθfocusing on learning the optimal temporal

resolution with a learnable function Fϕ:RF×T→RF×t,

where tis the parameter denoting the target output time

dimensions of DiffRes, and ϕis the learnable parameters.

DiffRes formulates Fϕwith two steps: i) estimating the im-

portance of each time frame with a learnable model Hϕ:

X→s, where sis a 1×Tshape row vector; and ii) warp-

ing frames based on a frame warping algorithm, the warp-

ing is performed along a single direction on the temporal di-

mension. We introduce the details of these two steps in Sec-

tion 2.2. We deﬁne the dimension reduction rate δof DiffRes

by δ= (T−t)/T . Usually, δ≤1and t≤Tbecause the

temporal resolution of the DiffRes output is either coarser or

equal to that of X. Given the same T, a larger δmeans fewer

temporal dimensions tin the output of DiffRes, and usually

less computation is needed for Gθ. Similar to Equation 1, Fϕ

can be jointly optimized with Gθby

arg min

θ,ϕ

E(x,y)∼DL(Gθ(Fϕ(X)),y).(2)

…

Hop size

STFT

MelFB+Log

ResConv1D

ResConv1D !

Estimate

frame importance

EfficientNet

Classifier

Calculate

mel-spectrogram

Calculate

warp matrix

Merge

frames

PreserveImportant

frame

DiffRes

Rescale

Importance score

Down stream

task

Input

audio waveform

ℱ"

ℋ$

)%×' 2 3

*(×' +)×' ,%×)

- ∈ ℝ *0+,-

Figure 2: Audio classiﬁcation with DiffRes and mel-spectrogram. Green blocks contain learnable parameters. DiffRes is a

“drop-in” module between spectrogram calculation and the downstream task.

2.2 Differentiable temporal resolution modeling

Frame importance estimation We design a frame impor-

tance estimation module Hϕto decide the proportion of each

frame that needs to be kept in the output, which is similar to

the sample weighting operation (Zhang and Pﬁster 2021) in

previous studies. The frame importance estimation module

will output a row vector s′with shape 1×T, where the ele-

ment s′

τis the importance score of the τ-th time frame X:,τ .

The frame importance estimation can be denoted as

s′=σ(Hϕ(X)),(3)

where s′is the row vector of importance scores, and

σis the sigmoid function. A higher value in s′

τindi-

cates the τ-th frame is important for classiﬁcation. We

apply the sigmoid function to stabilize training by limit-

ing the values in s′between zero and one. We implement

Hϕwith a stack of one-dimensional convolutional neu-

ral networks (CNNs) (Fukushima and Miyake 1982; Le-

Cun et al. 1989). Speciﬁcally, Hϕis a stack of ﬁve one-

dimensional convolutional blocks (ResConv1D). We design

the ResConv1D block following other CNN based meth-

ods (Shu et al. 2021; Liu et al. 2020; Kong et al. 2021b).

Each ResConv1D has two layers of one-dimensional CNN

with batch normalization (Ioffe and Szegedy 2015) and

leaky rectiﬁed linear unit activation functions. We apply

residual connection (He et al. 2016) for easier training of

the deep architecture (Zaeemzadeh, Rahnavard, and Shah

2020). Each CNN layer is zero-padded to ensure the tem-

poral dimension does not change (LeCun, Bengio, and Hin-

ton 2015). We use exponentially decreasing channel num-

bers to reduce the computation. In the next frame warping

step (Section 2.2), elements in the importance score will rep-

resent the proportion of each input frame that contributes

to an output frame. Therefore, we perform rescale opera-

tion on s′, resulting in an sthat satisﬁes s∈[0,1]1×T

and PT

k=1 sk≤t. The rescale operation can be denoted

as ˇ

s=s

′

i=1 s′

t, s=ˇ

max(ˇ

s,1) , where ˇ

sis an intermediate

variable that may contain elements greater than one, max

denotes the maximum operation. To quantify how active Hϕ

is trying to distinguish between important and less impor-

tant frames, we also design a measurement, activeness ρ,

which is calculated by the standard derivation of the non-

empty frames, given by

ρ=1

1−δsPi∈Sactive (si−¯

si)2

|Sactive|,(4)

Sactive ={i|E(X:,i)>min(E(X:,i)) + ϵ},(5)

where Sactive is the set of indices of non-empty frames, ϵis a

small value, |S|denotes the size of set S, function E(·)cal-

culates the root-mean-square energy (Law and Rennie 2015)

of a frame in the spectrogram, and function min(·)calcu-

lates the minimum value within a matrix. We use δto unify

the value of ρfor easier comparison between different δset-

tings. The activeness ρcan be used as an indicator of how

DiffRes behaves during training. A higher ρindicates the

model is more active at learning the frame importance. A

lower ρsuch as zero indicates learning nothing. We will dis-

cuss the learning process of DiffRes with ρin Section 3.3.

Temporal Frame Warping We perform temporal frame

warping based on sand Xto calculate an adaptive tempo-

ral resolution representation O, which is similar to the idea

of generating derived features (Pentreath 2015). Generally,

the temporal frame warping algorithm can be denoted by

W=α(s)and O=β(X,W), where α(·)is a func-

tion that convert sinto a warp matrix Wwith shape t×T,

and β(·)is a function that applies Wto Xto calculate the

warpped feature O. Elements in Wsuch as Wi,j denote the

contribution of the j-th input frame X:,j to the i-th output

frame O:,i. We will introduce the realization of α(·)and β(·)

as follows.

Function α(·)calculates the warp matrix Wwith sby:

Wi,j =(sj,if i < Pj

k=1 sk≤i+ 1

0,otherwise ,(6)

where we calculate the cumulative sum of sto decide which

output frame each input frame will be warped into. The warp

matrix Wwill be used for frame warping function β(·).

Function β(·)performs frame warping based on the warp

matrix W. The i-th output frame is calculated with Xand

the i-th row of W, given by

Oi,j =A((Xj,:)⊙(Wi,:)),(7)

where A:R1×T→Rstands for the frame aggregation

function such as averaging, Ois the ﬁnal output feature with

shape F×t.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningTemporalResolutioninSpectrogramforAudioClassificationHaoheLiu1,XuboLiu1,QiuqiangKong2,WenwuWang1,MarkD.Plumbley11UniversityofSurrey2TheChineseUniversityofHongKongAbstractTheaudiospectrogramisatime-frequencyrepresentationthathasbeenwidelyusedforaudioclassification.Oneofthekeyattributesoftheau...

展开>> 收起<<

Learning Temporal Resolution in Spectrogram for Audio Classification Haohe Liu1 Xubo Liu1 Qiuqiang Kong2 Wenwu Wang1 Mark D. Plumbley1 1University of Surrey.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning Temporal Resolution in Spectrogram for Audio Classification Haohe Liu1 Xubo Liu1 Qiuqiang Kong2 Wenwu Wang1 Mark D. Plumbley1 1University of Surrey

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: