
and shows the state-of-the-art result on VGG-Sound (Chen
et al. 2020). Recently, Liu et al. (2023) proposed a non-
parametric spectrogram-pooling-based module that can im-
prove classification efficiency with negligible performance
degradation. However, these approaches are generally built
on a fixed temporal resolution, which is not always optimal
for diverse sounds in the world. Intuitively, it is natural to
ask: can we dynamically learn the temporal resolution for
audio classification?
In this work, we demonstrate the first attempt to learn
temporal resolution in the spectrogram for audio classifi-
cation. We show that learning temporal resolution leads
to efficiency and accuracy improvements over the fixed-
resolution spectrogram. We propose a lightweight algo-
rithm, DiffRes, that makes spectrogram resolution differ-
entiable during model optimization. DiffRes can be used
as a “drop-in” module after spectrogram calculation and
optimized jointly with the downstream task. For the op-
timization of DiffRes, we propose a loss function, guide
loss, to inform the model of the low importance of empty
frames formed by SpecAug (Park et al. 2019). The output of
DiffRes is a time-frequency representation with varying res-
olution, which is achieved by adaptively merging the time
steps of a fixed-resolution spectrogram. The adaptive tem-
poral resolution alleviates the spectrogram temporal redun-
dancy and can speed up computation during training and
inference. We perform experiments on five different audio
tasks, including the largest audio dataset AudioSet (Gem-
meke et al. 2017). DiffRes shows clear improvements on all
tasks over the fixed-resolution mel-spectrogram baseline and
other learnable front-ends (Zeghidour et al. 2021; Ravanelli
and Bengio 2018b; Zeghidour et al. 2018). Compared with
methods using fixed-resolution spectrogram, we show that
using DiffRes-based models can achieve a computational
cost reduction of at least 25% with the equivalent or better
audio classification accuracy.
Besides, the potential of the high-resolution spectrogram,
e.g., with a one-millisecond hop size, is still unclear. Some
popular choices of hop size including 10 ms (B¨
ock et al.
2012; Kong et al. 2020; Gong, Chung, and Glass 2021a)
and 12.5ms (Rybakov et al. 2022). Previous studies (Kong
et al. 2020; Ferraro et al. 2021) show classification perfor-
mance can be steadily improved with the increase of res-
olution. One remaining question is: can even finer resolu-
tion improve the performance? We conduct an ablation study
for this question on a limited-vocabulary speech recogni-
tion task with hop sizes smaller than 10 ms. We noticed
that accuracy can still be improved with smaller hop size, at
a cost of increased computational complexity. By introduc-
ing DiffRes with high-resolution spectrograms, we observe
that the classifier performance gains are maintained while
the computational cost is significantly reduced.
Our contributions are summarized as follows:
• We present DiffRes, a differentiable approach for learn-
ing temporal resolution in the audio spectrogram, which
improves classification accuracy and reduces the compu-
tational cost for off-the-shelf audio classification models.
• We extensively evaluate the effectiveness of DiffRes
on five audio classification tasks. We further show that
DiffRes can improve classification accuracy by increas-
ing the temporal resolution of input acoustic features,
without adding to the computational cost.
• Our code is available at https://github.com/haoheliu/
diffres-python.
2 Method
We provide an overview of DiffRes-based audio classifica-
tion in Section 2.1. We introduce the detailed formulation
and the optimization of DiffRes in Section 2.2, and 2.3.
2.1 Overview
Let x∈RLdenote a one-dimensional audio time waveform,
where Lis the number of audio samples. An audio clas-
sification system can be decomposed into a feature extrac-
tion stage and a classification stage. In the feature extraction
stage, the audio waveform will be processed by a function
Ql,h :RL→RF×T, which maps the time waveform into
a two-dimensional time-frequency representation X, such
as a mel-spectrogram, where X:,τ = (X1,τ , ..., XF,τ )is
the τ-th frame. Here, Tand Fstand for the time and fre-
quency dimensions of the extracted representation. We also
refer to the representation along the temporal dimensions as
frames. We use land hto denote window length and hop
size, respectively. Usually T∝L
h. We define the temporal
resolution 1
hby frame per second (FPS), which denotes the
number of frames in one second. In the classification stage,
Xwill be processed by a classification model Gθparame-
terized by θ. The output of Gθis the label predictions ˆ
y, in
which ˆ
yidenotes the probability of class i. Given the paired
training data (x,y)∈D, where ydenotes the one-hot vector
for ground-truth labels, the optimization of the classification
system can be formulated as
arg min
θ
E(x,y)∼DL(Gθ(X),y),(1)
where Lis a loss function such as cross entropy (De Boer
et al. 2005). Figure 2 show an overview of performing clas-
sification with DiffRes. DiffRes is a “drop-in” module be-
tween Xand Gθfocusing on learning the optimal temporal
resolution with a learnable function Fϕ:RF×T→RF×t,
where tis the parameter denoting the target output time
dimensions of DiffRes, and ϕis the learnable parameters.
DiffRes formulates Fϕwith two steps: i) estimating the im-
portance of each time frame with a learnable model Hϕ:
X→s, where sis a 1×Tshape row vector; and ii) warp-
ing frames based on a frame warping algorithm, the warp-
ing is performed along a single direction on the temporal di-
mension. We introduce the details of these two steps in Sec-
tion 2.2. We define the dimension reduction rate δof DiffRes
by δ= (T−t)/T . Usually, δ≤1and t≤Tbecause the
temporal resolution of the DiffRes output is either coarser or
equal to that of X. Given the same T, a larger δmeans fewer
temporal dimensions tin the output of DiffRes, and usually
less computation is needed for Gθ. Similar to Equation 1, Fϕ
can be jointly optimized with Gθby
arg min
θ,ϕ
E(x,y)∼DL(Gθ(Fϕ(X)),y).(2)