
SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO
REPRESENTATIONS FOR MUSIC UNDERSTANDING
Matthew C. McCallum Filip Korzeniowski Sergio Oramas
Fabien Gouyon Andreas F. Ehmann
SiriusXM, USA
ABSTRACT
In this work, we provide a broad comparative analysis of
strategies for pre-training audio understanding models for
several tasks in the music domain, including labelling of
genre, era, origin, mood, instrumentation, key, pitch, vo-
cal characteristics, tempo and sonority. Specifically, we
explore how the domain of pre-training datasets (music or
generic audio) and the pre-training methodology (super-
vised or unsupervised) affects the adequacy of the resulting
audio embeddings for downstream tasks.
We show that models trained via supervised learning on
large-scale expert-annotated music datasets achieve state-
of-the-art performance in a wide range of music labelling
tasks, each with novel content and vocabularies. This can
be done in an efficient manner with models containing less
than 100 million parameters that require no fine-tuning or
reparameterization for downstream tasks, making this ap-
proach practical for industry-scale audio catalogs.
Within the class of unsupervised learning strategies,
we show that the domain of the training dataset can
significantly impact the performance of representations
learned by the model. We find that restricting the do-
main of the pre-training dataset to music allows for training
with smaller batch sizes while achieving state-of-the-art
in unsupervised learning—and in some cases, supervised
learning—for music understanding.
We also corroborate that, while achieving state-of-the-
art performance on many tasks, supervised learning can
cause models to specialize to the supervised information
provided, somewhat compromising a model’s generality.
1. INTRODUCTION
In this work, we consider a broad array of classification and
labelling tasks under the umbrella of music understanding.
Such tasks include the labelling of genre, origin, mood,
musical key, instruments, era, emotion and pitch present
in music. These tasks have many applications in industry,
© Matthew C. McCallum, Filip Korzeniowski, Sergio Ora-
mas, Fabien Gouyon, Andreas F. Ehmann. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribu-
tion: Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, Fa-
bien Gouyon, Andreas F. Ehmann, “Supervised and Unsupervised Learn-
ing of Audio Representations for Music Understanding”, in Proc. of the
23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India,
2022.
particularly in music streaming and recommendation ser-
vices where automated understanding of audio can assist
in a range of tasks such as organizing, filtering and person-
alizing content to a listener’s taste and context.
Recent research in automated audio understanding has
focused on training convolutional [1–12] and / or atten-
tion based [13–21] networks on moderately large collec-
tions of frequency-domain audio [2,5–8,12–14,18, 20,21],
time-domain audio [3, 10, 11, 19] or multi-format / multi-
modal [4, 9, 21, 22] data. Such models are often trained on
tags encompassing some of the musical labels listed above,
achieving promising results [1,3,6,11,13–15,22]. More re-
cent works propose unsupervised strategies for music un-
derstanding such as contrastive learning [2, 4, 5, 8–10, 21]
or predictive / generative approaches [7, 20, 21, 23]. Unsu-
pervised strategies are appealing because they require no
annotated data and generalize well to new tasks [2, 21],
but lag the performance of supervised learning at a simi-
lar scale [2,10]. Generative learning strategies [7,23] have
been shown to achieve competitive, and sometimes state-
of-the-art (SOTA), performance in several music under-
standing tasks [24], although, currently there is no evalua-
tion demonstrating the effectiveness of this approach to any
of the aforementioned approaches, at comparable scale.
Modern music streaming services have very large mu-
sic catalogs that amount to many petabytes of audio data
if uncompressed. Due to the scale of this data it is de-
sirable to build models that are efficiently scalable, and
understand audio in a general enough way that, as needs
or requirements change, they may be used to solve novel
problems without reprocessing such data. Models in the
order of 10M or 100M parameters are currently relatively
cost-effective to both train and apply inference to industry-
scale catalogs, whilst models consisting of billions of pa-
rameters, e.g. that evaluated in [24], are typically imprac-
tical, or very expensive, for both training and inference.
More recently, research has adopted approaches pro-
ducing generalized audio embeddings [2, 4–10, 12, 18, 19,
25] in a supervised or unsupervised way, by training mod-
els on large amounts of labelled or unlabelled audio. When
such models are applied to novel audio, the internal state of
the models has been found to contain much of the informa-
tion necessary for previously unseen tasks. This is demon-
strated by training shallow classifiers (probes) on embed-
dings consisting of the activations of a given model layer,
that map these values to a downstream task. Such an ap-
proach achieves competitive results using either unsuper-
arXiv:2210.03799v1 [cs.SD] 7 Oct 2022