SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO REPRESENTATIONS FOR MUSIC UNDERSTANDING Matthew C. McCallum Filip Korzeniowski Sergio Oramas

2025-05-02 0 0 372.85KB 8 页 10玖币
侵权投诉
SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO
REPRESENTATIONS FOR MUSIC UNDERSTANDING
Matthew C. McCallum Filip Korzeniowski Sergio Oramas
Fabien Gouyon Andreas F. Ehmann
SiriusXM, USA
ABSTRACT
In this work, we provide a broad comparative analysis of
strategies for pre-training audio understanding models for
several tasks in the music domain, including labelling of
genre, era, origin, mood, instrumentation, key, pitch, vo-
cal characteristics, tempo and sonority. Specifically, we
explore how the domain of pre-training datasets (music or
generic audio) and the pre-training methodology (super-
vised or unsupervised) affects the adequacy of the resulting
audio embeddings for downstream tasks.
We show that models trained via supervised learning on
large-scale expert-annotated music datasets achieve state-
of-the-art performance in a wide range of music labelling
tasks, each with novel content and vocabularies. This can
be done in an efficient manner with models containing less
than 100 million parameters that require no fine-tuning or
reparameterization for downstream tasks, making this ap-
proach practical for industry-scale audio catalogs.
Within the class of unsupervised learning strategies,
we show that the domain of the training dataset can
significantly impact the performance of representations
learned by the model. We find that restricting the do-
main of the pre-training dataset to music allows for training
with smaller batch sizes while achieving state-of-the-art
in unsupervised learning—and in some cases, supervised
learning—for music understanding.
We also corroborate that, while achieving state-of-the-
art performance on many tasks, supervised learning can
cause models to specialize to the supervised information
provided, somewhat compromising a model’s generality.
1. INTRODUCTION
In this work, we consider a broad array of classification and
labelling tasks under the umbrella of music understanding.
Such tasks include the labelling of genre, origin, mood,
musical key, instruments, era, emotion and pitch present
in music. These tasks have many applications in industry,
© Matthew C. McCallum, Filip Korzeniowski, Sergio Ora-
mas, Fabien Gouyon, Andreas F. Ehmann. Licensed under a Creative
Commons Attribution 4.0 International License (CC BY 4.0). Attribu-
tion: Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, Fa-
bien Gouyon, Andreas F. Ehmann, “Supervised and Unsupervised Learn-
ing of Audio Representations for Music Understanding”, in Proc. of the
23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India,
2022.
particularly in music streaming and recommendation ser-
vices where automated understanding of audio can assist
in a range of tasks such as organizing, filtering and person-
alizing content to a listener’s taste and context.
Recent research in automated audio understanding has
focused on training convolutional [1–12] and / or atten-
tion based [13–21] networks on moderately large collec-
tions of frequency-domain audio [2,5–8,12–14,18, 20,21],
time-domain audio [3, 10, 11, 19] or multi-format / multi-
modal [4, 9, 21, 22] data. Such models are often trained on
tags encompassing some of the musical labels listed above,
achieving promising results [1,3,6,11,13–15,22]. More re-
cent works propose unsupervised strategies for music un-
derstanding such as contrastive learning [2, 4, 5, 8–10, 21]
or predictive / generative approaches [7, 20, 21, 23]. Unsu-
pervised strategies are appealing because they require no
annotated data and generalize well to new tasks [2, 21],
but lag the performance of supervised learning at a simi-
lar scale [2,10]. Generative learning strategies [7,23] have
been shown to achieve competitive, and sometimes state-
of-the-art (SOTA), performance in several music under-
standing tasks [24], although, currently there is no evalua-
tion demonstrating the effectiveness of this approach to any
of the aforementioned approaches, at comparable scale.
Modern music streaming services have very large mu-
sic catalogs that amount to many petabytes of audio data
if uncompressed. Due to the scale of this data it is de-
sirable to build models that are efficiently scalable, and
understand audio in a general enough way that, as needs
or requirements change, they may be used to solve novel
problems without reprocessing such data. Models in the
order of 10M or 100M parameters are currently relatively
cost-effective to both train and apply inference to industry-
scale catalogs, whilst models consisting of billions of pa-
rameters, e.g. that evaluated in [24], are typically imprac-
tical, or very expensive, for both training and inference.
More recently, research has adopted approaches pro-
ducing generalized audio embeddings [2, 4–10, 12, 18, 19,
25] in a supervised or unsupervised way, by training mod-
els on large amounts of labelled or unlabelled audio. When
such models are applied to novel audio, the internal state of
the models has been found to contain much of the informa-
tion necessary for previously unseen tasks. This is demon-
strated by training shallow classifiers (probes) on embed-
dings consisting of the activations of a given model layer,
that map these values to a downstream task. Such an ap-
proach achieves competitive results using either unsuper-
arXiv:2210.03799v1 [cs.SD] 7 Oct 2022
vised [25] or supervised [6] learning. Most importantly,
the embeddings on which the probes are trained are many
orders of magnitude smaller than the audio itself and only
need to be computed once per audio file. Such embeddings
can be stored efficiently, and downstream classifiers can be
trained with significantly less resources. The excellent per-
formance, generality, and scalability of this approach are
crucial factors for its utility in industry.
This approach to audio understanding has been high-
lighted in recent benchmarks such as HARES [2] and
HEAR [26], where embeddings are evaluated across a
number of audio understanding tasks pertaining to a range
of content types. Any score aggregation across these
benchmarks to determine a "best" embedding is difficult
due to the disparate range of metrics employed and further-
more, may obfuscate the strengths and weakness of any
given approach. However, evaluating across a common
range of tasks can be useful in comparing such strengths
and weaknesses. We find that the current tasks evaluated in
the HEAR and HARES benchmarks are lacking in evalua-
tion on music content. Wrt. polyphonic music, the HARES
benchmark includes only the Magnatagatune dataset, and
the HEAR benchmark includes only GTZAN genre and
music / speech datasets. While other public music datasets
exist, such benchmarks are somewhat limited by the re-
quirement to provide access to the audio of all datasets.
Here, we do not intend to establish a new open bench-
mark, but investigate the effectiveness of supervised and
unsupervised learning for audio embeddings employed
specifically for music understanding, across as broad an
array of tasks as is available within time and resource
constraints. For supervised learning we train models on
large scale datasets of annotated magnitude log-mel spec-
trograms both in the music domain and in the general au-
dio domain. For unsupervised learning we train contrastive
models using SimCLR loss [27, 28] on the same sets of
magnitude log-mel spectrograms, excluding annotations.
The contributions of this work are as follows: we
provide a broad analysis of supervised and unsupervised
learning strategies for pre-training audio models for mu-
sic understanding; we show that for multilabel / multiclass
classification of music, large-scale supervised learning on
music data achieves SOTA performance, in many cases
outperforming both prior SOTA and unsupervised learn-
ing by significant margins; we show that supervised learn-
ing on labelled music data does not generalize as well as
unsupervised learning to novel tasks not covered in those
labels; finally, we show that the domain of pre-training au-
dio datasets has a significant impact on the performance of
embeddings, particularly for unsupervised learning.
2. PRE-TRAINING METHODOLOGY
To achieve the objectives outlined in Section 1, we fol-
low a familiar transfer learning paradigm (Figure 1) where
models are pre-trained using supervised or unsupervised
learning. Thereafter, the frozen activations from a layer of
that model, forming embeddings, z, are mapped to a down-
stream task using a simple network p(z).
CNN CNN
PROJECT PROJECT
PROBE
SUPERVISED
PRETRAINING
UNSUPERVISED
PRETRAINING EVALUATION
FEATURE
SAMPLING
Figure 1. System diagram of both pre-training approaches
employed in this paper, and evaluation.
2.1 Supervised and Unsupervised Training
In the supervised setting we learn a function f(X)ˆy
mapping features X(log-mel spectrograms) to binary la-
bels, y, by applying Adam optimization [29] to the binary
cross-entropy loss function,
Ls(y,ˆy) = 1
NK
N1
X
i=0
yilog(ˆyi) + (1 yi) log(1 ˆyi),
where batch size N= 512, and Kis the number of labels.
In the unsupervised setting, we employ the SimCLR ob-
jective [27, 28], which has been shown to provide promis-
ing results for both music and audio understanding [2, 10].
The SimCLR objective employs correlated (positive) pairs
of samples by mapping each feature to an embedding
space, f(X)zRm, with embedding dimensional-
ity m= 1728. A projector, then maps the embedding
space to a loss space h(z)vRn, with dimen-
sionality n= 1024. Here, each element is then com-
pared to all other elements in a batch via distance function,
d(vi,vj) = vi·vj/kvikkvjk. The loss is then computed
as the normalized temperature-scaled cross entropy,
Lu(vi,vj) = log exp (d(vi,vj))
P2N1
k=0 1[k6=i]exp (d(vi,vk)),
which is summed across 2Nexamples in N= 1920 posi-
tive pairs, where i=j, for all values of both i[0, N 1]
摘要:

SUPERVISEDANDUNSUPERVISEDLEARNINGOFAUDIOREPRESENTATIONSFORMUSICUNDERSTANDINGMatthewC.McCallumFilipKorzeniowskiSergioOramasFabienGouyonAndreasF.EhmannSiriusXM,USAABSTRACTInthiswork,weprovideabroadcomparativeanalysisofstrategiesforpre-trainingaudiounderstandingmodelsforseveraltasksinthemusicdomain,inc...

展开>> 收起<<
SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO REPRESENTATIONS FOR MUSIC UNDERSTANDING Matthew C. McCallum Filip Korzeniowski Sergio Oramas.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:372.85KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注