SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO REPRESENTATIONS FOR MUSIC UNDERSTANDING Matthew C. McCallum Filip Korzeniowski Sergio Oramas

2025-05-02 0 0 372.85KB 8 页 10玖币

侵权投诉

SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO

REPRESENTATIONS FOR MUSIC UNDERSTANDING

Matthew C. McCallum Filip Korzeniowski Sergio Oramas

Fabien Gouyon Andreas F. Ehmann

SiriusXM, USA

ABSTRACT

In this work, we provide a broad comparative analysis of

strategies for pre-training audio understanding models for

several tasks in the music domain, including labelling of

genre, era, origin, mood, instrumentation, key, pitch, vo-

cal characteristics, tempo and sonority. Speciﬁcally, we

explore how the domain of pre-training datasets (music or

generic audio) and the pre-training methodology (super-

vised or unsupervised) affects the adequacy of the resulting

audio embeddings for downstream tasks.

We show that models trained via supervised learning on

large-scale expert-annotated music datasets achieve state-

of-the-art performance in a wide range of music labelling

tasks, each with novel content and vocabularies. This can

be done in an efﬁcient manner with models containing less

than 100 million parameters that require no ﬁne-tuning or

reparameterization for downstream tasks, making this ap-

proach practical for industry-scale audio catalogs.

Within the class of unsupervised learning strategies,

we show that the domain of the training dataset can

signiﬁcantly impact the performance of representations

learned by the model. We ﬁnd that restricting the do-

main of the pre-training dataset to music allows for training

with smaller batch sizes while achieving state-of-the-art

in unsupervised learning—and in some cases, supervised

learning—for music understanding.

We also corroborate that, while achieving state-of-the-

art performance on many tasks, supervised learning can

cause models to specialize to the supervised information

provided, somewhat compromising a model’s generality.

1. INTRODUCTION

In this work, we consider a broad array of classiﬁcation and

labelling tasks under the umbrella of music understanding.

Such tasks include the labelling of genre, origin, mood,

musical key, instruments, era, emotion and pitch present

in music. These tasks have many applications in industry,

mas, Fabien Gouyon, Andreas F. Ehmann. Licensed under a Creative

Commons Attribution 4.0 International License (CC BY 4.0). Attribu-

tion: Matthew C. McCallum, Filip Korzeniowski, Sergio Oramas, Fa-

bien Gouyon, Andreas F. Ehmann, “Supervised and Unsupervised Learn-

ing of Audio Representations for Music Understanding”, in Proc. of the

23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India,

2022.

particularly in music streaming and recommendation ser-

vices where automated understanding of audio can assist

in a range of tasks such as organizing, ﬁltering and person-

alizing content to a listener’s taste and context.

Recent research in automated audio understanding has

focused on training convolutional [1–12] and / or atten-

tion based [13–21] networks on moderately large collec-

tions of frequency-domain audio [2,5–8,12–14,18, 20,21],

time-domain audio [3, 10, 11, 19] or multi-format / multi-

modal [4, 9, 21, 22] data. Such models are often trained on

tags encompassing some of the musical labels listed above,

achieving promising results [1,3,6,11,13–15,22]. More re-

cent works propose unsupervised strategies for music un-

derstanding such as contrastive learning [2, 4, 5, 8–10, 21]

or predictive / generative approaches [7, 20, 21, 23]. Unsu-

pervised strategies are appealing because they require no

annotated data and generalize well to new tasks [2, 21],

but lag the performance of supervised learning at a simi-

lar scale [2,10]. Generative learning strategies [7,23] have

been shown to achieve competitive, and sometimes state-

of-the-art (SOTA), performance in several music under-

standing tasks [24], although, currently there is no evalua-

tion demonstrating the effectiveness of this approach to any

of the aforementioned approaches, at comparable scale.

Modern music streaming services have very large mu-

sic catalogs that amount to many petabytes of audio data

if uncompressed. Due to the scale of this data it is de-

sirable to build models that are efﬁciently scalable, and

understand audio in a general enough way that, as needs

or requirements change, they may be used to solve novel

problems without reprocessing such data. Models in the

order of 10M or 100M parameters are currently relatively

cost-effective to both train and apply inference to industry-

scale catalogs, whilst models consisting of billions of pa-

rameters, e.g. that evaluated in [24], are typically imprac-

tical, or very expensive, for both training and inference.

More recently, research has adopted approaches pro-

ducing generalized audio embeddings [2, 4–10, 12, 18, 19,

25] in a supervised or unsupervised way, by training mod-

els on large amounts of labelled or unlabelled audio. When

such models are applied to novel audio, the internal state of

the models has been found to contain much of the informa-

tion necessary for previously unseen tasks. This is demon-

strated by training shallow classiﬁers (probes) on embed-

dings consisting of the activations of a given model layer,

that map these values to a downstream task. Such an ap-

proach achieves competitive results using either unsuper-

arXiv:2210.03799v1 [cs.SD] 7 Oct 2022

vised [25] or supervised [6] learning. Most importantly,

the embeddings on which the probes are trained are many

orders of magnitude smaller than the audio itself and only

need to be computed once per audio ﬁle. Such embeddings

can be stored efﬁciently, and downstream classiﬁers can be

trained with signiﬁcantly less resources. The excellent per-

formance, generality, and scalability of this approach are

crucial factors for its utility in industry.

This approach to audio understanding has been high-

lighted in recent benchmarks such as HARES [2] and

HEAR [26], where embeddings are evaluated across a

number of audio understanding tasks pertaining to a range

of content types. Any score aggregation across these

benchmarks to determine a "best" embedding is difﬁcult

due to the disparate range of metrics employed and further-

more, may obfuscate the strengths and weakness of any

given approach. However, evaluating across a common

range of tasks can be useful in comparing such strengths

and weaknesses. We ﬁnd that the current tasks evaluated in

the HEAR and HARES benchmarks are lacking in evalua-

tion on music content. Wrt. polyphonic music, the HARES

benchmark includes only the Magnatagatune dataset, and

the HEAR benchmark includes only GTZAN genre and

music / speech datasets. While other public music datasets

exist, such benchmarks are somewhat limited by the re-

quirement to provide access to the audio of all datasets.

Here, we do not intend to establish a new open bench-

mark, but investigate the effectiveness of supervised and

unsupervised learning for audio embeddings employed

speciﬁcally for music understanding, across as broad an

array of tasks as is available within time and resource

constraints. For supervised learning we train models on

large scale datasets of annotated magnitude log-mel spec-

trograms both in the music domain and in the general au-

dio domain. For unsupervised learning we train contrastive

models using SimCLR loss [27, 28] on the same sets of

magnitude log-mel spectrograms, excluding annotations.

The contributions of this work are as follows: we

provide a broad analysis of supervised and unsupervised

learning strategies for pre-training audio models for mu-

sic understanding; we show that for multilabel / multiclass

classiﬁcation of music, large-scale supervised learning on

music data achieves SOTA performance, in many cases

outperforming both prior SOTA and unsupervised learn-

ing by signiﬁcant margins; we show that supervised learn-

ing on labelled music data does not generalize as well as

unsupervised learning to novel tasks not covered in those

labels; ﬁnally, we show that the domain of pre-training au-

dio datasets has a signiﬁcant impact on the performance of

embeddings, particularly for unsupervised learning.

2. PRE-TRAINING METHODOLOGY

To achieve the objectives outlined in Section 1, we fol-

low a familiar transfer learning paradigm (Figure 1) where

models are pre-trained using supervised or unsupervised

learning. Thereafter, the frozen activations from a layer of

that model, forming embeddings, z, are mapped to a down-

stream task using a simple network p(z).

CNN CNN

PROJECT PROJECT

PROBE

SUPERVISED

PRETRAINING

UNSUPERVISED

PRETRAINING EVALUATION

FEATURE

SAMPLING

Figure 1. System diagram of both pre-training approaches

employed in this paper, and evaluation.

2.1 Supervised and Unsupervised Training

In the supervised setting we learn a function f(X)⇒ˆy

mapping features X(log-mel spectrograms) to binary la-

bels, y, by applying Adam optimization [29] to the binary

cross-entropy loss function,

Ls(y,ˆy) = −1

N−1

i=0

yilog(ˆyi) + (1 −yi) log(1 −ˆyi),

where batch size N= 512, and Kis the number of labels.

In the unsupervised setting, we employ the SimCLR ob-

jective [27, 28], which has been shown to provide promis-

ing results for both music and audio understanding [2, 10].

The SimCLR objective employs correlated (positive) pairs

of samples by mapping each feature to an embedding

space, f(X)⇒z∈Rm, with embedding dimensional-

ity m= 1728. A projector, then maps the embedding

space to a loss space h(z)⇒v∈Rn, with dimen-

sionality n= 1024. Here, each element is then com-

pared to all other elements in a batch via distance function,

d(vi,vj) = vi·vj/kvikkvjk. The loss is then computed

as the normalized temperature-scaled cross entropy,

Lu(vi,vj) = −log exp (d(vi,vj)/τ)

P2N−1

k=0 1[k6=i]exp (d(vi,vk)/τ),

which is summed across 2Nexamples in N= 1920 posi-

tive pairs, where i=j, for all values of both i∈[0, N −1]

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SUPERVISEDANDUNSUPERVISEDLEARNINGOFAUDIOREPRESENTATIONSFORMUSICUNDERSTANDINGMatthewC.McCallumFilipKorzeniowskiSergioOramasFabienGouyonAndreasF.EhmannSiriusXM,USAABSTRACTInthiswork,weprovideabroadcomparativeanalysisofstrategiesforpre-trainingaudiounderstandingmodelsforseveraltasksinthemusicdomain,inc...

展开>> 收起<<

SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO REPRESENTATIONS FOR MUSIC UNDERSTANDING Matthew C. McCallum Filip Korzeniowski Sergio Oramas.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SUPERVISED AND UNSUPERVISED LEARNING OF AUDIO REPRESENTATIONS FOR MUSIC UNDERSTANDING Matthew C. McCallum Filip Korzeniowski Sergio Oramas

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: