LEARNING MUSIC REPRESENTATIONS WITH WA V2VEC 2.0 Alessandro Ragano1 Emmanouil Benetos2 Andrew Hines1 1School of CS University College Dublin Ireland

2025-05-02 0 0 5.86MB 5 页 10玖币

侵权投诉

LEARNING MUSIC REPRESENTATIONS WITH WAV2VEC 2.0

Alessandro Ragano1, Emmanouil Benetos2, Andrew Hines1

1School of CS, University College Dublin, Ireland

2School of EECS, Queen Mary University of London, UK

ABSTRACT

Learning music representations that are general-purpose offers

the ﬂexibility to ﬁnetune several downstream tasks using smaller

datasets. The wav2vec 2.0 speech representation model showed

promising results in many downstream speech tasks, but has been

less effective when adapted to music. In this paper, we evaluate

whether pre-training wav2vec 2.0 directly on music data can be

a better solution instead of ﬁnetuning the speech model. We il-

lustrate that when pre-training on music data, the discrete latent

representations are able to encode the semantic meaning of musical

concepts such as pitch and instrument. Our results show that ﬁne-

tuning wav2vec 2.0 pre-trained on music data allows us to achieve

promising results on music classiﬁcation tasks that are compet-

itive with prior work on audio representations. In addition, the

results are superior to the pre-trained model on speech embeddings,

demonstrating that wav2vec 2.0 pre-trained on music data can be a

promising music representation model.

Index Terms—music representations, self-supervision, pre-

training

1. INTRODUCTION

Learning feature representations with deep architectures has shown

remarkable success over hand-crafted features in Music Information

Retrieval (MIR) [1]. Approaches such as transfer learning from mu-

sic auto-tagging [2, 3, 4] allows someone to pre-train neural net-

works using large datasets and extracting features for downstream

MIR tasks such as instrument classiﬁcation or genre recognition. In

this way, downstream MIR tasks can be solved using smaller anno-

tated datasets, which is desired since labeling is costly and difﬁcult

to achieve. One issue with auto-tagging models is that they require

very large annotated datasets that are still difﬁcult to obtain. To over-

come the need for large annotated datasets, new music representation

techniques have emerged that do not directly use waveform-related

labels emerged. For example, pre-training from language models [5]

or using noisy language descriptors of the musical content [6].

A different approach that is based on using proxy tasks to learn

representations is self-supervised learning (SSL), where information

from input data is extracted to provide labels. This is advantageous

since labels can be generated automatically without requiring human

intervention. Some SSL models have been proposed for music repre-

sentations showing competitive performance in several downstream

MIR tasks [7, 8, 9, 10]. Beyond music representation learning, SSL

This publication has emanated from research conducted with the ﬁnan-

cial support of Science Foundation Ireland (SFI) under Grant Number 17/RC-

PhD/3483 and 17/RC/2289 P2 and was supported by The Alan Turing Insti-

tute under the EPSRC grant EP/N510129/1. EB is supported by a Turing

Fellowship.

models have grown popularity for speech representations and down-

stream tasks such as speaker identiﬁcation, automatic speech recog-

nition, phoneme recognition, and speech translation [11]. Exam-

ples of speech SSL models include wav2vec 2.0 [12] which is a

contrastive learning-based approach where the model learns to dis-

tinguish a target sample (positive) from distractors (negative). The

original model was pre-trained on the LibriSpeech dataset [13] and

its success is highlighted by the ability to retain high performance

even when dedicated datasets for downstream tasks are very small,

e.g., 10 minutes only for speech recognition [12] or 1000 observa-

tions for non-intrusive speech quality assessment [14].

The wav2vec 2.0 SSL model has been extensively evaluated for

speech tasks. However, its adaptation to music tasks (such as pitch

classiﬁcation or instrument classiﬁcation) has been explored so far

without success, as shown in two studies. In the NeurIPS challenge

HEAR [15], wav2vec 2.0 embeddings are extracted from the model

pre-trained on the LibriSpeech dataset and are used as input features

without ﬁnetuning. Their performance is relatively low in music

tasks, even if wav2vec 2.0 speech embeddings can still represent

some musical concepts to some degree, such as pitch [15]. Wang

et al. [16] have also evaluated wav2vec 2.0 outside of the speech

domain. In this case, the authors found that wav2vec 2.0 did not

perform well when pre-trained on AudioSet [17], possibly due to

the limitation of the masked prediction objective of learning from a

dataset more complex than LibriSpeech [16].

An approach that is still unexplored is pre-training wav2vec 2.0

on music data only. The transferability of deep networks becomes

more challenging when the source and the target tasks have differ-

ent domains [18] and it has been shown that wav2vec 2.0 might

be sensitive to a domain shift. For example, pre-training wav2vec

2.0 with cross-lingual datasets improves performance of ASR sys-

tems [19] and ﬁnetuning with non-English languages shows a per-

formance drop for speech quality assessment [14].

In this paper, we study whether the domain shift between the

pre-trained model and the downstream tasks observed can cause this

performance drop in music tasks as reported in the studies above. We

explore further the capacity of wav2vec 2.0 features in non-speech

tasks asking the following questions:

1. Does wav2vec 2.0 pre-trained on music encode meaningful

music representations, i.e. related to musical concepts such

as pitch or instruments?

2. Is it possible to obtain competitive performance on MIR tasks

when ﬁnetuning wav2vec 2.0 pre-trained on music?

3. Can we establish if wav2vec 2.0 is a potential candidate

model for music tasks other than speech?

The paper is structured as follows. In Section 2 we illustrate

how we pre-train wav2vec 2.0 with music data. Section 3 is ded-

icated to the analysis of the features learned by wav2vec 2.0. We

show whether the information encoded in the codebooks is related

arXiv:2210.15310v1 [eess.AS] 27 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LEARNINGMUSICREPRESENTATIONSWITHWAV2VEC2.0AlessandroRagano1,EmmanouilBenetos2,AndrewHines11SchoolofCS,UniversityCollegeDublin,Ireland2SchoolofEECS,QueenMaryUniversityofLondon,UKABSTRACTLearningmusicrepresentationsthataregeneral-purposeofferstheexibilitytonetuneseveraldownstreamtasksusingsmallerdat...

展开>> 收起<<

LEARNING MUSIC REPRESENTATIONS WITH WA V2VEC 2.0 Alessandro Ragano1 Emmanouil Benetos2 Andrew Hines1 1School of CS University College Dublin Ireland.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LEARNING MUSIC REPRESENTATIONS WITH WA V2VEC 2.0 Alessandro Ragano1 Emmanouil Benetos2 Andrew Hines1 1School of CS University College Dublin Ireland

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: