LEARNING MUSIC REPRESENTATIONS WITH WA V2VEC 2.0 Alessandro Ragano1 Emmanouil Benetos2 Andrew Hines1 1School of CS University College Dublin Ireland

2025-05-02 0 0 5.86MB 5 页 10玖币
侵权投诉
LEARNING MUSIC REPRESENTATIONS WITH WAV2VEC 2.0
Alessandro Ragano1, Emmanouil Benetos2, Andrew Hines1
1School of CS, University College Dublin, Ireland
2School of EECS, Queen Mary University of London, UK
ABSTRACT
Learning music representations that are general-purpose offers
the flexibility to finetune several downstream tasks using smaller
datasets. The wav2vec 2.0 speech representation model showed
promising results in many downstream speech tasks, but has been
less effective when adapted to music. In this paper, we evaluate
whether pre-training wav2vec 2.0 directly on music data can be
a better solution instead of finetuning the speech model. We il-
lustrate that when pre-training on music data, the discrete latent
representations are able to encode the semantic meaning of musical
concepts such as pitch and instrument. Our results show that fine-
tuning wav2vec 2.0 pre-trained on music data allows us to achieve
promising results on music classification tasks that are compet-
itive with prior work on audio representations. In addition, the
results are superior to the pre-trained model on speech embeddings,
demonstrating that wav2vec 2.0 pre-trained on music data can be a
promising music representation model.
Index Termsmusic representations, self-supervision, pre-
training
1. INTRODUCTION
Learning feature representations with deep architectures has shown
remarkable success over hand-crafted features in Music Information
Retrieval (MIR) [1]. Approaches such as transfer learning from mu-
sic auto-tagging [2, 3, 4] allows someone to pre-train neural net-
works using large datasets and extracting features for downstream
MIR tasks such as instrument classification or genre recognition. In
this way, downstream MIR tasks can be solved using smaller anno-
tated datasets, which is desired since labeling is costly and difficult
to achieve. One issue with auto-tagging models is that they require
very large annotated datasets that are still difficult to obtain. To over-
come the need for large annotated datasets, new music representation
techniques have emerged that do not directly use waveform-related
labels emerged. For example, pre-training from language models [5]
or using noisy language descriptors of the musical content [6].
A different approach that is based on using proxy tasks to learn
representations is self-supervised learning (SSL), where information
from input data is extracted to provide labels. This is advantageous
since labels can be generated automatically without requiring human
intervention. Some SSL models have been proposed for music repre-
sentations showing competitive performance in several downstream
MIR tasks [7, 8, 9, 10]. Beyond music representation learning, SSL
This publication has emanated from research conducted with the finan-
cial support of Science Foundation Ireland (SFI) under Grant Number 17/RC-
PhD/3483 and 17/RC/2289 P2 and was supported by The Alan Turing Insti-
tute under the EPSRC grant EP/N510129/1. EB is supported by a Turing
Fellowship.
models have grown popularity for speech representations and down-
stream tasks such as speaker identification, automatic speech recog-
nition, phoneme recognition, and speech translation [11]. Exam-
ples of speech SSL models include wav2vec 2.0 [12] which is a
contrastive learning-based approach where the model learns to dis-
tinguish a target sample (positive) from distractors (negative). The
original model was pre-trained on the LibriSpeech dataset [13] and
its success is highlighted by the ability to retain high performance
even when dedicated datasets for downstream tasks are very small,
e.g., 10 minutes only for speech recognition [12] or 1000 observa-
tions for non-intrusive speech quality assessment [14].
The wav2vec 2.0 SSL model has been extensively evaluated for
speech tasks. However, its adaptation to music tasks (such as pitch
classification or instrument classification) has been explored so far
without success, as shown in two studies. In the NeurIPS challenge
HEAR [15], wav2vec 2.0 embeddings are extracted from the model
pre-trained on the LibriSpeech dataset and are used as input features
without finetuning. Their performance is relatively low in music
tasks, even if wav2vec 2.0 speech embeddings can still represent
some musical concepts to some degree, such as pitch [15]. Wang
et al. [16] have also evaluated wav2vec 2.0 outside of the speech
domain. In this case, the authors found that wav2vec 2.0 did not
perform well when pre-trained on AudioSet [17], possibly due to
the limitation of the masked prediction objective of learning from a
dataset more complex than LibriSpeech [16].
An approach that is still unexplored is pre-training wav2vec 2.0
on music data only. The transferability of deep networks becomes
more challenging when the source and the target tasks have differ-
ent domains [18] and it has been shown that wav2vec 2.0 might
be sensitive to a domain shift. For example, pre-training wav2vec
2.0 with cross-lingual datasets improves performance of ASR sys-
tems [19] and finetuning with non-English languages shows a per-
formance drop for speech quality assessment [14].
In this paper, we study whether the domain shift between the
pre-trained model and the downstream tasks observed can cause this
performance drop in music tasks as reported in the studies above. We
explore further the capacity of wav2vec 2.0 features in non-speech
tasks asking the following questions:
1. Does wav2vec 2.0 pre-trained on music encode meaningful
music representations, i.e. related to musical concepts such
as pitch or instruments?
2. Is it possible to obtain competitive performance on MIR tasks
when finetuning wav2vec 2.0 pre-trained on music?
3. Can we establish if wav2vec 2.0 is a potential candidate
model for music tasks other than speech?
The paper is structured as follows. In Section 2 we illustrate
how we pre-train wav2vec 2.0 with music data. Section 3 is ded-
icated to the analysis of the features learned by wav2vec 2.0. We
show whether the information encoded in the codebooks is related
arXiv:2210.15310v1 [eess.AS] 27 Oct 2022
摘要:

LEARNINGMUSICREPRESENTATIONSWITHWAV2VEC2.0AlessandroRagano1,EmmanouilBenetos2,AndrewHines11SchoolofCS,UniversityCollegeDublin,Ireland2SchoolofEECS,QueenMaryUniversityofLondon,UKABSTRACTLearningmusicrepresentationsthataregeneral-purposeofferstheexibilitytonetuneseveraldownstreamtasksusingsmallerdat...

展开>> 收起<<
LEARNING MUSIC REPRESENTATIONS WITH WA V2VEC 2.0 Alessandro Ragano1 Emmanouil Benetos2 Andrew Hines1 1School of CS University College Dublin Ireland.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:5 页 大小:5.86MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注