DIFFROLL DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION WITH UNSUPERVISED PRETRAINING CAPABILITY Kin Wai Cheuk12Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3

2025-04-27 0 0 2.2MB 5 页 10玖币

侵权投诉

DIFFROLL: DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION

WITH UNSUPERVISED PRETRAINING CAPABILITY

Kin Wai Cheuk1,2,∗Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3

Naoya Takahashi3Shusuke Takahashi3Dorien Herremans1Yuki Mitsufuji3

1Singapore University of Technology and Design, Singapore

2Agency for Science, Technology and Research, Singapore

3Sony Group Corporation, Tokyo, Japan

ABSTRACT

In this paper we propose a novel generative approach, DiffRoll,

to tackle automatic music transcription (AMT). Instead of treating

AMT as a discriminative task in which the model is trained to convert

spectrograms into piano rolls, we think of it as a conditional gen-

erative task where we train our model to generate realistic looking

piano rolls from pure Gaussian noise conditioned on spectrograms.

This new AMT formulation enables DiffRoll to transcribe, generate

and even inpaint music. Due to the classiﬁer-free nature, DiffRoll is

also able to be trained on unpaired datasets where only piano rolls

are available. Our experiments show that DiffRoll outperforms its

discriminative counterpart by 19 percentage points (ppt.) and our

ablation studies also indicate that it outperforms similar existing

methods by 4.8 ppt.Source code and demonstration are available at

https://sony.github.io/DiffRoll/.

Index Terms—Automatic Music Transcription, Music Infor-

mation Retrieval, Signal Processing, Diffusion, Generative Model,

Unsupervised Pretraining

1. INTRODUCTION

Automatic Music Transcription (AMT) has typically been treated as

a discriminative task [1] in which the frequency bins of spectrograms

are projected onto 88 notes (from A0 to C8) in posteriorgrams. There

have been attempts at modeling AMT using Bayesian inference [2]

and probabilistic models [3], however, these models lack the gener-

ative power found in variational autoencoders (VAEs) [4] and gener-

ative adversarial networks (GANs) [5, 6].

In addition, the majority of existing state-of-the-art (SOTA)

AMT models are fully supervised [7, 8, 9, 10, 11]. While there

have been attempts to develop weakly-supervised [12] and semi-

supervised AMT models [13, 14], unsupervised deep learning AMT

is still an under-explored direction and thus currently limited to drum

transcription [15] and non-deep learning based approaches [3, 16].

In this paper, we have devised a novel formulation for AMT called

“DiffRoll”, in which the power of diffusion is leveraged, making it

a generative model capable of generating new piano rolls with the

potential of unsupervised pretraining.

2. RELATED WORK

Diffusion model in deep learning inspired by the concept of non-

equilibrium thermodynamics [17], has shown promising results in

∗Work done during internship at Sony

generating highly-ﬁdelity audio samples [18, 19]. Subsequent im-

provements have led to better diffusion methods, for example, bet-

ter forward process [20], new regularization [21], faster sampling

speed [22], and discrete diffusion for binary data [23]. In the ﬁeld

of vision, conditional diffusion such as DALL-E [24] enables gen-

erated images to be controlled by an input text, resulting in SOTA

text-to-image performance.

Diffusion is now also being used for music-related tasks such as

music generation [25] and music synthesis [26]. These tasks are in-

tuitively considered as generative and hence it is reasonable to model

them using diffusion. AMT, on the other hand, has typically been

considered a discriminative task in which an AMT model classiﬁes

the notes present in the spectrograms frame-by-frame. Even though

there have been attempts [8, 27] to model AMT as a token-based

task inspired by natural language processing, these approaches are

still discriminative. AMT, however, can be modeled as a generative

task. The spectrograms can be considered as the conditions, and the

piano rolls can be considered as the image to be generated. In other

words, we can think of AMT as a piano roll generation task, which

is the reverse process of music synthesis [26]. In this paper, we take

a novel perspective by modeling AMT as a generative task, with the

source code and demonstration available1.

3. PROPOSED METHOD

DiffRoll, as shown in Fig. 1, is designed to convert Gaussian noise

xtinto a posteriorgram ˆx0conditioned on spectrogram cmel. To han-

dle binary piano rolls xroll ∈ {0,1}during training, we cast them

into [0,1], similar to Analog Bits [23]. During sampling, the poste-

riorgram is binarized back into piano roll with a threshold of 0.5and

then exported as a MIDI ﬁle.

The DiffRoll model architecture is inspired by DiffWave [18]

which is also a 1D convolutional model. In DiffWave, xtis a 1-

channel tensor, while in DiffRoll we consider xtas an 88-channel

tensor with a dimension of (B, 88, τ), where Bis the batch size and

τis the number of frames in the piano roll. As in DiffWave, xt

is projected into a tensor with shape (B, 512, τ)via a 1D convolu-

tional layer with a kernel size of 1. A total of 15 residual layers with

the same design as in DiffWave are used, where the output of the

previous residual layer is added to the input of the next layer. A dif-

fusion time t, i.e. a batch of integers with shape (B, 1), is projected

into (B, 512) and then broadcasted to (B, 512, τ)such that it can

be added to the input of each residual layer. Note that to not con-

fuse diffusion time twith the time dimension of the spectrograms

1https://sony.github.io/DiffRoll/

arXiv:2210.05148v2 [cs.SD] 20 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DIFFROLL:DIFFUSION-BASEDGENERATIVEMUSICTRANSCRIPTIONWITHUNSUPERVISEDPRETRAININGCAPABILITYKinWaiCheuk1;2;RyosukeSawata3ToshimitsuUesaka3NaokiMurata3NaoyaTakahashi3ShusukeTakahashi3DorienHerremans1YukiMitsufuji31SingaporeUniversityofTechnologyandDesign,Singapore2AgencyforScience,TechnologyandResearch...

展开>> 收起<<

DIFFROLL DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION WITH UNSUPERVISED PRETRAINING CAPABILITY Kin Wai Cheuk12Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DIFFROLL DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION WITH UNSUPERVISED PRETRAINING CAPABILITY Kin Wai Cheuk12Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: