DIFFROLL DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION WITH UNSUPERVISED PRETRAINING CAPABILITY Kin Wai Cheuk12Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3

2025-04-27 0 0 2.2MB 5 页 10玖币
侵权投诉
DIFFROLL: DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION
WITH UNSUPERVISED PRETRAINING CAPABILITY
Kin Wai Cheuk1,2,Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3
Naoya Takahashi3Shusuke Takahashi3Dorien Herremans1Yuki Mitsufuji3
1Singapore University of Technology and Design, Singapore
2Agency for Science, Technology and Research, Singapore
3Sony Group Corporation, Tokyo, Japan
ABSTRACT
In this paper we propose a novel generative approach, DiffRoll,
to tackle automatic music transcription (AMT). Instead of treating
AMT as a discriminative task in which the model is trained to convert
spectrograms into piano rolls, we think of it as a conditional gen-
erative task where we train our model to generate realistic looking
piano rolls from pure Gaussian noise conditioned on spectrograms.
This new AMT formulation enables DiffRoll to transcribe, generate
and even inpaint music. Due to the classifier-free nature, DiffRoll is
also able to be trained on unpaired datasets where only piano rolls
are available. Our experiments show that DiffRoll outperforms its
discriminative counterpart by 19 percentage points (ppt.) and our
ablation studies also indicate that it outperforms similar existing
methods by 4.8 ppt.Source code and demonstration are available at
https://sony.github.io/DiffRoll/.
Index TermsAutomatic Music Transcription, Music Infor-
mation Retrieval, Signal Processing, Diffusion, Generative Model,
Unsupervised Pretraining
1. INTRODUCTION
Automatic Music Transcription (AMT) has typically been treated as
a discriminative task [1] in which the frequency bins of spectrograms
are projected onto 88 notes (from A0 to C8) in posteriorgrams. There
have been attempts at modeling AMT using Bayesian inference [2]
and probabilistic models [3], however, these models lack the gener-
ative power found in variational autoencoders (VAEs) [4] and gener-
ative adversarial networks (GANs) [5, 6].
In addition, the majority of existing state-of-the-art (SOTA)
AMT models are fully supervised [7, 8, 9, 10, 11]. While there
have been attempts to develop weakly-supervised [12] and semi-
supervised AMT models [13, 14], unsupervised deep learning AMT
is still an under-explored direction and thus currently limited to drum
transcription [15] and non-deep learning based approaches [3, 16].
In this paper, we have devised a novel formulation for AMT called
“DiffRoll”, in which the power of diffusion is leveraged, making it
a generative model capable of generating new piano rolls with the
potential of unsupervised pretraining.
2. RELATED WORK
Diffusion model in deep learning inspired by the concept of non-
equilibrium thermodynamics [17], has shown promising results in
Work done during internship at Sony
generating highly-fidelity audio samples [18, 19]. Subsequent im-
provements have led to better diffusion methods, for example, bet-
ter forward process [20], new regularization [21], faster sampling
speed [22], and discrete diffusion for binary data [23]. In the field
of vision, conditional diffusion such as DALL-E [24] enables gen-
erated images to be controlled by an input text, resulting in SOTA
text-to-image performance.
Diffusion is now also being used for music-related tasks such as
music generation [25] and music synthesis [26]. These tasks are in-
tuitively considered as generative and hence it is reasonable to model
them using diffusion. AMT, on the other hand, has typically been
considered a discriminative task in which an AMT model classifies
the notes present in the spectrograms frame-by-frame. Even though
there have been attempts [8, 27] to model AMT as a token-based
task inspired by natural language processing, these approaches are
still discriminative. AMT, however, can be modeled as a generative
task. The spectrograms can be considered as the conditions, and the
piano rolls can be considered as the image to be generated. In other
words, we can think of AMT as a piano roll generation task, which
is the reverse process of music synthesis [26]. In this paper, we take
a novel perspective by modeling AMT as a generative task, with the
source code and demonstration available1.
3. PROPOSED METHOD
DiffRoll, as shown in Fig. 1, is designed to convert Gaussian noise
xtinto a posteriorgram ˆx0conditioned on spectrogram cmel. To han-
dle binary piano rolls xroll ∈ {0,1}during training, we cast them
into [0,1], similar to Analog Bits [23]. During sampling, the poste-
riorgram is binarized back into piano roll with a threshold of 0.5and
then exported as a MIDI file.
The DiffRoll model architecture is inspired by DiffWave [18]
which is also a 1D convolutional model. In DiffWave, xtis a 1-
channel tensor, while in DiffRoll we consider xtas an 88-channel
tensor with a dimension of (B, 88, τ), where Bis the batch size and
τis the number of frames in the piano roll. As in DiffWave, xt
is projected into a tensor with shape (B, 512, τ)via a 1D convolu-
tional layer with a kernel size of 1. A total of 15 residual layers with
the same design as in DiffWave are used, where the output of the
previous residual layer is added to the input of the next layer. A dif-
fusion time t, i.e. a batch of integers with shape (B, 1), is projected
into (B, 512) and then broadcasted to (B, 512, τ)such that it can
be added to the input of each residual layer. Note that to not con-
fuse diffusion time twith the time dimension of the spectrograms
1https://sony.github.io/DiffRoll/
arXiv:2210.05148v2 [cs.SD] 20 Oct 2022
摘要:

DIFFROLL:DIFFUSION-BASEDGENERATIVEMUSICTRANSCRIPTIONWITHUNSUPERVISEDPRETRAININGCAPABILITYKinWaiCheuk1;2;RyosukeSawata3ToshimitsuUesaka3NaokiMurata3NaoyaTakahashi3ShusukeTakahashi3DorienHerremans1YukiMitsufuji31SingaporeUniversityofTechnologyandDesign,Singapore2AgencyforScience,TechnologyandResearch...

展开>> 收起<<
DIFFROLL DIFFUSION-BASED GENERATIVE MUSIC TRANSCRIPTION WITH UNSUPERVISED PRETRAINING CAPABILITY Kin Wai Cheuk12Ryosuke Sawata3Toshimitsu Uesaka3Naoki Murata3.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:2.2MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注