JUKEDRUMMER CONDITIONAL BEAT-AWARE AUDIO-DOMAIN DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE Yueh-Kao Wu

2025-05-06 0 0 5.24MB 8 页 10玖币
侵权投诉
JUKEDRUMMER: CONDITIONAL BEAT-AWARE AUDIO-DOMAIN
DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE
Yueh-Kao Wu
Academia Sinica
yk.lego09@gmail.com
Ching-Yu Chiu
National Cheng Kung University
x2009971@gmail.com
Yi-Hsuan Yang
Taiwan AI Labs
yhyang@ailabs.tw
ABSTRACT
This paper proposes a model that generates a drum track
in the audio domain to play along to a user-provided drum-
free recording. Specifically, using paired data of drumless
tracks and the corresponding human-made drum tracks,
we train a Transformer model to improvise the drum part
of an unseen drumless recording. We combine two ap-
proaches to encode the input audio. First, we train a vector-
quantized variational autoencoder (VQ-VAE) to represent
the input audio with discrete codes, which can then be
readily used in a Transformer. Second, using an audio-
domain beat tracking model, we compute beat-related fea-
tures of the input audio and use them as embeddings in
the Transformer. Instead of generating the drum track di-
rectly as waveforms, we use a separate VQ-VAE to encode
the mel-spectrogram of a drum track into another set of
discrete codes, and train the Transformer to predict the se-
quence of drum-related discrete codes. The output codes
are then converted to a mel-spectrogram with a decoder,
and then to the waveform with a vocoder. We report both
objective and subjective evaluations of variants of the pro-
posed model, demonstrating that the model with beat in-
formation generates drum accompaniment that is rhythmi-
cally and stylistically consistent with the input audio.
1. INTRODUCTION
Deep generative models for musical audio generation have
witnessed great progress in recent years [1–8]. While mod-
els for generating symbolic music such as MIDI [9–12] or
musical scores [13] focus primarily on the composition of
musical content, an audio-domain music generation model
deals with sounds and thereby has extra complexities re-
lated to timbre and audio quality. For example, while a
model for generating symbolic guitar tabs can simply con-
sider a guitar tab as a sequence of notes [11], a model that
generates audio recordings of guitar needs to determine not
only the underlying sequence of notes but also the way to
render (synthesize) the notes into sounds. Due to the com-
plexities involved, research on deep generative models for
© Y.-K. Wu, C.-Y. Chiu, and Y.-H. Yang. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Y.-K. Wu, C.-Y. Chiu, and Y.-H. Yang, “JukeDrummer:
Conditional Beat-aware Audio-Domain Drum Accompaniment Genera-
tion via Transformer VQ-VAE”, in Proc. of the 23rd Int. Society for
Music Information Retrieval Conf., Bengaluru, India, 2022.
musical audio begins with the simpler task of synthesizing
individual musical notes [1–3], dispensing the need to con-
sider the composition of notes. Follow-up research [4–6]
extends the capability to generating musical passages of
a single instrument. The Jukebox model [7] proposed by
OpenAI greatly advances the state-of-the-art by being able
to, quoting their sentence, “generate high-fidelity and di-
verse songs with coherence up to multiple minutes. Being
trained on a massive collection of audio recordings with
the corresponding lyrics but not the symbolic transcrip-
tions of music, Jukebox generates multi-instrument music
as raw waveforms directly without an explicit model of the
underlying sequence of notes.
This work aims to improve upon Jukebox in two as-
pects. First, the backbone of Jukebox is a hundred-layer
Transformer [14, 15] with billions of parameters that are
trained with 1.2 million songs on hundreds of NVIDIA
V100 GPUs for weeks at OpenAI, which is hard to repro-
duce elsewhere. Inspired by a recent Jukebox-like model
for singing voice generation called KaraSinger [8], we in-
stead build a light-weight model with only 25 million pa-
rameters by working on Mel-spectrograms instead of raw
waveforms. Our model is trained with only 457 recordings
on a single GeForce GTX 1080 Ti GPU for 2 days.
Second, and more importantly, instead of a fully au-
tonomous model that makes a song from scratch with var-
ious instruments, we aim to build a model that can work
cooperatively with human, allowing the human partner to
come up with the musical audio of some instruments as in-
put to the model, and generating in return the musical audio
of some other instruments to accompany and to comple-
ment the user input, completing the song together. Such a
model can potentially contribute to human-AI co-creation
in songwriting [16] and enable new applications.
In technical terms, our work enhances the controllabil-
ity of the model by allowing its generation to be steered
on a user-provided audio track. It can be viewed as an in-
teresting sequence-to-sequence problem where the model
creates a “target sequence” of music that is to be played
along to the input “source sequence. Besides requirement
on audio quality, the coordination between the source and
target sequences in terms of musical aspects such as style,
rhythm, and harmony is also of central importance.
We note that, for controllability and the intelligibility
of the generated singing, both Jukebox [7] and KaraSinger
[8] have a lyrics encoder that allows their generation to be
steered on textual lyrics. While being technically similar,
arXiv:2210.06007v2 [cs.SD] 31 Oct 2022
our accompaniment generation task (“audio-to-audio”) is
different from the lyric-conditioned generation task (“text-
to-audio”) in that the latter does not need to deal with the
coordination between two audio recordings.
Specifically, we consider a drum accompaniment gener-
ation problem in our implementation, using a “drumless”
recording as the input and generating as the output a drum
track that involves the use of an entire drum kit. We use
this as an example task to investigate the audio-domain ac-
companiment generation problem out of the following rea-
sons. First, datasets used in musical source separation [17]
usually consist of an isolated drum stem along with stems
corresponding to other instruments. We can therefore eas-
ily merge the other stems to create paired data of drum-
less tracks and drum tracks as training data of our model.
(In musical terms, drumless, or “Minus Drums” songs are
recordings where the drum part has been taken out, which
corresponds nicely to our scenario.) Second, we suppose a
drum accompaniment generation model can easily find ap-
plications in songwriting [18], as it allows a user (who may
not be familiar with drum playing or beat making) to focus
on the other non-drum tracks. Third, audio-domain drum
accompaniment generation poses interesting challenges as
the model needs to determine not only the drum patterns
but also the drum sounds that are supposed to be, respec-
tively, rhythmically and stylistically consistent with the in-
put. Moreover, the generated drum track is expected to
follow a steady tempo, which is a basic requirement for a
human drummer. We call our model the “JukeDrummer.
As depicted in Figure 1, the proposed model architec-
ture contains an “audio encoder” (instead of the original
text encoder [7, 8]) named the drumless VQ encoder that
takes a drum-free audio as input. Besides, we experiment
with different ways to capitalize an audio-domain beat and
downbeat tracking model proposed recently [19] in a novel
beat-aware module that extracts beat-related information
from the input audio, so that the language model for gener-
ation (i.e., the Transformer) is better informed of the rhyth-
mic properties of the input. The specific model [19] was
trained on drumless recordings as well, befitting our task.
We extract features from different levels, including low-
level tracker embeddings, mid-level activation peaks, and
high-level beat/downbeat positions, and investigate which
one benefits the generation model the most.
Our contribution is four-fold. First, to our best knowl-
edge, this work represents the first attempt to drum ac-
companiment generation of a full drum kit given drum-free
mixed audio. Second, we develop a light-weight audio-to-
audio Jukebox variant that takes an input audio of up to 24
seconds as conditioning and generates accompanying mu-
sic in the domain of Mel-spectrograms (Section 3). Third,
we experiment with different beat-related conditions in the
context of audio generation (Section 4). Finally, we report
objective and subjective evaluations demonstrating the ef-
fectiveness of the proposed model (Sections 6 & 7). 1.
1We share our code and checkpoint at: https://github.com/
legoodmanner/jukedrummer. Moreover, we provide audio ex-
amples at the following demo page: https://legoodmanner.
github.io/jukedrummer-demo/
Figure 1: Diagram of the proposed JukeDrummer model
for the inference stage. The training stage involves learn-
ing additional Drum VQ Encoder and Drumless VQ De-
coder (see Figure 2) that are not used at inference time.
2. BACKGROUND
2.1 Related Work on Drum Generation
Conditional drum accompaniment generation has been
studied in the literature, but only in the symbolic domain
[20, 21], to the best of our knowledge. Dahale et al. [20]
used a Transformer encoder to generate an accompany-
ing symbolic drum pattern of 12 bars given a four-track,
melodic MIDI passage. Makris et al. [21] adopted instead
a sequence-to-sequence architecture with a bi-directional
long short-term memory (BLSTM) encoder extracting in-
formation from the melodic input and a Transformer de-
coder generating the drum track for up to 16 bars in MIDI
format. While symbolic-domain music generation has its
own challenges, it differs greatly from the audio-domain
counterpart studied in this paper, for it is not about gener-
ating sounds that can be readily listened to by human.
Related tasks that have been attempted in the litera-
ture with deep learning include symbolic-domain gener-
ation of a monophonic drum track (i.e., kick drum only)
of multiple bars [4], symbolic-domain drum pattern gener-
ation [22–25], symbolic-domain drum track generation as
part of a multi-track MIDI [26–29], audio-domain one-shot
drum hit generation [30–34], audio-domain generation of
drum sounds of an entire drum kit of a single bar [35], and
audio-domain drum loop generation [36]. Jukebox [7] gen-
erates a mixture of sounds that include drums, but not an
isolated drum track. By design, Jukebox does not take any
input audio as a condition and generate accompaniments.
2.2 The Original Jukebox model
The main architecture of Jukebox [7] is composed of two
components: a multi-scale vector-quantized variational au-
toencoder (VQ-VAE) [37–41] and an autoregressive Trans-
former decoder [14, 15]. The VQ-VAE is for converting a
continuous-valued raw audio waveform into a sequence of
so-called discrete VQ codes, while the Transformer estab-
lishes a language model (LM) of the VQ codes capable of
generating new code sequences.
摘要:

JUKEDRUMMER:CONDITIONALBEAT-AWAREAUDIO-DOMAINDRUMACCOMPANIMENTGENERATIONVIATRANSFORMERVQ-VAEYueh-KaoWuAcademiaSinicayk.lego09@gmail.comChing-YuChiuNationalChengKungUniversityx2009971@gmail.comYi-HsuanYangTaiwanAILabsyhyang@ailabs.twABSTRACTThispaperproposesamodelthatgeneratesadrumtrackintheaudiodoma...

展开>> 收起<<
JUKEDRUMMER CONDITIONAL BEAT-AWARE AUDIO-DOMAIN DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE Yueh-Kao Wu.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:8 页 大小:5.24MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注