JUKEDRUMMER CONDITIONAL BEAT-AWARE AUDIO-DOMAIN DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE Yueh-Kao Wu

2025-05-06 0 0 5.24MB 8 页 10玖币

侵权投诉

JUKEDRUMMER: CONDITIONAL BEAT-AWARE AUDIO-DOMAIN

DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE

Yueh-Kao Wu

Academia Sinica

yk.lego09@gmail.com

Ching-Yu Chiu

National Cheng Kung University

x2009971@gmail.com

Yi-Hsuan Yang

Taiwan AI Labs

yhyang@ailabs.tw

ABSTRACT

This paper proposes a model that generates a drum track

in the audio domain to play along to a user-provided drum-

free recording. Speciﬁcally, using paired data of drumless

tracks and the corresponding human-made drum tracks,

we train a Transformer model to improvise the drum part

of an unseen drumless recording. We combine two ap-

proaches to encode the input audio. First, we train a vector-

quantized variational autoencoder (VQ-VAE) to represent

the input audio with discrete codes, which can then be

readily used in a Transformer. Second, using an audio-

domain beat tracking model, we compute beat-related fea-

tures of the input audio and use them as embeddings in

the Transformer. Instead of generating the drum track di-

rectly as waveforms, we use a separate VQ-VAE to encode

the mel-spectrogram of a drum track into another set of

discrete codes, and train the Transformer to predict the se-

quence of drum-related discrete codes. The output codes

are then converted to a mel-spectrogram with a decoder,

and then to the waveform with a vocoder. We report both

objective and subjective evaluations of variants of the pro-

posed model, demonstrating that the model with beat in-

formation generates drum accompaniment that is rhythmi-

cally and stylistically consistent with the input audio.

1. INTRODUCTION

Deep generative models for musical audio generation have

witnessed great progress in recent years [1–8]. While mod-

els for generating symbolic music such as MIDI [9–12] or

musical scores [13] focus primarily on the composition of

musical content, an audio-domain music generation model

deals with sounds and thereby has extra complexities re-

lated to timbre and audio quality. For example, while a

model for generating symbolic guitar tabs can simply con-

sider a guitar tab as a sequence of notes [11], a model that

generates audio recordings of guitar needs to determine not

only the underlying sequence of notes but also the way to

render (synthesize) the notes into sounds. Due to the com-

plexities involved, research on deep generative models for

a Creative Commons Attribution 4.0 International License (CC BY 4.0).

Attribution: Y.-K. Wu, C.-Y. Chiu, and Y.-H. Yang, “JukeDrummer:

Conditional Beat-aware Audio-Domain Drum Accompaniment Genera-

tion via Transformer VQ-VAE”, in Proc. of the 23rd Int. Society for

Music Information Retrieval Conf., Bengaluru, India, 2022.

musical audio begins with the simpler task of synthesizing

individual musical notes [1–3], dispensing the need to con-

sider the composition of notes. Follow-up research [4–6]

extends the capability to generating musical passages of

a single instrument. The Jukebox model [7] proposed by

OpenAI greatly advances the state-of-the-art by being able

to, quoting their sentence, “generate high-ﬁdelity and di-

verse songs with coherence up to multiple minutes.” Being

trained on a massive collection of audio recordings with

the corresponding lyrics but not the symbolic transcrip-

tions of music, Jukebox generates multi-instrument music

as raw waveforms directly without an explicit model of the

underlying sequence of notes.

This work aims to improve upon Jukebox in two as-

pects. First, the backbone of Jukebox is a hundred-layer

Transformer [14, 15] with billions of parameters that are

trained with 1.2 million songs on hundreds of NVIDIA

V100 GPUs for weeks at OpenAI, which is hard to repro-

duce elsewhere. Inspired by a recent Jukebox-like model

for singing voice generation called KaraSinger [8], we in-

stead build a light-weight model with only 25 million pa-

rameters by working on Mel-spectrograms instead of raw

waveforms. Our model is trained with only 457 recordings

on a single GeForce GTX 1080 Ti GPU for 2 days.

Second, and more importantly, instead of a fully au-

tonomous model that makes a song from scratch with var-

ious instruments, we aim to build a model that can work

cooperatively with human, allowing the human partner to

come up with the musical audio of some instruments as in-

put to the model, and generating in return the musical audio

of some other instruments to accompany and to comple-

ment the user input, completing the song together. Such a

model can potentially contribute to human-AI co-creation

in songwriting [16] and enable new applications.

In technical terms, our work enhances the controllabil-

ity of the model by allowing its generation to be steered

on a user-provided audio track. It can be viewed as an in-

teresting sequence-to-sequence problem where the model

creates a “target sequence” of music that is to be played

along to the input “source sequence.” Besides requirement

on audio quality, the coordination between the source and

target sequences in terms of musical aspects such as style,

rhythm, and harmony is also of central importance.

We note that, for controllability and the intelligibility

of the generated singing, both Jukebox [7] and KaraSinger

[8] have a lyrics encoder that allows their generation to be

steered on textual lyrics. While being technically similar,

arXiv:2210.06007v2 [cs.SD] 31 Oct 2022

our accompaniment generation task (“audio-to-audio”) is

different from the lyric-conditioned generation task (“text-

to-audio”) in that the latter does not need to deal with the

coordination between two audio recordings.

Speciﬁcally, we consider a drum accompaniment gener-

ation problem in our implementation, using a “drumless”

recording as the input and generating as the output a drum

track that involves the use of an entire drum kit. We use

this as an example task to investigate the audio-domain ac-

companiment generation problem out of the following rea-

sons. First, datasets used in musical source separation [17]

usually consist of an isolated drum stem along with stems

corresponding to other instruments. We can therefore eas-

ily merge the other stems to create paired data of drum-

less tracks and drum tracks as training data of our model.

(In musical terms, drumless, or “Minus Drums” songs are

recordings where the drum part has been taken out, which

corresponds nicely to our scenario.) Second, we suppose a

drum accompaniment generation model can easily ﬁnd ap-

plications in songwriting [18], as it allows a user (who may

not be familiar with drum playing or beat making) to focus

on the other non-drum tracks. Third, audio-domain drum

accompaniment generation poses interesting challenges as

the model needs to determine not only the drum patterns

but also the drum sounds that are supposed to be, respec-

tively, rhythmically and stylistically consistent with the in-

put. Moreover, the generated drum track is expected to

follow a steady tempo, which is a basic requirement for a

human drummer. We call our model the “JukeDrummer.”

As depicted in Figure 1, the proposed model architec-

ture contains an “audio encoder” (instead of the original

text encoder [7, 8]) named the drumless VQ encoder that

takes a drum-free audio as input. Besides, we experiment

with different ways to capitalize an audio-domain beat and

downbeat tracking model proposed recently [19] in a novel

beat-aware module that extracts beat-related information

from the input audio, so that the language model for gener-

ation (i.e., the Transformer) is better informed of the rhyth-

mic properties of the input. The speciﬁc model [19] was

trained on drumless recordings as well, beﬁtting our task.

We extract features from different levels, including low-

level tracker embeddings, mid-level activation peaks, and

high-level beat/downbeat positions, and investigate which

one beneﬁts the generation model the most.

Our contribution is four-fold. First, to our best knowl-

edge, this work represents the ﬁrst attempt to drum ac-

companiment generation of a full drum kit given drum-free

mixed audio. Second, we develop a light-weight audio-to-

audio Jukebox variant that takes an input audio of up to 24

seconds as conditioning and generates accompanying mu-

sic in the domain of Mel-spectrograms (Section 3). Third,

we experiment with different beat-related conditions in the

context of audio generation (Section 4). Finally, we report

objective and subjective evaluations demonstrating the ef-

fectiveness of the proposed model (Sections 6 & 7). 1.

1We share our code and checkpoint at: https://github.com/

legoodmanner/jukedrummer. Moreover, we provide audio ex-

amples at the following demo page: https://legoodmanner.

github.io/jukedrummer-demo/

Figure 1: Diagram of the proposed JukeDrummer model

for the inference stage. The training stage involves learn-

ing additional Drum VQ Encoder and Drumless VQ De-

coder (see Figure 2) that are not used at inference time.

2. BACKGROUND

2.1 Related Work on Drum Generation

Conditional drum accompaniment generation has been

studied in the literature, but only in the symbolic domain

[20, 21], to the best of our knowledge. Dahale et al. [20]

used a Transformer encoder to generate an accompany-

ing symbolic drum pattern of 12 bars given a four-track,

melodic MIDI passage. Makris et al. [21] adopted instead

a sequence-to-sequence architecture with a bi-directional

long short-term memory (BLSTM) encoder extracting in-

formation from the melodic input and a Transformer de-

coder generating the drum track for up to 16 bars in MIDI

format. While symbolic-domain music generation has its

own challenges, it differs greatly from the audio-domain

counterpart studied in this paper, for it is not about gener-

ating sounds that can be readily listened to by human.

Related tasks that have been attempted in the litera-

ture with deep learning include symbolic-domain gener-

ation of a monophonic drum track (i.e., kick drum only)

of multiple bars [4], symbolic-domain drum pattern gener-

ation [22–25], symbolic-domain drum track generation as

part of a multi-track MIDI [26–29], audio-domain one-shot

drum hit generation [30–34], audio-domain generation of

drum sounds of an entire drum kit of a single bar [35], and

audio-domain drum loop generation [36]. Jukebox [7] gen-

erates a mixture of sounds that include drums, but not an

isolated drum track. By design, Jukebox does not take any

input audio as a condition and generate accompaniments.

2.2 The Original Jukebox model

The main architecture of Jukebox [7] is composed of two

components: a multi-scale vector-quantized variational au-

toencoder (VQ-VAE) [37–41] and an autoregressive Trans-

former decoder [14, 15]. The VQ-VAE is for converting a

continuous-valued raw audio waveform into a sequence of

so-called discrete VQ codes, while the Transformer estab-

lishes a language model (LM) of the VQ codes capable of

generating new code sequences.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

JUKEDRUMMER:CONDITIONALBEAT-AWAREAUDIO-DOMAINDRUMACCOMPANIMENTGENERATIONVIATRANSFORMERVQ-VAEYueh-KaoWuAcademiaSinicayk.lego09@gmail.comChing-YuChiuNationalChengKungUniversityx2009971@gmail.comYi-HsuanYangTaiwanAILabsyhyang@ailabs.twABSTRACTThispaperproposesamodelthatgeneratesadrumtrackintheaudiodoma...

展开>> 收起<<

JUKEDRUMMER CONDITIONAL BEAT-AWARE AUDIO-DOMAIN DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE Yueh-Kao Wu.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

JUKEDRUMMER CONDITIONAL BEAT-AWARE AUDIO-DOMAIN DRUM ACCOMPANIMENT GENERATION VIA TRANSFORMER VQ-VAE Yueh-Kao Wu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: