
our accompaniment generation task (“audio-to-audio”) is
different from the lyric-conditioned generation task (“text-
to-audio”) in that the latter does not need to deal with the
coordination between two audio recordings.
Specifically, we consider a drum accompaniment gener-
ation problem in our implementation, using a “drumless”
recording as the input and generating as the output a drum
track that involves the use of an entire drum kit. We use
this as an example task to investigate the audio-domain ac-
companiment generation problem out of the following rea-
sons. First, datasets used in musical source separation [17]
usually consist of an isolated drum stem along with stems
corresponding to other instruments. We can therefore eas-
ily merge the other stems to create paired data of drum-
less tracks and drum tracks as training data of our model.
(In musical terms, drumless, or “Minus Drums” songs are
recordings where the drum part has been taken out, which
corresponds nicely to our scenario.) Second, we suppose a
drum accompaniment generation model can easily find ap-
plications in songwriting [18], as it allows a user (who may
not be familiar with drum playing or beat making) to focus
on the other non-drum tracks. Third, audio-domain drum
accompaniment generation poses interesting challenges as
the model needs to determine not only the drum patterns
but also the drum sounds that are supposed to be, respec-
tively, rhythmically and stylistically consistent with the in-
put. Moreover, the generated drum track is expected to
follow a steady tempo, which is a basic requirement for a
human drummer. We call our model the “JukeDrummer.”
As depicted in Figure 1, the proposed model architec-
ture contains an “audio encoder” (instead of the original
text encoder [7, 8]) named the drumless VQ encoder that
takes a drum-free audio as input. Besides, we experiment
with different ways to capitalize an audio-domain beat and
downbeat tracking model proposed recently [19] in a novel
beat-aware module that extracts beat-related information
from the input audio, so that the language model for gener-
ation (i.e., the Transformer) is better informed of the rhyth-
mic properties of the input. The specific model [19] was
trained on drumless recordings as well, befitting our task.
We extract features from different levels, including low-
level tracker embeddings, mid-level activation peaks, and
high-level beat/downbeat positions, and investigate which
one benefits the generation model the most.
Our contribution is four-fold. First, to our best knowl-
edge, this work represents the first attempt to drum ac-
companiment generation of a full drum kit given drum-free
mixed audio. Second, we develop a light-weight audio-to-
audio Jukebox variant that takes an input audio of up to 24
seconds as conditioning and generates accompanying mu-
sic in the domain of Mel-spectrograms (Section 3). Third,
we experiment with different beat-related conditions in the
context of audio generation (Section 4). Finally, we report
objective and subjective evaluations demonstrating the ef-
fectiveness of the proposed model (Sections 6 & 7). 1.
1We share our code and checkpoint at: https://github.com/
legoodmanner/jukedrummer. Moreover, we provide audio ex-
amples at the following demo page: https://legoodmanner.
github.io/jukedrummer-demo/
Figure 1: Diagram of the proposed JukeDrummer model
for the inference stage. The training stage involves learn-
ing additional Drum VQ Encoder and Drumless VQ De-
coder (see Figure 2) that are not used at inference time.
2. BACKGROUND
2.1 Related Work on Drum Generation
Conditional drum accompaniment generation has been
studied in the literature, but only in the symbolic domain
[20, 21], to the best of our knowledge. Dahale et al. [20]
used a Transformer encoder to generate an accompany-
ing symbolic drum pattern of 12 bars given a four-track,
melodic MIDI passage. Makris et al. [21] adopted instead
a sequence-to-sequence architecture with a bi-directional
long short-term memory (BLSTM) encoder extracting in-
formation from the melodic input and a Transformer de-
coder generating the drum track for up to 16 bars in MIDI
format. While symbolic-domain music generation has its
own challenges, it differs greatly from the audio-domain
counterpart studied in this paper, for it is not about gener-
ating sounds that can be readily listened to by human.
Related tasks that have been attempted in the litera-
ture with deep learning include symbolic-domain gener-
ation of a monophonic drum track (i.e., kick drum only)
of multiple bars [4], symbolic-domain drum pattern gener-
ation [22–25], symbolic-domain drum track generation as
part of a multi-track MIDI [26–29], audio-domain one-shot
drum hit generation [30–34], audio-domain generation of
drum sounds of an entire drum kit of a single bar [35], and
audio-domain drum loop generation [36]. Jukebox [7] gen-
erates a mixture of sounds that include drums, but not an
isolated drum track. By design, Jukebox does not take any
input audio as a condition and generate accompaniments.
2.2 The Original Jukebox model
The main architecture of Jukebox [7] is composed of two
components: a multi-scale vector-quantized variational au-
toencoder (VQ-VAE) [37–41] and an autoregressive Trans-
former decoder [14, 15]. The VQ-VAE is for converting a
continuous-valued raw audio waveform into a sequence of
so-called discrete VQ codes, while the Transformer estab-
lishes a language model (LM) of the VQ codes capable of
generating new code sequences.