
inform the model with the intended musical form.
We improve upon the VLI model [20] in the following
ways to realize structure-aware infilling. First, we use the
classic Transformer [26–28] instead of the more sophisti-
cated XLNet [24] as the model backbone, to make it eas-
ier to add a conditioning module to exploit the structural
context. To improve the capability of the Transformer to
account for bi-directional contexts, we propose two novel
components, the bar-count-down technique (Section 3.2)
and order embeddings (Section 3.3), which respectively
give the model an explicit control of the length of the gen-
erated music, and a convenient way to attend to the fu-
ture context. Second, being inspired by the Theme Trans-
former [29], we use not a Transformer decoder-only archi-
tecture but a sequence-to-sequence (seq2seq) Transformer
encoder/decoder architecture, using the cross-attention be-
tween the encoder and decoder as the conditioning module
to account for the structural context. Moreover, we propose
an attention-selecting module that allows the Transformer
to access multiple structural contexts while infilling differ-
ent parts of a music piece, which can be useful both in the
training and inference time (Section 3.4) .
For evaluation, we compare our model with two strong
baselines, the VLI [20] and the work of Hsu & Chang [21],
on the task of symbolic-domain melody infilling of 4-bar
content using the POP909 dataset [30] and the associated
structural labels from Dai et al. [31]. With objective and
subjective analyses, we show that our model greatly out-
performs the baselines in the structure completeness of the
generated pieces, without degrading local smoothness.
We set up a webpage for demos 1and open source our
code at a public GitHub repository. 2
2. RELATED WORK
Generating missing parts with given surrounding contexts
has been attempted by early works. DeepBach [17] pre-
dicts missing notes based on the notes around them. They
use two recurrent neural networks (RNNs) to capture the
past and future contexts, and a feedforward neural network
to capture the current context from notes with the same
temporal position as the target note. COCONET [16] trains
a convolutional neural network (CNN) to complete partial
musical scores and explores the use of blocked Gibbs sam-
pling as an analog to rewriting. They encode the music
data with the piano roll representation and treat that as a
fixed-size image, so the model can only perform fixed-
length music infilling. Inpainting Net [15] uses an RNN
to integrate the temporal information from a variational
auto-encoder (VAE) [32] for bar-wise generation, Wei et
al. [23] build the model with a similar concept as Inpaint-
ing Net and use the contrastive loss [33, 34] for training
to improve the infilling quality. Some Transformer-based
models have also been proposed to achieve music infilling.
Ippolito et al. [18] concatenate the past and future context
with a special separator token. They keep the original po-
sitional encoding of the contexts and the missing segment,
1https://tanchihpin0517.github.io/structure-aware_infilling
2https://github.com/tanchihpin0517/structure-aware_infilling
which again limits the length of given contexts and gener-
ated sequence to be fixed. We see that these infilling mod-
els impose some data assumptions and thereby have certain
restrictions, e.g., the length of the input sequence cannot
be arbitrary, or the missing segment needs to be complete
bars. The work of Hsu & Chang [21] is free of these re-
strictions. They use two Transformer encoders to capture
the past and future context respectively and generate re-
sults with a Transformer decoder. The VLI model [20] can
also realize variable-length infilling. However, to our best
knowledge, no existing models have explicitly considered
structure-related information for infilling.
Structure-based conditioning has been explored only re-
cently by Shi et al. [29] in their Theme Transformer model
for sequential music generation. They use a seq2seq Trans-
former to account for not only the past context but also
an additional pre-given theme segment that is supposed to
manifest itself multiple times in the model’s generation re-
sult. The present work can be considered as an extension
of their work to the scenario of music infilling.
3. METHODOLOGY
Given a past context Cpast and a future context Cfuture, the
general, structure-agnostic music infilling task entails
generating an infilled segment Tthat interconnects Cpast
and Cfuture smoothly, preferably in a musically meaningful
way. When using an autoregressive generative model such
as the Transformer as the model backbone, the training ob-
ject is to maximize the following likelihood function:
Y
0<k≤|T|
P(tk|t<k, Cpast, Cfuture),(1)
where tkdenotes the element of Tat timestep k,t<k the
subsequence consisting of all the previously generated el-
ements, and |·|the length of a sequence.
Extending from Eq. (1), we propose and study in this
paper a special case, called structure-aware music infill-
ing, where an additional segment Grepresenting the struc-
tural context is given, leading to the new objective:
Y
0<k≤|T|
P(tk|t<k, Cpast, Cfuture;G).(2)
As depicted in Figure 2(a), our model is based on Trans-
former with the encoder-decoder architecture. It uses the
decoder to self-attend to the prompt (i.e., Cpast and Cfuture)
and the previously-generated elements (i.e., t<k), and the
encoder to cross-attend to the structural context G. We
provide details of the proposed model below.
Note that we do not require the length of all the involved
segments to be fixed; namely |T|,|Cpast|,|Cfuture|and |G|
are all variables in our setting.
3.1 REMI-based Token Representation
To incorporate structure-related information to our repre-
sentation of the music data, we devise an extension of the
REMI-based representation [8] that comprises five types