MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT Chih-Pin Tan12Alvin W.Y. Su1Yi-Hsuan Yang23 1National Cheng Kung University2Academia Sinica3Taiwan AI Labs

2025-05-02 0 0 876.09KB 8 页 10玖币
侵权投诉
MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT
Chih-Pin Tan1,2Alvin W.Y. Su1Yi-Hsuan Yang2,3
1National Cheng Kung University, 2Academia Sinica, 3Taiwan AI Labs
p76091551@gs.ncku.edu.tw, alvinsu@mail.ncku.edu.tw, yang@citi.sinica.edu.tw
ABSTRACT
This paper proposes a novel Transformer-based model for
music score infilling, to generate a music passage that fills
in the gap between given past and future contexts. While
existing infilling approaches can generate a passage that
connects smoothly locally with the given contexts, they do
not take into account the musical form or structure of the
music and may therefore generate overly smooth results.
To address this issue, we propose a structure-aware condi-
tioning approach that employs a novel attention-selecting
module to supply user-provided structure-related informa-
tion to the Transformer for infilling. With both objec-
tive and subjective evaluations, we show that the proposed
model can harness the structural information effectively
and generate melodies in the style of pop of higher quality
than the two existing structure-agnostic infilling models.
1. INTRODUCTION
In recent years, machine learning techniques have been
widely applied to symbolic music generation. A large
number of models attain sequential generation by account-
ing for only the past context, i.e., the generated music de-
pends on only the preceding musical content [1–14]. While
sequential generation can find useful use cases, it does not
align with typical human compositional practices which
can be non-sequential in nature. Musicians often write mo-
tifs or small pieces to get inspiration first, before working
on the middle parts to connect them.
Hence, we focus on the scenario when both the past and
future contexts are given, which is called music score in-
filling or inpainting [15]. As shown in Figure 1(a), the task
is to let models fill in the missing part between the two
given segments. Prompt-based conditioning approaches
[15–23] have been applied to such a task in recent years,
treating the two given segments as the “prompt. Among
them, the variable-length infilling model (VLI) [20] ob-
tains promising results by adding special positional encod-
ings to XLNet [24], a permutation-based language model
that is naturally suitable for generative tasks with given bi-
© Chih-Pin Tan, Alvin W.Y. Su, and Yi-Hsuan Yang. Li-
censed under a Creative Commons Attribution 4.0 International License
(CC BY 4.0). Attribution: Chih-Pin Tan, Alvin W.Y. Su, and Yi-Hsuan
Yang, “Melody Infilling with User-Provided Structural Context”, in Proc.
of the 23rd Int. Society for Music Information Retrieval Conf., Bengaluru,
India, 2022.
Figure 1: Comparison between (a) structure-agnostic and
(b) structure-aware approaches for music score infilling.
directional contexts. The experiment of VLI shows that
their model is capable of connecting the past and future
contexts smoothly locally for infilling solo piano passages
of up to 4 bars (measures).
Considering composers usually write musical pieces in
a hierarchical manner [25], we note that prompt-based con-
ditioning approaches have a strong limitation: they gen-
erate results with only consideration of local smoothness
among the past context, future context, and result, without
taking care of the overall musical form or structure of the
music. For instance, a composer may like to write a song
in a musical form of ABA’B’. If we consider the concate-
nation of the segments corresponding to Aand B(i.e., AB)
as the past context, and the segment corresponding to B’
as the future context, and feed them to an existing infilling
model, the model may generate a sequence that consists
of similar melody and chord progression as the segments
corresponding to Band B’, not the intended repetition or
variation of the segment corresponding to A.
To address this issue, we propose in this paper a novel
structure-aware setting for music infilling. As shown in
Figure 1(b), besides the past and future contexts exploited
by conventional structure-agnostic, prompt-based models,
out approach additionally capitalizes for the infilling task
the structural context, a music segment corresponding to a
certain part of the whole music that is supposed to share
the same structure label (such as Aor B) with the miss-
ing segment. Accordingly, besides local smoothness, the
model also needs to consider the similarity between the in-
filled segment and the structural context. Here, we assume
the structural context is provided by a user, not generated
by a model. For example, the user may designate the seg-
ment corresponding to Aas the structural context, thereby
arXiv:2210.02829v1 [cs.SD] 6 Oct 2022
inform the model with the intended musical form.
We improve upon the VLI model [20] in the following
ways to realize structure-aware infilling. First, we use the
classic Transformer [26–28] instead of the more sophisti-
cated XLNet [24] as the model backbone, to make it eas-
ier to add a conditioning module to exploit the structural
context. To improve the capability of the Transformer to
account for bi-directional contexts, we propose two novel
components, the bar-count-down technique (Section 3.2)
and order embeddings (Section 3.3), which respectively
give the model an explicit control of the length of the gen-
erated music, and a convenient way to attend to the fu-
ture context. Second, being inspired by the Theme Trans-
former [29], we use not a Transformer decoder-only archi-
tecture but a sequence-to-sequence (seq2seq) Transformer
encoder/decoder architecture, using the cross-attention be-
tween the encoder and decoder as the conditioning module
to account for the structural context. Moreover, we propose
an attention-selecting module that allows the Transformer
to access multiple structural contexts while infilling differ-
ent parts of a music piece, which can be useful both in the
training and inference time (Section 3.4) .
For evaluation, we compare our model with two strong
baselines, the VLI [20] and the work of Hsu & Chang [21],
on the task of symbolic-domain melody infilling of 4-bar
content using the POP909 dataset [30] and the associated
structural labels from Dai et al. [31]. With objective and
subjective analyses, we show that our model greatly out-
performs the baselines in the structure completeness of the
generated pieces, without degrading local smoothness.
We set up a webpage for demos 1and open source our
code at a public GitHub repository. 2
2. RELATED WORK
Generating missing parts with given surrounding contexts
has been attempted by early works. DeepBach [17] pre-
dicts missing notes based on the notes around them. They
use two recurrent neural networks (RNNs) to capture the
past and future contexts, and a feedforward neural network
to capture the current context from notes with the same
temporal position as the target note. COCONET [16] trains
a convolutional neural network (CNN) to complete partial
musical scores and explores the use of blocked Gibbs sam-
pling as an analog to rewriting. They encode the music
data with the piano roll representation and treat that as a
fixed-size image, so the model can only perform fixed-
length music infilling. Inpainting Net [15] uses an RNN
to integrate the temporal information from a variational
auto-encoder (VAE) [32] for bar-wise generation, Wei et
al. [23] build the model with a similar concept as Inpaint-
ing Net and use the contrastive loss [33, 34] for training
to improve the infilling quality. Some Transformer-based
models have also been proposed to achieve music infilling.
Ippolito et al. [18] concatenate the past and future context
with a special separator token. They keep the original po-
sitional encoding of the contexts and the missing segment,
1https://tanchihpin0517.github.io/structure-aware_infilling
2https://github.com/tanchihpin0517/structure-aware_infilling
which again limits the length of given contexts and gener-
ated sequence to be fixed. We see that these infilling mod-
els impose some data assumptions and thereby have certain
restrictions, e.g., the length of the input sequence cannot
be arbitrary, or the missing segment needs to be complete
bars. The work of Hsu & Chang [21] is free of these re-
strictions. They use two Transformer encoders to capture
the past and future context respectively and generate re-
sults with a Transformer decoder. The VLI model [20] can
also realize variable-length infilling. However, to our best
knowledge, no existing models have explicitly considered
structure-related information for infilling.
Structure-based conditioning has been explored only re-
cently by Shi et al. [29] in their Theme Transformer model
for sequential music generation. They use a seq2seq Trans-
former to account for not only the past context but also
an additional pre-given theme segment that is supposed to
manifest itself multiple times in the model’s generation re-
sult. The present work can be considered as an extension
of their work to the scenario of music infilling.
3. METHODOLOGY
Given a past context Cpast and a future context Cfuture, the
general, structure-agnostic music infilling task entails
generating an infilled segment Tthat interconnects Cpast
and Cfuture smoothly, preferably in a musically meaningful
way. When using an autoregressive generative model such
as the Transformer as the model backbone, the training ob-
ject is to maximize the following likelihood function:
Y
0<k≤|T|
P(tk|t<k, Cpast, Cfuture),(1)
where tkdenotes the element of Tat timestep k,t<k the
subsequence consisting of all the previously generated el-
ements, and |·|the length of a sequence.
Extending from Eq. (1), we propose and study in this
paper a special case, called structure-aware music infill-
ing, where an additional segment Grepresenting the struc-
tural context is given, leading to the new objective:
Y
0<k≤|T|
P(tk|t<k, Cpast, Cfuture;G).(2)
As depicted in Figure 2(a), our model is based on Trans-
former with the encoder-decoder architecture. It uses the
decoder to self-attend to the prompt (i.e., Cpast and Cfuture)
and the previously-generated elements (i.e., t<k), and the
encoder to cross-attend to the structural context G. We
provide details of the proposed model below.
Note that we do not require the length of all the involved
segments to be fixed; namely |T|,|Cpast|,|Cfuture|and |G|
are all variables in our setting.
3.1 REMI-based Token Representation
To incorporate structure-related information to our repre-
sentation of the music data, we devise an extension of the
REMI-based representation [8] that comprises five types
摘要:

MELODYINFILLINGWITHUSER-PROVIDEDSTRUCTURALCONTEXTChih-PinTan1;2AlvinW.Y.Su1Yi-HsuanYang2;31NationalChengKungUniversity,2AcademiaSinica,3TaiwanAILabsp76091551@gs.ncku.edu.tw,alvinsu@mail.ncku.edu.tw,yang@citi.sinica.edu.twABSTRACTThispaperproposesanovelTransformer-basedmodelformusicscoreinlling,toge...

展开>> 收起<<
MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT Chih-Pin Tan12Alvin W.Y. Su1Yi-Hsuan Yang23 1National Cheng Kung University2Academia Sinica3Taiwan AI Labs.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:876.09KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注