MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT Chih-Pin Tan12Alvin W.Y. Su1Yi-Hsuan Yang23 1National Cheng Kung University2Academia Sinica3Taiwan AI Labs

2025-05-02 2 0 876.09KB 8 页 10玖币

侵权投诉

MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT

Chih-Pin Tan1,2Alvin W.Y. Su1Yi-Hsuan Yang2,3

1National Cheng Kung University, 2Academia Sinica, 3Taiwan AI Labs

p76091551@gs.ncku.edu.tw, alvinsu@mail.ncku.edu.tw, yang@citi.sinica.edu.tw

ABSTRACT

This paper proposes a novel Transformer-based model for

music score inﬁlling, to generate a music passage that ﬁlls

in the gap between given past and future contexts. While

existing inﬁlling approaches can generate a passage that

connects smoothly locally with the given contexts, they do

not take into account the musical form or structure of the

music and may therefore generate overly smooth results.

To address this issue, we propose a structure-aware condi-

tioning approach that employs a novel attention-selecting

module to supply user-provided structure-related informa-

tion to the Transformer for inﬁlling. With both objec-

tive and subjective evaluations, we show that the proposed

model can harness the structural information effectively

and generate melodies in the style of pop of higher quality

than the two existing structure-agnostic inﬁlling models.

1. INTRODUCTION

In recent years, machine learning techniques have been

widely applied to symbolic music generation. A large

number of models attain sequential generation by account-

ing for only the past context, i.e., the generated music de-

pends on only the preceding musical content [1–14]. While

sequential generation can ﬁnd useful use cases, it does not

align with typical human compositional practices which

can be non-sequential in nature. Musicians often write mo-

tifs or small pieces to get inspiration ﬁrst, before working

on the middle parts to connect them.

Hence, we focus on the scenario when both the past and

future contexts are given, which is called music score in-

ﬁlling or inpainting [15]. As shown in Figure 1(a), the task

is to let models ﬁll in the missing part between the two

given segments. Prompt-based conditioning approaches

[15–23] have been applied to such a task in recent years,

treating the two given segments as the “prompt.” Among

them, the variable-length inﬁlling model (VLI) [20] ob-

tains promising results by adding special positional encod-

ings to XLNet [24], a permutation-based language model

that is naturally suitable for generative tasks with given bi-

censed under a Creative Commons Attribution 4.0 International License

(CC BY 4.0). Attribution: Chih-Pin Tan, Alvin W.Y. Su, and Yi-Hsuan

Yang, “Melody Inﬁlling with User-Provided Structural Context”, in Proc.

of the 23rd Int. Society for Music Information Retrieval Conf., Bengaluru,

India, 2022.

Figure 1: Comparison between (a) structure-agnostic and

(b) structure-aware approaches for music score inﬁlling.

directional contexts. The experiment of VLI shows that

their model is capable of connecting the past and future

contexts smoothly locally for inﬁlling solo piano passages

of up to 4 bars (measures).

Considering composers usually write musical pieces in

a hierarchical manner [25], we note that prompt-based con-

ditioning approaches have a strong limitation: they gen-

erate results with only consideration of local smoothness

among the past context, future context, and result, without

taking care of the overall musical form or structure of the

music. For instance, a composer may like to write a song

in a musical form of ABA’B’. If we consider the concate-

nation of the segments corresponding to Aand B(i.e., AB)

as the past context, and the segment corresponding to B’

as the future context, and feed them to an existing inﬁlling

model, the model may generate a sequence that consists

of similar melody and chord progression as the segments

corresponding to Band B’, not the intended repetition or

variation of the segment corresponding to A.

To address this issue, we propose in this paper a novel

structure-aware setting for music inﬁlling. As shown in

Figure 1(b), besides the past and future contexts exploited

by conventional structure-agnostic, prompt-based models,

out approach additionally capitalizes for the inﬁlling task

the structural context, a music segment corresponding to a

certain part of the whole music that is supposed to share

the same structure label (such as Aor B) with the miss-

ing segment. Accordingly, besides local smoothness, the

model also needs to consider the similarity between the in-

ﬁlled segment and the structural context. Here, we assume

the structural context is provided by a user, not generated

by a model. For example, the user may designate the seg-

ment corresponding to Aas the structural context, thereby

arXiv:2210.02829v1 [cs.SD] 6 Oct 2022

inform the model with the intended musical form.

We improve upon the VLI model [20] in the following

ways to realize structure-aware inﬁlling. First, we use the

classic Transformer [26–28] instead of the more sophisti-

cated XLNet [24] as the model backbone, to make it eas-

ier to add a conditioning module to exploit the structural

context. To improve the capability of the Transformer to

account for bi-directional contexts, we propose two novel

components, the bar-count-down technique (Section 3.2)

and order embeddings (Section 3.3), which respectively

give the model an explicit control of the length of the gen-

erated music, and a convenient way to attend to the fu-

ture context. Second, being inspired by the Theme Trans-

former [29], we use not a Transformer decoder-only archi-

tecture but a sequence-to-sequence (seq2seq) Transformer

encoder/decoder architecture, using the cross-attention be-

tween the encoder and decoder as the conditioning module

to account for the structural context. Moreover, we propose

an attention-selecting module that allows the Transformer

to access multiple structural contexts while inﬁlling differ-

ent parts of a music piece, which can be useful both in the

training and inference time (Section 3.4) .

For evaluation, we compare our model with two strong

baselines, the VLI [20] and the work of Hsu & Chang [21],

on the task of symbolic-domain melody inﬁlling of 4-bar

content using the POP909 dataset [30] and the associated

structural labels from Dai et al. [31]. With objective and

subjective analyses, we show that our model greatly out-

performs the baselines in the structure completeness of the

generated pieces, without degrading local smoothness.

We set up a webpage for demos 1and open source our

code at a public GitHub repository. 2

2. RELATED WORK

Generating missing parts with given surrounding contexts

has been attempted by early works. DeepBach [17] pre-

dicts missing notes based on the notes around them. They

use two recurrent neural networks (RNNs) to capture the

past and future contexts, and a feedforward neural network

to capture the current context from notes with the same

temporal position as the target note. COCONET [16] trains

a convolutional neural network (CNN) to complete partial

musical scores and explores the use of blocked Gibbs sam-

pling as an analog to rewriting. They encode the music

data with the piano roll representation and treat that as a

ﬁxed-size image, so the model can only perform ﬁxed-

length music inﬁlling. Inpainting Net [15] uses an RNN

to integrate the temporal information from a variational

auto-encoder (VAE) [32] for bar-wise generation, Wei et

al. [23] build the model with a similar concept as Inpaint-

ing Net and use the contrastive loss [33, 34] for training

to improve the inﬁlling quality. Some Transformer-based

models have also been proposed to achieve music inﬁlling.

Ippolito et al. [18] concatenate the past and future context

with a special separator token. They keep the original po-

sitional encoding of the contexts and the missing segment,

1https://tanchihpin0517.github.io/structure-aware_inﬁlling

2https://github.com/tanchihpin0517/structure-aware_inﬁlling

which again limits the length of given contexts and gener-

ated sequence to be ﬁxed. We see that these inﬁlling mod-

els impose some data assumptions and thereby have certain

restrictions, e.g., the length of the input sequence cannot

be arbitrary, or the missing segment needs to be complete

bars. The work of Hsu & Chang [21] is free of these re-

strictions. They use two Transformer encoders to capture

the past and future context respectively and generate re-

sults with a Transformer decoder. The VLI model [20] can

also realize variable-length inﬁlling. However, to our best

knowledge, no existing models have explicitly considered

structure-related information for inﬁlling.

Structure-based conditioning has been explored only re-

cently by Shi et al. [29] in their Theme Transformer model

for sequential music generation. They use a seq2seq Trans-

former to account for not only the past context but also

an additional pre-given theme segment that is supposed to

manifest itself multiple times in the model’s generation re-

sult. The present work can be considered as an extension

of their work to the scenario of music inﬁlling.

3. METHODOLOGY

Given a past context Cpast and a future context Cfuture, the

general, structure-agnostic music inﬁlling task entails

generating an inﬁlled segment Tthat interconnects Cpast

and Cfuture smoothly, preferably in a musically meaningful

way. When using an autoregressive generative model such

as the Transformer as the model backbone, the training ob-

ject is to maximize the following likelihood function:

0<k≤|T|

P(tk|t<k, Cpast, Cfuture),(1)

where tkdenotes the element of Tat timestep k,t<k the

subsequence consisting of all the previously generated el-

ements, and |·|the length of a sequence.

Extending from Eq. (1), we propose and study in this

paper a special case, called structure-aware music inﬁll-

ing, where an additional segment Grepresenting the struc-

tural context is given, leading to the new objective:

0<k≤|T|

P(tk|t<k, Cpast, Cfuture;G).(2)

As depicted in Figure 2(a), our model is based on Trans-

former with the encoder-decoder architecture. It uses the

decoder to self-attend to the prompt (i.e., Cpast and Cfuture)

and the previously-generated elements (i.e., t<k), and the

encoder to cross-attend to the structural context G. We

provide details of the proposed model below.

Note that we do not require the length of all the involved

segments to be ﬁxed; namely |T|,|Cpast|,|Cfuture|and |G|

are all variables in our setting.

3.1 REMI-based Token Representation

To incorporate structure-related information to our repre-

sentation of the music data, we devise an extension of the

REMI-based representation [8] that comprises ﬁve types

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MELODYINFILLINGWITHUSER-PROVIDEDSTRUCTURALCONTEXTChih-PinTan1;2AlvinW.Y.Su1Yi-HsuanYang2;31NationalChengKungUniversity,2AcademiaSinica,3TaiwanAILabsp76091551@gs.ncku.edu.tw,alvinsu@mail.ncku.edu.tw,yang@citi.sinica.edu.twABSTRACTThispaperproposesanovelTransformer-basedmodelformusicscoreinlling,toge...

展开>> 收起<<

MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT Chih-Pin Tan12Alvin W.Y. Su1Yi-Hsuan Yang23 1National Cheng Kung University2Academia Sinica3Taiwan AI Labs.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MELODY INFILLING WITH USER-PROVIDED STRUCTURAL CONTEXT Chih-Pin Tan12Alvin W.Y. Su1Yi-Hsuan Yang23 1National Cheng Kung University2Academia Sinica3Taiwan AI Labs

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: