MUSIC -TO-TEXT SYNAESTHESIA GENERATING DESCRIPTIVE TEXT FROM MUSIC RECORDINGS Zhihuan Kuang1 Shi Zong1 Jianbing Zhang1 Jiajun Chen1 Hongfu Liu2

2025-05-02 1 0 1.57MB 12 页 10玖币

侵权投诉

MUSIC-TO-TEXT SYNAESTHESIA: GENERATING DESCRIPTIVE

TEXT FROM MUSIC RECORDINGS

Zhihuan Kuang1, Shi Zong1, Jianbing Zhang1, Jiajun Chen1, Hongfu Liu2

1Nanjing University 2Brandeis University

kuangzh@smail.nju.edu.cn {szong, zjb, chenjj}@nju.edu.cn

hongfuliu@brandeis.edu

ABSTRACT

In this paper, we consider a novel research problem: music-to-text synaesthesia. Different from

the classical music tagging problem that classiﬁes a music recording into pre-deﬁned categories,

music-to-text synaesthesia aims to generate descriptive texts from music recordings with the same

sentiment for further understanding. As existing music-related datasets do not contain the semantic

descriptions on music recordings, we collect a new dataset that contains 1,955 aligned pairs of

classical music recordings and text descriptions. Based on this, we build a computational model

to generate sentences that can describe the content of the music recording. To tackle the highly

non-discriminative classical music, we design a group topology-preservation loss, which considers

more samples as a group reference and preserves the relative topology among different samples.

Extensive experimental results qualitatively and quantitatively demonstrate the effectiveness of our

proposed model over ﬁve heuristics or pre-trained competitive methods and their variants on our

collected dataset.1

1 Introduction

Our physical world is naturally composed of various modalities. In recent years, multi-modal learning has drawn great

attention and has been developed in diverse applications. Visual frames in videos are matched with text captions and

these pairs have been widely used for video-language pre-training [

]; Kinects employ the RGB camera and the

depth sensor for action recognition and human pose estimation [

]; autonomous driving cars integrate the visible

and invisible lights by the camera, radar, and lidar for a series of driving-related tasks [

]; cross-modal retrieval

aims to match text with the existing textual repository and other modalities to meet users’ queries [8, 9, 10]; language

grounding learns the meaning of language by leveraging the sensory data such as video or images [11, 12].

Besides the above studies that employ multi-modal data to jointly achieve the learning task, translating information

among different modalities, also regarded as synaesthesia, is another crucial task in the multi-modal community. Various

methods for synaesthesia between text and other modalities have been studied. Speech recognition can be directly

regarded as a translation between the text and audio modality [

]. Image captioning extracts the high-level visual cues

and translates them into a descriptive sentence to describe the image content, while some studies consider the inverse

process of image captioning by converting a semantic text into the visual image [

]. Different from the existing

modality translation studies, in this paper, we consider a novel problem, music-to-text synaesthesia, i.e., generating

descriptive texts from music recordings with the same sentiment orientation.

There have been some pioneering attempts that build the connections between music recordings and tags at the initial

stage. Cai et al.

[16]

formulates music auto-tagging as a captioning task and automatically outputs a sequence of tags

given a clip of music. Zhang et al.

[17]

uses keywords of music key, meter and style to generate music descriptions,

which can be used for caption generation. However, we argue that descriptive texts contain much richer information

than tags, thus providing a better understanding of music recording. Moreover, we notice that tags might have a biased

interpretation. To demonstrate this, in Figure 1 we present two music recordings with the same music tags, but the

opposite sentiment orientation of the text. The ﬁrst one expresses a positive sentiment by describing the music as

1Our code are available at https://github.com/MusicTextSynaesthesia.

arXiv:2210.00434v2 [eess.AS] 8 May 2023

Te x t : Perhaps the work's center of gravity is the slow movement, Andante

espressivo. The motif is given a peaceful, and extraordinarily beautiful

treatment. The stormy middle section makes a very strong impression.

Movement III: Andante espressivo (by Charle Stanford)

Movement I: Tem p o m o d e r at o (by Wilhelm Stenhammar)

Te x t : The movement to Quartet No.6, Tempo moderato un poco rubato,

though not funereal, clearly conveys the feelings sadness and loss. The tempo

never speeds up and the players are warned off keeping a strict tempo.

Ta g s : mode - minor, instrument - string, ensemble - quartet, tempo - medium

Figure 1: Samples of classical music and corresponding descriptions in our collected dataset. The ﬁrst piece is from

String Quartet No.2 in A minor Op.45 composed by Charles Stanford, and the second is from String Quartet No.6 in D

minor Op.35 by Wilhelm Stenhammar. These two samples have the same music tags but different sentiments.

“peaceful” and “beautiful,” while the second one uses tokens including “sadness” and “loss” to express a negative

sentiment. It is clear that music tags are insufﬁcient for describing the content of a music piece.

Contributions

. In this paper, we propose a new task of generating descriptive text from music recordings. Speciﬁcally,

given a music recording, we aim to build a computational model that can generate sentences that can describe the

content of the music recording, as well as the music’s inherent sentiment. We make the following contributions:

•

From the research problem perspective, different from the music tagging problem, our proposed music-to-text

synaesthesia is a cross-modality translation task that aims at converting a given music piece to a text description.

To our best knowledge, it is a novel research problem in the multi-modal learning community.

•

From the dataset perspective, the existing music-related datasets do not contain the semantic description of music

recordings. To build computational models for this task, we collect a new dataset that contains 1,955 aligned pairs

of classical music recordings and text descriptions.

•

From the technical perspective, we design a group topology-preservation loss in our computational model to tackle

the non-discriminative music representation, which considers more data points as a group reference and preserves

the relative topology among different nodes. Thus it can better align the music representations with the structure in

text space.

•

From the empirical evaluation, extensive experimental results demonstrate the effectiveness of our proposed model

over ﬁve heuristics or pre-trained competitive methods and their variants on our collected dataset. We also provide

several case studies for comparisons and elaborate the explorations on our group topology-preservation loss and

some parameter analyses.

2 Related Work

We introduce the related work on multi-modality learning and music tagging and captioning below.

Multi-modality Learning.

The goal of multi-modal machine learning is to build computational models that are able

to process and relate information from different modalities, such as audio, text, and image. A large portion of prior

works has focused on modality fusion, which aims at making predictions by joining information from two or more

modalities [

]. Applications include audio-visual speech recognition [

], visual question answering [

], and media

summarization.

Beyond multi-modality fusion, translation among different modalities also draws increasing attention. There are

three common frameworks of multi-modality translation. (1) Encoder-decoder models directly learn intermediate

representations used for projecting one modality into another. Zhang et al.

[21]

adapts a sketch-reﬁnement process to

generate photo-realistic images for text-to-image synthesis tasks. Wang et al.

[22]

designs a framework for end-to-end

dense video captioning with parallel decoding. (2) Models with joint representations fuse multi-modal features by

mapping representations of different modalities together into a shared semantic subspace. Sun et al.

[1]

proposes

ViLBERT, which extends BERT architecture to a multi-modal two-stream model, which learns task-agnostic joint

representations of image content and natural language. Habibian et al.

[23]

designs an embedding between video

features and term vectors to learn the entire representation from freely available web videos and their descriptions. (3)

Representations in coordinated representations-based models exist in separated spaces, but are coordinated through a

similarity function (e.g., Euclidean distance) or a structure constraint. These works include Wang et al.

[24]

, which

present a method to learn a common subspace based on adversarial learning for adversarial cross-modal retrieval. Peng

et al.

[25]

proposed a modality-speciﬁc cross-modal similarity measurement approach for tasks including cross-modal

retrieval. In this work, we experiment with different losses on the coordinate model, as it achieves the best performance

among all three different types of models.

Music Tagging and Captioning Tasks.

We notice some pioneering studies on music tagging or captioning tasks [

]. Manco et al.

[29]

uses a private production music dataset, with music clips of length between only 30 and 360

seconds and captions containing between 3 and 22 tokens. Their proposed model is an encoder-decoder network with a

multimodal CNN-LSTM encoder with temporal attention and an LSTM decoder. Our proposed task is different from

tagging and captioning tasks, as we aim at translating semantics and preserving sentiment between modalities.

Existing public music-related datasets mainly contain simple music tags. AudioSet dataset [

] is a large-scale collection

of human-labeled 10-second sound clips (not music recording) drawn from YouTube videos. This dataset only has

descriptions for categories, not for individual sounds. The MTG-Jamendo dataset [

] contains over 55,000 full audio

tracks with 195 tags ranging from genre, instrument, and mood/theme categories. Oramas et al.

[32]

describes a

dataset containing reviewers from Amazon for albums. However, users’ reviews may not necessarily describe the

actual contents of music recordings. Cai et al.

[16]

formulates the music tagging problem as a multi-class classiﬁcation

problem. A dataset called MajorMiner is used, with each music recording associated with tags collected from different

users. Zhang et al.

[17]

studies bidirectional music-sentence retrieval and generation tasks. The used dataset contains

16,257 folk songs paired with metadata information, including select key, meter, and style as keywords. However, text

describing music only focuses on speciﬁc information and has limited writing styles.

3 Data Collection and Analysis

In order to generate descriptive texts from music recordings, a dataset containing aligned music-text pairs is required

for model training. Although there are several public music/audio datasets with tags or user reviews (see Section 2),

unfortunately, they are not suitable for our task for the following reasons: (1) From the text side, current datasets only

have pre-deﬁned tags for music pieces, rather than descriptive texts for music contents. (2) From the audio side, some

clips are too short without a musical melody. In light of this, we collect a new dataset for music-to-text synaesthesia

task.

Data Collection and Post-Processing.

We collected the data from EARSENSE,

a website that hosts a database

for chamber music. EARSENSE provides comprehensive meta-information for each music composition, including

composers, works, and related multi-media resources. There is also an associated introductory article from professional

experts, with detailed explanations, comments or analyses for movements. Figure 1 shows an illustrative example of

the music-text pairs. A typical music composition contains several movements. Each movement has its own title that

normally contains tempo markings or terms such as minuet and trio; in some cases, it has a unique name speaking to

the larger story of the entire work. As movements have their own form, key, and mood, and often contain a complete

resolution or ending, we will treat each movement as the basic unit in this work.3

(a) music similarity (b) text similarity

Figure 2: Pairwise similarity matrices of music representa-

tion by a self-reconstruction autoencoder and raw text by

cosine and BLEU score.

We managed to collect 2,380 text descriptions in total,

where 1,955 descriptions have corresponding music pieces.

We converted the tempo markings in the titles of move-

ments into universal four categories from slow to super

fast. These categories are then added for movements as

tags, by directly checking whether it contains tokens in

our list.

Preliminary Exploration.

We observe that the lengths

of the 95% collected music pieces vary from 2.5 to 14

minutes, which correspond to the descriptive texts with

14 to 192 tokens. We provide more details of our data

statistics in Appendix A.

2http://earsense.org/

For example, Ludwig van Beethoven’s Sonata Pathétique (No. 8 in C minor, Op. 13) contains three movements: (I) Grave

(slowly, with solemnity), (II) Adagio cantabile (slowly, in a singing style), and (III) Rondo: Allegro (quickly).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MUSIC-TO-TEXTSYNAESTHESIA:GENERATINGDESCRIPTIVETEXTFROMMUSICRECORDINGSZhihuanKuang1,ShiZong1,JianbingZhang1,JiajunChen1,HongfuLiu21NanjingUniversity2BrandeisUniversitykuangzh@smail.nju.edu.cn{szong,zjb,chenjj}@nju.edu.cnhongfuliu@brandeis.eduABSTRACTInthispaper,weconsideranovelresearchproblem:music-...

展开>> 收起<<

MUSIC -TO-TEXT SYNAESTHESIA GENERATING DESCRIPTIVE TEXT FROM MUSIC RECORDINGS Zhihuan Kuang1 Shi Zong1 Jianbing Zhang1 Jiajun Chen1 Hongfu Liu2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MUSIC -TO-TEXT SYNAESTHESIA GENERATING DESCRIPTIVE TEXT FROM MUSIC RECORDINGS Zhihuan Kuang1 Shi Zong1 Jianbing Zhang1 Jiajun Chen1 Hongfu Liu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: