MUSIC -TO-TEXT SYNAESTHESIA GENERATING DESCRIPTIVE TEXT FROM MUSIC RECORDINGS Zhihuan Kuang1 Shi Zong1 Jianbing Zhang1 Jiajun Chen1 Hongfu Liu2

2025-05-02 0 0 1.57MB 12 页 10玖币
侵权投诉
MUSIC-TO-TEXT SYNAESTHESIA: GENERATING DESCRIPTIVE
TEXT FROM MUSIC RECORDINGS
Zhihuan Kuang1, Shi Zong1, Jianbing Zhang1, Jiajun Chen1, Hongfu Liu2
1Nanjing University 2Brandeis University
kuangzh@smail.nju.edu.cn {szong, zjb, chenjj}@nju.edu.cn
hongfuliu@brandeis.edu
ABSTRACT
In this paper, we consider a novel research problem: music-to-text synaesthesia. Different from
the classical music tagging problem that classifies a music recording into pre-defined categories,
music-to-text synaesthesia aims to generate descriptive texts from music recordings with the same
sentiment for further understanding. As existing music-related datasets do not contain the semantic
descriptions on music recordings, we collect a new dataset that contains 1,955 aligned pairs of
classical music recordings and text descriptions. Based on this, we build a computational model
to generate sentences that can describe the content of the music recording. To tackle the highly
non-discriminative classical music, we design a group topology-preservation loss, which considers
more samples as a group reference and preserves the relative topology among different samples.
Extensive experimental results qualitatively and quantitatively demonstrate the effectiveness of our
proposed model over five heuristics or pre-trained competitive methods and their variants on our
collected dataset.1
1 Introduction
Our physical world is naturally composed of various modalities. In recent years, multi-modal learning has drawn great
attention and has been developed in diverse applications. Visual frames in videos are matched with text captions and
these pairs have been widely used for video-language pre-training [
1
,
2
,
3
]; Kinects employ the RGB camera and the
depth sensor for action recognition and human pose estimation [
4
,
5
]; autonomous driving cars integrate the visible
and invisible lights by the camera, radar, and lidar for a series of driving-related tasks [
6
,
7
]; cross-modal retrieval
aims to match text with the existing textual repository and other modalities to meet users’ queries [8, 9, 10]; language
grounding learns the meaning of language by leveraging the sensory data such as video or images [11, 12].
Besides the above studies that employ multi-modal data to jointly achieve the learning task, translating information
among different modalities, also regarded as synaesthesia, is another crucial task in the multi-modal community. Various
methods for synaesthesia between text and other modalities have been studied. Speech recognition can be directly
regarded as a translation between the text and audio modality [
13
]. Image captioning extracts the high-level visual cues
and translates them into a descriptive sentence to describe the image content, while some studies consider the inverse
process of image captioning by converting a semantic text into the visual image [
14
,
15
]. Different from the existing
modality translation studies, in this paper, we consider a novel problem, music-to-text synaesthesia, i.e., generating
descriptive texts from music recordings with the same sentiment orientation.
There have been some pioneering attempts that build the connections between music recordings and tags at the initial
stage. Cai et al.
[16]
formulates music auto-tagging as a captioning task and automatically outputs a sequence of tags
given a clip of music. Zhang et al.
[17]
uses keywords of music key, meter and style to generate music descriptions,
which can be used for caption generation. However, we argue that descriptive texts contain much richer information
than tags, thus providing a better understanding of music recording. Moreover, we notice that tags might have a biased
interpretation. To demonstrate this, in Figure 1 we present two music recordings with the same music tags, but the
opposite sentiment orientation of the text. The first one expresses a positive sentiment by describing the music as
1Our code are available at https://github.com/MusicTextSynaesthesia.
arXiv:2210.00434v2 [eess.AS] 8 May 2023
Te x t : Perhaps the work's center of gravity is the slow movement, Andante
espressivo. The motif is given a peaceful, and extraordinarily beautiful
treatment. The stormy middle section makes a very strong impression.
Movement III: Andante espressivo (by Charle Stanford)
Movement I: Tem p o m o d e r at o (by Wilhelm Stenhammar)
Te x t : The movement to Quartet No.6, Tempo moderato un poco rubato,
though not funereal, clearly conveys the feelings sadness and loss. The tempo
never speeds up and the players are warned off keeping a strict tempo.
Ta g s : mode - minor, instrument - string, ensemble - quartet, tempo - medium
Ta g s : mode - minor, instrument - string, ensemble - quartet, tempo - medium
Figure 1: Samples of classical music and corresponding descriptions in our collected dataset. The first piece is from
String Quartet No.2 in A minor Op.45 composed by Charles Stanford, and the second is from String Quartet No.6 in D
minor Op.35 by Wilhelm Stenhammar. These two samples have the same music tags but different sentiments.
peaceful” and “beautiful,” while the second one uses tokens including “sadness” and “loss” to express a negative
sentiment. It is clear that music tags are insufficient for describing the content of a music piece.
Contributions
. In this paper, we propose a new task of generating descriptive text from music recordings. Specifically,
given a music recording, we aim to build a computational model that can generate sentences that can describe the
content of the music recording, as well as the music’s inherent sentiment. We make the following contributions:
From the research problem perspective, different from the music tagging problem, our proposed music-to-text
synaesthesia is a cross-modality translation task that aims at converting a given music piece to a text description.
To our best knowledge, it is a novel research problem in the multi-modal learning community.
From the dataset perspective, the existing music-related datasets do not contain the semantic description of music
recordings. To build computational models for this task, we collect a new dataset that contains 1,955 aligned pairs
of classical music recordings and text descriptions.
From the technical perspective, we design a group topology-preservation loss in our computational model to tackle
the non-discriminative music representation, which considers more data points as a group reference and preserves
the relative topology among different nodes. Thus it can better align the music representations with the structure in
text space.
From the empirical evaluation, extensive experimental results demonstrate the effectiveness of our proposed model
over five heuristics or pre-trained competitive methods and their variants on our collected dataset. We also provide
several case studies for comparisons and elaborate the explorations on our group topology-preservation loss and
some parameter analyses.
2 Related Work
We introduce the related work on multi-modality learning and music tagging and captioning below.
Multi-modality Learning.
The goal of multi-modal machine learning is to build computational models that are able
to process and relate information from different modalities, such as audio, text, and image. A large portion of prior
works has focused on modality fusion, which aims at making predictions by joining information from two or more
modalities [
18
]. Applications include audio-visual speech recognition [
19
], visual question answering [
20
], and media
summarization.
Beyond multi-modality fusion, translation among different modalities also draws increasing attention. There are
three common frameworks of multi-modality translation. (1) Encoder-decoder models directly learn intermediate
representations used for projecting one modality into another. Zhang et al.
[21]
adapts a sketch-refinement process to
generate photo-realistic images for text-to-image synthesis tasks. Wang et al.
[22]
designs a framework for end-to-end
dense video captioning with parallel decoding. (2) Models with joint representations fuse multi-modal features by
mapping representations of different modalities together into a shared semantic subspace. Sun et al.
[1]
proposes
ViLBERT, which extends BERT architecture to a multi-modal two-stream model, which learns task-agnostic joint
representations of image content and natural language. Habibian et al.
[23]
designs an embedding between video
features and term vectors to learn the entire representation from freely available web videos and their descriptions. (3)
Representations in coordinated representations-based models exist in separated spaces, but are coordinated through a
2
similarity function (e.g., Euclidean distance) or a structure constraint. These works include Wang et al.
[24]
, which
present a method to learn a common subspace based on adversarial learning for adversarial cross-modal retrieval. Peng
et al.
[25]
proposed a modality-specific cross-modal similarity measurement approach for tasks including cross-modal
retrieval. In this work, we experiment with different losses on the coordinate model, as it achieves the best performance
among all three different types of models.
Music Tagging and Captioning Tasks.
We notice some pioneering studies on music tagging or captioning tasks [
26
,
27
,
28
]. Manco et al.
[29]
uses a private production music dataset, with music clips of length between only 30 and 360
seconds and captions containing between 3 and 22 tokens. Their proposed model is an encoder-decoder network with a
multimodal CNN-LSTM encoder with temporal attention and an LSTM decoder. Our proposed task is different from
tagging and captioning tasks, as we aim at translating semantics and preserving sentiment between modalities.
Existing public music-related datasets mainly contain simple music tags. AudioSet dataset [
30
] is a large-scale collection
of human-labeled 10-second sound clips (not music recording) drawn from YouTube videos. This dataset only has
descriptions for categories, not for individual sounds. The MTG-Jamendo dataset [
31
] contains over 55,000 full audio
tracks with 195 tags ranging from genre, instrument, and mood/theme categories. Oramas et al.
[32]
describes a
dataset containing reviewers from Amazon for albums. However, users’ reviews may not necessarily describe the
actual contents of music recordings. Cai et al.
[16]
formulates the music tagging problem as a multi-class classification
problem. A dataset called MajorMiner is used, with each music recording associated with tags collected from different
users. Zhang et al.
[17]
studies bidirectional music-sentence retrieval and generation tasks. The used dataset contains
16,257 folk songs paired with metadata information, including select key, meter, and style as keywords. However, text
describing music only focuses on specific information and has limited writing styles.
3 Data Collection and Analysis
In order to generate descriptive texts from music recordings, a dataset containing aligned music-text pairs is required
for model training. Although there are several public music/audio datasets with tags or user reviews (see Section 2),
unfortunately, they are not suitable for our task for the following reasons: (1) From the text side, current datasets only
have pre-defined tags for music pieces, rather than descriptive texts for music contents. (2) From the audio side, some
clips are too short without a musical melody. In light of this, we collect a new dataset for music-to-text synaesthesia
task.
Data Collection and Post-Processing.
We collected the data from EARSENSE,
2
a website that hosts a database
for chamber music. EARSENSE provides comprehensive meta-information for each music composition, including
composers, works, and related multi-media resources. There is also an associated introductory article from professional
experts, with detailed explanations, comments or analyses for movements. Figure 1 shows an illustrative example of
the music-text pairs. A typical music composition contains several movements. Each movement has its own title that
normally contains tempo markings or terms such as minuet and trio; in some cases, it has a unique name speaking to
the larger story of the entire work. As movements have their own form, key, and mood, and often contain a complete
resolution or ending, we will treat each movement as the basic unit in this work.3
(a) music similarity (b) text similarity
Figure 2: Pairwise similarity matrices of music representa-
tion by a self-reconstruction autoencoder and raw text by
cosine and BLEU score.
We managed to collect 2,380 text descriptions in total,
where 1,955 descriptions have corresponding music pieces.
We converted the tempo markings in the titles of move-
ments into universal four categories from slow to super
fast. These categories are then added for movements as
tags, by directly checking whether it contains tokens in
our list.
Preliminary Exploration.
We observe that the lengths
of the 95% collected music pieces vary from 2.5 to 14
minutes, which correspond to the descriptive texts with
14 to 192 tokens. We provide more details of our data
statistics in Appendix A.
2http://earsense.org/
3
For example, Ludwig van Beethoven’s Sonata Pathétique (No. 8 in C minor, Op. 13) contains three movements: (I) Grave
(slowly, with solemnity), (II) Adagio cantabile (slowly, in a singing style), and (III) Rondo: Allegro (quickly).
3
摘要:

MUSIC-TO-TEXTSYNAESTHESIA:GENERATINGDESCRIPTIVETEXTFROMMUSICRECORDINGSZhihuanKuang1,ShiZong1,JianbingZhang1,JiajunChen1,HongfuLiu21NanjingUniversity2BrandeisUniversitykuangzh@smail.nju.edu.cn{szong,zjb,chenjj}@nju.edu.cnhongfuliu@brandeis.eduABSTRACTInthispaper,weconsideranovelresearchproblem:music-...

展开>> 收起<<
MUSIC -TO-TEXT SYNAESTHESIA GENERATING DESCRIPTIVE TEXT FROM MUSIC RECORDINGS Zhihuan Kuang1 Shi Zong1 Jianbing Zhang1 Jiajun Chen1 Hongfu Liu2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:12 页 大小:1.57MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注