
similarity function (e.g., Euclidean distance) or a structure constraint. These works include Wang et al.
[24]
, which
present a method to learn a common subspace based on adversarial learning for adversarial cross-modal retrieval. Peng
et al.
[25]
proposed a modality-specific cross-modal similarity measurement approach for tasks including cross-modal
retrieval. In this work, we experiment with different losses on the coordinate model, as it achieves the best performance
among all three different types of models.
Music Tagging and Captioning Tasks.
We notice some pioneering studies on music tagging or captioning tasks [
26
,
27
,
28
]. Manco et al.
[29]
uses a private production music dataset, with music clips of length between only 30 and 360
seconds and captions containing between 3 and 22 tokens. Their proposed model is an encoder-decoder network with a
multimodal CNN-LSTM encoder with temporal attention and an LSTM decoder. Our proposed task is different from
tagging and captioning tasks, as we aim at translating semantics and preserving sentiment between modalities.
Existing public music-related datasets mainly contain simple music tags. AudioSet dataset [
30
] is a large-scale collection
of human-labeled 10-second sound clips (not music recording) drawn from YouTube videos. This dataset only has
descriptions for categories, not for individual sounds. The MTG-Jamendo dataset [
31
] contains over 55,000 full audio
tracks with 195 tags ranging from genre, instrument, and mood/theme categories. Oramas et al.
[32]
describes a
dataset containing reviewers from Amazon for albums. However, users’ reviews may not necessarily describe the
actual contents of music recordings. Cai et al.
[16]
formulates the music tagging problem as a multi-class classification
problem. A dataset called MajorMiner is used, with each music recording associated with tags collected from different
users. Zhang et al.
[17]
studies bidirectional music-sentence retrieval and generation tasks. The used dataset contains
16,257 folk songs paired with metadata information, including select key, meter, and style as keywords. However, text
describing music only focuses on specific information and has limited writing styles.
3 Data Collection and Analysis
In order to generate descriptive texts from music recordings, a dataset containing aligned music-text pairs is required
for model training. Although there are several public music/audio datasets with tags or user reviews (see Section 2),
unfortunately, they are not suitable for our task for the following reasons: (1) From the text side, current datasets only
have pre-defined tags for music pieces, rather than descriptive texts for music contents. (2) From the audio side, some
clips are too short without a musical melody. In light of this, we collect a new dataset for music-to-text synaesthesia
task.
Data Collection and Post-Processing.
We collected the data from EARSENSE,
2
a website that hosts a database
for chamber music. EARSENSE provides comprehensive meta-information for each music composition, including
composers, works, and related multi-media resources. There is also an associated introductory article from professional
experts, with detailed explanations, comments or analyses for movements. Figure 1 shows an illustrative example of
the music-text pairs. A typical music composition contains several movements. Each movement has its own title that
normally contains tempo markings or terms such as minuet and trio; in some cases, it has a unique name speaking to
the larger story of the entire work. As movements have their own form, key, and mood, and often contain a complete
resolution or ending, we will treat each movement as the basic unit in this work.3
(a) music similarity (b) text similarity
Figure 2: Pairwise similarity matrices of music representa-
tion by a self-reconstruction autoencoder and raw text by
cosine and BLEU score.
We managed to collect 2,380 text descriptions in total,
where 1,955 descriptions have corresponding music pieces.
We converted the tempo markings in the titles of move-
ments into universal four categories from slow to super
fast. These categories are then added for movements as
tags, by directly checking whether it contains tokens in
our list.
Preliminary Exploration.
We observe that the lengths
of the 95% collected music pieces vary from 2.5 to 14
minutes, which correspond to the descriptive texts with
14 to 192 tokens. We provide more details of our data
statistics in Appendix A.
2http://earsense.org/
3
For example, Ludwig van Beethoven’s Sonata Pathétique (No. 8 in C minor, Op. 13) contains three movements: (I) Grave
(slowly, with solemnity), (II) Adagio cantabile (slowly, in a singing style), and (III) Rondo: Allegro (quickly).
3