
Detection and Classification of Acoustic Scenes and Events 2022 3–4 November 2022, Nancy, France
MATCHING TEXT AND AUDIO EMBEDDINGS: EXPLORING TRANSFER-LEARNING
STRATEGIES FOR LANGUAGE-BASED AUDIO RETRIEVAL
Benno Weck1,2, Miguel P´
erez Fern´
andez1,2, Holger Kirchhoff 1, Xavier Serra2
1Huawei Technologies, Munich Research Center, Germany
{firstname.lastname}@huawei.com
2Universitat Pompeu Fabra, Music Technology Group, Spain
{firstname.lastname}01@estudiant.upf.edu, xavier.serra@upf.edu
ABSTRACT
We present an analysis of large-scale pretrained deep learning mod-
els used for cross-modal (text-to-audio) retrieval. We use embed-
dings extracted by these models in a metric learning framework
to connect matching pairs of audio and text. Shallow neural net-
works map the embeddings to a common dimensionality. Our sys-
tem, which is an extension of our submission to the Language-based
Audio Retrieval Task of the DCASE Challenge 2022, employs the
RoBERTa foundation model as the text embedding extractor. A pre-
trained PANNs model extracts the audio embeddings. To improve
the generalisation of our model, we investigate how pretraining with
audio and associated noisy text collected from the online platform
Freesound improves the performance of our method. Furthermore,
our ablation study reveals that the proper choice of the loss func-
tion and fine-tuning the pretrained models are essential in training a
competitive retrieval system.
1. INTRODUCTION
The DCASE2022 challenge subtask 6b provides a platform to stim-
ulate research in the underexplored problem domain of language-
based audio retrieval [1]. The goal of this task is to find the closest
matching audio recordings for a given text query. A possible ap-
plication for this task is a search engine for audio files in which a
user can enter a free-form textual description to retrieve matching
recordings. Such systems need to draw a connection between the
two modalities: audio and text.
Given the complex nature of both audio and text, we expect that
a system can only perform well in this task if it can capitalise on a
large amount of training data. Due to the novelty of the task, not
many previous studies and systems exist for language-based audio
retrieval and training data is still limited. We instead turn to the
fields of machine listening, specifically audio tagging, and natural
language processing to draw inspiration from related problems and
make use of existing resources such as pretrained models. It has
become a popular approach to use large-scale pretrained models in
a transfer learning setup for tasks where only limited training data
is available.
The goal of this work is to study a simple, generic cross-modal
alignment system. Our approach should be able to process audio
and text independently to be used in a cross-modal retrieval context.
Therefore, we leverage the power of pretrained models and a met-
ric learning framework to semantically link the two modalities. We
limit the complexity of our approach by employing the pretrained
models with fixed weights and only train shallow network architec-
tures to perform the alignment. Additionally, this paper presents
t
t
Z'
Z
[
t
A
"A machine breaks,
and the alarm turns on"
Text tower
Contrastive
loss
Z'
A
Z
a
a
a
Audio tower
[
E E
a t
Figure 1: Overview of the architecture of our system. An audio
tower and a text tower process the respective input data separately
and produce a single embedding.
an analysis of our submission [2] to the Language-based Audio Re-
trieval Task of the DCASE2022 Challenge. With an ablation study,
we investigate the impact of different training strategies on the per-
formance of our system. This helps us to understand the differences
in performance between our system and other submissions to the
challenge.
The remainder of this paper is structured as follows. In the
next section, we introduce the methodological framework of our
system. Section 3 explains the experiments that lead to our chal-
lenge submission and Section 4 presents the results of the submitted
systems. The results of additional experiments performed as an ab-
lation study are discussed in Section 5. We summarise our findings
in Section 6.
2. METHOD
We adopt a metric learning [3] framework in our approach, which
differs from a classification scenario used in related tasks such as
audio tagging. In a classification scenario, the outputs of a net-
work are the predictions for the different classes and the features
that characterise each of those classes remain in the intermediate
layers of the network. However, in metric learning, the goal is to
obtain those features directly, so that the output of the network can
be used to measure the similarity between two different inputs. The
arXiv:2210.02833v1 [cs.IR] 6 Oct 2022