Detection and Classiﬁcation of Acoustic Scenes and Events 2022 34 November 2022 Nancy France MATCHING TEXT AND AUDIO EMBEDDINGS EXPLORING TRANSFER-LEARNING STRATEGIES FOR LANGUAGE-BASED AUDIO RETRIEVAL

2025-04-27 0 0 301.01KB 5 页 10玖币

侵权投诉

Detection and Classiﬁcation of Acoustic Scenes and Events 2022 3–4 November 2022, Nancy, France

MATCHING TEXT AND AUDIO EMBEDDINGS: EXPLORING TRANSFER-LEARNING

STRATEGIES FOR LANGUAGE-BASED AUDIO RETRIEVAL

Benno Weck1,2, Miguel P´

erez Fern´

andez1,2, Holger Kirchhoff 1, Xavier Serra2

1Huawei Technologies, Munich Research Center, Germany

{ﬁrstname.lastname}@huawei.com

2Universitat Pompeu Fabra, Music Technology Group, Spain

{ﬁrstname.lastname}01@estudiant.upf.edu, xavier.serra@upf.edu

ABSTRACT

We present an analysis of large-scale pretrained deep learning mod-

els used for cross-modal (text-to-audio) retrieval. We use embed-

dings extracted by these models in a metric learning framework

to connect matching pairs of audio and text. Shallow neural net-

works map the embeddings to a common dimensionality. Our sys-

tem, which is an extension of our submission to the Language-based

Audio Retrieval Task of the DCASE Challenge 2022, employs the

RoBERTa foundation model as the text embedding extractor. A pre-

trained PANNs model extracts the audio embeddings. To improve

the generalisation of our model, we investigate how pretraining with

audio and associated noisy text collected from the online platform

Freesound improves the performance of our method. Furthermore,

our ablation study reveals that the proper choice of the loss func-

tion and ﬁne-tuning the pretrained models are essential in training a

competitive retrieval system.

1. INTRODUCTION

The DCASE2022 challenge subtask 6b provides a platform to stim-

ulate research in the underexplored problem domain of language-

based audio retrieval [1]. The goal of this task is to ﬁnd the closest

matching audio recordings for a given text query. A possible ap-

plication for this task is a search engine for audio ﬁles in which a

user can enter a free-form textual description to retrieve matching

recordings. Such systems need to draw a connection between the

two modalities: audio and text.

Given the complex nature of both audio and text, we expect that

a system can only perform well in this task if it can capitalise on a

large amount of training data. Due to the novelty of the task, not

many previous studies and systems exist for language-based audio

retrieval and training data is still limited. We instead turn to the

ﬁelds of machine listening, speciﬁcally audio tagging, and natural

language processing to draw inspiration from related problems and

make use of existing resources such as pretrained models. It has

become a popular approach to use large-scale pretrained models in

a transfer learning setup for tasks where only limited training data

is available.

The goal of this work is to study a simple, generic cross-modal

alignment system. Our approach should be able to process audio

and text independently to be used in a cross-modal retrieval context.

Therefore, we leverage the power of pretrained models and a met-

ric learning framework to semantically link the two modalities. We

limit the complexity of our approach by employing the pretrained

models with ﬁxed weights and only train shallow network architec-

tures to perform the alignment. Additionally, this paper presents

t

Z'

Z

[

t

A

"A machine breaks, 

and the alarm turns on"

Text tower

Contrastive

loss

Z'

A

Z

a

Audio tower

[

E E

a t

Figure 1: Overview of the architecture of our system. An audio

tower and a text tower process the respective input data separately

and produce a single embedding.

an analysis of our submission [2] to the Language-based Audio Re-

trieval Task of the DCASE2022 Challenge. With an ablation study,

we investigate the impact of different training strategies on the per-

formance of our system. This helps us to understand the differences

in performance between our system and other submissions to the

challenge.

The remainder of this paper is structured as follows. In the

next section, we introduce the methodological framework of our

system. Section 3 explains the experiments that lead to our chal-

lenge submission and Section 4 presents the results of the submitted

systems. The results of additional experiments performed as an ab-

lation study are discussed in Section 5. We summarise our ﬁndings

in Section 6.

2. METHOD

We adopt a metric learning [3] framework in our approach, which

differs from a classiﬁcation scenario used in related tasks such as

audio tagging. In a classiﬁcation scenario, the outputs of a net-

work are the predictions for the different classes and the features

that characterise each of those classes remain in the intermediate

layers of the network. However, in metric learning, the goal is to

obtain those features directly, so that the output of the network can

be used to measure the similarity between two different inputs. The

arXiv:2210.02833v1 [cs.IR] 6 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DetectionandClassicationofAcousticScenesandEvents202234November2022,Nancy,FranceMATCHINGTEXTANDAUDIOEMBEDDINGS:EXPLORINGTRANSFER-LEARNINGSTRATEGIESFORLANGUAGE-BASEDAUDIORETRIEVALBennoWeck1;2,MiguelP´erezFern´andez1;2,HolgerKirchhoff1,XavierSerra21HuaweiTechnologies,MunichResearchCenter,Germanyfrs...

展开>> 收起<<

Detection and Classiﬁcation of Acoustic Scenes and Events 2022 34 November 2022 Nancy France MATCHING TEXT AND AUDIO EMBEDDINGS EXPLORING TRANSFER-LEARNING STRATEGIES FOR LANGUAGE-BASED AUDIO RETRIEVAL.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detection and Classiﬁcation of Acoustic Scenes and Events 2022 34 November 2022 Nancy France MATCHING TEXT AND AUDIO EMBEDDINGS EXPLORING TRANSFER-LEARNING STRATEGIES FOR LANGUAGE-BASED AUDIO RETRIEVAL

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: