MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION Yuanbo Hou1 Siyang Song2 Chuang Yu3 Yuxin Song4 Wenwu Wang5 Dick Botteldooren1

2025-05-02 0 0 395.93KB 5 页 10玖币

侵权投诉

MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH

REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION

Yuanbo Hou1, Siyang Song2, Chuang Yu3, Yuxin Song4, Wenwu Wang5, Dick Botteldooren1

1Ghent University, Belgium. 2University of Cambridge, UK.

3University of Manchester, UK. 4Baidu Inc., China. 5University of Surrey, UK.

ABSTRACT

Most existing deep learning-based acoustic scene classiﬁ-

cation (ASC) approaches directly utilize representations ex-

tracted from spectrograms to identify target scenes. However,

these approaches pay little attention to the audio events occur-

ring in the scene despite they provide crucial semantic infor-

mation. This paper conducts the ﬁrst study that investigates

whether real-life acoustic scenes can be reliably recognized

based only on the features that describe a limited number of

audio events. To model the task-speciﬁc relationships be-

tween coarse-grained acoustic scenes and ﬁne-grained audio

events, we propose an event relational graph representation

learning (ERGL) framework for ASC. Speciﬁcally, ERGL

learns a graph representation of an acoustic scene from the

input audio, where the embedding of each event is treated

as a node, while the relationship cues derived from each

pair of event embeddings are described by a learned multi-

dimensional edge feature. Experiments on a polyphonic

acoustic scene dataset show that the proposed ERGL achieves

competitive performance on ASC by using only a limited

number of embeddings of audio events without any data aug-

mentations. The validity of the proposed ERGL framework

proves the feasibility of recognizing diverse acoustic scenes

based on the event relational graph. Our code is available on

our homepage (https://github.com/Yuanbo2020/ERGL).

Index Terms—Acoustic scene classiﬁcation, audio

event, graph representation learning, multi-dimensional edge

1. INTRODUCTION

Acoustic scene classiﬁcation (ASC) aims to classify an audio

clip from various sources in real scenarios into a predeﬁned

semantic label (e.g., park, mall, or bus) [1]. ASC provides a

broad description of the surrounding environment to assist in-

telligent agents in quickly understanding the general picture

of the environment, and thus is beneﬁcial for various applica-

tions, such as sound source recognition [2], elderly well-being

assistance [3], and audio-visual scene recognition [4].

Typical deep learning-based ASC methods usually con-

sist of three steps: ﬁrst, they convert the input time-domain

audio stream into a time-frequency spectrogram as its acous-

tic features. Then, the obtained acoustic features are fed

to neural networks to automatically generate task-orientated

representations. Finally, the classiﬁer recognizes the acous-

tic scene of the input audio stream based on such high-level

representations. For example, the paper [5] utilizes a CNN-

based method with mel spectrograms of input audio for ASC,

where attention-based pooling layers are used to reduce the

dimension of the representation. The spatial pyramid pooling

approach is used by CNN in [6] to provide various resolu-

tions for ASC. Except for mel spectrograms, wavelet-based

deep scattering spectrum [7] is introduced in ASC to exploit

higher-order temporal information of acoustic features by

convolutional recurrent neural networks (CRNN) with bidi-

rectional gated recurrent units. Given the intrinsic relation-

ship between acoustic scenes and audio events, some studies

jointly analyze scenes and events relying on multi-task learn-

ing (MTL) [8, 9, 10]. To further mine the implicit relational

information between coarse-grained scenes and embedded

ﬁne-grained events, a relation-guided ASC [11] is proposed

to guide the model to bidirectionally fuse scene-event rela-

tions for mutually beneﬁcial scene and event classiﬁcation.

However, most of the aforementioned approaches do not

speciﬁcally consider the important semantically meaningful

information in the acoustic scene (i.e., audio events). It is dif-

ﬁcult to explain what types of cues in the audio stream are

utilized by these approaches to recognize the acoustic scene.

Meanwhile, it is natural for humans to recognize acoustic

scenes based on the semantically meaningful audio events

contained in them, where the occurring events and their re-

lationships vary in different acoustic scenes [12]. This paper

proposes to deep learn a pair of multi-dimensional edge-based

graph to represent each audio event in an end-to-end man-

ner, which contains not only the activation of audio events

(they are treated as nodes in the graph) in the audio signal but

also their task-speciﬁc relationships (represented as edges).

Then, the scene-dependent event relational graph is fed into

a gated graph convolutional network (Gated GCN) to extract

scene-related cues for classiﬁcation. This result shows that

by relying only on several explicit audio event embeddings,

the proposed ERGL can successfully build scene-dependent

event relational graphs and effectively distinguish scenes. The

paper is organized as follows. Section 2 introduces the pro-

posed ERGL. Section 3 describes the dataset, experimental

setup, and analyzes results. Section 4 draws conclusions.

arXiv:2210.15366v2 [eess.AS] 1 Nov 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MULTI-DIMENSIONALEDGE-BASEDAUDIOEVENTRELATIONALGRAPHREPRESENTATIONLEARNINGFORACOUSTICSCENECLASSIFICATIONYuanboHou1,SiyangSong2,ChuangYu3,YuxinSong4,WenwuWang5,DickBotteldooren11GhentUniversity,Belgium.2UniversityofCambridge,UK.3UniversityofManchester,UK.4BaiduInc.,China.5UniversityofSurrey,UK.ABSTRA...

展开>> 收起<<

MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION Yuanbo Hou1 Siyang Song2 Chuang Yu3 Yuxin Song4 Wenwu Wang5 Dick Botteldooren1.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION Yuanbo Hou1 Siyang Song2 Chuang Yu3 Yuxin Song4 Wenwu Wang5 Dick Botteldooren1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: