MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION Yuanbo Hou1 Siyang Song2 Chuang Yu3 Yuxin Song4 Wenwu Wang5 Dick Botteldooren1

2025-05-02 0 0 395.93KB 5 页 10玖币
侵权投诉
MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH
REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION
Yuanbo Hou1, Siyang Song2, Chuang Yu3, Yuxin Song4, Wenwu Wang5, Dick Botteldooren1
1Ghent University, Belgium. 2University of Cambridge, UK.
3University of Manchester, UK. 4Baidu Inc., China. 5University of Surrey, UK.
ABSTRACT
Most existing deep learning-based acoustic scene classifi-
cation (ASC) approaches directly utilize representations ex-
tracted from spectrograms to identify target scenes. However,
these approaches pay little attention to the audio events occur-
ring in the scene despite they provide crucial semantic infor-
mation. This paper conducts the first study that investigates
whether real-life acoustic scenes can be reliably recognized
based only on the features that describe a limited number of
audio events. To model the task-specific relationships be-
tween coarse-grained acoustic scenes and fine-grained audio
events, we propose an event relational graph representation
learning (ERGL) framework for ASC. Specifically, ERGL
learns a graph representation of an acoustic scene from the
input audio, where the embedding of each event is treated
as a node, while the relationship cues derived from each
pair of event embeddings are described by a learned multi-
dimensional edge feature. Experiments on a polyphonic
acoustic scene dataset show that the proposed ERGL achieves
competitive performance on ASC by using only a limited
number of embeddings of audio events without any data aug-
mentations. The validity of the proposed ERGL framework
proves the feasibility of recognizing diverse acoustic scenes
based on the event relational graph. Our code is available on
our homepage (https://github.com/Yuanbo2020/ERGL).
Index TermsAcoustic scene classification, audio
event, graph representation learning, multi-dimensional edge
1. INTRODUCTION
Acoustic scene classification (ASC) aims to classify an audio
clip from various sources in real scenarios into a predefined
semantic label (e.g., park, mall, or bus) [1]. ASC provides a
broad description of the surrounding environment to assist in-
telligent agents in quickly understanding the general picture
of the environment, and thus is beneficial for various applica-
tions, such as sound source recognition [2], elderly well-being
assistance [3], and audio-visual scene recognition [4].
Typical deep learning-based ASC methods usually con-
sist of three steps: first, they convert the input time-domain
audio stream into a time-frequency spectrogram as its acous-
tic features. Then, the obtained acoustic features are fed
to neural networks to automatically generate task-orientated
representations. Finally, the classifier recognizes the acous-
tic scene of the input audio stream based on such high-level
representations. For example, the paper [5] utilizes a CNN-
based method with mel spectrograms of input audio for ASC,
where attention-based pooling layers are used to reduce the
dimension of the representation. The spatial pyramid pooling
approach is used by CNN in [6] to provide various resolu-
tions for ASC. Except for mel spectrograms, wavelet-based
deep scattering spectrum [7] is introduced in ASC to exploit
higher-order temporal information of acoustic features by
convolutional recurrent neural networks (CRNN) with bidi-
rectional gated recurrent units. Given the intrinsic relation-
ship between acoustic scenes and audio events, some studies
jointly analyze scenes and events relying on multi-task learn-
ing (MTL) [8, 9, 10]. To further mine the implicit relational
information between coarse-grained scenes and embedded
fine-grained events, a relation-guided ASC [11] is proposed
to guide the model to bidirectionally fuse scene-event rela-
tions for mutually beneficial scene and event classification.
However, most of the aforementioned approaches do not
specifically consider the important semantically meaningful
information in the acoustic scene (i.e., audio events). It is dif-
ficult to explain what types of cues in the audio stream are
utilized by these approaches to recognize the acoustic scene.
Meanwhile, it is natural for humans to recognize acoustic
scenes based on the semantically meaningful audio events
contained in them, where the occurring events and their re-
lationships vary in different acoustic scenes [12]. This paper
proposes to deep learn a pair of multi-dimensional edge-based
graph to represent each audio event in an end-to-end man-
ner, which contains not only the activation of audio events
(they are treated as nodes in the graph) in the audio signal but
also their task-specific relationships (represented as edges).
Then, the scene-dependent event relational graph is fed into
a gated graph convolutional network (Gated GCN) to extract
scene-related cues for classification. This result shows that
by relying only on several explicit audio event embeddings,
the proposed ERGL can successfully build scene-dependent
event relational graphs and effectively distinguish scenes. The
paper is organized as follows. Section 2 introduces the pro-
posed ERGL. Section 3 describes the dataset, experimental
setup, and analyzes results. Section 4 draws conclusions.
arXiv:2210.15366v2 [eess.AS] 1 Nov 2022
摘要:

MULTI-DIMENSIONALEDGE-BASEDAUDIOEVENTRELATIONALGRAPHREPRESENTATIONLEARNINGFORACOUSTICSCENECLASSIFICATIONYuanboHou1,SiyangSong2,ChuangYu3,YuxinSong4,WenwuWang5,DickBotteldooren11GhentUniversity,Belgium.2UniversityofCambridge,UK.3UniversityofManchester,UK.4BaiduInc.,China.5UniversityofSurrey,UK.ABSTRA...

展开>> 收起<<
MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION Yuanbo Hou1 Siyang Song2 Chuang Yu3 Yuxin Song4 Wenwu Wang5 Dick Botteldooren1.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:395.93KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注