
MULTI-DIMENSIONAL EDGE-BASED AUDIO EVENT RELATIONAL GRAPH
REPRESENTATION LEARNING FOR ACOUSTIC SCENE CLASSIFICATION
Yuanbo Hou1, Siyang Song2, Chuang Yu3, Yuxin Song4, Wenwu Wang5, Dick Botteldooren1
1Ghent University, Belgium. 2University of Cambridge, UK.
3University of Manchester, UK. 4Baidu Inc., China. 5University of Surrey, UK.
ABSTRACT
Most existing deep learning-based acoustic scene classifi-
cation (ASC) approaches directly utilize representations ex-
tracted from spectrograms to identify target scenes. However,
these approaches pay little attention to the audio events occur-
ring in the scene despite they provide crucial semantic infor-
mation. This paper conducts the first study that investigates
whether real-life acoustic scenes can be reliably recognized
based only on the features that describe a limited number of
audio events. To model the task-specific relationships be-
tween coarse-grained acoustic scenes and fine-grained audio
events, we propose an event relational graph representation
learning (ERGL) framework for ASC. Specifically, ERGL
learns a graph representation of an acoustic scene from the
input audio, where the embedding of each event is treated
as a node, while the relationship cues derived from each
pair of event embeddings are described by a learned multi-
dimensional edge feature. Experiments on a polyphonic
acoustic scene dataset show that the proposed ERGL achieves
competitive performance on ASC by using only a limited
number of embeddings of audio events without any data aug-
mentations. The validity of the proposed ERGL framework
proves the feasibility of recognizing diverse acoustic scenes
based on the event relational graph. Our code is available on
our homepage (https://github.com/Yuanbo2020/ERGL).
Index Terms—Acoustic scene classification, audio
event, graph representation learning, multi-dimensional edge
1. INTRODUCTION
Acoustic scene classification (ASC) aims to classify an audio
clip from various sources in real scenarios into a predefined
semantic label (e.g., park, mall, or bus) [1]. ASC provides a
broad description of the surrounding environment to assist in-
telligent agents in quickly understanding the general picture
of the environment, and thus is beneficial for various applica-
tions, such as sound source recognition [2], elderly well-being
assistance [3], and audio-visual scene recognition [4].
Typical deep learning-based ASC methods usually con-
sist of three steps: first, they convert the input time-domain
audio stream into a time-frequency spectrogram as its acous-
tic features. Then, the obtained acoustic features are fed
to neural networks to automatically generate task-orientated
representations. Finally, the classifier recognizes the acous-
tic scene of the input audio stream based on such high-level
representations. For example, the paper [5] utilizes a CNN-
based method with mel spectrograms of input audio for ASC,
where attention-based pooling layers are used to reduce the
dimension of the representation. The spatial pyramid pooling
approach is used by CNN in [6] to provide various resolu-
tions for ASC. Except for mel spectrograms, wavelet-based
deep scattering spectrum [7] is introduced in ASC to exploit
higher-order temporal information of acoustic features by
convolutional recurrent neural networks (CRNN) with bidi-
rectional gated recurrent units. Given the intrinsic relation-
ship between acoustic scenes and audio events, some studies
jointly analyze scenes and events relying on multi-task learn-
ing (MTL) [8, 9, 10]. To further mine the implicit relational
information between coarse-grained scenes and embedded
fine-grained events, a relation-guided ASC [11] is proposed
to guide the model to bidirectionally fuse scene-event rela-
tions for mutually beneficial scene and event classification.
However, most of the aforementioned approaches do not
specifically consider the important semantically meaningful
information in the acoustic scene (i.e., audio events). It is dif-
ficult to explain what types of cues in the audio stream are
utilized by these approaches to recognize the acoustic scene.
Meanwhile, it is natural for humans to recognize acoustic
scenes based on the semantically meaningful audio events
contained in them, where the occurring events and their re-
lationships vary in different acoustic scenes [12]. This paper
proposes to deep learn a pair of multi-dimensional edge-based
graph to represent each audio event in an end-to-end man-
ner, which contains not only the activation of audio events
(they are treated as nodes in the graph) in the audio signal but
also their task-specific relationships (represented as edges).
Then, the scene-dependent event relational graph is fed into
a gated graph convolutional network (Gated GCN) to extract
scene-related cues for classification. This result shows that
by relying only on several explicit audio event embeddings,
the proposed ERGL can successfully build scene-dependent
event relational graphs and effectively distinguish scenes. The
paper is organized as follows. Section 2 introduces the pro-
posed ERGL. Section 3 describes the dataset, experimental
setup, and analyzes results. Section 4 draws conclusions.
arXiv:2210.15366v2 [eess.AS] 1 Nov 2022