
Detection and Classification of Acoustic Scenes and Events 2021 Challenge
MULTI-SOURCE TRANSFORMER ARCHITECTURES
FOR AUDIOVISUAL SCENE CLASSIFICATION
Technical Report
Wim Boes, Hugo Van hamme
ESAT, KU Leuven
wim.boes@esat.kuleuven.be, hugo.vanhamme@esat.kuleuven.be
ABSTRACT
In this technical report, the systems we submitted for subtask 1B
of the DCASE 2021 challenge, regarding audiovisual scene classi-
fication, are described in detail. They are essentially multi-source
transformers employing a combination of auditory and visual fea-
tures to make predictions. These models are evaluated utilizing the
macro-averaged multi-class cross-entropy and accuracy metrics.
In terms of the macro-averaged multi-class cross-entropy, our
best model achieved a score of 0.620 on the validation data. This is
slightly better than the performance of the baseline system (0.658).
With regard to the accuracy measure, our best model achieved
a score of 77.1% on the validation data, which is about the same as
the performance obtained by the baseline system (77.0%).
Index Terms—DCASE 2021, audiovisual scene classification,
transformer
1. INTRODUCTION
Subtask 1B of DCASE 2021 [1] is dedicated to audiovisual scene
classification. Models are ranked using the macro-averaged multi-
class cross-entropy, which is further explained in Section 3.
For this work, we were influenced by the prior use of trans-
formers in the context of environmental event classification with
audiovisual data [2]. We also were inspired by the employment
of multi-source transformer architectures in the context of machine
translation with multiple languages [3].
In Section 2, the submitted systems are described. In Section 3,
we go into the experimental setup. Next, in Section 4, we report the
obtained results, and finally, we draw a conclusion in Section 5.
2. MODELS
In this section, the submitted models are elaborated upon.
2.1. Architecture
The architecture of the models is visualized in Figure 1.
The input to the system is a combination of three inputs: spec-
tral auditory features, pretrained auditory features and pretrained
visual features. How these are obtained from audiovisual scene
recordings is further explained in Section 3.
The spectral auditory features are first processed by a convolu-
tional neural network (CNN), consisting of four blocks. Each block
comprises five layers: a convolutional layer, a batch normalization
layer [4], a ReLU activation layer, a dropout layer [5] (with a drop
rate of 33%) and an average pooling layer.
Each convolutional layer uses a square kernel of size 3 and a
stride of 1. The amount of output channels are equal to 12, 24, 48
and 96 for the first, second, third and fourth blocks respectively.
Table 1: Kernel sizes and strides of pooling layers in CNN
Block Kernel size =stride
0 (3, 4)
1-2 (2, 4)
3 (1, 2)
The kernel sizes and strides of the pooling operations are listed
in Table 1. The first and second numbers of each tuple relate to the
time and frequency axes respectively.
At the end of the last block, the frequency-related dimension of
the spectral input has been reduced to one and the corresponding
axis can therefore be discarded.
The sequences of convolutional auditory, pretrained auditory
and pretrained visual features are linearly mapped to embeddings
of size 96. The outputs of these operations are additionally passed
through dropout layers with a drop rate of 33%.
The embeddings extracted from the convolutional auditory and
pretrained visual features are then (separately) passed through a
transformer encoder consisting of three layers.
Next, a multi-source serial transformer decoder with three lay-
ers is used. This module is a simple extension of the regular trans-
former decoder allowing for more than two inputs by using multiple
multi-head cross-attention blocks (in series) instead of just one.
In the considered model, the embeddings extracted from the
pretrained auditory features constitute the queries of the multi-
source serial transformer decoder. The keys and values of the first
and second multi-head cross-attention blocks in this decoder come
from the transformer encodings based on the convolutional auditory
and pretrained visual features respectively.
All transformer components described above use three attention
heads, fully connected layers with 96 units and dropout rates of 0.1.
As explained in Section 3, the pretrained auditory features fed
as input to the architecture are “sequences” of one vector. Thus, the
output of the transformer decoder also consists of just one element.
This vector is passed through a dropout layer, a linear map is
computed, and finally, the softmax activation function is applied to
obtain the output audiovisual scene probabilities.
2.2. Loss function
The loss used to train these models is the categorical cross entropy.
arXiv:2210.10212v1 [eess.AS] 18 Oct 2022