Detection and Classiﬁcation of Acoustic Scenes and Events 2021 Challenge MULTI-SOURCE TRANSFORMER ARCHITECTURES FOR AUDIOVISUAL SCENE CLASSIFICATION

2025-04-26 1 0 115.98KB 3 页 10玖币

侵权投诉

Detection and Classiﬁcation of Acoustic Scenes and Events 2021 Challenge

MULTI-SOURCE TRANSFORMER ARCHITECTURES

FOR AUDIOVISUAL SCENE CLASSIFICATION

Technical Report

Wim Boes, Hugo Van hamme

ESAT, KU Leuven

wim.boes@esat.kuleuven.be, hugo.vanhamme@esat.kuleuven.be

ABSTRACT

In this technical report, the systems we submitted for subtask 1B

of the DCASE 2021 challenge, regarding audiovisual scene classi-

ﬁcation, are described in detail. They are essentially multi-source

transformers employing a combination of auditory and visual fea-

tures to make predictions. These models are evaluated utilizing the

macro-averaged multi-class cross-entropy and accuracy metrics.

In terms of the macro-averaged multi-class cross-entropy, our

best model achieved a score of 0.620 on the validation data. This is

slightly better than the performance of the baseline system (0.658).

With regard to the accuracy measure, our best model achieved

a score of 77.1% on the validation data, which is about the same as

the performance obtained by the baseline system (77.0%).

Index Terms—DCASE 2021, audiovisual scene classiﬁcation,

transformer

1. INTRODUCTION

Subtask 1B of DCASE 2021 [1] is dedicated to audiovisual scene

classiﬁcation. Models are ranked using the macro-averaged multi-

class cross-entropy, which is further explained in Section 3.

For this work, we were inﬂuenced by the prior use of trans-

formers in the context of environmental event classiﬁcation with

audiovisual data [2]. We also were inspired by the employment

of multi-source transformer architectures in the context of machine

translation with multiple languages [3].

In Section 2, the submitted systems are described. In Section 3,

we go into the experimental setup. Next, in Section 4, we report the

obtained results, and ﬁnally, we draw a conclusion in Section 5.

2. MODELS

In this section, the submitted models are elaborated upon.

2.1. Architecture

The architecture of the models is visualized in Figure 1.

The input to the system is a combination of three inputs: spec-

tral auditory features, pretrained auditory features and pretrained

visual features. How these are obtained from audiovisual scene

recordings is further explained in Section 3.

The spectral auditory features are ﬁrst processed by a convolu-

tional neural network (CNN), consisting of four blocks. Each block

comprises ﬁve layers: a convolutional layer, a batch normalization

layer [4], a ReLU activation layer, a dropout layer [5] (with a drop

rate of 33%) and an average pooling layer.

Each convolutional layer uses a square kernel of size 3 and a

stride of 1. The amount of output channels are equal to 12, 24, 48

and 96 for the ﬁrst, second, third and fourth blocks respectively.

Table 1: Kernel sizes and strides of pooling layers in CNN

Block Kernel size =stride

0 (3, 4)

1-2 (2, 4)

3 (1, 2)

The kernel sizes and strides of the pooling operations are listed

in Table 1. The ﬁrst and second numbers of each tuple relate to the

time and frequency axes respectively.

At the end of the last block, the frequency-related dimension of

the spectral input has been reduced to one and the corresponding

axis can therefore be discarded.

The sequences of convolutional auditory, pretrained auditory

and pretrained visual features are linearly mapped to embeddings

of size 96. The outputs of these operations are additionally passed

through dropout layers with a drop rate of 33%.

The embeddings extracted from the convolutional auditory and

pretrained visual features are then (separately) passed through a

transformer encoder consisting of three layers.

Next, a multi-source serial transformer decoder with three lay-

ers is used. This module is a simple extension of the regular trans-

former decoder allowing for more than two inputs by using multiple

multi-head cross-attention blocks (in series) instead of just one.

In the considered model, the embeddings extracted from the

pretrained auditory features constitute the queries of the multi-

source serial transformer decoder. The keys and values of the ﬁrst

and second multi-head cross-attention blocks in this decoder come

from the transformer encodings based on the convolutional auditory

and pretrained visual features respectively.

All transformer components described above use three attention

heads, fully connected layers with 96 units and dropout rates of 0.1.

As explained in Section 3, the pretrained auditory features fed

as input to the architecture are “sequences” of one vector. Thus, the

output of the transformer decoder also consists of just one element.

This vector is passed through a dropout layer, a linear map is

computed, and ﬁnally, the softmax activation function is applied to

obtain the output audiovisual scene probabilities.

2.2. Loss function

The loss used to train these models is the categorical cross entropy.

arXiv:2210.10212v1 [eess.AS] 18 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DetectionandClassicationofAcousticScenesandEvents2021ChallengeMULTI-SOURCETRANSFORMERARCHITECTURESFORAUDIOVISUALSCENECLASSIFICATIONTechnicalReportWimBoes,HugoVanhammeESAT,KULeuvenwim.boes@esat.kuleuven.be,hugo.vanhamme@esat.kuleuven.beABSTRACTInthistechnicalreport,thesystemswesubmittedforsubtask1Bo...

展开>> 收起<<

Detection and Classiﬁcation of Acoustic Scenes and Events 2021 Challenge MULTI-SOURCE TRANSFORMER ARCHITECTURES FOR AUDIOVISUAL SCENE CLASSIFICATION.pdf

共3页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detection and Classiﬁcation of Acoustic Scenes and Events 2021 Challenge MULTI-SOURCE TRANSFORMER ARCHITECTURES FOR AUDIOVISUAL SCENE CLASSIFICATION

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: