PLAY IT BACK ITERATIVE ATTENTION FOR AUDIO RECOGNITION Alexandros Stergiou12 Dima Damen3 1Vrije University of Brussels Belgium2Interuniversity Microelectronics Centre Leuven Belgium

2025-05-02 0 0 1.46MB 5 页 10玖币

侵权投诉

PLAY IT BACK: ITERATIVE ATTENTION FOR AUDIO RECOGNITION

Alexandros Stergiou1,2,∗, Dima Damen3

1Vrije University of Brussels, Belgium 2Interuniversity Microelectronics Centre, Leuven, Belgium

3University of Bristol, United Kingdom

ABSTRACT

A key function of auditory cognition is the association of

characteristic sounds with their corresponding semantics

over time. Humans attempting to discriminate between ﬁne-

grained audio categories, often replay the same discriminative

sounds to increase their prediction conﬁdence. We propose an

end-to-end attention-based architecture that through selective

repetition attends over the most discriminative sounds across

the audio sequence. Our model initially uses the full audio

sequence and iteratively reﬁnes the temporal segments re-

played based on slot attention. At each playback, the selected

segments are replayed using a smaller hop length which rep-

resents higher resolution features within these segments. We

show that our method can consistently achieve state-of-the-

art performance across three audio-classiﬁcation benchmarks:

AudioSet, VGG-Sound, and EPIC-KITCHENS-100. 1

Index Terms—Audio classiﬁcation, playback, attention

1. INTRODUCTION

Audio recognition is the task of categorizing audio with dis-

crete labels that semantically represent the emitted sounds.

This includes signiﬁcant challenges considering the similarity

in object sounds (e.g. boat motors and road vehicles), musi-

cal instruments (e.g. guitar, banjo, and ukulele), human (e.g.

wail and groan), or animal (e.g. yip and growl) sounds.

In everyday life, we repeat parts of songs or ask for some-

one to repeat themselves to better understand audio. This re-

lates to the development of echoic memory which is responsi-

ble for the memorization of sounds [1,2]. Therefore, repeated

listens and replays of sound stimulants [3] are an essential part

of learning and associating sound patterns.

Driven by the perception of sound through echoic mem-

ory and the recent success of Vision Transformers (ViT) [4]

at utilizing global context information, we propose an end-

to-end attention-based model that recognizes sounds through

discovering and playing back the most informative sounds

from the audio sequence, as shown in Figure 1. We use

slots [5] to attend to category-relevant sounds in the input se-

quence. These slots select the time segments to be replayed.

∗Work was done while A. Stergiou was at the University of Bristol.

1Our code is available at: tinyurl.com/playitback2023

motorcycle engine outboard motor Hercules beetle

Salient regions slowed & played-back

Fig. 1:Playback of discriminative sounds. Given an audio

sequence, the most relevant sounds are selected and played

back at reduced hop length. The generated playbacks attend

solely informative sounds at a higher temporal resolution.

Coarser features from earlier playbacks are memorized along-

side ﬁner (i.e. higher-temporal resolution) features from later

playbacks with the use of a transformer decoder.

Our contributions are as follows: i) We propose to se-

lect and replay relevant audio features with decreased hop

lengths, slowing down relevant parts of the audio. ii) We pro-

pose an end-to-end transformer architecture for audio recog-

nition that jointly selects and attends to multiple audio re-

plays, and reﬁnes the ﬁnal class predictions. iii) Our method

achieves state-of-the-art performance on AudioSet [6], VGG-

Sound [7], and EPIC-KITCHENS-100 [8].

2. RELATED WORK

Audio recognition. A popular approach for audio classiﬁca-

tion has been the use of convolutional networks, previously

used for image-based object recognition [9,10,11] or video

classiﬁcation [12] tasks, to learn features from audio spectro-

grams. The introduction of Transformer-based architectures

has further given rise to their adaptation for audio recognition

by works relying on hybrid architectures [13,14,15]. Simi-

lar attempts have also built on image-pretrained Transformer

models for attending audio spectrograms [16,17]. [18] incor-

arXiv:2210.11328v2 [cs.SD] 12 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PLAYITBACK:ITERATIVEATTENTIONFORAUDIORECOGNITIONAlexandrosStergiou1;2;,DimaDamen31VrijeUniversityofBrussels,Belgium2InteruniversityMicroelectronicsCentre,Leuven,Belgium3UniversityofBristol,UnitedKingdomABSTRACTAkeyfunctionofauditorycognitionistheassociationofcharacteristicsoundswiththeircorrespondi...

收起<<

PLAY IT BACK ITERATIVE ATTENTION FOR AUDIO RECOGNITION Alexandros Stergiou12 Dima Damen3 1Vrije University of Brussels Belgium2Interuniversity Microelectronics Centre Leuven Belgium.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PLAY IT BACK ITERATIVE ATTENTION FOR AUDIO RECOGNITION Alexandros Stergiou12 Dima Damen3 1Vrije University of Brussels Belgium2Interuniversity Microelectronics Centre Leuven Belgium

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: