PLAY IT BACK ITERATIVE ATTENTION FOR AUDIO RECOGNITION Alexandros Stergiou12 Dima Damen3 1Vrije University of Brussels Belgium2Interuniversity Microelectronics Centre Leuven Belgium
2025-05-02
1
0
1.46MB
5 页
10玖币
侵权投诉
PLAY IT BACK: ITERATIVE ATTENTION FOR AUDIO RECOGNITION
Alexandros Stergiou1,2,∗, Dima Damen3
1Vrije University of Brussels, Belgium 2Interuniversity Microelectronics Centre, Leuven, Belgium
3University of Bristol, United Kingdom
ABSTRACT
A key function of auditory cognition is the association of
characteristic sounds with their corresponding semantics
over time. Humans attempting to discriminate between fine-
grained audio categories, often replay the same discriminative
sounds to increase their prediction confidence. We propose an
end-to-end attention-based architecture that through selective
repetition attends over the most discriminative sounds across
the audio sequence. Our model initially uses the full audio
sequence and iteratively refines the temporal segments re-
played based on slot attention. At each playback, the selected
segments are replayed using a smaller hop length which rep-
resents higher resolution features within these segments. We
show that our method can consistently achieve state-of-the-
art performance across three audio-classification benchmarks:
AudioSet, VGG-Sound, and EPIC-KITCHENS-100. 1
Index Terms—Audio classification, playback, attention
1. INTRODUCTION
Audio recognition is the task of categorizing audio with dis-
crete labels that semantically represent the emitted sounds.
This includes significant challenges considering the similarity
in object sounds (e.g. boat motors and road vehicles), musi-
cal instruments (e.g. guitar, banjo, and ukulele), human (e.g.
wail and groan), or animal (e.g. yip and growl) sounds.
In everyday life, we repeat parts of songs or ask for some-
one to repeat themselves to better understand audio. This re-
lates to the development of echoic memory which is responsi-
ble for the memorization of sounds [1,2]. Therefore, repeated
listens and replays of sound stimulants [3] are an essential part
of learning and associating sound patterns.
Driven by the perception of sound through echoic mem-
ory and the recent success of Vision Transformers (ViT) [4]
at utilizing global context information, we propose an end-
to-end attention-based model that recognizes sounds through
discovering and playing back the most informative sounds
from the audio sequence, as shown in Figure 1. We use
slots [5] to attend to category-relevant sounds in the input se-
quence. These slots select the time segments to be replayed.
∗Work was done while A. Stergiou was at the University of Bristol.
1Our code is available at: tinyurl.com/playitback2023
motorcycle engine outboard motor Hercules beetle
Salient regions slowed & played-back
Fig. 1:Playback of discriminative sounds. Given an audio
sequence, the most relevant sounds are selected and played
back at reduced hop length. The generated playbacks attend
solely informative sounds at a higher temporal resolution.
Coarser features from earlier playbacks are memorized along-
side finer (i.e. higher-temporal resolution) features from later
playbacks with the use of a transformer decoder.
Our contributions are as follows: i) We propose to se-
lect and replay relevant audio features with decreased hop
lengths, slowing down relevant parts of the audio. ii) We pro-
pose an end-to-end transformer architecture for audio recog-
nition that jointly selects and attends to multiple audio re-
plays, and refines the final class predictions. iii) Our method
achieves state-of-the-art performance on AudioSet [6], VGG-
Sound [7], and EPIC-KITCHENS-100 [8].
2. RELATED WORK
Audio recognition. A popular approach for audio classifica-
tion has been the use of convolutional networks, previously
used for image-based object recognition [9,10,11] or video
classification [12] tasks, to learn features from audio spectro-
grams. The introduction of Transformer-based architectures
has further given rise to their adaptation for audio recognition
by works relying on hybrid architectures [13,14,15]. Simi-
lar attempts have also built on image-pretrained Transformer
models for attending audio spectrograms [16,17]. [18] incor-
arXiv:2210.11328v2 [cs.SD] 12 Mar 2023
摘要:
展开>>
收起<<
PLAYITBACK:ITERATIVEATTENTIONFORAUDIORECOGNITIONAlexandrosStergiou1;2;,DimaDamen31VrijeUniversityofBrussels,Belgium2InteruniversityMicroelectronicsCentre,Leuven,Belgium3UniversityofBristol,UnitedKingdomABSTRACTAkeyfunctionofauditorycognitionistheassociationofcharacteristicsoundswiththeircorrespondi...
声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
相关推荐
-
公司营销部领导述职述廉报告VIP免费
2024-12-03 4 -
100套述职述廉述法述学框架提纲VIP免费
2024-12-03 3 -
20220106政府党组班子党史学习教育专题民主生活会“五个带头”对照检查材料VIP免费
2024-12-03 3 -
20220106县纪委监委领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 6 -
A文秘笔杆子工作资料汇编手册(近70000字)VIP免费
2024-12-03 3 -
20220106县领导班子党史学习教育专题民主生活会对照检查材料VIP免费
2024-12-03 4 -
经济开发区党工委书记管委会主任述学述职述廉述法报告VIP免费
2024-12-03 34 -
20220106政府领导专题民主生活会五个方面对照检查材料VIP免费
2024-12-03 11 -
派出所教导员述职述廉报告6篇VIP免费
2024-12-03 8 -
民主生活会对县委班子及其成员批评意见清单VIP免费
2024-12-03 50
分类:图书资源
价格:10玖币
属性:5 页
大小:1.46MB
格式:PDF
时间:2025-05-02


渝公网安备50010702506394