A VE-CLIP AudioCLIP-based Multi-window Temporal Transformer for A udio Visual E vent Localization Tanvir Mahmud

2025-04-27 0 0 3.47MB 10 页 10玖币

侵权投诉

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio

Visual Event Localization

Tanvir Mahmud

The University of Texas at Austin

tanvirmahmud@uetxas.edu

Diana Marculescu

The University of Texas at Austin

dianam@utexas.edu

Abstract

An audio-visual event (AVE) is denoted by the correspon-

dence of the visual and auditory signals in a video segment.

Precise localization of the AVEs is very challenging since

it demands effective multi-modal feature correspondence to

ground the short and long range temporal interactions. Ex-

isting approaches struggle in capturing the different scales

of multi-modal interaction due to ineffective multi-modal

training strategies. To overcome this limitation, we intro-

duce AVE-CLIP, a novel framework that integrates the Au-

dioCLIP pre-trained on large-scale audio-visual data with

a multi-window temporal transformer to effectively operate

on different temporal scales of video frames. Our contribu-

tions are three-fold: (1) We introduce a multi-stage train-

ing framework to incorporate AudioCLIP pre-trained with

audio-image pairs into the AVE localization task on video

frames through contrastive ﬁne-tuning, effective mean video

feature extraction, and multi-scale training phases. (2)

We propose a multi-domain attention mechanism that op-

erates on both temporal and feature domains over varying

timescales to fuse the local and global feature variations.

(3) We introduce a temporal reﬁning scheme with event-

guided attention followed by a simple-yet-effective post pro-

cessing step to handle signiﬁcant variations of the back-

ground over diverse events. Our method achieves state-of-

the-art performance on the publicly available AVE dataset

with 5.9% mean accuracy improvement which proves its su-

periority over existing approaches.

1. Introduction

Temporal reasoning of multi-modal data plays a signiﬁ-

cant role in human perception in diverse environmental con-

ditions [10, 38]. Grounding the multi-modal context is crit-

ical to current and future tasks of interest, especially those

that guide current research efforts in this space, e.g., embod-

ied perception of automated agents [29, 4, 8], human-robot

interaction with multi-sensor guidance [25, 6, 2], and active

Figure 1. Example of an audio-visual event (AVE) representing

the event of an individual speaking. The person’s voice is audible

in all of the frames. Only when the person is visible, an AVE is

identiﬁed.

sound source localization [35, 22, 18, 27]. Similarly, audio-

visual event (AVE) localization demands complex multi-

modal correspondence of grounded audio-visual percep-

tion [24, 7]. The simultaneous presence of the audio-visual

cues over a video frame denotes an audio-visual event. As

shown in Fig. 1, the speech of the person is audible in all

of the frames. However, the individual speaking is visible

in only a few particular frames which represent the AVE.

Precise detection of such events greatly depends on the con-

textual understanding of the multi-modal features over the

video frame.

Learning the inter-modal audio-visual feature correspon-

dence over the video frames is one of the major challenges

of AVE localization. Effective multi-modal training strate-

gies can signiﬁcantly improve performance by enhancing

the relevant features. Earlier work integrates audio and

image encoders pre-trained on large scale unimodal (im-

age/audio) datasets [5, 9] to improve performance [36, 17,

32, 7, 30]. However, such a uni-modal pre-training scheme

struggles to extract relevant inter-modal features that are

particularly signiﬁcant for AVEs. Recently, following the

wide-spread success of CLIP [19] pre-trained on large-scale

vision-language datasets, AudioCLIP [12] has integrated an

audio encoder into the vision-language models with large-

scale pre-training on audio-image pairs. To enhance the

audio-visual feature correspondence for AVEs, we integrate

the image and audio encoders from AudioCLIP with ef-

fective contrastive ﬁne-tuning that exploits the large-scale

pre-trained knowledge from multi-modal datasets instead of

uni-modal ones.

arXiv:2210.05060v1 [cs.CV] 11 Oct 2022

Effective audio-visual fusion for multi-modal reasoning

over entire video frames is another major challenge for

proper utilization of the uni-modal features. Recently, sev-

eral approaches have focused on using the grounded multi-

modal features to generate temporal attention for operating

on the intra-modal feature space [37, 17, 32]. Other recent

work has applied recursive temporal attention on the aggre-

gated multi-modal features [7, 17, 31]. However, these ex-

isting approaches attempt to generalize audio-visual context

over the whole video frame and hence struggle to extract

local variational patterns that are particularly signiﬁcant at

event transitions. Though generalized multi-modal context

over long intervals is of great importance for categorizing

diverse events, local changes of multi-modal features are

critical for precise event detection at transition edges. To

solve this dilemma, we introduce a multi-window temporal

transformer based fusion scheme that operates on different

timescales to guide attention over sharp local changes with

short temporal windows, as well as extract the global con-

text across long temporal windows.

The background class representing uncorrelated audio-

visual frames varies a lot over different AVEs for diverse

surroundings (Figure 1). In many cases, it becomes difﬁcult

to distinguish the background from the event regions due to

subtle variations [37]. Xu et al. [30] suggests that joint bi-

nary classiﬁcation of the event regions (event/background)

along with the multi-class event prediction improves over-

all performance for better discrimination of the event ori-

ented features. Inspired by this, we introduce a temporal

feature reﬁning scheme for guiding temporal attentions over

the event regions to introduce sharp contrast with the back-

ground. Moreover, we introduce a simple post-processing

algorithm that ﬁlters out such incorrect predictions in be-

tween event transitions by exploiting the high temporal lo-

cality of event/background frames in AVEs (Figure 1). By

unifying these strategies in the AVE-CLIP framework, we

achieve state-of-the-art performance on the AVE dataset

which outperforms existing approaches by a considerable

margin.

The major contributions of this work are summarized as

follows:

• We introduce AVE-CLIP to exploit AudioCLIP pre-

trained on large-scale audio-image pairs for improving

inter-modal feature correspondence on video AVEs.

• We propose a multi-window temporal transformer

based fusion scheme that operates on different

timescales of AVE frames to extract local and global

variations of multi-modal features.

• We introduce a temporal feature reﬁnement scheme

through event guided temporal attention followed by

a simple, yet-effective post-processing method to in-

crease contrast with the background.

2. Related Work

Audio Visual Event Localization

AVE localization, introduced by Tian et al. [24] targets

the identiﬁcation of different types of events (e.g., indi-

vidual man/woman speaking, crying babies, frying food,

musical instruments, etc.) at each temporal instance based

on audio-visual correspondence. The authors introduced a

residual learning method with LSTM guided audio-visual

attention relying on simple concatenation and addition fu-

sion. A dual attention matching (DAM) module is intro-

duced by Wu et al. [28] for operating on event-relevant fea-

tures. Zhou et al. [37] proposed a positive sample propa-

gation scheme by pruning out the weaker multi-modal in-

teractions. Xuan et al. [32, 33] proposed a discriminative

multi-modal attention module for sequential learning with

an eigen-value based objective function. Duan et al. [7]

introduced joint co-learning with cyclic attention over the

aggregated multi-modal features. Lin and Wang [17] intro-

duced a transformer-based approach that operates on groups

of video frames based on audio-visual attention. Xu et

al. [30] introduced multi-modal relation-aware audio-visual

representation learning with an interaction module. Differ-

ent from existing approaches, AVE-CLIP exploits temporal

features from various windows by extracting short and long

range multi-modal interactions along with temporal reﬁne-

ment of the event frames.

Sound Source Localization

The sound source localization task [35] identiﬁes the sound-

ing object in the corresponding video based on the audi-

tory signal. Arda et al. [22] introduced an audio-visual

classiﬁcation model that can be adapted for sound source

localization without explicit training by utilizing simple

multi-modal attention. Wu et al. [27] proposed an encoder-

decoder based framework to operate on the continuous fea-

ture space through likelihood measurements of the sounding

sources. Qian et al. [18] attempted multiple source localiza-

tion by exploiting gradient weighted class activation map

(Grad-CAM) correspondence on the audio-visual signal. A

self-supervised audio-visual matching scheme is introduced

by Hu et al. [15] with a dictionary learning of the sound-

ing objects. Afouras et al. [1] utilized optical ﬂow features

along with multimodal attention maps targeting both source

localization and audio source separation.

Large Scale Contrastive Pre-training

To improve the data-efﬁciency on diverse target tasks, large-

scale pre-training of very deep neural networks has been

found to be effective for transfer learning [16]. CLIP has in-

troduced vision-language pre-training with self-supervised

contrastive learning on large-scale datasets, an approach

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AVE-CLIP:AudioCLIP-basedMulti-windowTemporalTransformerforAudioVisualEventLocalizationTanvirMahmudTheUniversityofTexasatAustintanvirmahmud@uetxas.eduDianaMarculescuTheUniversityofTexasatAustindianam@utexas.eduAbstractAnaudio-visualevent(AVE)isdenotedbythecorrespon-denceofthevisualandauditorysignalsi...

展开>> 收起<<

A VE-CLIP AudioCLIP-based Multi-window Temporal Transformer for A udio Visual E vent Localization Tanvir Mahmud.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A VE-CLIP AudioCLIP-based Multi-window Temporal Transformer for A udio Visual E vent Localization Tanvir Mahmud

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: