A VE-CLIP AudioCLIP-based Multi-window Temporal Transformer for A udio Visual E vent Localization Tanvir Mahmud

2025-04-27 0 0 3.47MB 10 页 10玖币
侵权投诉
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization
Tanvir Mahmud
The University of Texas at Austin
tanvirmahmud@uetxas.edu
Diana Marculescu
The University of Texas at Austin
dianam@utexas.edu
Abstract
An audio-visual event (AVE) is denoted by the correspon-
dence of the visual and auditory signals in a video segment.
Precise localization of the AVEs is very challenging since
it demands effective multi-modal feature correspondence to
ground the short and long range temporal interactions. Ex-
isting approaches struggle in capturing the different scales
of multi-modal interaction due to ineffective multi-modal
training strategies. To overcome this limitation, we intro-
duce AVE-CLIP, a novel framework that integrates the Au-
dioCLIP pre-trained on large-scale audio-visual data with
a multi-window temporal transformer to effectively operate
on different temporal scales of video frames. Our contribu-
tions are three-fold: (1) We introduce a multi-stage train-
ing framework to incorporate AudioCLIP pre-trained with
audio-image pairs into the AVE localization task on video
frames through contrastive fine-tuning, effective mean video
feature extraction, and multi-scale training phases. (2)
We propose a multi-domain attention mechanism that op-
erates on both temporal and feature domains over varying
timescales to fuse the local and global feature variations.
(3) We introduce a temporal refining scheme with event-
guided attention followed by a simple-yet-effective post pro-
cessing step to handle significant variations of the back-
ground over diverse events. Our method achieves state-of-
the-art performance on the publicly available AVE dataset
with 5.9% mean accuracy improvement which proves its su-
periority over existing approaches.
1. Introduction
Temporal reasoning of multi-modal data plays a signifi-
cant role in human perception in diverse environmental con-
ditions [10, 38]. Grounding the multi-modal context is crit-
ical to current and future tasks of interest, especially those
that guide current research efforts in this space, e.g., embod-
ied perception of automated agents [29, 4, 8], human-robot
interaction with multi-sensor guidance [25, 6, 2], and active
Figure 1. Example of an audio-visual event (AVE) representing
the event of an individual speaking. The person’s voice is audible
in all of the frames. Only when the person is visible, an AVE is
identified.
sound source localization [35, 22, 18, 27]. Similarly, audio-
visual event (AVE) localization demands complex multi-
modal correspondence of grounded audio-visual percep-
tion [24, 7]. The simultaneous presence of the audio-visual
cues over a video frame denotes an audio-visual event. As
shown in Fig. 1, the speech of the person is audible in all
of the frames. However, the individual speaking is visible
in only a few particular frames which represent the AVE.
Precise detection of such events greatly depends on the con-
textual understanding of the multi-modal features over the
video frame.
Learning the inter-modal audio-visual feature correspon-
dence over the video frames is one of the major challenges
of AVE localization. Effective multi-modal training strate-
gies can significantly improve performance by enhancing
the relevant features. Earlier work integrates audio and
image encoders pre-trained on large scale unimodal (im-
age/audio) datasets [5, 9] to improve performance [36, 17,
32, 7, 30]. However, such a uni-modal pre-training scheme
struggles to extract relevant inter-modal features that are
particularly significant for AVEs. Recently, following the
wide-spread success of CLIP [19] pre-trained on large-scale
vision-language datasets, AudioCLIP [12] has integrated an
audio encoder into the vision-language models with large-
scale pre-training on audio-image pairs. To enhance the
audio-visual feature correspondence for AVEs, we integrate
the image and audio encoders from AudioCLIP with ef-
fective contrastive fine-tuning that exploits the large-scale
pre-trained knowledge from multi-modal datasets instead of
uni-modal ones.
arXiv:2210.05060v1 [cs.CV] 11 Oct 2022
Effective audio-visual fusion for multi-modal reasoning
over entire video frames is another major challenge for
proper utilization of the uni-modal features. Recently, sev-
eral approaches have focused on using the grounded multi-
modal features to generate temporal attention for operating
on the intra-modal feature space [37, 17, 32]. Other recent
work has applied recursive temporal attention on the aggre-
gated multi-modal features [7, 17, 31]. However, these ex-
isting approaches attempt to generalize audio-visual context
over the whole video frame and hence struggle to extract
local variational patterns that are particularly significant at
event transitions. Though generalized multi-modal context
over long intervals is of great importance for categorizing
diverse events, local changes of multi-modal features are
critical for precise event detection at transition edges. To
solve this dilemma, we introduce a multi-window temporal
transformer based fusion scheme that operates on different
timescales to guide attention over sharp local changes with
short temporal windows, as well as extract the global con-
text across long temporal windows.
The background class representing uncorrelated audio-
visual frames varies a lot over different AVEs for diverse
surroundings (Figure 1). In many cases, it becomes difficult
to distinguish the background from the event regions due to
subtle variations [37]. Xu et al. [30] suggests that joint bi-
nary classification of the event regions (event/background)
along with the multi-class event prediction improves over-
all performance for better discrimination of the event ori-
ented features. Inspired by this, we introduce a temporal
feature refining scheme for guiding temporal attentions over
the event regions to introduce sharp contrast with the back-
ground. Moreover, we introduce a simple post-processing
algorithm that filters out such incorrect predictions in be-
tween event transitions by exploiting the high temporal lo-
cality of event/background frames in AVEs (Figure 1). By
unifying these strategies in the AVE-CLIP framework, we
achieve state-of-the-art performance on the AVE dataset
which outperforms existing approaches by a considerable
margin.
The major contributions of this work are summarized as
follows:
We introduce AVE-CLIP to exploit AudioCLIP pre-
trained on large-scale audio-image pairs for improving
inter-modal feature correspondence on video AVEs.
We propose a multi-window temporal transformer
based fusion scheme that operates on different
timescales of AVE frames to extract local and global
variations of multi-modal features.
We introduce a temporal feature refinement scheme
through event guided temporal attention followed by
a simple, yet-effective post-processing method to in-
crease contrast with the background.
2. Related Work
Audio Visual Event Localization
AVE localization, introduced by Tian et al. [24] targets
the identification of different types of events (e.g., indi-
vidual man/woman speaking, crying babies, frying food,
musical instruments, etc.) at each temporal instance based
on audio-visual correspondence. The authors introduced a
residual learning method with LSTM guided audio-visual
attention relying on simple concatenation and addition fu-
sion. A dual attention matching (DAM) module is intro-
duced by Wu et al. [28] for operating on event-relevant fea-
tures. Zhou et al. [37] proposed a positive sample propa-
gation scheme by pruning out the weaker multi-modal in-
teractions. Xuan et al. [32, 33] proposed a discriminative
multi-modal attention module for sequential learning with
an eigen-value based objective function. Duan et al. [7]
introduced joint co-learning with cyclic attention over the
aggregated multi-modal features. Lin and Wang [17] intro-
duced a transformer-based approach that operates on groups
of video frames based on audio-visual attention. Xu et
al. [30] introduced multi-modal relation-aware audio-visual
representation learning with an interaction module. Differ-
ent from existing approaches, AVE-CLIP exploits temporal
features from various windows by extracting short and long
range multi-modal interactions along with temporal refine-
ment of the event frames.
Sound Source Localization
The sound source localization task [35] identifies the sound-
ing object in the corresponding video based on the audi-
tory signal. Arda et al. [22] introduced an audio-visual
classification model that can be adapted for sound source
localization without explicit training by utilizing simple
multi-modal attention. Wu et al. [27] proposed an encoder-
decoder based framework to operate on the continuous fea-
ture space through likelihood measurements of the sounding
sources. Qian et al. [18] attempted multiple source localiza-
tion by exploiting gradient weighted class activation map
(Grad-CAM) correspondence on the audio-visual signal. A
self-supervised audio-visual matching scheme is introduced
by Hu et al. [15] with a dictionary learning of the sound-
ing objects. Afouras et al. [1] utilized optical flow features
along with multimodal attention maps targeting both source
localization and audio source separation.
Large Scale Contrastive Pre-training
To improve the data-efficiency on diverse target tasks, large-
scale pre-training of very deep neural networks has been
found to be effective for transfer learning [16]. CLIP has in-
troduced vision-language pre-training with self-supervised
contrastive learning on large-scale datasets, an approach
摘要:

AVE-CLIP:AudioCLIP-basedMulti-windowTemporalTransformerforAudioVisualEventLocalizationTanvirMahmudTheUniversityofTexasatAustintanvirmahmud@uetxas.eduDianaMarculescuTheUniversityofTexasatAustindianam@utexas.eduAbstractAnaudio-visualevent(AVE)isdenotedbythecorrespon-denceofthevisualandauditorysignalsi...

展开>> 收起<<
A VE-CLIP AudioCLIP-based Multi-window Temporal Transformer for A udio Visual E vent Localization Tanvir Mahmud.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:3.47MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注