Effective audio-visual fusion for multi-modal reasoning
over entire video frames is another major challenge for
proper utilization of the uni-modal features. Recently, sev-
eral approaches have focused on using the grounded multi-
modal features to generate temporal attention for operating
on the intra-modal feature space [37, 17, 32]. Other recent
work has applied recursive temporal attention on the aggre-
gated multi-modal features [7, 17, 31]. However, these ex-
isting approaches attempt to generalize audio-visual context
over the whole video frame and hence struggle to extract
local variational patterns that are particularly significant at
event transitions. Though generalized multi-modal context
over long intervals is of great importance for categorizing
diverse events, local changes of multi-modal features are
critical for precise event detection at transition edges. To
solve this dilemma, we introduce a multi-window temporal
transformer based fusion scheme that operates on different
timescales to guide attention over sharp local changes with
short temporal windows, as well as extract the global con-
text across long temporal windows.
The background class representing uncorrelated audio-
visual frames varies a lot over different AVEs for diverse
surroundings (Figure 1). In many cases, it becomes difficult
to distinguish the background from the event regions due to
subtle variations [37]. Xu et al. [30] suggests that joint bi-
nary classification of the event regions (event/background)
along with the multi-class event prediction improves over-
all performance for better discrimination of the event ori-
ented features. Inspired by this, we introduce a temporal
feature refining scheme for guiding temporal attentions over
the event regions to introduce sharp contrast with the back-
ground. Moreover, we introduce a simple post-processing
algorithm that filters out such incorrect predictions in be-
tween event transitions by exploiting the high temporal lo-
cality of event/background frames in AVEs (Figure 1). By
unifying these strategies in the AVE-CLIP framework, we
achieve state-of-the-art performance on the AVE dataset
which outperforms existing approaches by a considerable
margin.
The major contributions of this work are summarized as
follows:
• We introduce AVE-CLIP to exploit AudioCLIP pre-
trained on large-scale audio-image pairs for improving
inter-modal feature correspondence on video AVEs.
• We propose a multi-window temporal transformer
based fusion scheme that operates on different
timescales of AVE frames to extract local and global
variations of multi-modal features.
• We introduce a temporal feature refinement scheme
through event guided temporal attention followed by
a simple, yet-effective post-processing method to in-
crease contrast with the background.
2. Related Work
Audio Visual Event Localization
AVE localization, introduced by Tian et al. [24] targets
the identification of different types of events (e.g., indi-
vidual man/woman speaking, crying babies, frying food,
musical instruments, etc.) at each temporal instance based
on audio-visual correspondence. The authors introduced a
residual learning method with LSTM guided audio-visual
attention relying on simple concatenation and addition fu-
sion. A dual attention matching (DAM) module is intro-
duced by Wu et al. [28] for operating on event-relevant fea-
tures. Zhou et al. [37] proposed a positive sample propa-
gation scheme by pruning out the weaker multi-modal in-
teractions. Xuan et al. [32, 33] proposed a discriminative
multi-modal attention module for sequential learning with
an eigen-value based objective function. Duan et al. [7]
introduced joint co-learning with cyclic attention over the
aggregated multi-modal features. Lin and Wang [17] intro-
duced a transformer-based approach that operates on groups
of video frames based on audio-visual attention. Xu et
al. [30] introduced multi-modal relation-aware audio-visual
representation learning with an interaction module. Differ-
ent from existing approaches, AVE-CLIP exploits temporal
features from various windows by extracting short and long
range multi-modal interactions along with temporal refine-
ment of the event frames.
Sound Source Localization
The sound source localization task [35] identifies the sound-
ing object in the corresponding video based on the audi-
tory signal. Arda et al. [22] introduced an audio-visual
classification model that can be adapted for sound source
localization without explicit training by utilizing simple
multi-modal attention. Wu et al. [27] proposed an encoder-
decoder based framework to operate on the continuous fea-
ture space through likelihood measurements of the sounding
sources. Qian et al. [18] attempted multiple source localiza-
tion by exploiting gradient weighted class activation map
(Grad-CAM) correspondence on the audio-visual signal. A
self-supervised audio-visual matching scheme is introduced
by Hu et al. [15] with a dictionary learning of the sound-
ing objects. Afouras et al. [1] utilized optical flow features
along with multimodal attention maps targeting both source
localization and audio source separation.
Large Scale Contrastive Pre-training
To improve the data-efficiency on diverse target tasks, large-
scale pre-training of very deep neural networks has been
found to be effective for transfer learning [16]. CLIP has in-
troduced vision-language pre-training with self-supervised
contrastive learning on large-scale datasets, an approach