
VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL
MUTUAL KNOWLEDGE TRANSFER
Yixuan Weng*†, Bin Li*?‡
†National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy Sciences
‡College of Electrical and Information Engineering, Hunan University
ABSTRACT
The goal of visual answering localization (VAL) in the video
is to obtain a relevant and concise time clip from a video as the
answer to the given natural language question. Early meth-
ods are based on the interaction modelling between video
and text to predict the visual answer by the visual predictor.
Later, using the textual predictor with subtitles for the VAL
proves to be more precise. However, these existing meth-
ods still have cross-modal knowledge deviations from visual
frames or textual subtitles. In this paper, we propose a cross-
modal mutual knowledge transfer span localization (Mutu-
alSL) method to reduce the knowledge deviation. MutualSL
has both visual predictor and textual predictor, where we ex-
pect the prediction results of these both to be consistent, so
as to promote semantic knowledge understanding between
cross-modalities. On this basis, we design a one-way dynamic
loss function to dynamically adjust the proportion of knowl-
edge transfer. We have conducted extensive experiments on
three public datasets for evaluation. The experimental results
show that our method outperforms other competitive state-of-
the-art (SOTA) methods, demonstrating its effectiveness1.
Index Terms—Cross-modal, Mutual Knowledge Trans-
fer, Visual Answer Localization
1. INTRODUCTION
The explosion of online videos has changed the way that
people obtain information, and knowledge [1, 2]. Various
video platforms make it more convenient for people to per-
form video queries [3, 4]. However, people who want to
get direct instructions or tutorials from the video often need
to browse the video content several times to locate relevant
parts, which usually takes time and effort [5]. Visual answer
localization (VAL) is an emerging technology to solve the
above problem [6], and has received wide attention because
of its practical value [7, 8]. As shown in Fig. 1(a), the task of
VAL is to find a time clip that can answer the given question.
1All the experimental datasets and codes are open-sourced on the website
https://github.com/WENGSYX/MutualSL.
*: These authors contributed equally to this work.
?: Corresponding author.
Question: How do you begin "automatic start"?
Target Answer: 2:18
1:32
……
Automatic start all ows you to
test at your own pace, and
Now Iwant to introduce some
knowl edge about cars.
begins once you hit the
accelerator pedal.
Textual Predictor:2:01
Visual Predictor:2:37
(a) Task Introduction
…
Textual Predictor
(c) The Paradigm of Textual Predictor
Subtitles Start Span End Span
00:00:12
00:00:24
…
00:01:45
00:01:53
00:03:01
00:03:17
Subtitle4
Subtitle5
Subtitle6
Subtitle11
Subtitle1
Visual Predictor
End Point
Start Point
End 2:37
(b) The Paradigm of Visual Predictor
Frames
Frame Span Timepoint
End 2:01
Subtitle TimeLine
Start 1:32
Start 1:48
(d) Cross-Modal Mutual Knowledge Transfer
Textual
Predictor
Visual
Predictor
Target Answer
Our Method
PTextual(Subtitles)
Output Probability:
PVisual(Frames)
Output Probability:
00:01:55
00:02:01
00:01:32
00:01:39
One-way Dynamic
Loss Function
1:32
1:48
Fig. 1. Task description of the visual answer localization,
where the below is the paradigms of the previous methods
and our method.
For example, when inputting “How do you begin ‘automatic
start’”, you may need to find a clip according to voice content
(or transcribed text subtitles) and visual frames. The VAL
technology can not only recognize the relevant video clips
to the text questions but also return the target visual answer
(1:32 ˜ 2:18).
The existing VAL method can be mainly divided into vi-
sual predictor and textual predictor according to the predic-
tion contents. The paradigm of visual predictor is shown in
Fig. 1(b). The video information is first extracted according
to the frame features, and then these frame features queried
by the question are used to predict the relevant time points [9,
10]. The paradigm of textual predictor is shown in Fig. 1(c).
The textual predictor adopts a span-based method to model
the cross-modal information, where the predicted span inter-
vals with subtitle timeline are used as the final results [8, 11].
The performance of the textual predictor is better than the
visual one [7], because it uses the additional subtitle informa-
tion, and embeds visual information into the text feature space
with the visual information as an auxiliary feature. However,
as shown in Fig. 1(a), results from both two predictors suf-
fer cross-modal knowledge deviations. For the textual pre-
arXiv:2210.14823v3 [cs.CV] 28 Oct 2022