
LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION
Bin Li*†, Yixuan Weng*‡, Bin Sun†, Shutao Li†?
†College of Electrical and Information Engineering, Hunan University
‡National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy Sciences
ABSTRACT
We introduce a new task, named video corpus visual answer
localization (VCVAL), which aims to locate the visual an-
swer in a large collection of untrimmed instructional videos
using a natural language question. This task requires a range
of skills - the interaction between vision and language, video
retrieval, passage comprehension, and visual answer local-
ization. In this paper, we propose a cross-modal contrastive
global-span (CCGS) method for the VCVAL, jointly train-
ing the video corpus retrieval and visual answer localization
subtasks with the global-span matrix. We have reconstructed
a dataset named MedVidCQA, on which the VCVAL task is
benchmarked. Experimental results show that the proposed
method outperforms other competitive methods both in the
video corpus retrieval and visual answer localization sub-
tasks. Most importantly, we perform detailed analyses on
extensive experiments, paving a new path for understanding
the instructional videos, which ushers in further research1.
Index Terms—Video corpus, visual answer localization
1. INTRODUCTION
In recent years, the popularity of the video platform has en-
riched people’s lives [1, 2]. People can easily use natural lan-
guage to query video, but when it comes to instructional or
educational questions, more visual details are often needed
to aid comprehension [3, 4]. Thus, an increasing number
of people expect to use the most intuitive video clips to an-
swer directly when querying. Different from traditional video
question-answering [5], visual answer localization (VAL) is
more efficient, which aims to provide a visual answer and has
received extensive attention from researchers [6, 7].
Medical video question answering (MedVidQA) [3] is the
pioneer for the VAL task within one untrimmed video, which
employs medical experts to annotate the corpus manually.
1All the experimental datasets and codes are open-sourced on the website
https://github.com/WENGSYX/CCGS.
This work is supported by the National Natural Science Fund of China
(62221002, 62171183) , the Hunan Provincial Natural Science Foundation of
China (2022JJ20017), and partially sponsored by CAAI-Huawei MindSpore
Open Fund.
*: These authors contributed equally to this work.
?: Corresponding author.
……
……
Video Corpus
The thyroid cartilage is a helpful landmark, since
the thyroid can be palpated bilaterally below it.
In healthy individuals, the thyroid only weighs
around 18 to 25 gr ams and s hould not be vis ible.
Palpation is usually performed from behind,
although anterior palpation is also feasible.
The gland is often only detectable upon palpation
if a pathology is prese nt. If thyroid perfusion is increased like in Graves disease,
here may reveal an auscultable bruit and a palpable thrill.
…
Text Question:
“How to get immediate relief in gum pain?”
…
…
…
…
Video Corpus Retrieval & Visual
Answer Localization Subtasks:
……
…
…
……
…
Let me show you how to s tretc h with
elastic bands Improve joint mobility by stretching The use of the elastic band is to hold the ends of
the tail
Support your body on both side s, then str aighten
your legs for a stretch
Finally, do it to str etch the other joints of
the body
Input text question and video corpus Output retrieval videos and visual answer
Finally, do it to str etch the other joints of
the body
……
The thyroid cartilage is a helpful landmark, since
the thyroid can be palpated bilaterally below it.
In healthy individuals, the thyroid only weighs
around 18 to 25 gr ams and s hould not be vis ible.
Palpation is usually performed from behind,
although anterior palpation is also feasible.
The gland is often only detectable upon palpation
if a pathology is prese nt. If thyroid perfusion is increased like in Graves disease,
here may reveal an auscultable bruit and a palpable thrill.
…
…
Visual
Answer: 0:230:14
Fig. 1. Illustration of the video corpus visual answer local-
ization in the medical instructional video, where the visual
answer with the subtitles is highlighted in the yellow box.
However, it orients towards a single video and presumes the
visual answer exists in the given video, which greatly limits
the human-machine interaction application scenarios in the
vast videos [8, 9]. Therefore, we extend and introduce a
new video corpus visual answer localization (VCVAL) task
shown in Fig. 1. Specifically, we reconstruct the MedVidQA
datasets into the medical video corpus question answering
(MedVidCQA) dataset to perform the VCVAL task. Our
goal for the VCVAL is to find the matching video answer
span with the corresponding question from the large-scale
video corpus. The VCVAL task is more challenging than the
original VAL task because it not only requires retrieving the
target video from a large-scale video corpus, but also needs
to locate the visual answer corresponding to the target video
accurately. This task requires a range of skills—interaction
between vision and language, video retrieval, passage com-
prehension, and visual answer localization. To complete the
VCVAL task, two types of problems need to be considered.
The first is that the features are inconsistent in cross-video
retrieval, where retrieval performance will be worse due to
the different features between video contents and the given
question [10]. The second are the semantic gaps in the cross-
modal modeling between the given retrieved video and the
question before visual answer localization, resulting in worse
downstream performance [11].
To solve the aforementioned two problems, we propose a
cross-modal contrastive global-span (CCGS) method, jointly
training the video retrieval and visual answer localization sub-
tasks in an end2end manner. Specifically, to alleviate the re-
trieval errors caused by feature inconsistency, we design the
arXiv:2210.05423v4 [cs.CV] 2 Mar 2023