LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION Bin Liy Yixuan Wengz Bin Suny Shutao Liy yCollege of Electrical and Information Engineering Hunan University

2025-04-24 2 0 769.11KB 5 页 10玖币

侵权投诉

LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION

Bin Li*†, Yixuan Weng*‡, Bin Sun†, Shutao Li†?

†College of Electrical and Information Engineering, Hunan University

‡National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy Sciences

ABSTRACT

We introduce a new task, named video corpus visual answer

localization (VCVAL), which aims to locate the visual an-

swer in a large collection of untrimmed instructional videos

using a natural language question. This task requires a range

of skills - the interaction between vision and language, video

retrieval, passage comprehension, and visual answer local-

ization. In this paper, we propose a cross-modal contrastive

global-span (CCGS) method for the VCVAL, jointly train-

ing the video corpus retrieval and visual answer localization

subtasks with the global-span matrix. We have reconstructed

a dataset named MedVidCQA, on which the VCVAL task is

benchmarked. Experimental results show that the proposed

method outperforms other competitive methods both in the

video corpus retrieval and visual answer localization sub-

tasks. Most importantly, we perform detailed analyses on

extensive experiments, paving a new path for understanding

the instructional videos, which ushers in further research1.

Index Terms—Video corpus, visual answer localization

1. INTRODUCTION

In recent years, the popularity of the video platform has en-

riched people’s lives [1, 2]. People can easily use natural lan-

guage to query video, but when it comes to instructional or

educational questions, more visual details are often needed

to aid comprehension [3, 4]. Thus, an increasing number

of people expect to use the most intuitive video clips to an-

swer directly when querying. Different from traditional video

question-answering [5], visual answer localization (VAL) is

more efﬁcient, which aims to provide a visual answer and has

received extensive attention from researchers [6, 7].

Medical video question answering (MedVidQA) [3] is the

pioneer for the VAL task within one untrimmed video, which

employs medical experts to annotate the corpus manually.

1All the experimental datasets and codes are open-sourced on the website

https://github.com/WENGSYX/CCGS.

This work is supported by the National Natural Science Fund of China

(62221002, 62171183) , the Hunan Provincial Natural Science Foundation of

China (2022JJ20017), and partially sponsored by CAAI-Huawei MindSpore

Open Fund.

*: These authors contributed equally to this work.

?: Corresponding author.

……

Video Corpus

The thyroid cartilage is a helpful landmark, since

the thyroid can be palpated bilaterally below it.

In healthy individuals, the thyroid only weighs

around 18 to 25 gr ams and s hould not be vis ible.

Palpation is usually performed from behind,

although anterior palpation is also feasible.

The gland is often only detectable upon palpation

if a pathology is prese nt. If thyroid perfusion is increased like in Graves disease,

here may reveal an auscultable bruit and a palpable thrill.

…

Text Question:

“How to get immediate relief in gum pain?”

…

Video Corpus Retrieval & Visual

Answer Localization Subtasks:

……

…

……

…

Let me show you how to s tretc h with

elastic bands Improve joint mobility by stretching The use of the elastic band is to hold the ends of

the tail

Support your body on both side s, then str aighten

your legs for a stretch

Finally, do it to str etch the other joints of

the body

Input text question and video corpus Output retrieval videos and visual answer

Finally, do it to str etch the other joints of

the body

……

The thyroid cartilage is a helpful landmark, since

the thyroid can be palpated bilaterally below it.

In healthy individuals, the thyroid only weighs

around 18 to 25 gr ams and s hould not be vis ible.

Palpation is usually performed from behind,

although anterior palpation is also feasible.

The gland is often only detectable upon palpation

if a pathology is prese nt. If thyroid perfusion is increased like in Graves disease,

here may reveal an auscultable bruit and a palpable thrill.

…

Visual

Answer: 0:230:14

Fig. 1. Illustration of the video corpus visual answer local-

ization in the medical instructional video, where the visual

answer with the subtitles is highlighted in the yellow box.

However, it orients towards a single video and presumes the

visual answer exists in the given video, which greatly limits

the human-machine interaction application scenarios in the

vast videos [8, 9]. Therefore, we extend and introduce a

new video corpus visual answer localization (VCVAL) task

shown in Fig. 1. Speciﬁcally, we reconstruct the MedVidQA

datasets into the medical video corpus question answering

(MedVidCQA) dataset to perform the VCVAL task. Our

goal for the VCVAL is to ﬁnd the matching video answer

span with the corresponding question from the large-scale

video corpus. The VCVAL task is more challenging than the

original VAL task because it not only requires retrieving the

target video from a large-scale video corpus, but also needs

to locate the visual answer corresponding to the target video

accurately. This task requires a range of skills—interaction

between vision and language, video retrieval, passage com-

prehension, and visual answer localization. To complete the

VCVAL task, two types of problems need to be considered.

The ﬁrst is that the features are inconsistent in cross-video

retrieval, where retrieval performance will be worse due to

the different features between video contents and the given

question [10]. The second are the semantic gaps in the cross-

modal modeling between the given retrieved video and the

question before visual answer localization, resulting in worse

downstream performance [11].

To solve the aforementioned two problems, we propose a

cross-modal contrastive global-span (CCGS) method, jointly

training the video retrieval and visual answer localization sub-

tasks in an end2end manner. Speciﬁcally, to alleviate the re-

trieval errors caused by feature inconsistency, we design the

arXiv:2210.05423v4 [cs.CV] 2 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LEARNINGTOLOCATEVISUALANSWERINVIDEOCORPUSUSINGQUESTIONBinLi*y,YixuanWeng*z,BinSuny,ShutaoLiy?yCollegeofElectricalandInformationEngineering,HunanUniversityzNationalLaboratoryofPatternRecognitionInstituteofAutomation,ChineseAcademySciencesABSTRACTWeintroduceanewtask,namedvideocorpusvisualanswerlocaliz...

展开>> 收起<<

LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION Bin Liy Yixuan Wengz Bin Suny Shutao Liy yCollege of Electrical and Information Engineering Hunan University.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LEARNING TO LOCATE VISUAL ANSWER IN VIDEO CORPUS USING QUESTION Bin Liy Yixuan Wengz Bin Suny Shutao Liy yCollege of Electrical and Information Engineering Hunan University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: